An S3 Storage Experiment

My team at SUSE is working on a new S3-compatible storage solution for Kubernetes, based on Ceph’s RADOS Gateway (RGW), except without any of the RADOS bits. The idea is that you can deploy our s3gw container on top of Longhorn (which provides the underlying replicated storage), and all this is running in your Kubernetes cluster, along with your applications which thus have convenient access to a local S3-compatible object store.

We’ve done this by adding a new storage backend to RGW. The approach we’ve taken is to use SQLite for metadata, with object data stored as files in a regular filesystem. This works quite neatly in a Kubernetes cluster with Longhorn, because Longhorn can provide a persistent volume (think: an ext4 filesystem), on which s3gw can store its SQLite database and object data files. If you’d like to kick the tyres, check out Giuseppe’s deployment tutorial for the 0.2.0 release, but bear in mind that as I’m writing this we’re all the way up to 0.4.0 so some details may have changed.

While s3gw on Longhorn on Kubernetes remains our primary focus for this project, the fact that this thing only needs a filesystem for backing storage means it can be run on top of just about anything. Given “just about anything” includes an old school two node Pacemaker cluster with DRBD for replicated storage, why not give that a try? I kinda like the idea of a good solid highly available S3-compatible storage solution that you could shove into the bottom of a rack somewhere without too much difficulty.

It’s probably eight years since I last deployed Pacemaker and DRBD, so to refresh my memory I ran with SUSE’s latest Highly Available NFS Storage with DRBD and Pacemaker document, but skipped all the NFS bits. That gives a filesystem mounted on one node, which will fail over to the other node if something breaks. On top of that, we need to run the s3gw container, the s3gw-ui container, an nginx HTTPS reverse proxy to smoosh those two together, and a virtual/floating IP, so the whole lot is accessible to the outside world.

Here’s the interesting parts of my Pacemaker configuration:

# crm configure show
[...]
primitive drbd_s3 ocf:linbit:drbd \
        params drbd_resource=s3 drbdconf="/etc/drbd.conf" \
        op monitor interval=29s role=Master \
        op monitor interval=31s role=Slave
primitive fs_s3 Filesystem \
        params device="/dev/drbd0" directory="/data" fstype=ext4 \
        meta target-role=Started \
        op start timeout=60s interval=0 \
        op stop timeout=60s interval=0 \
        op monitor interval=20s timeout=40s
primitive https nginx \
        op start timeout=40s interval=0 \
        op stop timeout=60s interval=0 \
        op monitor timeout=30s interval=10s \
        op monitor timeout=30s interval=30s \
        op monitor timeout=60s interval=20s
primitive s3-ip IPaddr2 \
        params ip=192.168.100.50 \
        op monitor interval=10 timeout=20
primitive s3gw podman \
        params image="ghcr.io/aquarist-labs/s3gw:latest" run_opts="-p 7480:7480 -v/data:/data" \
        op start interval=0 timeout=90s \
        op stop interval=0 timeout=90s \
        op monitor interval=30s timeout=30s
primitive s3gw-ui podman \
        params image="ghcr.io/aquarist-labs/s3gw-ui:latest" run_opts="-p 8080:8080 -e RGW_SERVICE_URL=https://s3gw.sleha.test" \
        op start interval=0 timeout=90s \
        op stop interval=0 timeout=90s \
        op monitor interval=30s timeout=30s
group g-s3 fs_s3 s3gw s3gw-ui https s3-ip
ms ms-drbd_s3 drbd_s3 \
        meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
colocation col-s3_on_drbd inf: g-s3 ms-drbd_s3:Promoted
order o-drbd_before_fs Mandatory: ms-drbd_s3:promote g-s3:start
[...]

The g-s3 group ensures that the ext4 filesystem (fs_s3), s3gw container (s3gw), s3gw-ui container (s3gw-ui), nginx instance (https) and virtual IP (s3-ip) all run on the same node, and start one after another. The colocation and ordering constraints ensure that g-s3 runs on whichever node is currently the DRBD (ms-drbd_s3) primary.

The important pieces of glue here are:

The fs_s3 resource mounts /dev/drbd0 on /data
The s3gw resource passes -p 7480:7480 -v/data:/data to podman, so the container can write to /data on the host, and the S3 service is accessible via HTTP on port 7480.
The s3gw-ui resource passes -p 8080:8080 -e RGW_SERVICE_URL=https://s3gw.sleha.test to podman, so the UI is accessible via HTTP on port 8080, and it expects the S3 service to be externally available via https://s3gw.sleha.test.
nginx is configured to reverse proxy https://s3gw.sleha.test to http://localhost:7480, and https://s3gw-ui.sleha.test to http://localhost:8080.
I’ve got an entry in /etc/hosts to point s3gw.sleha.test and s3gw-ui.sleha.test at the virtual IP (192.168.100.50).
I’m using self-signed certificates (openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout cert.key -out cert.pem) for s3gw and s3gw-ui, so I had to go visit both https://s3gw.sleha.test and https://s3gw-ui.sleha.test in my browser and accept the SSL certificate before the UI would work.
The DRBD config, nginx config and SSL certificates and keys need to be present on all nodes. I used csync2 for this.

Here’s my /etc/nginx/nginx.conf. I’m not entirely convinced I’ve got everything 100% right here, but it seems to work (this is, incredibly, my first time doing anything with nginx, and my first time dealing with CORS):

worker_processes  1;

events {
    worker_connections  1024;
    use epoll;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    sendfile        on;
    keepalive_timeout  65;

    server {
        listen       80;
        return       301 https://$host$request_uri; 
    }

    server {
        listen       443 ssl;
        server_name  s3gw.sleha.test;

        access_log /var/log/nginx/s3gw.access.log;

        location / {
            proxy_set_header        Host $host;
            proxy_set_header        X-Real-IP $remote_addr;
            proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header        X-Forwarded-Proto $scheme;

            add_header Access-Control-Allow-Origin 'https://s3gw-ui.sleha.test';
            add_header Access-Control-Allow-Methods 'GET,HEAD,PUT,POST,DELETE';
            add_header Access-Control-Allow-Headers '*';
            add_header 'Access-Control-Allow-Credentials' 'true';

            if ($request_method = 'OPTIONS') {
                add_header Access-Control-Allow-Origin 'https://s3gw-ui.sleha.test';
                add_header Access-Control-Allow-Methods 'GET,HEAD,PUT,POST,DELETE';
                add_header Access-Control-Allow-Headers '*';
                add_header 'Access-Control-Allow-Credentials' 'true';
                add_header 'Content-Type' 'text/plain charset=UTF-8';
                add_header 'Content-Length' 0;
                return 204;
            }

            proxy_pass          http://localhost:7480;
            proxy_read_timeout  90;
            proxy_redirect      http://localhost:7480 https://s3gw.sleha.test;
        }

        ssl_certificate      cert.pem;
        ssl_certificate_key  cert.key;
        ssl_protocols        TLSv1.2;
        ssl_session_cache    shared:SSL:1m;
        ssl_session_timeout  5m;
        ssl_ciphers  HIGH:!aNULL:!MD5;
        ssl_prefer_server_ciphers  on;
    }

    server {
        listen       443 ssl;
        server_name  s3gw-ui.sleha.test;

        access_log /var/log/nginx/s3gw-ui.access.log;

        location / {
            proxy_set_header        Host $host;
            proxy_set_header        X-Real-IP $remote_addr;
            proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header        X-Forwarded-Proto $scheme;

            proxy_pass          http://localhost:8080;
            proxy_read_timeout  90;

            proxy_redirect      http://localhost:8080 https://s3gw-ui.sleha.test;
        }

        ssl_certificate      cert-ui.pem;
        ssl_certificate_key  cert-ui.key;
        ssl_protocols        TLSv1.2;
        ssl_session_cache    shared:SSL:1m;
        ssl_session_timeout  5m;
        ssl_ciphers  HIGH:!aNULL:!MD5;
        ssl_prefer_server_ciphers  on;
    }
}

A couple of important points about Pacemaker’s support for running containers with podman:

You have to manually pull the containers on both nodes (podman pull ghcr.io/aquarist-labs/s3gw ; podman pull ghcr.io/aquarist-labs/s3gw-ui) before you can run them, because by default the podman resource agent won’t do it for you. There is an option (allow_pull) which you could turn on, but doing so “can drastically increase the time required to start the container if the image repository is pulled over the network“. This sounds like a bad idea to me.
Monitoring of processes inside containers is a bit sketchy. By default, it will run podman exec $CONTAINER /bin/true, but that really only proves that the container is alive. You can override that command with something else, but it’s apparently better to engineer your container to die quickly and well if something goes wrong.

So what was the end result? TL;DR: It pretty much All Just Worked^TM, which is exactly what you’d hope for when running a new application on a mature HA stack. I can use s3cmd to mess around with the S3 service, and use my web browser to play with the UI. Failover is nice and quick (think: a few seconds) if I kill a node. For the sake of convenience I did this experiment on a couple of VMs using the external/libvirt STONITH plugin, but I don’t expect a real deployment to be hugely different in behaviour. Also, I’d forgotten how good Pacemaker is at highlighting poorly behaved applications – prior to this experiment the s3gw-ui container didn’t stop well, but we weren’t aware of that until I tried a manual failover which took too long and resulted in an unexpected STONITH due to a stop timeout. Moritz has since fixed that.

One thing I tripped over when doing this deployment was the correct values to use for the access_key and secret_key of the default user when talking to the S3 service. These are actually settable for the s3gw container via the RGW_DEFAULT_USER_ACCESS_KEY and RGW_DEFAULT_USER_SECRET_KEY environment variables, but if left unset, they default to “test” and “test” respectively. The interesting bits of my s3cmd.cfg are thus:

access_key = test
secret_key = test
host_base = https://s3gw.sleha.test/
host_bucket = htts://s3gw.sleha.test/%(bucket)

In retrospect I probably should have added -e RGW_DEFAULT_USER_ACCESS_KEY=tserong -e RGW_DEFAULT_USER_SECRET_KEY=do_not_tell_anyone_this_is_your_password to the run_opts parameter of the s3gw resource in the Pacemaker config.

Ourobengr

An engineer eating his own tail

An S3 Storage Experiment

1 thought on “An S3 Storage Experiment”

Leave a Reply to Alan Robertson Cancel reply