Migrating Kubernetes and Ceph into another Network

Von Christian Hüning | 2. June, 2018

Introduction

Here at the University of Applied Sciences in Hamburg we operate a multi-tenant cluster for 1000+ users, who conduct research and teach computer science using it. We call the cluster Informatik Compute Cloud, or ICC. A central part of that cluster is our storage solution, which is backed by the Ceph distributed storage system. Since we run Kubernetes as our cluster orchestrator, we decided to use the rook.io project to manage and run our Ceph cluster within Kubernetes.

The network at the University is shared by many parties, which is why we have multiple VLANs spanning across the entire campus to separate the various users. When we started the cloud project 2 years ago we didn’t know whether it would be widely adopted to become a production system or not. For that reason the cluster initially was placed into a project network together with other experimental systems.

As it turned out the project became a huge success and lots of users jumped onto the cloud solution. This resulted in several problems for the VLAN we were operating in. For instance we started talking to a number of other systems in external VLANs (i.e. etcd was running in a vSphere infrastructure on another site of the campus), which introduced additional overhead by the network components and limited the speed by which we could effectively operate the cluster. Also the ICC quickly became a noisy neighbor to a number of other participants in the experimental VLAN.

Finally we made the decision to move all parts of the ICC into its own VLAN and co-locate them physically to mitigate any network overhead and VLAN routing problems.

Preparations

In preparation of the switch we made sure that all relevant traffic between the two networks was temporarily allowed in our Access Control Lists (ACL) until the move was complete. Our Kubernetes Control Plane is running on VMs for now and we decided to keep them there until the move of all other nodes was complete.

Moving Compute Nodes

Moving the compute nodes over was the simple part. As we boot our nodes with RedHat CoreOS (formerly ContainerLinux) via iPXE through Matchbox a ‘move’ means to power down the node, change the switchport’s VLAN, adjust the PXE Boot Cconfiguration to use a new IP address and reboot the node. We did that one by one and the nodes all migrated over to the new network.

Moving Ceph/rook Storage Nodes

Hardware Setup

The storage nodes were a bit of a different story. We’re running 6 storage nodes that each sport 8 OSDs with 4TB. Here are the full specs:

2x10 Core Xeon & 96 GB Ram
8 x 4TB HDD @ 7200rpm
1 x 2TB NVMe PCIe SSD
1 x 8GB USB Stick

You may wonder what we used the USB stick for. In fact it was bit of a hacky solution to store the Ceph Monitor data in a persistent way.

ICC Hardware

Prior to moving the storage we decided we wanted to get rid of the USB Sticks and replace them with proper M.2 NVMe drives with more storage capacity, as we learned through conversations on the Rook Slack that the monitor map data size can easily increase to 50 - 100GB during migration or recovery phases of the cluster. So we bought 6x 512GB M.2 Samsung 960 Pro drives. The resulting specs is:

2x10 Core Xeon & 96 GB Ram
8 x 4TB HDD @ 7200rpm
1 x 2TB NVMe PCIe SSD
1 x 512GB M.2 Samsung 960 Pro

Ceph Cluster Spec

Our rook.io ceph cluster specification looks like this:

apiVersion: rook.io/v1alpha1
kind: Cluster
metadata:
  name: rook
  namespace: rook
spec:
  placement:
    all:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: rook
              operator: In
              values:
              - "true"
  versionTag: v0.7.1
  dataDirHostPath: /var/lib/rook
  monCount: 5
  storage:
    useAllNodes: false
    useAllDevices: false
    metadataDevice: "nvme0n1"
    storeConfig:
      storeType: bluestore
      databaseSizeMB: 204800
      walSizeMB: 5670
    nodes:
    - name: storage-node-1.icc.informatik.haw-hamburg.de
      devices:
      - name: "sda"
      - name: "sdb"
      - name: "sdc"
      - name: "sdd"
      - name: "sde"
      - name: "sdf"
      - name: "sdg"
      - name: "sdh"
    # repeats 5 times for each node
    # omitted for readability

As you can see, we use the new bluestore storage engine. We specify our metadata device to be the 2TB NVMe drive and provide a DB and WAL size which nicely utilizes almost the complete NVMe metadata device (labelled as nvme0n1 on our ContainerLinux hosts). We also specify to have 5 monitors and want to run on nodes which are labeled with rook=true. For every node we then explicitly list the devices to use for actual data storage to avoid accidental device stealing.

To The Migration We Go

The Plan

We thought long and hard how to approach the migration. We knew we would have switch to the new drive config and also wanted to maintain operations. Production workloads ideally shouldn’t be affected by the migration as all ACLs were set and CEPH should stay functioning. As we could theoretically loose 2 nodes with our 6 node cluster, we came up with this procedure:

  1. Remove a node from cluster.yaml spec and let the rook-operator do its work of removing the OSDs
  2. Drain node and shutdown
  3. Move node to new network (see Moving Compute Nodes above)
  4. Alter config to use the m.2 drive for /var/lib/rook
  5. Boot node
  6. Add node with new Hostname to cluster.yaml spec and let rook-operator do its work
  7. Repeat until done

The First and Second Node

We started with the first node. It went out of the cluster, we added the new drive, adjusted config, rebooted and the node was back. Upon re-entering into rook it started immediately to add OSDs and reorganize the Ceph cluster. Only small issue: We had to restart the rook-operator after it finished removing the node and its OSDs to remove the noscrub and nodepp-scrub flags again. The other thing we noticed, was that the M.2 drive showed up as nvme0n1 now, whereas the drive formerly known by that name now was nvme1n1. Since our configuration stated we wanted to use nvme0n1 as the Metadata drive, we had to adjust that. We did it in such a way, that we kept the default for the non-migrated hosts and overwrote it for all migrated nodes. So in an intermediate step the cluster.yaml looked like this:

apiVersion: rook.io/v1alpha1
kind: Cluster
metadata:
  name: rook
  namespace: rook
spec:
  placement:
    all:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: rook
              operator: In
              values:
              - "true"
  versionTag: v0.7.1
  dataDirHostPath: /var/lib/rook
  monCount: 5
  storage:
    useAllNodes: false
    useAllDevices: false
    metadataDevice: "nvme0n1"
    storeConfig:
      storeType: bluestore
      databaseSizeMB: 204800
      walSizeMB: 5670
    nodes:
    - name: storage-node-1.icc.informatik.haw-hamburg.de # already migrated
      metadataDevice: "nvme1n1"   # override to account for new m.2 device
      devices:
      - name: "sda"
      - name: "sdb"
      - name: "sdc"
      - name: "sdd"
      - name: "sde"
      - name: "sdf"
      - name: "sdg"
      - name: "sdh"
    - name: storage-node-2.icc.informatik.haw-hamburg.de # not yet migrated
      devices:
      - name: "sda"
      - name: "sdb"
      - name: "sdc"
      - name: "sdd"
      - name: "sde"
      - name: "sdf"
      - name: "sdg"
      - name: "sdh"
    # repeats 5 times for each node
    # omitted for readability

With that in mind we kept going and repeated with node 2. All went well.

The Third Node

After going through the list for node 3 the cluster suddenly went into HEALTH_ERR state. Output from ceph status looked like this:

  cluster:
    id:     df18a2db-23a2-43f0-9e7f-0404ea4d7e09
    health: HEALTH_ERR
            1 full osd(s)
            1 pool(s) full
            104946/932172 objects misplaced (11.258%)

  services:
    mon: 5 daemons, quorum rook-ceph-mon44,rook-ceph-mon43,rook-ceph-mon48,rook-ceph-mon50,rook-ceph-mon51
    mgr: rook-ceph-mgr0(active)
    osd: 48 osds: 48 up, 48 in; 517 remapped pgs

  data:
    pools:   1 pools, 2048 pgs
    objects: 303k objects, 1100 GB
    usage:   12950 GB used, 167 TB / 180 TB avail
    pgs:     104946/932172 objects misplaced (11.258%)
             1528 active+clean
             513  active+remapped+backfill_wait
             7    active+remapped+backfilling

  io:
    recovery: 84838 kB/s, 1 keys/s, 23 objects/s

So it appeared an OSD was full and thus the entire pool. This was concerning. Among our first assumptions was that a drive got maliciously formatted and exposed a too small partition to the cluster. It turned out this wasn’t the case, but we forgot about the USB sticks in the nodes. They only had 8 GB of capacity and on the third node suddenly the stick got announced as /dev/sda instead of /dev/sdi. The result was an OSD with only 8GB, which in our case was way too small for the ~300 GB being stored on every OSD at that time. The immediate solution was to remove the OSD from the cluster via ceph osd out <ID>. The cluster got back to HEALTH_WARN. The real solution was to simply add the step Pull USB stick to the plan:

  1. Remove a node from cluster.yaml spec and let the rook-operator do its work of removing the OSDs
  2. Drain node and shutdown
  3. Move node to new network (see Moving Compute Nodes above)
  4. Alter config to use the m.2 drive for /var/lib/rook
  5. Pull USB stick
  6. Boot node
  7. Add node with new Hostname to cluster.yaml spec and let rook-operator do its work
  8. Repeat until done

This were ~10 minutes in which the CEPH Cluster was not available to users!

The Third to Sixth Node

For the remainder of the nodes it went smooth. We just followed our plan and repeated each step on every node. Removing a node from a cluster took about ~ 1 hour, re-adding it was at about 1.5 hours.

Conclusion

Ceph and especially the rook.io project enabled us to move the nodes to a new network without having to stop production mode. rook.io was especially helpful as we didn’t have to perform the removal of a node and its OSDs from the Ceph cluster by hand. The rook-operator could get better at recognizing changes, but we have no doubt the upcoming version of rook (0.8) will improve a lot on that! Also the community around rook is awesome. Without the helpful people at the Slack Team, we wouldn’t be rolling Ceph in production as we do currently!

Moving an in-production Kubernetes + Ceph Cluster to another network is possible but complex. The overall prep time took us 4-5 weeks and we couldn’t perform it with some minor issues (i.e. ACLs we overlooked). Overall the technology proved it can be done, but we conclude:

Don’t try this at home!