Remember our big migration for Private Cloud customers from our infrastructure to xCloud on Terremark? It's been over a year since that announcement and our migrations are long complete. But how did they work? This post, and another coming soon, will give the technical rundown of the process and tools we used.
To understand how and why we made choices, it's helpful to know what we started with. For running virtual machines (VMs) we used Xen with all VMs running paravirtualized. This means they used a special kernel that knew it was being virtualized thus providing some shortcuts and performance benefits for storage and network access. In our case it also meant that the VMs were missing some items you would find on a fully virtualized VM or physical machine, such as a boot loader.
For storage, there were two aspects: storage available locally to VMs and storage shared among sets of VMs. VM local storage was used for local filesystems such as root and swap. Shared storage was used for data sets of VMs performing the same function would need to access together, such as public assets.
Both kinds were managed at the Xen host level using clustered LVM which carved up LUNs presented by our Coraid ATA over Ethernet (AoE) shelves. The resulting logical volumes (LVs) were then attached directly to the VMs as block devices. For example, an LV for a root filesystem would be attached to a VM and be used from within that VM as
Local filesystems used ReiserFS which worked well for us. Shared filesystems used GFS, part of the the Red Hat Cluster Suite. GFS, while often misunderstood, also worked well for us, but it required that all VMs using it had to be able to access the same backing LV simultaneously.
On the networking side, each VM had two network interfaces: one for VM external traffic and another for the cluster communication needed for using GFS. VM external traffic included normal VM to VM traffic such as database access as well as getting traffic from the Internet into the VM for processing. When using the VM external interface, all traffic was routed; VMs were not in the same broadcast domain as any others.
Requirements and options
Once the decision to migrate was made, we began pondering options. Some of our requirements included:
- Minimal customer involvement
- Minimal customer application impact, including downtime during whatever the cutover to Terremark would end up meaning
- Minimal moving/changing parts
- As much automation as possible
- Ability for customers to test and verify operation of the new Terremark setup without impacting their application still running on our infrastructure
Ultimately, we decided the copy-and-transform method best met our requirements. Re-deploying would have been nice as we could review and improve existing setups but it would have been too invasive for both customers and our staff that would be performing the migrations.
For moving the data, we could either ship physical media (probably entire Coraid shelves) with archives on them or transfer data over the Internet. Sending data over the network won handily since it was much faster to start developing with and we could easily change the archiving process as needed.
Knowing our requirements and how we wanted to move data, we then looked at the changes we were facing going from our infrastructure and its assumptions to Terremark.
Terremark uses VMware for virtualization, running VMs fully virtualized instead of paravirtualized. Since our VMs assumed they were paravirtualized, we would need to get them self-bootable with a proper kernel and drivers for storage and network access.
On the storage side, we would be using disks instead of LVs. In our infrastructure using LVM behind the scenes meant we could snapshot VM local filesystems at the Xen host level by snapshotting the backing LV. To maintain that ability, we would need to have each VM using LVM. For shared VM storage, GFS would not be an option since disks could not be accessed by multiple VMs simultaneously.
The networking changes were less substantial. We decided each VM would only need one interface since we would not be using GFS and the related clustering. At Terremark, internal networking is done with VLANs which segment traffic and layers 2 and 3. This wouldn't require a large change in operation but we would need to get each customer's VMs into a separate VLAN.
Step 1: What would you like to migrate today?
With all that, we were ready to start. From here we'll take a look at the steps for a migration and what was involved for each.
First, we needed a way to describe a migration. Some questions that needed to be answered:
- What VMs from our infrastructure are migrating and how should they be assembled at Terremark?
- What should they look like as far as memory, CPU, and storage?
- Are there databases involved? If so, where will the data be coming from?
- How many public IPs are needed and what ports on the outside should map to some set of VMs on the inside?
Migration documents revolve around document-unique identifiers (“duid”) as a way to reference the yet-unknown new entity at Terremark. For example, lines 7-28 describe a VM at Terremark that will be created at some point. Since it doesn't yet have a name and we need to also describe other items related to it, such as what VM on our infrastructure it should be migrated from (lines 259 and 260), we used its duid instead. They're also used for environments (groups of VMs doing similar things), public IPs and their port mappings, and groups of database servers.
The primary creator of these documents was our migration UI. It was a Sinatra application that walked our customers through reviewing their old and new configurations and, once they were ready, created an appropriate migration document and submitted it to our migration service for processing.
Some migrations had to be handled outside the migration UI due to extra complexity. Fortunately, while not ideal, the migration document format was usable by humans as well for describing situations outside what the migration UI was expected to accommodate.
Step 2: Migration document submitted
Once a migration document was submitted, the migration service went to work. The migration service was responsible for taking the intent described by the document and getting a new setup at Terremark provisioned and ready for testing and verification by our migration staff and the customer.
The first step was processing the document and turning the placeholder duids into real entities. For VMs, new names were assigned using our standard formatting and sequencing. For public IPs, the migration service allocated the quantity required by the document and marked them as used. If the customer didn't already have an internal VLAN assigned to them, one was chosen and marked as used.
Thus, using the example document above, the migration service might have determined that:
- The new VMs (duids vm_1 through vm_6) would be named tm21-s00101 through tm21-s00106
- The new environments (duids env_1 and env_2) would be named tm21-e00050 and tm21-e00051
- The new public IP address (duid ip_1) would be 184.108.40.206
- The VMs for this customer would be placed on VLAN 10.100.0.0/26. Additionally, unused IPs were assigned to each VM.
- Lines 157-166 now define a mapping on new IP 220.127.116.11, port 443, going to VMs tm21-s00101 through tm21-s00104, also on port 443
- Lines 224-233 declare that tm21-s00105 and tm21-s00106 will be the new database master and replica VMs, respectively, running MySQL 5.0.51
migrationsection of the document:
- The current shared filesystem (GFS) from old environment ey04-e00075 mounted at
/datawould be moved to new environment tm21-e00050, specifically VM tm21-s00101 since it has a filesystem with a matching mountpoint on lines 25-27
- The current IP address 18.104.22.168 would be equivalent to new IP address 22.214.171.124 as far as, for example, what backing VMs traffic to port 80 went to
- The foobar_production database, currently on mysql-mega-2-master and its replica, would be moving to tm21-s00105 and tm21-s00106 by following the database server group mapping
- The local filesystems (mainly
/) for each of tm21-s00101 through tm21-s00104 would be coming from ey04-s50000 through ey04-s50003, respectively
A bit about the migration service. It started as a Sinatra application backed by MongoDB using MongoMapper which worked pretty well for prototyping but there were a couple major problems, mostly our own making. Most glaring was we made the mistake of coupling the migration document format too closely to the backend model structure. This made it difficult to separate validation and relationships for the migration document itself from the model and data that was added or changed as a migration went through the process. That clashing with the way MongoMapper handled validation and saving with embedded documents made things especially difficult; saving a deeply embedded object would run validations on the entire document causing an epic storm of thrashing.
Additionally, we ran into concurrency issues with MongoDB. With our documents too deeply nested, containing many embedded documents, we often had parts of the migration service working on different embedded documents within the same parent document. It was especially problematic with our filesystem archiving process, which we'll get into in the next step, since it was long-running and needed to update state information as it worked. We ended up with data being effectively rolled back unexpectedly. A long-running worker would get the current version of the document and hang on to it, while in the meantime another part of the system would fetch the same document and make a change. When the long-running worker finished and saved its result, it would save back an updated version of the original document it fetched, thus reverting the newer changes that had been made.
Given all this trouble and that we were still in prototype mode, we decided to make a change for the real version of the service. We went with the soon to be released Rails 3 and PostgreSQL. The primary benefit we got from Rails was the baked-in migration support in ActiveRecord and an already defined application structure. Going with PostgreSQL addressed our concurrency issues, if only by not letting us do any more embedding. We could have worked more to solve the concurrency issues with MongoDB but perhaps we didn't really need a document store anyway.
Both the prototype and real versions of the service used the outstanding Resque library for running jobs asynchronously. The bulk of what those jobs did will be covered in the next step.
Step 3: Archiving
If you're not familiar with nbd, or Network Block Device, it's a couple little programs and a kernel module that let you export block devices over a TCP connection. Say you have a block device,
/dev/foo, on a server and you want to make it available to another server. On the server with the block device, you run nbd-server against it. On the server that wants to access
/dev/foo remotely, you load the nbd kernel module and use nbd-client to connect to nbd-server. nbd-client works with the kernel module to make a local block device, say
/dev/nbd0, operate as
/dev/foo. You can then use
/dev/nbd0 as if it were
/dev/foo. There is naturally added latency involved but it works as expected.
How did nbd fit in the archiving process? We'll walk through how filesystem migrations worked to see.
After processing a migration document, the migration service would enqueue jobs to archive each VM's local filesystems as well as each involved shared GFS filesystem. Local filesystems were archived differently than GFS filesystems; we'll cover the local filesystem case first.
The first thing we needed with a local filesystem was something to archive. With VMs running, we couldn't mount their in-use filesystems elsewhere and running something like
rsync running-vm:/ /archives/running-vm would give us something that was both inconsistent and tainted by other factors, such as the special
/dev filesystem being mounted on top of the real
/dev on the root filesystem. We could have shut down each VM in turn and archived its filesystems but that would have violated a few of our overall requirements listed previously.
Since local filesystems were backed by LVs at the Xen host level, we decided to take snapshots of each filesystem and use that as our archive source. There would be artifacts from the system being running, such as files in
/tmp and pid files of running services, but the directories those are in are cleaned up at boot anyway. A problem with this approach in our environment, however, was that our clustered LVM service (part of the Red Hat Cluster Suite, like GFS) didn't support snapshotting LVs visible to the cluster. Since we knew LVs for VM local filesystems would only be used by one VM and that VM would only be running on one Xen host at a time, we found a workaround relying on the fact that LVs are really just block devices created with device mapper: we used raw device mapper manipulations to manually create copy-on-write snapshots, effectively doing what creating a snapshot with non-clustered LVM would do.
Next, what to do with the snapshot. The easiest option would have been to mount it in the Xen host's management domain (dom0) and run tar/gzip against it, but that had at least several drawbacks that violated our overall requirements. The biggest was that it would place the processing cost of running gzip too close to VMs, causing their performance to suffer. Another possible problem was the actual mounting of the snapshot; if anything went wrong that caused kernel-level trouble within the dom0 the VMs on that host could be affected. This is where nbd came in.
Using nbd, we could do the mounting and tar/gzip of the snapshot on a completely different system as long as the network latency was low. Additionally, we could (and did) tunnel the nbd TCP connection over ssh for wire encryption of the data being archived. This method kept processing impact on the Xen hosts to a minimum while also avoiding any trouble that might come from mounting the snapshot. We set up two such archiving systems, one per datacenter. While it was possible (and fun!) to mount snapshots cross-country from one datacenter to the other, it wasn't very practical.
nbd served us very well for local filesystems. For GFS filesystems, we had to go a different route. Since they were accessed by multiple VMs on multiple Xen hosts simultaneously, there was no single IO path over which to manage writes while a snapshot was in place. We could have moved all VMs using the GFS to one Xen host for the purpose of archiving but that would have had a number of potential issues, namely the loss of Xen host redundancy.
Instead, we went with just straight tar over ssh, picking one VM that had the GFS mounted and running tar there. While this wasn't guaranteed to produce a consistent result since files could come and go while tar was running, it was good enough. Inconsistencies probably wouldn't show up in testing and would be cleared up anyway in the cutover process. Like with local filesystem archiving, we pushed the processing impact off to our archiving systems by having data piped into gzip there.
Databases were done in a similar fashion to GFS: we ran mysqldump on the appropriate replica, piped the output over ssh, and sent that into gzip running on the archiving system in the relevant datacenter. This gave us consistent database data with which we could seed the new database VMs for testing. The database data could be inconsistent with, say, the file assets stored on the GFS, but as with the internal consistency trouble we had with GFS any inconsistencies were minute enough for testing and would be cleared up on final cutover.
For future VMs that were to be database servers, we did come up with one minor optimization. We had the database data archived but needed a root filesystem for to-be-created VMs. Instead of starting them from scratch and installing MySQL/PostgreSQL, we archived the root filesystem of the relevant current replica. This worked for both existing shared and dedicated database servers as long as we did proper cleanup of anything not specific to the customer the new VMs would be created for. As a bonus, the migration service knew how to reuse archives so if ten customers were on the same shared database server the relevant replica's root would only be archived once.
Once staged on the archiving systems, they were transferred to our migration server at Terremark. As we were dealing with large files and the Internet, we built retry into the transfer phase. We started with scp which quickly didn't make sense with retrying as it started from the beginning of the file every time. Eventually we settled on rsync with the --inplace option, tunneled over ssh.
Some archive statistics:
Total archive count: 1474 Local filesystem archives: 808 GFS filesystem archives: 294 Database archives: 372
Largest local filesystem archive: 185GB raw used, compressed to 154GB Largest GFS filesystem archive: 318GB raw used, compressed to 287GB Largest database archive: 170GB on disk, dumped and compressed to 16GB
Where “raw used” was Used as reported by df.
95% of our local filesystems had raw used sizes under 20GB. 93% of our GFS filesystems had raw used sizes under 20GB. 95% of our databases had on-disk sizes under 20GB, 97% under 30GB.
The decision to transfer archives over the Internet instead of shipping physical media worked out well as the copy of the largest archive (of the 287GB compressed GFS) took 12 hours, averaging about 7 MB/s.
In my next post, coming soon, we'll continue on to discuss VM creation and configuration, database bootstrapping, testing and verification, and final cutover to the new setup at Terremark.