10 Years of Virtual Machine Performance (Semi) Demystified

There are many opinions in the air about the impact that virtualization has on performance, so I thought a short blog would be good to explain (as best I can) virtual machine performance characteristics with pointers to relevant benchmarks and technical papers.

My background is that I was an early Product Manager working on VMware ESX Server (from version 1.5) and among other things ran product management for VMware for a few years. As a product management guy, I kept track of the output of the engineering performance group, and as a result had a reasonable high level (although never code level) understanding of the whys and wherefores of virtualization performance. Although I’m not as fresh on virtualization as I once was, I’ll try to do my best here. I also want to thank Steve Herrod at VMware, and Simon Crosby at Citrix for providing a technical sanity check on the blog contents, although I retain responsibility for any mistakes and oversights.

First, a solid statement: virtualization has always levied a CPU “tax.” Early on, this was very high, recently not so much. Probably the most comprehensive recent non-vendor benchmark of performance vs. native is AnandTech’s, which recently showed anywhere from a 2% to a 7% CPU tax on a fully loaded system running mixed workload 4-CPU virtual machines on recent hardware.

The virtualization tax has always varied a lot with the type of workload, number of virtual machines, number of virtual CPU’s per machine and your hypervisor type. The reason people have been willing to pay the tax is that virtualization is just a better way to manage systems: system utilization is higher because you can pack workloads together while still maintaining hardware-guaranteed security isolation; hardware upgrades are trivial because the guest OS’s always run on a consistent virtual hardware layer; image management is trivial; and neat tricks like shared copy-on-write memory means that you can actually use fewer resources in a virtualized environment. Best of all, you get a consistent container for managing your workload, no matter what you end up having to put in it.

However, many people still look at the Googles and Yahoo’s of the world who designed their architecture when virtualization tax was high and say “Google doesn’t believe in virtualization, so maybe I shouldn’t”. So, let’s dive into the issue of the virtualization “tax.”

There are really two aspects of performance that you have to consider when you look at virtualization. The first is, for a given workload, what level of work do you get done in a virtualized environment vs. a native environment. The second is, for a given level of work done, how much flexibility do you lose.

Take the example of a highly i/o bound workload. Let’s say a native environment gets 10M disk i/o’s performed per time period, and a virtual environment gets 9.8M disk i/o’s performed. Should you consider that a 2% overhead? Yes. But what if I was to tell you that under the native environment, CPU utilization was 20%, while under the virtual environment, the utilization was 30%. Should you consider that a 50% overhead? What’s the right number, 2% or 50%.

The rule here is that you always look at your limiting factor. If you’re burning more of the non-limiting factor by being virtualized, then you don’t really care—it wasn’t being used anyway. So 2% is the right number. But there’s a caveat: what if your workload changes so that you have a CPU intensive workload in another thread? Should you care that 10% of your CPU time is being burned by virtualization?

The history of modern virtualization is a history of engineers eating an elephant. Taking each bottleneck to performance in turn and tackling it. Knowing the things they can change, and the things they can’t, and having the wisdom to know the difference. Over time, as virtualization-friendly features have spread to every part of the IT stack, the most insuperable barriers to virtualization performance—oddities in the Intel architecture, OS limitations, uncooperative NIC’s—have been addressed one by one, until finally this year (yes, just this year), the last serious performance barriers to virtualization have been finally addressed.

But first, the history. Before the dawn of modern virtualization, there were lots of emulators out there that emulated one operating system on top of another. But because every OS call had to be emulated in software, they were slow. More importantly, if they were running on top of Windows, they were dependent on Microsoft Windows not changing its behavior from patch to patch—which of course was a terrible bet. But virtualization was different—most instructions didn’t have to be emulated—if they weren’t accessing memory or an i/o device (or were one of a handful of badly behaved instructions), they could simply be passed down directly to the CPU—drastically increasing performance, but also critically, bypassing the need for dependence on the host operating system’s API.

In the Beginning Was Disco

The seminal project inaugurating this generation of x86 virtual machines was the Disco project at Stanford, which published its key paper in 1997. That project (three of the four authors were future founders of VMware) built a virtual machine monitor for the Irix operating system running on the FLASH research processor.

The performance characteristics were reasonable for the systems of the day. 3% to 36% overheads for a single VM on memory/CPU intensive tasks. But the really interesting thing about the paper was that total system output with eight VM’s on an 8CPU system almost doubled vs. native, because on native hardware, Irix was not very effective at scheduling work across 8 cores.

The Stone Age: VMware Workstation

VMware was founded in 1998 and in 1999 it released VMware Workstation 1.0. As a desktop product, it ran on Windows and Linux and allowed people to run other operating systems on top of either. By this stage, VMware engineering had tuned the core virtual machine monitor so that memory and CPU intensive workloads were pretty fast compared to native with some exceptions. On the other hand, networking-intensive workloads had fairly terrible performance.

The reasons for the overheads were outlined by some of the VMware engineering team in a 2001 Usenix paper. The paper gamely showed that with several optimizations it was possible to get full native throughput for networking workloads (10/100 BaseT), although the amount of CPU work spent to process that workload was about 4x the work required in the native environment. The paper also pointed out several possible further optimizations.

The Bronze Age: Hypervisors

One of the optimizations suggested was a custom kernel that would cut the amount of interrupt handling (a major cause of CPU overhead) in half by bypassing a host operating system. The ESX Server project was already in full swing by that stage, and when the product came out, it had two big innovations—the vmkernel, a kernel built from scratch to run guest OS’s, and VMFS, a highly simplified extent-based file system for fast disk access.

The benchmarks for ESX Server were a huge improvement over the host-based Workstation. ESX Server 1.0 could basically process a 10/100 networking workload with about a 10-20% CPU burn. One of the things working in virtualization’s advantage has always been that Moore’s Law would give it more CPU cycles every year, so the fixed overhead of processing a particular workload decreased proportionally over time. However, as customers shifted to GigE networking during 2003, benchmarks vs. native took a nose-dive. On the server hardware of the time, GigE workloads were being CPU-limited at about 300 MB/s for an average packet size. Basically, you could saturate your CPU just processing network traffic. (To be fair, on the hardware of the time the CPU burn was also very high on native hardware.)

The Iron Age: Paravirtualization and Virtual SMP

The next jump in performance was the introduction of paravirtualization by the Xen open source team and multi-CPU virtual machines by VMware. The Xen team patched Linux to get rid of some of the more problematic instructions to virtualize. The first Xen software also included a high performance networking system (but I believe that system was later abandoned due to other issues – although hopefully someone with better Xen knowledge could chip in with more details).

Meanwhile, VMware was introducing the first multi-CPU guest virtual machines. This was a long performance optimization task. In early stages of development in 2002, Virtual SMP achieved about 5% of native performance, but over the course of 18 months of steady performance optimization, it got to about 75% of native, and it shipped at around that performance level. Around the same time, (about early 2004), Intel shipped the first generation of VT technology, slightly ahead of AMD’s equivalent. Ironically, this initially decreased VMware performance on some workloads, and VT did not enjoy a lot of adoption. A great backgrounder on the impact of VT technology is Ole Agesen’s primer from VMworld 2007.

The Silicon Age: Virtual I/O

Since 2005, VMware and Xen have gradually reduced the performance overheads of virtualization, aided by the Moore’s law doubling in transistor count, which inexorably shrinks overheads over time. AMD’s Rapid Virtualization Indexing (RVI - 2007) and Intel’s Extended Page Tables (EPT - 2009) substantially improved performance for a class of recalcitrant workloads by offloading the mapping of machine-level pages to Guest OS “physical” memory pages, from software to silicon. In the case of operations that stress the MMU—like an Apache compile with lots of short lived processes and intensive memory access—performance doubled with RVI/EPT. (Xen showed similar challenges prior to RVI/EPT on compilation benchmarks.)

Some of the other performance advances have included interrupt coalescing, IPv6 TCP segmentation offloading and NAPI support in the new VMware vmxnet3 driver. However, the last year has also seen two big advances: direct device mapping, enabled by this generation of CPU’s (e.g. Intel VT-D first described back in 2006), and the first generation of i/o adapters that are truly virtualization-aware.

Before Intel VT-D, 10GigE workloads became CPU-limited out at around 3.5Gb/s of throughput. Afterwards (and with appropriate support in the hypervisor), throughputs above 9.6 Gb/s have been achieved. More important, however, is the next generation of i/o adapters that actually spin up mini-virtual NIC’s in hardware and connect them directly into virtual machines—eliminating the need to copy networking packets around. This is one of the gems in Cisco’s UCS hardware which tightly couples a new NIC design with matching switch hardware. We’re now at the stage that if you’re using this year’s VMware or Xen technologies, Intel Nehalems and Shanghai Opterons and the new i/o adapters – virtualization has most performance issues pretty much beat.

Common Attribution Problems

So why then do people attribute chronic performance problems to virtualization? Well sometimes they’re comparing apples and oranges, new hardware to old. And sometimes they’re not comparing limiting factors. A sysadmin will sometimes pack virtual machines on a machine until CPU utilization hits 75%, without realizing that he’s run out of i/o capacity way before that. And sometimes it’s true. Running hundreds of multi-CPU VM’s on a single machine still probably wastes a lot of CPU cycles—but in that case, the alternative of putting all those Guest Operating Systems on separate servers is probably a very expensive idea. And I have to imagine (without evidence, but just looking at trends) that performance overheads for 8+ vCPU virtual machines are still not all that great. But in most cases, the tax seems to be worth it.


Subscribe to our Blog