The Engine Yard PaaS products are highly complex, and span thousands of servers to serve our customers. Ensuring that our applications run quickly and reliably requires a good understanding of concurrency and parallelization. This blog post is a recap of my journey down the rabbit hole of concurrency in an attempt to solve this problem in my own development efforts. Part II will be the continuation of this trek, only with more of a focus on parallelization.
We can try throwing hardware at the problem. However, if CPU is the bottleneck then increasing the speed per core will only go so far before it no longer helps. A similar statement holds true for adding more servers. If the processing happens in Serial, one task after another, we only get a fixed number of processing units done per server added.
Become a “Serial” Killer!
The name of the game these days is not as much core speed as it is core density. CPU's aren't getting much faster - they are getting more cores. However, unless you can break your application into “bite-sized” chunks that can all run at the same time (or “parallelize”), you cannot take advantage of this multi-core goodness.
Concurrency vs Parallelism
When tackling such a complex problem, it is important to know the difference between concurrency and parallelism. Concurrency is the ability of a program to make progress on a task that is spread out over different slices of time. It allows us to run multiple “threads” at the same time. As one thread may be sleeping or waiting on I/O, another thread may take priority and start working, thus making optimal use of available CPU time. When you think of concurrency, think of “threads”. Parallelism is the ability to accomplish multiple tasks simultaneously. Concurrency may give the illusion of parallelism, but is not parallelism in and of itself, as your threads are still deferring to one another competing for the same CPU resources. For example: In MRI, you have something called a Global Interpreter Lock (GIL). Even though your code uses multiple threads, those threads are run one at a time, due to the GIL. Therefore, your application is not truly multithreaded! Instead what MRI does is switch quickly between running threads giving each a small window of processing time. Your I/O may run parallelized, but the ruby code is still run one thread at a time. Fortunately, there are a few options for you! The option I advocate is Rubinius. Rubinius is a dynamic language platform implemented from scratch for the Ruby programming language. It includes a bytecode interpreter, modern garbage collector, just-in-time compiler to machine code, and native threads with no global interpreter lock, so multiple Ruby threads will run in parallel if the hardware supports it. Rubinius was conceived by Evan Phoenix and is maintained by Dirkjan Bussink and Brian Ford of Engine Yard. Most importantly for our discussion here is that Rubinius has no GIL and thus supports true multithreading. In Rubinius each thread is mapped to a native operating system light weight process (LWP). Rubinius is the ruby interpreter we use for our project mentioned above. Even if you correctly implement threading, you won't get the full benefit of your available hardware until you look at other options, such as adding multi-process and multi-machine capabilities to your application. These techniques will be discussed in Part II. Please read Evan Phoenix's post for a more in-depth discussion on Ruby concurrency and parallelism.
When you create a Thread, the code contained within the thread block is run in parallel to the main application that instantiated the thread.
For our example application, let's start with a simple currency downloader:
It works fine as is, however, since we are processing each next currency one at a time it takes a bit. Let's make this faster by using concurrency (parallelize it). A more concurrent example might look like this:
Notice that we are using an array to keep track of the threads we spawn. We usually want to wait for all threads to finish (or “join”) before moving on, rather than just let them run amok. Also, if you fail to make the parent thread join (wait) on its children, the children could be killed before they finish running. Without the join loop your output would look like this:
With the join loop included your output looks like this:
Threads Gone Wild!
OK, great - we now know how to use threads. We still have a problem: In our example, let's say we expand our application to use over 500 currencies (if 500 currencies do indeed exist...). Does this mean I can take my entire application, put these calls into 500 simultaneous threads, and let it run? While we can absolutely do so, it is not such a good idea unless you have a 100 core CPU. Here's why:
In any multithreaded environment, context switching occurs. Context switching is the process by which a thread is stopped and its state and context are stored allowing other threads to use CPU cycles. Once the competing thread has been interrupted in a similar manner, the context and state of the original thread are loaded and the original thread has the opportunity to run if it gets priority.
One other thing to note - the actual act of context switching in and of itself takes CPU cycles. If you simply spawn as many threads as you can, your system can actually spend more time context switching than actually processing your data. This will slow down your application.
So how do we properly schedule our threads? The common solution that I use is the producer-consumer model. In this model, we have 2 main threads: The Producer thread provides units of work to be done, and the Consumer thread takes those units of work and spins up threads to do the work.
There are a few avenues of communication that need to be laid out for this to work:
- The Producer needs a way to give units of work to the Consumer.
- The Consumer needs to know which threads are working and which ones are done so others can be scheduled.
- The Producer needs a way to tell the Consumer that it has given out all available work.
Let's apply this model to our example:
In this case, if we were to add hundreds of currencies to the list, it would not overwhelm our system. You would want to adjust the “thread_count” variable as you monitor your running system until you get the maximum performance available to your application.
As threads execute, you are no longer assured of the order in which your code runs. Therefore, anything that alters a shared structure (such as a variable) could potentially put your application into an inconsistent state.
Instead of simply writing our results to the console, let us instead change our example to write the results to a shared array for printing later:
So we run our application, and make a discovery:
Only 12 currencies are returned, even though we are asking for 13. If we run it again... we get 13. Sometimes 12 on other tries. Why do we randomly get different results?
The trouble here lies in the fact that all of our child threads are accessing 1 array. When 2 or more threads append a result to the array at the exact same time, one may overwrite the others as the threads insert the item into the same indexed spot in the array. This means that at some time during the execution of the code above, one of our returned currencies is being overwritten.
To address this issue, we can add a Mutex which ensures that whenever a thread runs code inside the mutex “synchronize” block, it is the only thread writing to the array:
Using the mutex, shown above, we now consistently get all 13 currencies back as expected:
As we write applications that do many things in parallel in a short amount of time, we have to be careful how we use memory. There are a couple of pitfalls you can run into:
Bloat typically occurs when we keep stuffing objects into memory without giving the memory back. In a language like Ruby this means that you are ‘referencing' the objects always, perhaps using a hash or an array for example. Your process ends up taking up more and more memory - “bloating” - forcing the operating system to move part of your process into swap (disk partition for memory overflow, slow!). A process which continues to ‘bloat' until all available swap/RAM has been used up causes a segmentation fault or other undesired side effects within the operating system such as process killings.
One way to reduce bloat is to ensure that you allocate objects within the thread for which they are intended to be used. In a multithreaded application, all threads have access to the memory of the parent thread.
In this example, we've wrapped our application into a class function, and are now running it in a loop to get arrays of currencies every 5 minutes (300 seconds):
This is a fairly contrived example that you wouldn't typically write. The point is to show how as subsequent threads run, the array keeps growing until we run out of memory.
Now, let's rewrite the loop to be more conscientious of memory:
As each thread terminates, the structures created within it are able to be garbage collected, thus freeing up memory for future threads and fighting ‘memory bloat'.
Memory leaks are a different animal, and are often harder to diagnose. In this case, your application has allocated memory, which for some reason has become inaccessible and cannot be freed back up by the Garbage Collector (GC).
Tools for Memory Analysis
There are many tools and gems out there that will help you “plug the leaks” and figure out where you need to optimize your code.
One “quick and dirty” way I found to figure out where you're bloating is to use the following code:
This code runs on a set interval (in this case, 5 minutes) and catalogues how many of each type of object is currently allocated in your application. When you compare the results from one run to another, pay attention to the object classes that have consistently increasing numbers of objects allocated which are the objects to look at for optimization. This code is not something you want to run in Production as it adds additional overhead to your application. However, it's probably something you want to bake into “debug mode” or activate with a command-line switch.
An article that I find especially helpful in understanding this was written by the Rubinius team. In this article they go over tools that are included in Rubinius, as well as how to use OS X's “leaks” and gdb to do additional diagnosis:
Manage Your Dependencies
When writing a concurrent application, you must ensure that your gems and other dependencies have been written to be “thread-safe”. Here are a couple of examples:
Be aware of how up to date both your OS libraries and gems are. Some newer versions have been better optimized for multithreaded applications. Even Rubinius itself is being diligently improved in this area. For example, I recently found a memory leak (using the leaks command) where I was seeing the SSL certificate for an API I was calling over and over again. By recompiling Rubinius with OpenSSL v1.0.1c, this memory leak was resolved.
A layer of abstraction may reduce the amount of code needed to write your application, but may add overhead. For example: In one of our applications, we were tracking down a thread safety and performance issue. We narrowed the problem down to a gem we created that inserts high volumes of data into a Postgres database. Other gems that performed similar tasks did not have the problem in question - the difference was that this particular gem was using an ORM instead of the pg gem to connect to Postgres. Removing the ORM gave our application a significant performance boost, and drastically reduced the amount of memory used.
So the main takeaways from this post are:
- Look for ways to run multiple things at once instead of just throwing hardware at the problem.
- Make sure you know the difference between concurrency and parallelism.
- Pick the right interpreter if you really want a concurrent / parallelized application (use Rubinius!)
- When using threads...
- Make sure you wait for your threads to finish executing before you exit your application.
- Use a control mechanism, such as the Producer/Consumer model, to make sure your threads are mostly working instead of just context-switching.
- Protect structures in memory that are commonly edited by your threads - use a mutex or monitor.
- Be conscious of how your application uses memory
- Be sure to select gems that are thread-safe.
This concludes Part I. In Part II, we will discuss processes and queueing mechanisms - options to use when threads are not enough.