Background Job processing is all the rage lately, with numerous folks speaking and blogging about it—and rightly so. Since response time is a critical factor when scaling a web application, it makes sense to focus on keeping response times low, even when the app has tasks to perform. Moving the heavy lifting out of the request and response cycle is key to scaling a web application with high performance.
There's been a fair bit of good coverage of available background job frameworks recently. But I'm not going to do a technology review here, instead, I'll walk through some of the deployment best practices we've come to agree on here at Engine Yard.
1. Know Your Limits
Most background jobs do heavy number crunching, content fetching, video transcoding, etc. It makes sense then, that the user has to wait. That said, we want to keep that wait time to a reasonable limit. If you need to transcode an average of 15 videos every hour, your maximum (average) execution time for your background job is four minutes. Knowing how many jobs you have to process at peak will help you provide enough resources to make sure that your jobs complete in time.
Be sure you're using technology you understand. Just because something is getting lots of attention on Hacker News doesn't mean it's right for you. If you don't understand the tool you're using you will likely have difficultly installing and operating it correctly.
2. You Did Benchmark Your Jobs... Right?
Far too often in the rush to scale, folks leave out load testing. Because most background jobs are heavy tasks, understanding the resource utilization of your jobs is extremely important. First you want to know how much memory and CPU your most frequently utilized job consumes. You can figure this out by submitting one job and watching its resource utilization in
top. Alternatively you can run benchmarks on the code being run in the background (please use "bmbm" benchmarks!)
When benchmarking your jobs, you must use production data, and use libraries that match the production environment—be it 32bit or 64bit. If you have slow benchmarks and high memory usage, review your code for areas that may create large objects (use
Array.each for example.)
3. Track Job Failures and Queue Length
Knowing that your jobs are performant and that you've deployed adequate resources is a start, but you're not quite there yet. At some point your jobs will fail due to resource starvation, bugs in your code or other unexpected monkey wrenches. The last thing you want is an inbox full of emails from your hard-earned customers informing you that 'the site is broken.'
Use an exception notifier (such as hoptoad) to instrument your jobs. If your background job implementation defaults to automatically deleting failed jobs, disable this feature. Make sure you can easily pause the processing of specific jobs when you receive numerous failure alerts. Implement a "failure" job to test that your notification system works. Generate graphs that make it easy to visualize your job queue.
4. Know the Hidden Pitfalls
- Make sure your [Monit or God config](http://gist.github.com/26196) starts one or two fewer workers than you expect your instance to be able to handle. This leaves headroom, so if a job or two hangs (or gets backed up) your system will stay out of swap and the jobs will complete.
- Delayed Job uses UTC time; if your workers use local time and your database uses UTC (or vice versa), this may cause issues with jobs failing to be executed.
- Loading the full Rails environment increases resource costs and job processing time; only load what you need when you need it. If you have a job that is particularly memory or resource intensive, split it into multiple jobs so the work can be divided over additional servers.
- Make sure you set timeouts for your jobs; you don't want to pull an RSS feed and wait forever just for the HTTP timeout.
5. Don't Use BackgrounDRb
Friends don't let friends use backgrounDRb. BackgrounDRb served its purpose with excellence when it was created (at a time when there were no alternatives). Now there are many alternatives to get the job done that don't leak memory and work reliably (like our recommended backgroundjob and delayed_job).
So what are the takeaways?
Measure the scale of your work both in time and resources. Benchmark your jobs with live data in a cloned production environment. Don't wait for your jobs to fail—expect them to. Make sure you're notified before your users see trouble. And last, don't forget to periodically review your job performance. You are planning on doing some heavy lifting aren't you?