Note: This article was originally included in the April issue of the Engine Yard Newsletter. To read more posts like this, subscribe to receive the Engine Yard Newsletter.
This post comes from Vitaly Sedelnik and Renat Khasanshyn of Altoros Systems, Inc. Altoros specializes in helping technology companies identify, recruit and support teams of uber-talented Ruby on Rails engineers across Eastern Europe. Altoros currently operates 120+ engineers for 21 consumer Internet, SaaS and gaming companies.
APRIL 26, 2011 UPDATE: This blog post has received some negative attention, the ferocity of which surprised us.
After reaching out to some of the detractors, it's clear that their arguments that the post, as written, is neither objective nor scientific, are absolutely correct.
Thank you, community, for keeping us on our toes. We'll work internally to avoid such incidents in the future.
-Tom Mornini, co-founder and CTO
Choosing a Key-Value Solution for Logging Events Data
"It is not necessary to change. Survival is not mandatory."
The words of Edwards Deming, an American statistician and author, can, perhaps best of all describe the idea behind the latest Ruby on Rails project I participated in. Change was the only hope.
Need to scale
Last year, we built a Web application on Ruby on Rails that worked with a MySQL database fulfilling all the requirements of a customer it was meant to. Later, the customer decided to make the project more social, building in a new module that allowed their users to diversify the activities within the app. The good news was that the new spirit stimulated users to be more interactive, attracted the new users, and helped the company expand its services to new countries. The bad news was that the amount of data to be logged increased immensely and was likely to continue to grow at an even greater pace. That was too much for the initial MySQL database, which couldn't cope with the increased load. Thus, change was inevitable.
While looking for a better database solution, we had to consider the special character of logging data, since it was the main function required. Typically, logging implies that data is frequently written and rarely read. There are almost no references or weak references between tables. There is no need for complex joins, as records are usually accessed by a primary key or another unique index. The records have a simple, but ununified structure.
Key-Value Solutions Analysis
Since our team was previously involved in the Memcached project, we already had NoSQL-related experience. It was obvious to us that a key-value solution could be a good fit. But which one?To identify the most suitable one for our app, we sorted out the popular open source solutions, such as Cassandra, CouchDB, MongoDB, etc.
As performance was critical, the solutions that could in any way impact the efficiency of the application were dismissed. Thus, we crossed out Cassandra, since it's written in Java and could require substantial memory resources and JVM installation. The application was created utilizing Ruby on Rails, therefore the key-value solution to be chosen had to be properly compatible with Rails. Now, CouchDB and MongoDB looked like good candidates as they do have the required Rails gems. Considering future plans and how fast the application had evolved up to that moment, we had to think about further development of the system and its maintenance. The easier it would be to engage the experts out there that could support the technology, the better. Sorry, CouchDB, Erlang is not that popular compared to C++ or Java.
This is how we decided that the most suitable fit in our case was MongoDB. It's written in C++, which implies enhanced performance for our Rails app. It's highly compatible with the Ruby on Rails application, since it comes with Ruby driver. And, it's very easy to install, just one step vs. configuring a separate environment. In addition, it offers cross-platform support, which is good in case the customer chooses to tune our Linux-based app to work on several operating systems.
Though the analysis of the key-value systems took some time, the choice to go with MongoDB proved to be the right one for a Rails application dealing with lots of logging data.
Replacing MySQL with MongoDB helped:
- Enhance performance: MongoDB turned out to be significantly faster than MySQL both for write and read operations for logging data. MongoDB uses RAM and disk extensively, just like the Memcached solution does.
- Ensure easy backups: Data is stored in flat files using custom binary format, which is compact and efficient. Files are allocated in incremental manner with extensions like .0, .1, .2, etc., which enables simple incremental backup. All you have to do is just back up the new files. Note again that this logic works due to the specifics of logging data.
- Provide easy replication: Replication was very clear and easy to setup with guaranteed performance overhead. No precise numbers, but it was around ~5%.
While implementing MongoDB, we discovered several peculiar things one should pay attention to if s/he chooses the solution for the similar case:
- The record structure is flexible, so you can store whatever you need within each record. There is no need to recreate the entire database or the table by adding a new field. Just store it in new records and that's it. However, no schema means that each record needs to store field names. By cutting field names, you can save a lot of disk space and memory. We saved ~900Mb on table of 5'000'000 rows by using “dt” instead of “date_time,” “un” instead of “user_name,” etc.
- Migrating was a bit time consuming, mostly because of renaming the fields.
- When you delete documents, it locks the entire database. This wasn't an issue for us, because of logging data nature. Just keep this in mind in case you want to use MongoDB for other types of data.
For our application, the new functionality and the increasing popularity caused increased traffic load, which could have been predicted, but still not planned. In some other cases—for example, with embedded system devices that continually generate and write some data—large amounts of data to be logged can be planned in advance. So, it might be a good idea to think about implementing a key-value solution at the stage of developing an application.
However, it's also possible that you might prefer something different rather than MongoDB. If your application needs to preserve the structured character of data (such as in a CMS), or if it focuses on indexing, then you might want to consider other solutions. Besides, MongoDB doesn't provide transaction support, so beware of moving your billing app to MongoDB. ;-) Just try, analyze, and don't be afraid of changes.
Note: Engine Yard AppCloud customers, keep an eye out for a follow up post coming soon from Altoros on installing and configuring MongoDB on AppCloud. Engine Yard is also working on formalizing our support for MongoDB. Stay tuned.