Tokyo Cabinet (Key-Value Stores Part 2)

A few weeks ago I started a series of posts on Key-Value Stores with a general piece on the Key-Value Store concept, and why you should be using one. This week, I'm following it up with a focus on one of our favorite tools for the job here at Engine Yard: Tokyo Cabinet. Tokyo Cabinet was written by Mikio Hirabayashi, and was originally created for mixi -- the most popular social networking site in Japan. So it's proven in production already, and as far as we've seen -- nice and mature.

If you're just beginning to experiment with Key-Value Stores, you've likely played with PStore, and while you probably found it interesting, you probably also ran into its shortcomings pretty early on. As a general diagnosis, I'd guess that you need something faster, more feature rich, and almost as importantly, something that leaves you with a vague tingly feeling after using it for the first time. This brings me back to my original thought: I'd like to introduce you to Tokyo Cabinet.

Tokyo Cabinet is designed specifically to be a robust, high performance, high efficiency Key-Value Store. It supports three storage formats: hash tables, B- tree tables and fixed length arrays. Via additional components called Tokyo Tyrant and Tokyo Dystopia, it also supports full text indexing and network based access to the Key-Value Store.

As if that's not enough, Tokyo Cabinet also provides a table format for data. Using its underlying data structures, Tokyo Cabinet lets you create something that looks a lot like a relational database table, except that it needs no predefined schema. You can designate arbitrary indexes and then perform queries against them.

In the previous post, I illustrated how you'd work with PStore using a script that loads stock pricing data from a CSV file. Let's revisit that piece of code, except that this time it will be loaded into a Tokyo Cabinet hash table. For this example, we'll use the Ruby API provided in conjunction with Tokyo Cabinet.

stock_loader_tc1.rb

require 'tokyocabinet'
require 'csv'
require 'date'

store = TokyoCabinet::HDB.new
store.open("fund_data.tch", TokyoCabinet::HDB::OWRITER | TokyoCabinet::HDB::OCREAT)

funds = {}
store.iterinit
while key = store.iternext
  funds \[key\] = Marshal.load(store \[key\])
end

CSV.read(ARGV.first).reject {|x| x.first.nil? }.each do |row|
  row \[0\] = row \[0\].to_s.strip

  record = {
    :ticker => row \[0\],
    :rate_date =>; Date.parse(row \[2\]),
    :price => row \[3\],
    :price_change => row \[4\],
    :percent_change => row \[5\]
  }
  store[row \[0\]] = Marshal.dump(record)
end

With methods like #iterinit and #iternext to iterate over the keys, the API isn't always my favorite thing to look at. That said, this is just one of the available Ruby interfaces for Tokyo Cabinet, so I don't usually have to. I myself prefer rufus-tokyo, which I find to be a bit more mature, and certainly easier to look at. It uses Ruby-FFI, which means that the same code can be used on multiple Ruby implementations. It also supports a more conventional Ruby style API. stock_loader_tc2.rb

require 'rubygems'
require 'rufus/tokyo'
require 'csv'
require 'date'

store = Rufus::Tokyo::Cabinet.new('fund_data.tch')

funds = {}
store.each { |key, value| funds \[key\] = Marshal.load(value) }

CSV.read(ARGV.first).reject {|x| x.first.nil? }.each do |row|
  row \[0\] = row \[0\].to_s.strip

  record = {
    :ticker => row \[0\],
    :rate_date => Date.parse(row \[2\]),
    :price => row \[3\],
    :price_change => row \[4\],
    :percent_change => row \[5\]
  }
  store[row \[0\]] = Marshal.dump(record)
end

One important thing to note about Tokyo Cabinet is that the values are all stored as strings. With PStore you can store arbitrary Ruby data types, and it handles the serialization tasks. With Tokyo Cabinet, if you store data that is not a String, you need to handle that yourself.

Furthermore, different Ruby interfaces have different behaviors. The rufus-tokyo library calls to_s on data (such as Hash) before storing it, so while the above code works if the Marshaling is left out, the value retrieved is a mashed string. Something like this:

"tickerXXXrate_date2009-08-12price10.01price_change0.01percent_change0.01"

On the other hand, the Tokyo Cabinet library will just throw an exception if given a complex data type like Hash.

If you store table type data, you can go beyond simple Key-Value storage semantics with Tokyo Cabinet, and use its built-in support for tables. Tokyo Cabinet tables resemble tables in a relational database. A primary key is required for each record, but you can insert hash-like data into records without having to predefine a schema. The keys of the hash correspond to column names. You can then define indexes for these keys, perform queries on them, and do other fun activities. stock_loader_tc3.rb

require 'rubygems'
require 'rufus/tokyo'
require 'csv'
require 'date'

store = Rufus::Tokyo::Table.new('fund_data.tct')

funds = {}
store.each { |key, value| funds \[key\] = value }

CSV.read(ARGV.first).reject {|x| x.first.nil? }.each do |row|
  row \[0\] = row \[0\].to_s.strip

  record = {
    :ticker => row \[0\],
    :rate_date => Date.parse(row \[2\]),
    :price => row \[3\],
    :price_change => row \[4\],
    :percent_change => row \[5\]
  }
  store[row \[0\]] = record
end

Notice that Marshaling is no longer required when using a Tokyo Cabinet table. Also, you're now able to do more sophisticated queries on the data.

store.query do |q|
  q.add 'ticker', :equals, 'GOOG'
end

store.query do |q|
  q.add 'price', :gte, 100.0
  q.add 'percent_change', :lte, 1.0
  q.order_by 'ticker'
end

If you're deploying an application that you expect to have a non-trivial user load, you may want to be able to support multiple processes across multiple machines, accessing the same central data store. You can use the previously mentioned Tokyo Tyrant together with Tokyo Cabinet to make this happen.

Tokyo Tyrant is a network server for a Tokyo Cabinet database, capable of handling high concurrency. It also has useful production features like database replication and failover. Here's what a stock loader looks like that uses Tokyo Tyrant to load data into a remote database: stock_loader_tc4.rb

require 'rubygems'
require 'rufus/tokyo/tyrant'
require 'csv'
require 'date'

store = Rufus::Tokyo::TyrantTable.new('db1.yakinyouryard.com',1978)

funds = {}
store.each { |key, value| funds \[key\] = value }

CSV.read(ARGV.first).reject {|x| x.first.nil? }.each do |row|
  row \[0\] = row \[0\].to_s.strip

  record = {
    :ticker => row \[0\],
    :rate_date => Date.parse(row \[2\]),
    :price => row \[3\],
    :price_change => row \[4\],
    :percent_change => row \[5\]
  }
  store[row \[0\]] = record
end

That's pretty nice! The only changes from the previous example are in which library is required, and the creation of the store object. The store object must be pointed at the host and port that the Tokyo Tyrant server is running on. All other mechanics for interacting with the data store are the same as if it were local.

As you can see from these examples, Tokyo Cabinet offers a substantial step up in capabilities from what you'd find in a simple Key-Value store like PStore. Additionally, Tokyo Cabinet is quite fast. Using rufus-tokyo on a small Engine Yard slice, it inserted more then 220,000 records per second into an in-memory hash table, and more than 190,000 records per second into a disk backed hash table. If you wanted to forgo the nicer API of rufus-tokyo for the raw performance of the non-FFI bindings, it can do a million records to a disk backed store on an Engine Yard slice in a second.

The Tokyo family of tools has a myriad of other capabilites that could be great to have in your developer toolbox (such as data store compression), and we haven't even touched on the full text indexing and searching capabilities of Tokyo Dystopia. It is well worth your time to spend an hour or two playing with these libraries to get a feel for them, if your work ever involves writing code to store or retrieve data.

I have only shown rufus-tokyo as an alternative to the original Tokyo Cabinet interface just because there isn't space in a single post for more, but there are many excellent new projects supporting it now, such as Moneta, which offers a backend-agnostic interface.

Stay tuned for further adventures in Key-Value stores.