Engine Yard Blog RSS Feed

Note: Our friend and CEO at Crowd Interactive, David Padilla, wrote this great piece about hash lookup in Ruby. Be sure to check out MagmaConf!

Have you ever noticed that in Ruby looking for a specific key-value pair in a Hash is very fast?

Allow me to explain the logic behind hashes in Ruby with a language that you probably understand: Ruby.

Let's imagine for a second that we want to emulate the functionality in hashes because, for some strange reason, they have not been implemented yet.

If we want to have some sort of key-value structure, we'll have to implement it ourselves, so let's get to work.

First, we'll need a Struc to represent a HashEntry, or the key-value objects that we will add to our hashes.


HashEntry = Struct.new(:key, :value)

Now, we'll need a class that represents our Hashes or (to avoid conflicts with the original Hash class) HashTable.


class HashTable
  attr_accessor :bins

  def initialize
    self.bins = []
  end
end

We add the bins attribute to the class. bins will be an array where we'll store our HashEntry elements.

Now, let's write a method to add a HashEntry to a HashTable. To follow convention, we'll use the traditional << as method name.


class HashTable
  attr_accessor :bins

  def initialize
    self.bins = []
  end

  def <<(entry)
    self.bins << entry
  end
end

Great, now we can add HashEntry elements to our HashTable like so:


entry = HashEntry.new :foo, :bar
table = HashTable.new
table << entry

What if we want to look for an entry by key? Let's write the [] method on the HashTable class to handle that.


def [](key)
  self.bins.detect { |entry| entry.key == key }
end

What we're doing here is simply going element by element comparing the given key until we find what we're looking for. Efficient? Let's figure it out.

Benchmarking

We'll use Ruby's benchmarking tools to figure out how much time we're spending looking for elements on our hash tables.


require 'benchmark'

#
# HashTable instance
#
table = HashTable.new

#
# CREATE 1,000,000 entries and add them to the table
#

(1..1000000).each do |i|
  entry = HashEntry.new i.to_s, "bar#{i}"

  table << entry
end

#
# Look for an element at the beginning, middle and end of the HashTable.
# Benchmark it
#
%w(100000 500000 900000).each do |key|
  time = Benchmark.realtime do
    table[key]
  end

  puts "Finding #{key} took #{time * 1000} ms"
end

When we run this benchmark, we get the following results:


Finding 100000 took 33.641 ms
Finding 500000 took 192.678 ms
Finding 900000 took 345.329 ms

What we see here is that lookup times increase depending on the amount of entries and its position within the array or bins. This is obviously very inefficient and unacceptable for real life scenarios.

Now, let's see how Ruby tackles this problem internally.

Bins

Instead of using a single array to store all its entries, hashes in Ruby use an array of arrays or "bins".

First, it calculates a unique integer value for each entry. For this example we will use Object#hash. Then, Ruby divides this hash integer by the total number of bins and obtains the remainder or modulus. This modulus will be used as the bin index for that specific entry.

When you lookup for a key, you calculate its bin index again using the same algorithm and you look for the corresponding object directly on that bin.

Let's add an attribute on the HashTable class that will determine how many bins each HashTable will have, and we'll initialize it with 500.


class HashTable
  # ...

  attr_accessor :bin_count

  def initialize
    self.bin_count = 500
    self.bins = []
  end

  # ...
end

(Now, let's write a method that calculates the bin for a specific entry depending on the number of bins. )


class HashTable
  # ...

  def bin_for(key)
    key.hash % self.bin_count
  end

  # ...
end

When storing the HashEntry in the HashTable, we won't just store it on an array, we'll store it on an array that corresponds to the bins index depending on what the bin_for method returns:


class HashTable
  # ...

  def <<(entry)
    index = bin_for(entry.key)
    self.bins[index] ||= []
    self.bins[index] << entry
  end

  # ...
end

And last, whenever we want to retrieve a HashEntry, we'll recalculate the bin index again using the bin_for method and once we have that, we'll know exactly where to look for our entry.


def [](key)
  index = bin_for(key)
  self.bins[index].detect do |entry|
    entry.key == key
  end
end

When we run the same benchmark that we used earlier, we can see times improve dramatically:


Finding 100000 took 0.025 ms
Finding 500000 took 0.094 ms
Finding 900000 took 0.112 ms

(Not only did times improve, but we got rid of the variance that we used to have depending on the position of the element in the bin pool.

There's still room for improvement here. Let's add more bins and see what happens.

)


class HashTable
  # ...

  def initialize
    self.bin_count = 300000
    self.bins = []
  end

  # ...
end

When we run the benchmark we get:


Finding 100000 took 0.014 ms
Finding 500000 took 0.016 ms
Finding 900000 took 0.005 ms

Even more improvement. This mean that the more bins, the less time spent looking for a specific key in a bin.

How many bins does Ruby actually use?

Ruby manages the size of the bins dynamically. It starts with 11 and as soon as one of the bins has 5 or more elements, the bin size is increased and all hash elements are reallocated to their new corresponding bin.

At some point you pay an exponentially increased time penalty while Ruby resizes the bin pool, but if you think about it, its worth the time since this will keep lookup times and memory usage as low as possible.

Further reading

If you want to learn where this algorithm came from and a little more about Ruby internals, I really recommend that you read Pat Shaughnessy's Ruby Under a Microscope book. Pat explains how the Ruby VM works in a way that anyone can understand. No C knowledge required, I really enjoyed reading it.

You can find the working example I used for this example on this gist.

You could also read some of the Rubinius source code. Take a look at their implementation of the Hash class, you'll probably understand a little more of the logic they used after you've read this post and Pat's book.

Thanks for reading.


Tagged:

comments powered by Disqus