Have you ever noticed that in Ruby looking for a specific key-value pair in a Hash is very fast?
Allow me to explain the logic behind hashes in Ruby with a language that you probably understand: Ruby.
Let's imagine for a second that we want to emulate the functionality in hashes because, for some strange reason, they have not been implemented yet.
If we want to have some sort of key-value structure, we'll have to implement it ourselves, so let's get to work.
First, we'll need a Struc to represent a HashEntry, or the key-value objects that we will add to our hashes.
Now, we'll need a class that represents our Hashes or (to avoid conflicts with the original Hash class) HashTable.
We add the
bins attribute to the class.
bins will be an array where we'll store our HashEntry elements.
Now, let's write a method to add a HashEntry to a HashTable. To follow convention, we'll use the traditional
<< as method name.
Great, now we can add HashEntry elements to our HashTable like so:
What if we want to look for an entry by key? Let's write the
 method on the HashTable class to handle that.
What we're doing here is simply going element by element comparing the given key until we find what we're looking for. Efficient? Let's figure it out.
We'll use Ruby's benchmarking tools to figure out how much time we're spending looking for elements on our hash tables.
When we run this benchmark, we get the following results:
What we see here is that lookup times increase depending on the amount of entries and its position within the array or
bins. This is obviously very inefficient and unacceptable for real life scenarios.
Now, let's see how Ruby tackles this problem internally.
Instead of using a single array to store all its entries, hashes in Ruby use an array of arrays or "bins".
First, it calculates a unique integer value for each entry. For this example we will use
Object#hash. Then, Ruby divides this hash integer by the total number of bins and obtains the remainder or modulus. This modulus will be used as the bin index for that specific entry.
When you lookup for a key, you calculate its bin index again using the same algorithm and you look for the corresponding object directly on that bin.
Let's add an attribute on the HashTable class that will determine how many bins each HashTable will have, and we'll initialize it with 500.
(Now, let's write a method that calculates the bin for a specific entry depending on the number of bins.)
When storing the HashEntry in the HashTable, we won't just store it on an array, we'll store it on an array that corresponds to the
bins index depending on what the
bin_for method returns:
And last, whenever we want to retrieve a HashEntry, we'll recalculate the bin index again using the
bin_for method and once we have that, we'll know exactly where to look for our entry.
When we run the same benchmark that we used earlier, we can see times improve dramatically:
(Not only did times improve, but we got rid of the variance that we used to have depending on the position of the element in the bin pool.
There's still room for improvement here. Let's add more bins and see what happens.)
When we run the benchmark we get:
Even more improvement. This mean that the more bins, the less time spent looking for a specific key in a bin.
How many bins does Ruby actually use?
Ruby manages the size of the bins dynamically. It starts with 11 and as soon as one of the bins has 5 or more elements, the bin size is increased and all hash elements are reallocated to their new corresponding bin.
At some point you pay an exponentially increased time penalty while Ruby resizes the bin pool, but if you think about it, its worth the time since this will keep lookup times and memory usage as low as possible.
If you want to learn where this algorithm came from and a little more about Ruby internals, I really recommend that you read Pat Shaughnessy's Ruby Under a Microscope book. Pat explains how the Ruby VM works in a way that anyone can understand. No C knowledge required, I really enjoyed reading it.
You can find the working example I used for this example on this gist.
You could also read some of the Rubinius source code. Take a look at their implementation of the Hash class, you'll probably understand a little more of the logic they used after you've read this post and Pat's book.
Thanks for reading.