Memoization and id2ref

This article was originally included in the February issue of the Engine Yard Newsletter. To read more posts like this one, subscribe to the __Engine Yard Newsletter_._

In this series, Evan Phoenix, Rubinius creator and Ruby expert, presents tips and tricks to help you improve your knowledge of Ruby.

The performance of a library or application is one of the key factors into getting it accepted, so it should come as no surprise that Ruby programmers have many different tricks they use to squeeze more performance out of their code.

One of the most common is memoization. This is the technique of calculating a value once, then saving the result and transparently substituting it for the code that calculated the original value.

Here’s a short example:

def size_of_universe
 @size ||= Universe.find.size

Here, we’ve calculated the size of the universe and then saved the result into the @size ivar. This way, the next time size_of_universe is called, the previously calculated value is returned.

We’ve already gone over one of the simplest and most basic techniques, above. This technique uses the ||= operator to run the right hand side if, and only if, the left hand side is not true. It’s short and sweet, rarely confusing the user.

Another technique that has been seen in production code uses ObjectSpace._id2ref. While this is becoming a common technique, it has a number of problems that we’ll look at today.

Here is an example of using this technique:

obj = Universe.find.size
eval <<CODE
def size_of_universe

This technique is used frequently with metaprogramming, when you want to embed a specific object directly into a generated method. People use this technique because, at first glance, it removes any kind of data dependency on the generated code and obj. There is no ivar to make sure is in scope, no constant, etc. But, in fact, this technique masks some rather terrible bugs.

This technique basically uses the whole Ruby process as a big table, leveraging the ability to easily get the table index for an object and convert that table index back into the object.

The primary issue stems from the fact that Ruby is a garbage collected language. Even though the code has requested the object_id for an object, that is not enough to keep the object alive. So if the only reference to the return value from #size was obj, when this method returns, obj becomes garbage.

So what happens when you run #size_of_universe and obj has been garbage collected? Well, a few things can happen:

  1. id2ref will raise a RangeError, saying that the id no longer points to an object.
  2. A random object will be returned.

The second scenario is probably the strangest, but this can be observed. This bizarre _id2ref behavior occurs because the return value from #object_id is actually the address in memory of the object itself. This means that when the GC runs and collects the object, and then the allocator puts another object in the same place (which is exactly what an GC does), whatever object happens to be there is returned. This is essentially the same as a hanging pointer bug in C.

Lastly, the implementation of #_id2ref varies wildly between different Ruby implementations, each having different performance and different potential bugs. Due to these factors, using #_id2ref in production is even more nebulous.

So what’s a simple alternative?

UNIVERSE_SIZES << Universe.find.size
eval <<-CODE
def size_of_universe

This seems silly if there is just a single value in UNIVERSE_SIZES, but the expectation here is that you might be generating many methods with values that need to memoized. In the example above, we’re storing methods in an Array that is in a constant, which will keep the value alive from a GC standpoint. This avoids the bugs that #_id2ref has.

So hopefully if you need to memoize, you won’t use _id2ref. There are a number of alternatives, most of them are better than worrying about the bugs that #_id2ref can easily introduce.