Engine Yard Blog RSS Feed

Web applications often have the need to do text searches against data stored in a database. While the built in MySQL and PostgresSQL functions for full text searching work, they are often not the best solutions for fast and complex full text searching.

This is where dedicated search engines come into play, and Sphinx is our favorite tool for the job. Written in C++ by Andrew Aksyonoff, and originally released to open source in 2001, Sphinx is a blazing fast search engine. Considering that fast and complex full-text searching is a somewhat frequent need, I've put together this post with my top five tips for implementing Sphinx.

1) Use thinking_sphinx

There are several plugins out there for Sphinx integration into Rails. UltraSphinx, ThinkingSphinx, and acts_as_sphinx (no longer under active development) are the most commonly used plugins.

We recommend ThinkingSphinx over UltraSphinx for several reasons. Ultrasphinx, unfortunately, can throw cryptic errors because of the way it preloads your indexed model. If you have patches that are in your lib directory, they must be explicitly required. And while the UltraSphinx way of defining blocks is simpler for simple cases, in more advanced cases it can become far less readable, and you'll hit those advanced cases before you know it.

2) Know When to Index, and How Often

This next tip is really three tips all bundled into one (because who wants to read seven tips for anything? ;) )

  1. Know your requirements What's an acceptable lag between the data being updated and becoming available in search results? Does it need to be instantaneous, or is it acceptable to wait 5, 10, 15 or even 30 minutes?__ __The longer you wait between index updates, the less resources (and CPU time) your search engine consumes.
  2. Know how long your indexing takes If it takes three minutes to complete an index run, but you kick off a full re-index every 60 seconds, that's not going to work out well. If you absolutely need an updated index every 60 seconds, then you need to consider alternatives like a bigger instance for your search engine or other strategies like delta indexing (below).
  3. Test your new indexes or new data in staging When making changes to data being indexed or adding new indexes, do a test run of your indexing in a staging environment with snapshot of production data. Sometimes small changes to data or indexing result in an enormous increase in index size. If your changes create gigantic indexes, it's best to learn that on staging, instead of running out of space in your production environment.

3) Use Delta Indexes (When You Need To)

Without a reindex, your search won't be up to date; the question is when to reindex.

When indexing small data-sets, a full reindex can be done frequently. But as size grows, so does the index, and with it the time it takes to index. This is when delta indexing comes into play. A delta index is nothing more than a second index containing indexes for only the documents that changed since your last index. There are three main methods of delta indexing built into ThinkingSphinx: the default behavior, timestamped deltas, and delayed_job integration.

The first method—the default behavior of thinking_sphinx—is big on convenience. On every save it fires off the delta indexer and you get near instantaneous index updating. However, while this works great on development environments, and most staging environments, in production this can be problematic.

One problem is that the indexer is now part of the request cycle, which means that with each save comes a reindex. This method will cause scaling problems—with increased traffic, the indexer will fire more frequently. This puts increased load on the database as well as the filesystem.

Another problem is that in a production environment with many instances, the delta index is only created on the instance that handles the request. This results in instances with out of date information until the next full index. We deal with this by adding a cron entry to run the delta index cron task on all machines that run the search daemon. This has the effect of keeping your indexes in sync to the interval that the cron job runs at.

The second method—the timestamped version—works by adding a time threshold to the define_index block e.g.

set_property :delta => :datetime, :threshold => 1.hour

This is the frequency with which you run your rake task to reindex the delta. This means your deltas are updated every hour (in this example). While a nice improvement to the built-in default, this means that your indexes are out of date until the next rake tasks run, so you need to set the frequency according to user expectations (or reset expectations).

The third method uses the delayed_job gem and pushes a job onto the delayed_job queue that tells the indexer to run as needed. This is more immediate than the threshold option, while still running outside of the request cycle. This is the most promising setup, although it lends itself to a single searchd server setup. Specifically, a single machine running the indexer and search daemon with each instance sending reindex tasks to the queue when needed.

The drawbacks to this third approach are:

  1. You lose availability. If the searchd server goes down, your search goes down. Allowing each instance to have its own instance of searchd builds redundancy into the setup.
  2. As mentioned in the official documentation,"because the delta indexing requests are queued, they will not be processed immediately—and so your search results will not not be accurate straight after a change. Delayed_Job is pretty fast at getting through the queue though, so it shouldn't take too long."

4) Know Your Bottleneck: Database or Filesystem

When maintaining your indexes, you have a choice of merging delta indexes into your main index or doing a full reindex. Merging can save you a database hit, but require twice the I/O of the two indexes to be merged, and hits the filesystem hard. On the other hand, reindexing hits the database hard. So you have to know your bottleneck. Most Rails developers are acutely aware of their database load. We optimize queries, we index tables, and we even use methods to read exclusively from the replica and write only to the master. So, instinctively, most developers select to merge their delta index into the main, rather than perform a full reindex, in order to take load off the database. But this isn't always right.

If your application processes a lot of uploads or your application has poor cacheability (and you're serving direct from filesystem a lot) then you probably want to avoid putting more load on the filesystem. In these cases, reindexing will make more sense then merging delta indexes.

5) If you're on ultrasphinx, switch to thinking_sphinx

Rein Henrichs wrote a great blog post which included the steps to make the switch. I'll expand on those here, and include some real world code samples.

Switching is actually relatively simple and in these four steps you can convert an Ultrasphinx application to a ThinkingSphinx one.

_1. Uninstall UltraSphinx and install ThinkingSphinx: _

Run:

script/plugin remove ultrasphinx

and add this line to you environment.rb:

config.gem('freelancing-god-thinking-sphinx', :lib => 'thinking_sphinx')

2. Translate your is_indexed declaration into a define_index block and change your search actions to use the ThinkingSphinx API:

class Post < ActiveRecord::Base
belongs_to :blog
belongs_to :category

is_indexed :conditions => "posts.state = 'published'",
  :fields     => [{:field => 'title', :sortable => true},
  {:field => 'body'},
  {:field => 'cached_tag_list'}],
  :include    => [{:association_name => "blog",
    :field            => "title",
    :as               => "blog",
    :sortable         =>  true},
    {:association_name => "blog",
  :field            => "description",
  :as               => "blog_description"},
  {:association_name => "category",
  :field            => "title",
  :as               => "category",
  :sortable         =>  true}]
end

class Post < ActiveRecord::Base
  belongs_to :blog
  belongs_to :category

  define_index do
    indexes title, :sortable => true
    indexes body, cached_tag_list

    indexes blog.description, :as => :blog_description
    indexes blog.title,       :as => :blog,     :sortable => true
    indexes category.title,   :as => :category, :sortable => true

    where "posts.state = 'published'"
  end
end

Your old search task might look like:

Ultrasphinx::Search.new(:query => params[:query])

Where your new one would look like (assuming you've indexed the model Post):

Post.search(params[:query])

3. Rewrite your deployment tasks to run the ThinkingSphinx rake tasks:

namespace :sphinx do
  desc "Stop the sphinx server"
  task :stop, :roles => [:app], :only => {:sphinx => true} do
    run "cd #{latest_release} && RAILS_ENV=#{rails_env} rake thinking_sphinx:stop"
  end

  desc "Reindex the sphinx server"
  task :index, :roles => [:app], :only => {:sphinx => true} do
    run "cd #{latest_release} && RAILS_ENV=#{rails_env} rake thinking_sphinx:index"
  end

  desc "Configure the sphinx server"
  task :configure, :roles => [:app], :only => {:sphinx => true} do
    run "cd #{latest_release} && RAILS_ENV=#{rails_env} rake thinking_sphinx:configure"
  end

  desc "Start the sphinx server"
  task :start, :roles => [:app], :only => {:sphinx => true} do
    run "cd #{latest_release} && RAILS_ENV=#{rails_env} rake thinking_sphinx:start"
  end

  desc "Restart the sphinx server"
  task :restart, :roles => [:app], :only => {:sphinx => true} do
    run "cd #{latest_release} && RAILS_ENV=#{rails_env} rake thinking_sphinx:running_start"
  end
end

and you'll probably want to add these as well to automate the reindexing and starting on deploy:

after "deploy:symlink_configs", "new_sphinx:configure"
after "sphinx:configure", "sphinx:index"
after "sphinx:index", "sphinx:restart"

If you're not running on Engine Yard Slices, you can still get the benefit of prewritten Capistrano tasks by adding:

require "vendor/plugins/thinking-sphinx/lib/thinking_sphinx/deploy/capistrano"

to the top of your deploy.rb file. If you're running on the latest release from GitHub, as a plugin, the tasks should be included automatically. This is of course assuming that you are running Capistrano from a working repository and by itself (a non-developer deploying code for example—thanks for the tip, commenter Josh!)

Because of a custom script that exists on Engine Yard Slices, if you are using our eycap gem, these tasks are included for you as:

sphinx:configure
sphinx:reindex
sphinx:restart
sphinx:start
sphinx:stop
thinking_sphinx:configure
thinking_sphinx:reindex

4. Stop searchd and then run your new configure, index and start start tasks:

cap sphinx:stop && cap sphinx:configure && cap sphinx:index && cap sphinx:start

Solid searching is key in numerous applications; Sphinx is a great tool for many cases, and I hope this post helped convince you!


Tagged:

comments powered by Disqus