Engine Yard Blog RSS Feed

Hello again from your friends on the Engine Yard Data Team. We've been quite busy over the past few months, and we wanted to share some exciting news with you. In this post, we'll go over some cool new features in our latest data release and also give you an idea of what our next big projects are. So, find a comfortable spot because there is a lot to cover!

PostgreSQL 9.1 is now GA

As of today, PostgreSQL 9.1.3 is available on Engine Yard Cloud and Engine Yard Managed as a fully supported GA release. This milestone is very dear to us since it includes significant changes in our overall release process. We've established a new and improved QA cycle for all databases, and we used this process to confirm the readiness of this release. We are very happy with the results and hope you will be too!

Taking advantage of PostgreSQL's extensions

When you select PostgreSQL 9.1 on your Engine Yard environment, you get an extremely reliable and robust database server. Did you also know that PostgreSQL comes with extensions that allow you to do fancy string searches?

To follow the examples on this post you'll need:

  • To deploy the Engine Yard [to-do application](https://github.com/engineyard/todo) on a solo environment
  • To clone the [ey-cloud-recipes](https://github.com/engineyard/ey-cloud-recipes) repo locally and enable the following extensions in the postgresql9_extensions cookbook:
    • fuzzystrmatch
    • pg_trgm

  • Apply the custom recipes to your test environment (instructions here)

_Note: _Code examples are meant to be run against your postgres shell. You can access the shell by executing the following command from your solo instance.

psql -U postgres todo

You should see the following output

psql (9.1.3)
Type "help" for help.
todo=#

Levenshtein Matching

The fuzzystrmatch extension gives you the Levenshtein string comparison algorithm. This algorithm compares how similar two strings are by counting how many steps are required to change one string into another. Each change is counted as one step and the distance is the total number of steps required for strings to match. This extension is particularly helpful when searching for strings like titles or names that can be easily misspelled.

For example, say we are terrible at spelling, and we want to search for the string Engine Yard but provide EngneYrd instead. The Levenshtein distance between these two strings is 3 because only three changes (adding an ‘i', an ‘a' and a space) are needed to make both strings equal.

todo=# select levenshtein('Engine Yard','EngneYrd') as distance;
distance
----------
3
(1 row)

As you can see, we can make our searches tolerant of people with bad spelling by relaxing the match to a given number of steps.


todo=# select id, name, done from tasks where levenshtein(lower(name),lower('r0kc on')) < 5;
id |   name   | done
----+----------+------
5 | Rock on! |
(1 row)

Metaphones

Along with Levenshtein matching, the fuzzystrmatch extension gives you a way to compare strings based on the way words sound. The metaphone algorithm constructs representative code for a given input string that you can then use for phonetic comparison.

For example, here is the 5 character representation of the word ‘Engine Yard'

todo=# select metaphone('Engine Yard',5);
metaphone
-----------
ENJNY
(1 row)

Say we wanted to search again for the ‘Rock on!' task, and we don't know how the word is spelled but we've heard it pronounced. We can issue the following statement and still get a match:

todo=# select id, name, done from tasks where metaphone(name,7) = metaphone ('rukon',7);
id |   name   | done
----+----------+------
5 | Rock on! |
(1 row)

Check the fuzzystrmatch documentation for double metaphones, soundexes, and more examples.

Trigrams

The pg_trgnm extension gives you trigrams. A trigram is a group of three consecutive characters taken from a string. We can find matches between strings by counting the number of trigrams shared. The string with the most matches is the most similar.

Trigram matching is a very useful tool when used in conjunction with a GIST (Generalized Index Search Tree) index. Let's create an index on task names:

todo=# create index task_name_trigram on tasks using gist(name gist_trgm_ops);CREATE INDEX

Now, let's query the task table with another misspelling of the “our blog” to do item. We still get the correct result:

todo=# select id, name, done from tasks where name % 'We blog enineyrd bolg';
id |                  name                   | done
----+-----------------------------------------+------
4 | We blog http://www.engineyard.com/blog/ |
(1 row)

As you can see, this extension is also useful for doing searches when you want to be tolerant of slight misspellings or even minor words missing. Our query returned all tasks that are sufficiently similar to our provided criteria of ‘We blog enineyrd bolg', sorted from best match to worst. The index we created will be used to make this a fast operation, even over very large datasets.

There is a lot more to learn about trigrams, so make sure you check the documentation for more information and examples.

This is not all that PostgreSQL can do! We'll continue to introduce features in future posts. Stay tuned for information on how to enable full-text search on a PostgreSQL database.

Sneak peek into our work queue

We hope you enjoy the PostgreSQL GA release. Customer feedback is always very important to us, and we use it to fine tune our QA process, so let us know if you experience any problems.

Here is a preview of our current roadmap:

  • Enhancing database slaves on Engine Yard Cloud for both MySQL and PostgreSQL
  • Removing the db-master requirement for customers who are using hosted solutions
  • Working towards a GA release for MySQL 5.5 (currently in beta)

Additionally, we are actively working on expanding the supported NoSQL stack on Engine Yard Managed. Are you an Engine Yard Managed customer interested in Riak or Cassandra? Let us know!

Remember that you can always influence our roadmap by submitting data feature requests.


Tagged:

comments powered by Disqus