Our first OSS data sponsorship: meet Oleg, Teodor, and Alexander

Engine Yard has been a long-time friend and sponsor of open source projects. Over the last few years we have made contributions to projects like JRuby, Travis CI, RailsInstaller, Vagrant, and many more. But in our long OSS sponsoring history there was something missing: a data-related open source project. That is no longer the case.

We are delighted to announce our support of the work that Oleg Bartunov, Teodor Sigaev, and Alexander Korotkov are doing to improve JSON support and full text search performance in PostgreSQL. We are happy to help PostgreSQL and its community to continue to lead the way as the most adaptable and robust open source database in the market.

Oleg and Teodor present their work at PGCon13

Meet Oleg, Teodor, and Alexander

Oleg Bartunov started developing for PostgreSQL in 1996 and is a major developer and member of the PosgreSQL Global Development Group (PGDG). Oleg is also an active member of the Russian PostgreSQL community and is an advocate for the adoption of PostgreSQL by astronomers and their community.

Teodor Sigaev began developing for PostgreSQL in 2000 and is also a member of the PGDG. Teodor enjoys hacking the internals of PostgreSQL related to indexing, data types, and full text search.

During their tenure at the Sternberg Astronomical Institute of Moscow State University, Oleg and Teodor developed the infrastructure for implementing user-defined index access methods GiST and GIN, built-in full-text search facilities in PostgreSQL (formerly known as tsearch2), and a number of popular extensions like intarray, ltree, hstore, and pg_trgm.

Alexander Korotkov, our newest sponsorship recipient, contributes to PostgreSQL in the areas of indexing and statistics and uses PostgreSQL in various projects. Alexander has recently defended his PhD thesis in Computer Science at NRNU MEPhI. His thesis included plenty of PostgreSQL love with the presentation of a new page-splitting algorithm for GiST and levenshtein distance calculation with threshold (levenshteinlessequal function in contrib/fuzzystrmatch).

Collectively they bring incredible expertise and understanding of PostgreSQL and its internals. We are very excited to help advance the state of data by supporting their work.

Features under development

We have identified areas of improvement that we believe will help developers get more out of their PostgreSQL database, and luckily for us, Oleg, Teodor, and Alexander’s interests are aligned with ours. Here are the two features they are working on which are currently under our sponsorship:

True JSON data type support for PostgreSQL

A JSON data type was included in PostgreSQL 9.2. It is internally represented as string storage with validity checking for stored values and some related functions. To be a real data type, it must have a binary representation, and its development would be a big project if started from scratch. Oleg and Teodor developed hstore over a decade ago to facilitate working with semi-structured data in PostgreSQL. Hstore is a very mature and widely used data type with indexing support.

Oleg and Teodor are currently extending hstore to be nested (with values that can also contain hstore data) and are adding support for arrays, so its binary representation can be shared with JSON. They have presented a working prototype of their new hstore data type at PGCon2013. See Oleg’s blogpost on their design decisions, implementation issues, and the feedback they received at PGCon. They continue to make great performance improvements and have submitted their work to the PostgreSQL 9.4 commitfest.

GIN Indexing Improvements

Oleg, Teodor, and Alexander have also set out to significantly improve GIN indexes in PostgreSQL. Their primary goal is to make full-text search (FTS) in PostgreSQL as fast as standalone solutions like Sphinx and Solr.

Improvements to GIN indexes will also facilitate compression of item pointers in index store (they aim to make GIN indexes about 2X smaller!), and allow for additional information in posting trees and posting lists. The ability to use additional information in filtering would allow for new features in GIN operator classes, including better phrase search, improved array similarity search, inverse FTS search, inverse regex search, and enhanced string similarity using positioned n-grams. Usage of additional information for sorting in index accelerates ranking in FTS and dramatically reduces its I/O. See their presentation on the evolution of FTS in PostgreSQL and the preliminary results of their work.

Come hear them talk!

Oleg, Alexander, and Teodor will be speaking at The PostgreSQL Europe Conference on October 29-November 1, 2013, in Dublin, Ireland.

Don’t miss their talk on The next generation of GIN as they go in-depth and discuss their benchmark results using several datasets (6 M and 15 M documents) and real-life load for PostgreSQL and Sphinx full-text search engines. They set out to demonstrate that improved PostgreSQL FTS (with all ACID overhead) outperforms the standalone Sphinx search engine! Make sure to also see their presentation on Binary storage for nested data structures and its application to the hstore data type.