austin entrepreneur

Real Time Storage and Search Fun

Posted in Uncategorized by Eric Falcão on March 25, 2010

I have been toying with interesting methods for improving how we store and make searchable streaming content inside of TweetRiver and something else I am prototyping.

Today we have mysql+sphinx. Things work ok, although writes to mysql are starting to take a bit since we’re trying to make sure that no duplicate tweets are inserted. Also, sphinx delta indexing (done every minute) is not available for search as quickly as we’d like.

Attempt #1: Build it like Facebook inbox search

Store a reverse index in cassandra, where the customer_id is the key, the term is the super column, and a time-sorted list of columns points to tweet ids. Raw tweets are stored in cassandra in another column family.

Pros: Excellent write speed, Worked extremely well for single-term searches

Cons: I don’t feel like building a search server for complex searches (multiple words, etc). Also, SuperColumns are currently only practical up to a few thousand terms.

Attempt #2: Lucandra!

Lucandra makes Cassandra the back-end for Lucene. This gives you the write performance of Cassandra and fancy Lucene queries that read from Cassandra insertions in real time. The big drawback here is that Solr support is not here yet (big frown) and I don’t really want to muck with finding ways to talk to Lucene from Ruby.

Attempt #3: Zoie Solr Plugin

I tried to detach myself from the hotness of cassandra and focus on near real time Solr, after all, I just want to be able to query as tweets are being inserted, period. I found the Zoie Solr plugin, which appears to have been released just a few days ago by the SNA team at LinkedIn. Basically, the Zoie plugin replaces the Solr index reader and update handler and promises to make things available to search immediately. This was fairly easy to get going and seemed to work quite well. I was very excited to hear that the Zoie team is looking towards “forward rolling indexes” next, which is ideal for logs and tweets since only the most recent N documents are kept in the index…awesome stuff.

Is there a winner?

I don’t really know where to go from here at this point because I have a feeling that Solr integration will be coming to Lucandra very soon. I want to wait until then to see which performs best. I’m looking for ease of setup, convenient storage of original document terms, great search performance regardless of quantity and rate of insertion, and decently powerful querying.

Tagged with: , ,