It seems the whole index is kept in RAM. Thus the index size is limited by the a...

markpapadakis · on Oct 23, 2017

1.65ms for what kind of queries? Also, is that 1M blog posts, all weighting 500mb in total size(characters)?

ethanwillis · on Oct 23, 2017

I wonder if they're using succint data structures.

I'm in bioinformatics and the first time I implemented a wavelet tree to reduce the size of genomes in memory.. It was just breathtaking.

dogruck · on Oct 24, 2017

Neat, TIL about wavelet trees. https://en.m.wikipedia.org/wiki/Wavelet_Tree

jagger27 · on Oct 24, 2017

That's a beautiful structure. Thanks for sharing.

ausjke · on Oct 23, 2017

very interesting, can you elaborate and on it a little more?

I need quick fuzzy search on a low-end embedded device that has limited storage(both RAM and HDD), was thinking about putting the index on a server with plenty RAM then do websocket or RPC for that.

ethanwillis · on Oct 24, 2017

There's a very good blog post for the implementation details here: http://alexbowe.com/wavelet-trees/ I had a decent implementation in Python, but it's on my old macbook that I would need to dig up. If you're interested you can add me on telegram: @rightcheek.

Now to go with Wavelet trees you may or may not need to know about suffix arrays and optimal suffix array construction. Take a look at this: https://en.wikipedia.org/wiki/Suffix_array This is what's going to give you space efficiency in combination with a wavelet tree. And the wavelet tree also gives you good rank/select efficiency.

Edit: Here's a suffix array construction algorithm implementation I did (not sure if it's fully correct) https://github.com/ethanwillis/comp7295_final/blob/master/sa... It is based on this paper: https://local.ugene.unipro.ru/tracker/secure/attachment/1214...

fizx · on Oct 23, 2017

FWIW, a few years back, I was averaging <5ms for a benchmark of live traffic (many thousands of queries per second) on Pinterest queries against their full dataset on a single EC2 box with a weeks worth of customization on top of Lucene.

The existing stuff isn't slow.

markpapadakis · on Oct 24, 2017

Trinity - depending on the execution mode - is over 100% faster for certain queries compared to Lucene, and Lucene is already very fast. It all comes down to the postings ling codecs, and the iterators design/impl anyway.

veni0 · on Oct 24, 2017

The data is a real-time storage disk, the index size is limited by the available hard memory space.

veni0 · on Oct 24, 2017

Yep, the current index will be anyway in memory, I am writing a branch of the cache disk.

veni0 · on Oct 24, 2017

The document is improving.