Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It seems the whole index is kept in RAM. Thus the index size is limited by the amount of RAM available. This explains the impressive indexing and search performance (1M blog 500M data 28 seconds index finished, 1.65 ms search response time, 19K search QPS) The Persistent storage data is stored to the hard disk solely when the program closes. The data is then restored from the hard disk when the program restarts ( https://github.com/go-ego/riot/blob/master/docs/zh/persisten... ). This is a limited approach compared to Lucene/Solr/Elasticsearch LSM which handle high-volume inserts to its indexes with a log-structured merge-tree (LSM) and where the index size is only limited by the available hard disk space.


1.65ms for what kind of queries? Also, is that 1M blog posts, all weighting 500mb in total size(characters)?


I wonder if they're using succint data structures.

I'm in bioinformatics and the first time I implemented a wavelet tree to reduce the size of genomes in memory.. It was just breathtaking.


Neat, TIL about wavelet trees. https://en.m.wikipedia.org/wiki/Wavelet_Tree


That's a beautiful structure. Thanks for sharing.


very interesting, can you elaborate and on it a little more?

I need quick fuzzy search on a low-end embedded device that has limited storage(both RAM and HDD), was thinking about putting the index on a server with plenty RAM then do websocket or RPC for that.


There's a very good blog post for the implementation details here: http://alexbowe.com/wavelet-trees/ I had a decent implementation in Python, but it's on my old macbook that I would need to dig up. If you're interested you can add me on telegram: @rightcheek.

Now to go with Wavelet trees you may or may not need to know about suffix arrays and optimal suffix array construction. Take a look at this: https://en.wikipedia.org/wiki/Suffix_array This is what's going to give you space efficiency in combination with a wavelet tree. And the wavelet tree also gives you good rank/select efficiency.

Edit: Here's a suffix array construction algorithm implementation I did (not sure if it's fully correct) https://github.com/ethanwillis/comp7295_final/blob/master/sa... It is based on this paper: https://local.ugene.unipro.ru/tracker/secure/attachment/1214...


FWIW, a few years back, I was averaging <5ms for a benchmark of live traffic (many thousands of queries per second) on Pinterest queries against their full dataset on a single EC2 box with a weeks worth of customization on top of Lucene.

The existing stuff isn't slow.


Trinity - depending on the execution mode - is over 100% faster for certain queries compared to Lucene, and Lucene is already very fast. It all comes down to the postings ling codecs, and the iterators design/impl anyway.


The data is a real-time storage disk, the index size is limited by the available hard memory space.


Yep, the current index will be anyway in memory, I am writing a branch of the cache disk.


The document is improving.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: