It seems the whole index is kept in RAM. Thus the index size is limited by the amount of RAM available. This explains the impressive indexing and search performance (1M blog 500M data 28 seconds index finished, 1.65 ms search response time, 19K search QPS)
The Persistent storage data is stored to the hard disk solely when the program closes. The data is then restored from the hard disk when the program restarts ( https://github.com/go-ego/riot/blob/master/docs/zh/persisten... ). This is a limited approach compared to Lucene/Solr/Elasticsearch LSM
which handle high-volume inserts to its indexes with a log-structured merge-tree (LSM) and where the index size is only limited by the available hard disk space.
very interesting, can you elaborate and on it a little more?
I need quick fuzzy search on a low-end embedded device that has limited storage(both RAM and HDD), was thinking about putting the index on a server with plenty RAM then do websocket or RPC for that.
There's a very good blog post for the implementation details here: http://alexbowe.com/wavelet-trees/ I had a decent implementation in Python, but it's on my old macbook that I would need to dig up. If you're interested you can add me on telegram: @rightcheek.
Now to go with Wavelet trees you may or may not need to know about suffix arrays and optimal suffix array construction. Take a look at this: https://en.wikipedia.org/wiki/Suffix_array This is what's going to give you space efficiency in combination with a wavelet tree. And the wavelet tree also gives you good rank/select efficiency.
FWIW, a few years back, I was averaging <5ms for a benchmark of live traffic (many thousands of queries per second) on Pinterest queries against their full dataset on a single EC2 box with a weeks worth of customization on top of Lucene.
Trinity - depending on the execution mode - is over 100% faster for certain queries compared to Lucene, and Lucene is already very fast. It all comes down to the postings ling codecs, and the iterators design/impl anyway.