10x speedup in write performance in Riak Innostore based on keyname

smanek · on May 22, 2011

Be careful with this trick. Although it appears to work well for Riak, it can cause lots of problems for BigTable style stores, like HBase, since it bottlenecks all the writes through one node.

See: http://ikaisays.com/2011/01/25/app-engine-datastore-tip-mono...

HBase (and I've heard BigTable) work best with purely random row keys - like the author of this post was using initially.

strlen · on May 22, 2011

It should be noted that Riak uses consistent hashing (where you route a key based on the Murmur or FNV hash of its md5 checksum) and virtual nodes. That means that even if keys are next to each other, they will get routed to different virtual nodes. Even in a single virtual node is "hot", virtual nodes do not map 1-to-1 onto physical nodes.

BigTable uses token ranges, which allows for range queries, but makes it vulnerable to this kind of a situation. This, however, shouldn't be needed with BigTable or HBase as BigTable and HBase use LSMs instead of a conventional B-Tree (what InnoDB -- which is what Innostore uses -- is): all writes and updates are strictly sequential, so this kind of "trick" is not needed.

[Disclaimer: Voldemort developer here, we use consistent hashing, virtual nodes and -- by default -- a log structured B+Tree from BerkeleyDB Java Edition]

fizx · on May 22, 2011

Yeah, I realized that the first time I was looking at someone else's HBase and they had a primary key of timestamp.toString.reverse. ;)

jchrisa · on May 22, 2011

This is also a best practice in CouchDB - in Couch as long as you use the UUIDs that Couch generates for you (via the /_uuids API endpoint) you'll get keys that are designed to minimize the work the b-tree has to do, to insert them.

rdtsc · on May 22, 2011

There are basically 2 type of common UUIDs : 1 and 4.

UUID1 is generated from mac address of the machine + timestamp + random bits and UUID4 is completely random. Sometimes you'd want one sometimes the other.

You can try these in python as:

    >>> import uuid
    >>> uuid.uuid1()
    >>> uuid.uuid4()

antirez · on May 22, 2011

IMHO for all this class of problems, that is: need to log big amount of data over time for years, the way to go is not Riak, nor Redis, nor <put your preferred DB name here> but, simply writing to files in append only mode (and when you can, using a fixed size record for fast access later).

There are good reasons IMHO for writing a small networked C server doing this work.

strlen · on May 22, 2011

> There are good reasons IMHO for writing a small networked C server doing this

Right, because:

https://github.com/cloudera/flume

https://github.com/facebook/scribe/wiki

http://sna-projects.com/kafka/

http://www.freebsd.org/cgi/man.cgi?query=syslogd&sektion...

... don't exist?

(Formulation courtesy of abhay, plug for Kafka mine)

antirez · on May 22, 2011

if there is something already great at doing this, sure, no need. I did not checked however, so can't talk about this specific projects.

benblack · on May 23, 2011

Well, yes. Exactly.

Big ups to the Homo Sapiens posse.

- Lil' B

benblack · on May 22, 2011

I've recently discovered that when you know almost nothing about the problem someone is trying to solve it is both easy and attractive to speculate about how much better your solution is than the one reached by the people who have to solve it. Full time. For money. I see you have discovered the same. Great minds think alike?

- Lil' B

antirez · on May 22, 2011

good point... but another good one related to your is that often people do everything to avoid relaxing specifications, ending with complex designs.