I needed a really good hash function for the distributed indexing in SolrCloud. Since it is be used for partitioning documents, it needed to be really high quality (well distributed) since we don’t want uneven shards. It also needed to be cross-platform, so a client could calculate this hash value themselves if desired, to calculate which partition a given document belongs on.
MurmurHash3 is one of the top favorite new hash function these days, being both really fast and of high quality. Unfortunately it’s written in C++, and a quick google did not yield any suitable high quality port (this was back in 2011). So I took 15 minutes (it’s small!) to port the 32 bit version, since it should be faster than the other versions for small keys like document ids. It works in 32 bit chunks and produces a 32 bit hash – more than enough for partitioning documents by hash code.
It would be nice to prevent others from having to do the same thing. Since stuff like this is small enough, I simply put it under the public domain and uploaded to github. This way anyone can just copy the file or the function into their project and avoid extra dependencies and license hassles.
Here’s the code, copy away!
2015 Update: MurmurHash3 128
Fast forward to 2015, and we’re implementing hyperloglog based distributed cardinality count for the new Facet Analytics Module. That algorithm requires excellent 64 bit hashes.
Google Guava MurmurHash3 implmentation
The first step was to evaluate the Google Guava implementation of the 128 bit MurmurHash3 algorithm since Solr already uses the guava library. After a quick inspection, of the source code, I was disappointed. Their implementation is part of a larger hashing framework that introduces all sorts of inefficiencies.
- The guava implementation does not match the reference C++ implementation for all seeds!
- The implementation allocates *multiple* new objects for every hash
- Even for hashing primitives (like int and long), the implementation creates a new byte buffer and copies in the value, then calls hash
It looks as if the Google implementation focused on hashing streams and large amounts of data and is completely inappropriate for hashing large numbers of small values.
New MurmurHash 128 bit Java implementation
I searched for another suitable implementation, but did not see one that had an appropriate license and did not do any object allocation. So I reinvented the wheel again and implemented a port myself, starting from the reference implementation.
- Matches the reference MurmurHash3 implementation, for all seeds, so it can safely be used in multi-language scenarios where hashes must match.
- Does not allocate any objects.
- Public domain, so you can just copy the file into your project and not have to worry about extra licenses or extra JAR dependencies.
I also added in
fmix64 from MurmurHash3 for quickly hashing integers and longs respectively.
If you need 32 bit hashes:
- int – use MurmurHash3.fmix32(val)
- long – use (int)MurmurHash3.fmix64(val)
- float – use MurmurHash3.fmix32(Float.floatToRawIntBits(value))
- double – use (int)MurmurHash3.fmix64(Double.doubleToRawLongBits(value))
- bytes – use MurmurHash3.murmurhash3_x86_32
If you need 64 bit hashes:
- int – use MurmurHash3.fmix64((long)val)
- long – use MurmurHash3.fmix64(val)
- float – use MurmurHash3.fmix64((long)Float.floatToRawIntBits(value))
- double – use MurmurHash3.fmix64(Double.doubleToRawLongBits(value))
- bytes – use MurmurHash3.murmurhash3_x64_128(value) and then just use one half (one long) of the 128bit result
This implementation is public domain, so just copy the code into your project!