A decent stand-alone Java Bloom Filter implementation

Update (11th Jan 2010): Magnus Skjegstad has started a project with a Bloom Filter implementation inspired by this one, but which corrects a number of deficiencies. I recommend that you use this implementation instead of the one shown here, you can find it on Github.

Update (3rd July 2015): These days if you need a Bloom Filter I’d recommend using the implementation in Google’s Guava library.

They say necessity is the mother of invention, and a few days ago I needed a stand-alone Java Bloom Filter implementation. A Bloom Filter is a relatively simple way to efficiently remember whether you’ve seen something before, without actually having to store everything you’ve seen. Wikipedia can fill you in on the details.

Anyway, I did some hunting around, and the best available option seemed to be this implementation in the Apache Hadoop project.

Fortunately, I took a look at the source code before going ahead and using it. The first warning sign was that at the heart of their implementation they use an array of boolean values (boolean[]). This is a serious problem, because most Java implementations I know will use a byte to represent a boolean in a boolean array. This means that for every actual bit stored, there are seven wasted bits – and this is in a datastructure that is supposed to be saving you memory!

Some other parts of their code also concerned me, such as how they were using SHA1 as their hash function, it seemed somewhat haphazard to say the least. Oh, the other minor problem was that in my initial testing, I couldn’t actually get their implementation to work.

So, I decided to implement my own, and as luck would have it I hit on a very simple method to do it, using the java.util.Random class. The two key methods, add() and contains(), are 7 and 8 short lines of very simple readable code respectively.

Of course, crypto geeks will point out that java.util.Random is not a very secure hash function, but the worst that can happen is an increased false-positive rate, and I think even this is unlikely for most applications (its certainly good enough for anything I’ve ever needed a Bloom Filter for).

Some other nice features:

    • Implements Java’s generic Set interface, since this should be familiar to people
    • You just tell its constructor how many elements you expect to insert into it, and it will automatically configure itself for optimal operation
    • It will calculate its expected false-positive rate for you
    • Its serializable
    • I’ve included a reasonably thorough unit test
    • Unlike the Hadoop implementation, SimpleBloomFilter uses BitSet, which should be nice and compact
    • I’m releasing it under a very liberal “do what you want, just remember who wrote it” license

Here is a simple example usage:

// Create a bloom filter 100 bits in size to contain Strings, optimized to
// contain 4 items
SimpleBloomFilter mammals = new SimpleBloomFilter(100, 4);

// Add some things to the bloom filter

// Test to see if the bloom filter remembers
if (mammals.contains(“dog”)) {
System.out.println(“A dog is a mammal with probability ”
+ (1 – mammals.expectedFalsePositiveProbability()));
} else {
System.out.println(“A dog is definitely not a mammal”);

If you are wondering how many bits to use, here is a table of the false-positive rates you should expect given specific ratios between number of items, and number of bits.

Ratio (items:bits) False-positive probability
1:1 0.63212055882856
1:2 0.39957640089373
1:4 0.14689159766038
1:8 0.02157714146322
1:16 0.00046557303372
1:32 0.00000021167340
1:64 0.00000000000004

Download the source code to the SimpleBloomFilter, and a unit test here. If you would like to review the code, you can do so here with nice syntax highlighting.

9 thoughts on “A decent stand-alone Java Bloom Filter implementation

  1. Pingback: Bloom filter inside Map Reduce « Vanja Komadinovic

  2. coder

    could u pls kindly tell us:
    why sha-1 or sha-512 is not good for hash functions in bloom filter? can you do a math proof or strict algorithm analysis on your claim?

    1. ian Post author

      coder, they’re not good because they’re simply too slow. It’s overkill to use cryptographically secure hash functions for a bloom filter.

  3. John

    Hi ian,

    I am just curious. why did your SimpleBloomFilter implement the serializable interface? I have read alot of reasons why it shouldn’t be used and the only reason I found for using it in most cases is the persist the class but I don’t see why you want to persist your SimpleBloomFilter. So, am I missing something? Please, enlighten me.


  4. George

    FYI: Array of boolean[] is not serious problem, compiler optimization will take care about it, it is not so stupid 🙂

    1. ian Post author

      George, I’m afraid you’re wrong about that. In Java a boolean array uses a byte for each boolean, so it uses 8 times as much memory as an efficient data-structure like java.util.BitSet.


Leave a Reply

Your email address will not be published.