A decent stand-alone Java Bloom Filter implementation

Update (11th Jan 2010): Magnus Skjegstad has started a Google Code project with a Bloom Filter implementation inspired by this one, but which corrects a number of deficiencies. I recommend that you use this implementation instead of the one shown here, you can find it on its Google Code page.

They say necessity is the mother of invention, and a few days ago I needed a stand-alone Java Bloom Filter implementation. A Bloom Filter is a relatively simple way to efficiently remember whether you’ve seen something before, without actually having to store everything you’ve seen. Wikipedia can fill you in on the details.

Anyway, I did some hunting around, and the best available option seemed to be this implementation in the Apache Hadoop project.

Fortunately, I took a look at the source code before going ahead and using it. The first warning sign was that at the heart of their implementation they use an array of boolean values (boolean[]). This is a serious problem, because most Java implementations I know will use a byte to represent a boolean in a boolean array. This means that for every actual bit stored, there are seven wasted bits – and this is in a datastructure that is supposed to be saving you memory!

Some other parts of their code also concerned me, such as how they were using SHA1 as their hash function, it seemed somewhat haphazard to say the least. Oh, the other minor problem was that in my initial testing, I couldn’t actually get their implementation to work.

So, I decided to implement my own, and as luck would have it I hit on a very simple method to do it, using the java.util.Random class. The two key methods, add() and contains(), are 7 and 8 short lines of very simple readable code respectively.

Of course, crypto geeks will point out that java.util.Random is not a very secure hash function, but the worst that can happen is an increased false-positive rate, and I think even this is unlikely for most applications (its certainly good enough for anything I’ve ever needed a Bloom Filter for).

Some other nice features:

  • Implements Java’s generic Set interface, since this should be familiar to people
  • You just tell its constructor how many elements you expect to insert into it, and it will automatically configure itself for optimal operation
  • It will calculate its expected false-positive rate for you
  • Its serializable
  • I’ve included a reasonably thorough unit test
  • Unlike the Hadoop implementation, SimpleBloomFilter uses BitSet, which should be nice and compact
  • I’m releasing it under a very liberal “do what you want, just remember who wrote it” license

Here is a simple example usage:

// Create a bloom filter 100 bits in size to contain Strings, optimized to
// contain 4 items
SimpleBloomFilter mammals = new SimpleBloomFilter(100, 4);

// Add some things to the bloom filter

// Test to see if the bloom filter remembers
if (mammals.contains("dog")) {
System.out.println("A dog is a mammal with probability "
+ (1 - mammals.expectedFalsePositiveProbability()));
} else {
System.out.println("A dog is definitely not a mammal");

If you are wondering how many bits to use, here is a table of the false-positive rates you should expect given specific ratios between number of items, and number of bits.

Ratio (items:bits) False-positive probability
1:1 0.63212055882856
1:2 0.39957640089373
1:4 0.14689159766038
1:8 0.02157714146322
1:16 0.00046557303372
1:32 0.00000021167340
1:64 0.00000000000004

Download the source code to the SimpleBloomFilter, and a unit test here. If you would like to review the code, you can do so here with nice syntax highlighting.

Update (11th Jan 2010): As mentioned at the top of the article, I now recommend that you use Magnus Skjegstad’s Java bloom filter, which is inspired by this code, as it corrects a number of deficiencies with my approach. Its available under the LGPL, and you can find it on its Google Code page.

6 thoughts on “A decent stand-alone Java Bloom Filter implementation

  1. Pingback: Bloom filter inside Map Reduce « Vanja Komadinovic

  2. coder

    could u pls kindly tell us:
    why sha-1 or sha-512 is not good for hash functions in bloom filter? can you do a math proof or strict algorithm analysis on your claim?

    1. ian Post author

      coder, they’re not good because they’re simply too slow. It’s overkill to use cryptographically secure hash functions for a bloom filter.

  3. John

    Hi ian,

    I am just curious. why did your SimpleBloomFilter implement the serializable interface? I have read alot of reasons why it shouldn’t be used and the only reason I found for using it in most cases is the persist the class but I don’t see why you want to persist your SimpleBloomFilter. So, am I missing something? Please, enlighten me.


  4. Toby

    Awesome post. You should use social sites to increase
    traffic and make your site go viral. There are tools which automate this
    time consuming process.Visitors can flood your page in no time, just type in google for:
    Rixisosa’s Social Automation

  5. www.gamesmembership.com

    Animal Quickly pull is definitely an online playground for
    children exactly where they could purchase natural globe while playing with
    buddies. National Geographic Pet Jam is definitely an exciting
    online recreation space for kids which enjoy animals as well as
    the outside. Virtual online games are addicting, plus
    overseeing their own content material to get a
    younger market is a huge parental job. The state facts the astonishing virtual world associated with Naitonal Geographic Children Pet Jam, this particular colorful, fun friend guide provides newbies
    plus professional gamers as well all they have to understand.
    Trying to find therefore swept up within so many other
    activities I’ve experienced virtually no time
    to focus on AJ or the weblog… and to end up being honest, We’ve lost almost all
    interest in Pet Quickly pull plus blogging about it.
    With college coming back within Sept, I’ve got much more things to focus on. The particular Perform Outrageous app brings the dog Quickly pull experience within an enjoyable new way,


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>