A decent stand-alone Java Bloom Filter implementation

Update (11th Jan 2010): Magnus Skjegstad has started a Google Code project with a Bloom Filter implementation inspired by this one, but which corrects a number of deficiencies. I recommend that you use this implementation instead of the one shown here, you can find it on its Google Code page.

They say necessity is the mother of invention, and a few days ago I needed a stand-alone Java Bloom Filter implementation. A Bloom Filter is a relatively simple way to efficiently remember whether you’ve seen something before, without actually having to store everything you’ve seen. Wikipedia can fill you in on the details.

Anyway, I did some hunting around, and the best available option seemed to be this implementation in the Apache Hadoop project.

Fortunately, I took a look at the source code before going ahead and using it. The first warning sign was that at the heart of their implementation they use an array of boolean values (boolean[]). This is a serious problem, because most Java implementations I know will use a byte to represent a boolean in a boolean array. This means that for every actual bit stored, there are seven wasted bits – and this is in a datastructure that is supposed to be saving you memory!

Some other parts of their code also concerned me, such as how they were using SHA1 as their hash function, it seemed somewhat haphazard to say the least. Oh, the other minor problem was that in my initial testing, I couldn’t actually get their implementation to work.

So, I decided to implement my own, and as luck would have it I hit on a very simple method to do it, using the java.util.Random class. The two key methods, add() and contains(), are 7 and 8 short lines of very simple readable code respectively.

Of course, crypto geeks will point out that java.util.Random is not a very secure hash function, but the worst that can happen is an increased false-positive rate, and I think even this is unlikely for most applications (its certainly good enough for anything I’ve ever needed a Bloom Filter for).

Some other nice features:

  • Implements Java’s generic Set interface, since this should be familiar to people
  • You just tell its constructor how many elements you expect to insert into it, and it will automatically configure itself for optimal operation
  • It will calculate its expected false-positive rate for you
  • Its serializable
  • I’ve included a reasonably thorough unit test
  • Unlike the Hadoop implementation, SimpleBloomFilter uses BitSet, which should be nice and compact
  • I’m releasing it under a very liberal “do what you want, just remember who wrote it” license

Here is a simple example usage:

// Create a bloom filter 100 bits in size to contain Strings, optimized to
// contain 4 items
SimpleBloomFilter mammals = new SimpleBloomFilter(100, 4);

// Add some things to the bloom filter
mammals.add("dog");
mammals.add("cat");
mammals.add("mouse");
mammals.add("dolphin");

// Test to see if the bloom filter remembers
if (mammals.contains("dog")) {
System.out.println("A dog is a mammal with probability "
+ (1 - mammals.expectedFalsePositiveProbability()));
} else {
System.out.println("A dog is definitely not a mammal");
}

If you are wondering how many bits to use, here is a table of the false-positive rates you should expect given specific ratios between number of items, and number of bits.

Ratio (items:bits) False-positive probability
1:1 0.63212055882856
1:2 0.39957640089373
1:4 0.14689159766038
1:8 0.02157714146322
1:16 0.00046557303372
1:32 0.00000021167340
1:64 0.00000000000004

Download the source code to the SimpleBloomFilter, and a unit test here. If you would like to review the code, you can do so here with nice syntax highlighting.

Update (11th Jan 2010): As mentioned at the top of the article, I now recommend that you use Magnus Skjegstad’s Java bloom filter, which is inspired by this code, as it corrects a number of deficiencies with my approach. Its available under the LGPL, and you can find it on its Google Code page.

22 thoughts on “A decent stand-alone Java Bloom Filter implementation

  1. Pingback: Bloom filter inside Map Reduce « Vanja Komadinovic

  2. coder

    could u pls kindly tell us:
    why sha-1 or sha-512 is not good for hash functions in bloom filter? can you do a math proof or strict algorithm analysis on your claim?

    Reply
    1. ian Post author

      coder, they’re not good because they’re simply too slow. It’s overkill to use cryptographically secure hash functions for a bloom filter.

      Reply
  3. Dcaideoidoedf.Com

    Basically excellent writeup. It was a new leisure account the item. Search complicated to far more included agreeable from you finding out! However, the best way could possibly all of us connect?

    Reply
  4. Oma

    Hi, Neat post. There is aan issue together with your webb site in internet explorer,
    could check this? IE nonetheoess is the market chieef and a
    huge component of other folks will miss your great writing due to this
    problem.

    Reply
  5. Barley Mow Brewing Co

    Hi, Neat post. There is an issue with your site in web explorer, would check this?
    IE still is the marketplace chief and a large portion of people will omit your fantastic writing
    because of this problem.

    Reply
  6. to

    Thanks , I’ve recently been looking for information approximately this topic for a while and
    yours is the best I’ve came upon so far. However, what concerning
    the conclusion? Are you sure in regards to the supply?

    Reply
  7. Jesus

    Hi! Thiss post couldn’t be written any better!
    Reading through this post reminds me of my good old room mate!
    He always kept chatting about this. I will forward this page to him.
    Pretty sure he will have a good read. Many thanks for sharing!

    my webpage :: Dsa helpline; Jesus,

    Reply
  8. hp color cartridge

    This is the perfect webpage for anybody who hopes to understand this topic.

    You realize so much its almost tough to argue with you (not that I really will
    need to…HaHa). You certainly put a fresh spin on a topic
    which has been written about for ages. Excellent stuff, just excellent!

    Reply
  9. Marilyn

    alwayys i usaed to read smaller posts that also
    clear their motive, and that is also happening with this article which I am
    reading now.

    Feel free to visit my webpage – speak to Child Benefit directly (Marilyn)

    Reply
  10. bulk sms

    It’s actually a cool and helpful piece of information. I’m satisfied that you simply shared this useful information with
    us. Please keep us informed like this. Thank you for sharing.

    Reply
  11. Terri

    Your style iis very unique compared to other folks I’ve read stuff from.
    I appreciate yyou for posting when you’ve got the opportunity,
    Guess I’ll just book mark this page.

    Also visit myy web blog :: T-Mobile customer services (Terri)

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>