A decent stand-alone Java Bloom Filter implementation

Update (11th Jan 2010): Magnus Skjegstad has started a Google Code project with a Bloom Filter implementation inspired by this one, but which corrects a number of deficiencies. I recommend that you use this implementation instead of the one shown here, you can find it on its Google Code page.

They say necessity is the mother of invention, and a few days ago I needed a stand-alone Java Bloom Filter implementation. A Bloom Filter is a relatively simple way to efficiently remember whether you’ve seen something before, without actually having to store everything you’ve seen. Wikipedia can fill you in on the details.

Anyway, I did some hunting around, and the best available option seemed to be this implementation in the Apache Hadoop project.

Fortunately, I took a look at the source code before going ahead and using it. The first warning sign was that at the heart of their implementation they use an array of boolean values (boolean[]). This is a serious problem, because most Java implementations I know will use a byte to represent a boolean in a boolean array. This means that for every actual bit stored, there are seven wasted bits – and this is in a datastructure that is supposed to be saving you memory!

Some other parts of their code also concerned me, such as how they were using SHA1 as their hash function, it seemed somewhat haphazard to say the least. Oh, the other minor problem was that in my initial testing, I couldn’t actually get their implementation to work.

So, I decided to implement my own, and as luck would have it I hit on a very simple method to do it, using the java.util.Random class. The two key methods, add() and contains(), are 7 and 8 short lines of very simple readable code respectively.

Of course, crypto geeks will point out that java.util.Random is not a very secure hash function, but the worst that can happen is an increased false-positive rate, and I think even this is unlikely for most applications (its certainly good enough for anything I’ve ever needed a Bloom Filter for).

Some other nice features:

  • Implements Java’s generic Set interface, since this should be familiar to people
  • You just tell its constructor how many elements you expect to insert into it, and it will automatically configure itself for optimal operation
  • It will calculate its expected false-positive rate for you
  • Its serializable
  • I’ve included a reasonably thorough unit test
  • Unlike the Hadoop implementation, SimpleBloomFilter uses BitSet, which should be nice and compact
  • I’m releasing it under a very liberal “do what you want, just remember who wrote it” license

Here is a simple example usage:

// Create a bloom filter 100 bits in size to contain Strings, optimized to
// contain 4 items
SimpleBloomFilter mammals = new SimpleBloomFilter(100, 4);

// Add some things to the bloom filter
mammals.add("dog");
mammals.add("cat");
mammals.add("mouse");
mammals.add("dolphin");

// Test to see if the bloom filter remembers
if (mammals.contains("dog")) {
System.out.println("A dog is a mammal with probability "
+ (1 - mammals.expectedFalsePositiveProbability()));
} else {
System.out.println("A dog is definitely not a mammal");
}

If you are wondering how many bits to use, here is a table of the false-positive rates you should expect given specific ratios between number of items, and number of bits.

Ratio (items:bits) False-positive probability
1:1 0.63212055882856
1:2 0.39957640089373
1:4 0.14689159766038
1:8 0.02157714146322
1:16 0.00046557303372
1:32 0.00000021167340
1:64 0.00000000000004

Download the source code to the SimpleBloomFilter, and a unit test here. If you would like to review the code, you can do so here with nice syntax highlighting.

Update (11th Jan 2010): As mentioned at the top of the article, I now recommend that you use Magnus Skjegstad’s Java bloom filter, which is inspired by this code, as it corrects a number of deficiencies with my approach. Its available under the LGPL, and you can find it on its Google Code page.

40 thoughts on “A decent stand-alone Java Bloom Filter implementation

  1. Pingback: Bloom filter inside Map Reduce « Vanja Komadinovic

  2. coder

    could u pls kindly tell us:
    why sha-1 or sha-512 is not good for hash functions in bloom filter? can you do a math proof or strict algorithm analysis on your claim?

    Reply
    1. ian Post author

      coder, they’re not good because they’re simply too slow. It’s overkill to use cryptographically secure hash functions for a bloom filter.

      Reply
  3. Dcaideoidoedf.Com

    Basically excellent writeup. It was a new leisure account the item. Search complicated to far more included agreeable from you finding out! However, the best way could possibly all of us connect?

    Reply
  4. Oma

    Hi, Neat post. There is aan issue together with your webb site in internet explorer,
    could check this? IE nonetheoess is the market chieef and a
    huge component of other folks will miss your great writing due to this
    problem.

    Reply
  5. Barley Mow Brewing Co

    Hi, Neat post. There is an issue with your site in web explorer, would check this?
    IE still is the marketplace chief and a large portion of people will omit your fantastic writing
    because of this problem.

    Reply
  6. to

    Thanks , I’ve recently been looking for information approximately this topic for a while and
    yours is the best I’ve came upon so far. However, what concerning
    the conclusion? Are you sure in regards to the supply?

    Reply
  7. Jesus

    Hi! Thiss post couldn’t be written any better!
    Reading through this post reminds me of my good old room mate!
    He always kept chatting about this. I will forward this page to him.
    Pretty sure he will have a good read. Many thanks for sharing!

    my webpage :: Dsa helpline; Jesus,

    Reply
  8. hp color cartridge

    This is the perfect webpage for anybody who hopes to understand this topic.

    You realize so much its almost tough to argue with you (not that I really will
    need to…HaHa). You certainly put a fresh spin on a topic
    which has been written about for ages. Excellent stuff, just excellent!

    Reply
  9. Marilyn

    alwayys i usaed to read smaller posts that also
    clear their motive, and that is also happening with this article which I am
    reading now.

    Feel free to visit my webpage – speak to Child Benefit directly (Marilyn)

    Reply
  10. bulk sms

    It’s actually a cool and helpful piece of information. I’m satisfied that you simply shared this useful information with
    us. Please keep us informed like this. Thank you for sharing.

    Reply
  11. Terri

    Your style iis very unique compared to other folks I’ve read stuff from.
    I appreciate yyou for posting when you’ve got the opportunity,
    Guess I’ll just book mark this page.

    Also visit myy web blog :: T-Mobile customer services (Terri)

    Reply
  12. app

    This design is steller! You most certainly know how to keep a reader amused.
    Between your wit and your videos, I was almost moved to start my own blog (well, almost…HaHa!) Wonderful job.

    I really loved what you had to say, and more than that,
    how you presented it. Too cool!

    My site – app

    Reply
  13. shisha pens refillable

    Smoking is a physical habit, not just a drug habit. From there, it draws out the voltage and gives the continuous feel of smoking.
    Also referred to as the e-cigarette or the e-cig, these cigarettes have seen a current rise in reputation by people who find themselves making an attempt to kick the behavior of smoking tobacco cigarettes.

    Reply
  14. Laptop Deals

    You can present one or two good options to the customer.
    Dell laptops have achieved a large amount of success with theiir line of notebook computers.
    This provides purchasers the benefit of
    getting in a location too stor near to earlier to producing a
    decision. If youu are in the Uk and searching for Asus laptops, Laptrops House has has a
    huge collection from leading retailers for you too compare so you
    can get the best deal.

    Reply
  15. chipper repair Dayton NV

    ) An out-of-season landscape lets you see what’s missing.
    A lot of people cannot afford to create all the changes at
    the same time. Very first of all, you’ll need to commence out having a thorough program which would
    plainly define your desired goals and expectations.

    Reply
  16. fruitwise

    Magnificent goods from you, man. I’ve understand your stuff previous to and you are just extremely wonderful.
    I really like what you have acquired here, certainly like what you are
    stating and the way in which you say it.
    You make it enjoyable and you still take care of to keep it
    wise. I can not wait to read far more from you. This is
    really a tremendous website.

    Reply
  17. bawdyrope

    Have you ever considered publishing an e-book or guest authoring
    on other websites? I have a blog based upon on the same subjects you discuss and would really like to
    have you share some stories/information. I know my audience would appreciate your work.
    If you’re even remotely interested, feel free to send me
    an e-mail.

    Reply
  18. bicycle

    I am not suure the place you are getting your info,
    however great topic. I must spend a while studying much more or figuring out more.

    Thank you for magnificent information I was looking for this information for my mission.

    Reply
  19. neck lift

    For most up-to-date information you have to pay a visit the web and on world-wide-web I found this web site as a most excellent website
    for most recent updates.

    Reply
  20. SEO Services Philly

    I like the helpful information you provide in your articles.
    I’ll bookmark your blog and check agaiin here regularly. I’m quite certain Iwill learn lots of new stuff right here!
    Good luck for the next!

    Reply
  21. www.xieshoue.cn

    Hey, I think your website might be having browser compatibility issues.
    When I look at your website in Ie, it looks fine but when opening in Internet Explorer,
    it has some overlapping. I just wanted to give you a quick heads up!
    Other then that, great blog!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>