Category Archives: Programming

Swarm now known as “Vega”, and has its own Google Code project

Avid readers of my blog (hi Mum!) may recall my proposal for a Distributed Programming Language a few weeks ago.

In the hope of stimulating progress, I’ve now created a Google Code Project for it, and renamed the concept “Vega” (after nothing in particular).  If interested, please sign up for the Google Code project, join the discussion group etc etc.

Great software heads for branding suicide

A few weeks ago I purchased an Asus Eee PC, and I have to say I’m impressed. Its small, incredibly cheap, yet is very capable of running a modern operating system with a variety of useful apps.

My version came with a custom version of Linux that, frankly, isn’t great. So I installed the absolutely excellent Ubuntu Eee, and have been extremely impressed by it.

So the Ubuntu Eee project has been asked by Ubuntu to change their name. Unfortunately, it seems that they’ve opted for “Easy Peasy”, which is (to be blunt) cringeworthy.

Now, its not that I consider myself any kind of branding expert, but as they say: “You don’t need to be able to lay eggs to tell a bad one”.

The funny thing is that they are clearly well aware that the name is problematic. In a recent blog post they say:

Some love the new name while some think it is childish. That’s why we have to make a really good looking and that is why we need a really good looking and professional logo and website.

Please guys, no amount of professionalism in the logo or website is going to make “Easy Peasy” any less childish!

Whoever within your organization is pushing for this name (and, knowing human nature, its probably the guy that thought of it) needs to take a step back and actually listen to what people are saying.  Don’t just fob it off with “oh well there are arguments on both sides”.  There are arguments on both sides of creationism too – that doesn’t get you off the hook!

Do the right thing, don’t burden a great project with a crappy name.

A plan for better source code diffs and merging

I’ve been using Subversion for years, but a few months ago I was really starting to feel the limitations of being able to create and merge branches easily. I’d heard that Git made this very easy indeed, and so I decided to try it.

Anyway, this isn’t yet another “I discovered Git and now I’ve achieved self-actualization” blog post, so to cut a long story short, I now use git for everything (together with the excellent GitHub).

Even though merging is a lot better with Git than Subversion, I’ve still found myself getting into situations where it requires a lot of work to merge a branch back into another branch, and it got me thinking about better ways to do merging.

While I’m no merging expert, it seems that most merging algorithms do it on a line-by-line basis, treating source code as nothing but a list of lines of text.  It got me thinking, what if the merging algorithm understood the structure of the source code it is trying to merge?

So the idea is this:

Provide the merge algorithm with the grammar of the programming language, perhaps in the form of a Bison, Yacc, or JavaCC grammar file

The merge algorithm then uses this to parse the files to be diffed and/or merged into trees, and then the diff and merge are treated as operations on these trees.  These operations may include creating, deleting, or moving nodes or branches, renaming nodes, etc.  There has been quite a bit (pdf) of academic research on this topic, although I haven’t yet found off-the-shelf code that will do what we need.  Still, it shouldn’t be terribly hard to implement.

The beauty of this approach is that the merge algorithm should be far less likely to be confused by formatting changes, and much more likely to correctly identify what has changed.

I can’t think of any reason that such a tool wouldn’t work in the exact same way as existing diff/merge tools from the programmer’s perspective. The tool would automatically select the correct grammar based on the file extension, or fall-back to line-based diffs if the extension is unrecognized (or the file isn’t valid for the selected grammar). Thus, it should be trivial to use this new tool with existing version control systems.

I’d love to have the time to implement this, although regretfully it is at the bottom of a very large pile of “some day” projects.  I think this is an interesting enough idea, and one that would be immediately useful, that if I put it out there someone somewhere might be able to make it a reality.

Any takers? I’ve set up a Google Group for further discussion, please join if interested.

Swarm: A true distributed programming language


The fundamental concept behind Swarm is that we should “move the computation, not the data”.

The Swarm prototype is a simple stack-based language, akin to a primitive version of the Java bytecode interpreter. I wanted the proof of concept to be quick to implement, while demonstrating that the concept could work for a popular runtime like the JVM or Microsoft’s CLR.

Update (Sept 17th 09): Swarm is now implemented as a Scala library, so you program in normal Scala, rather than a custom stack-based library as with the prototype described here.  It uses the Scala 2.8 Continuations plugin to achieve this.  See end of blog post for further information.

The Prototype

The prototype is implemented in Scala, and I will use snippets of Scala code below, but a knowledge of Scala won’t be required to understand the rest of this article. I chose Scala because I wanted to learn it, and because its rich semantics tends to make coding easier and faster than Java (my normal language of choice).

As with the JVM, there are three places to store data in the Swarm VM: the stack, a local variable array, and the store. The stack is used for intermediate values in computations, data here tends to be very short-lived. In the prototype it is implemented as a List[Any]. The local variable array is for data that is used within a block of code (its implemented as a Map[Int, Any]).

The “Store”

The “store” is somewhat analogous to the JVM heap. It is used for long-term storage of data, indeed, in an actual implementation it may be persistent, and/or transactional, but in the prototype it is in-memory. The store contains “objects”, each of which is a list of key-value pairs. The values may be references to other objects. The store is implemented as a Map[Int, Map[String, Any]].

The store is the key to how Swarm is distributed. A store can be spread across multiple computers, and objects in a store can point to objects on remote computers. While the objects stored on different computers are different, all the computers contain the same computer program (or must have access to it).

When some code is running on a computer that wants to read or write to an object stored locally, this can be done as it would with any other computer, a very efficient operation. But what if the code needs to access a remote object? Well, in this case, rather than moving the remote object across the network to the local computer, we freeze the computation, and move it to the remote computer, where it resumes execution – now able to access the object directly.


Freezing the code execution in this way is called a “continuation”. A continuation is akin to using the “Save Game” feature in a computer game. It allows you to start the game from the same place in the future, or even to email the saved game to a friend of yours and have them continue where you left-off.

Continuations are supported by some advanced programming languages like Haskell and Ruby, but not other more common languages like Java and Python (at least, not directly).

In practice, the continuation must contain the current stack, the current local variables, and the program counter. These three things represent the “state” of a Swarm computer program. If you can transmit these three things to another computer that also has access to the same computer program, then it can pick up where you left off, and the author of the computer program doesn’t need to know or care about this little bit of magic.

Of course, sending this state across the network, even though it will typically be quite a small amount of data (probably fitting in a single IP packet), will still introduce a significant delay relative to the normal speed of code execution. For this reason, Swarm must try to ensure that objects are distributed among the computers in the cluster in such a way that over-the-network traffic is minimized.

Deciding where objects should go

Our goal is to minimise the number of times continuations must be packages up and sent across the network. We can do this by ensuring that objects which are “tightly coupled”, meaning that the reference from one to the other are frequently used by the code, reside on the same computer.

For now we use a very simple approach to this. When an object is created, we attempt to put it on the same computer as the last object accessed. This is easy, because this is the computer that will be executing the code. The only situation under which we don’t do this is if the current server has too many objects relative to the other servers, in which case we create the object on the least-full server. This helps to balance objects across the servers.

Of course, this is a very simple approach, in a real-world implementation we’d want much more sophisticated load-balancing that monitored actual usage and moved objects as necessary.

The simulation

To test these ideas I wrote a simulation in Scala. My goal was to create something (to quote Einstein) as simple as possible, but not simpler.

First I created a simple stack-based language. It supports primitive control commands, Goto and If. It also supports mathematical operations like addition, generation of random numbers, and greater than comparison. Lastly, it supports commands to create, modify, and retrieve objects in the store. Recall that each object is a list of key-value pairs, where the key is a string, and the value can be a number, string, or a reference to another object. At the time of writing the language doesn’t support subroutines, but these could be added with relative ease. You can find the definitions of these commands in Instruction.scala.

I used immutable datastructures for the state of the execution, a List for the stack (Scala’s Stack implementation is buggy – yay Scala!), and an IntMap for the local variables.

The simulated Store

Next I implemented the Store class. This is a map of integers to objects. Each object is itself a map of strings to arbitrary values, which could include integers, strings, ObjectRef objects, among others.

Overseeing the various Stores is a Cluster. Its role is to tell us when we should create new objects, using its loadBalance() function. It will generally create the object on the current computer (aka “node”), but will sometimes say that it must be created on another node to ensure objects are reasonably well distributed.

The load balancing algorithm is simple: No store is permitted to have less than half the free-space of the least-full store. If it is, then the object is created on the least-full store instead.

A test program

Next I needed to write a simple program to see how the Swarm prototype performs. I needed something simple, but which would hopefully be representative of the way real-world software creates and uses objects.

I decided to implement a simple binary-search tree, and then insert random numbers into it. It took me a while, and was rather tedious, but eventually I got something working. You can see the end result in Test2.scala. Initially line numbers had to be specified explicitly, this was a royal PITA, so I later implemented a label mechanism, and a “compilation” pre-processing stage (implemented in Interpreter.scala) that assigned line-numbers to the labels.

The experiment

I created 5 virtual computers, each with a capacity of 20 objects. Of course, this is a ludicrously small number of objects (in the real world, a node would be able to handle millions, even billions of objects), but I wanted to stress the system.


Here is a graph of the resultant tree, the color indicates which of the computers the object is stored on – as you can see the root of the tree was stored on the pink computer.

g1.pdf (1 page)

In this experiment, I show the percentage of instructions that result in a continuation being transmitted between computers as the size of the tree is increased:

Microsoft Excel

As can be seen, it increases rapidly before leveling out around 3.25%.


Even the simplistic load-balancing approach employed in the simulation seems to be reasonably effective, only 3.5% of instructions required some communication between nodes. A number of factors conspired to make this a difficult task for it:

  • The insertion of data to random points in the tree make it more difficult for the algorithm to group branches of the tree together on the same computer
  • Each node had very limited capacity

Immediate future

Continuous re-allocation of objects between computers in response to actual usage is key. This will need to be a trade-off between efficiency of the re-balancing process, and its effectiveness. Near-term work on the simulation is likely to focus on this.

Long term

I guesstimate that turning this concept into an actual working piece of software would be a 5 man-year effort. Among the major questions/challenges are:

Can we work with an existing virtual machine, or do we need a new language?

Ideally we would be able to make Swarm support a widely-used bytecode language such as the Java Virtual Machine, Microsoft’s Common Language Runtime, or the LLVM compiler infrastructure. Swarm’s simple bytecode is a simplification of the Java Virtual Machine for this purpose. However, it is possible that we will uncover reasons that we need a new language for this. That would be unfortunate if the case, as it would create a significant barrier to entry.

What about persistence, transactions, and replication?

If we want Swarm to have the potential to replace databases, then it will need to be able to store objects persistently (ie. on disk). It will also need to support transactions (so that we can be sure that the stored data is always in a consistent state). We will also need a solution to ensure that it is fault-tolerant, meaning some kind of replication scheme.

Wrap up

Right now Swarm is still more of a thought-experiment for me, as I am still very-much focussed on my day job. That being said, I have had some conversations with a number of people who have expressed an interest in providing funding to make Swarm a reality. As I mentioned before, it would be a big task, probably requiring around 5 very smart people working full-time for at least a year. The potential, however, is huge – just imagine being able to write massively scalable applications without having to think about scalability and persistence!

Source code

You can find the Scala source code in a public git repository here. The code is not that well commented, and at the time of writing it will spit out a Graphviz file to produce the tree diagram above, its not really intended for public consumption – but it may be interesting for people to browse.

Update (13th Jan 09): I have created a Google Code project for Swarm with a discussion mailing list. If you are interested please head on over and sign up!

Update 2 (17th Sept 09): I’ve done a lot more work on this, see here.

How can A.(x-B)^2+C help you avoid a curry overdose?

picture of curryYou might love curry, but no matter how much you love it, you probably don’t want to eat it every night.  We’ve spent the last few months working on a powerful collaborative filter called SenseArray.  A collaborative filter’s job is to quickly figure out what people want, and then give it to them, but most  don’t account for the fact that people need some diversity.  Indeed, the Netflix Prize (pretty-much the olympics of collaborative filters), doesn’t even measure it!

One fundamental question we need to answer is just how much diversity users want.  Too little and users get bored, too much and the quality of your recommendations is too low, so it’s a balance.  This will be different for different applications of SenseArray.  So how do we figure out how much diversity SenseArray should use in its recommendations in any particular situation?

Well, the high-level answer isn’t complicated, it’s “trial and error”.  You try a certain parameter, and you measure the aggregate user response to that parameter.  If more users like what they see with parameter A versus parameter B, then you go with parameter A.

There are a few problems though.  These patterns of user response only become meaningful after you’ve collected a lot of data, so it can take hours, perhaps even days to collect enough data to properly evaluate a particular parameter.

Also, since you are testing this on a live system, you really want to settle on a good parameter as quickly as possible so that users are getting the best possible experience as quickly as possible.  How can we do this?

We should start by getting some data.  Let’s say that we know our parameter must be between 0 and 1.  Initially, since we know nothing, let’s run 3 experiments, testing performance when the input is 0.0, 0.5, and 1.0.

Graph of a curveSo now we have these initial 3 samples, what do we do with them?  I’ll start by making an assumption: we expect there to be an input value at which performance is optimal, where performance gradually deteriorates the further you get from this value.  This means that if you were to graph performance as the parameter changes, where a lower value means better performance, it would look something like the image to the right, a U-shaped curve.

Turns out that you can produce a simple U-shaped curve with the equation:

You can adjust A to change the steepness of the slope, B determines the middle of the curve (the lowest point), and C is the height of the lowest point.

So the next step, given our initial 3 samples, is to find values for A, B, and C that will produce a curve that most closely fits these sample points.  The technique used to do this is a rudimentary genetic algorithm optimization.  We choose some initial values for A, B, and C.  We then measure how good the “fit” is by measuring the output of our curve equation for the sample points, and squaring the difference with the actual values we got for these samples.  We then adjust A, B, and C, measure the fit again, and if it is better, we keep the new values, if not we revert to the old values for A, B, and C and try again.

Typically this approach will find a good curve fit in just a few hundred iterations.  There are undoubtedly  more efficient ways to do this, but this is something we need to do quite infrequently, so a “good enough” approach is just fine.

You will recall that value B represents the lowest point in the curve, this means that whatever our curve-fitting told us B should be, is our guess for the input that will produce the best output, cool!

So what do we do now?  Well, chances are that B won’t be a perfect guess, after all it’s based on only 3 samples, which probably incorporate a lot of noise, and our simple curve almost certainly isn’t an exact match even if there were no noise.  But it has given us an approximation, its given us a clue as to where we should focus our attention next.

We started by testing across the entire range of possible input values, so we should now narrow this range.  I’ve found that narrowing it to 1/4th of its previous size works well, although this is a fairly arbitrary number.  We then repeat the process using this new range centered on our best guess from the previous iteration.

Another refinement is to track whether our guesses appear to be “homing in”, meaning that each successive change in our best guess is smaller and smaller.  While the successive changes are getting smaller, we continue to decrease the range by 1/4th.  If a change is larger than the previous change, then we increase the range again – I arbitrarily chose 1/3rd, and it seems to work (I didn’t want to choose 1/4th because then we’d just re-test the same set of samples we just tested).

So lets see how this works in practice.  Lets say we have a mystery function, you and I know its defined as follows:
[sourcecode language=’java’]
public double mysteryFunction(double input) {
return Math.sin(input+4)+1.1+(random.nextDouble()/3-0.11);
This is a simple sine curve with some random salt thrown in to make things a little more interesting (real sample data is rarely as neat as a clean curve). The curve itself (without randomness) looks like this:


We will focus on the part of the curve between 0 and 3, because this only has a single lowest point.

I’ve created a class called ParameterOptomizer which implements the algorithm described above.  Here is how we use it:
[sourcecode language=’java’]
public void testMystery() {
// Create our ParameterOptimizer object and tell it that valid inputs
// are in the range 0 to 3
ParameterOptimizer po = new ParameterOptimizer(0, 3);
// Output the initial best guess. Since we don’t have any samples yet,
// this will be mid-way between 0 and 3
System.out.println(” Best guess: “+df.format(po.getBestGuess()));
// Now we run a few tests
for (int x = 0; x < 3; x++) {
// Get three sample points from the ParameterOptimizer
List probes = po.getNextSamples(3);
for (Double probe : probes) {
// For each of these points, obtain the output provided by
// the mysteryFunction. Note that in real-life scenarios,
// this might be a time-consuming process
double output = mysteryFunction(probe);
// Tell the ParameterOptimizer what we learned
po.addSample(probe, output);
// And output the information so we can see what is going on
System.out.print(” “+df.format(probe)+”:”+df.format(output));
// Find out the current best guess, along with what we predict the output will be
// at this point
System.out.println(” Best guess: “

private double mysteryFunction(double input) {
return Math.sin(input+4)+1.1+(random.nextDouble()/3-0.11);
Hopefully this code is relatively self-explanatory.

So how does it perform?  Well, here is the output from a typical run:

 Best guess: 1.50
0.00:0.51 1.50:0.56 3.00:1.70 Best guess: 0.72:0.39
0.34:0.28 0.72:0.12 1.09:0.34 Best guess: 0.72:0.28
0.63:0.03 0.72:0.10 0.81:0.32 Best guess: 0.72:0.23
0.44:0.33 0.72:0.09 1.00:0.11 Best guess: 0.76:0.21
0.00:0.47 0.80:0.28 1.61:0.41 Best guess: 0.76:0.21
0.55:0.17 0.76:0.32 0.97:0.35 Best guess: 0.73:0.22
0.10:0.37 0.73:0.07 1.36:0.48 Best guess: 0.70:0.22

On the first iteration, using just three samples, we get extremely close to the lowest point on the curve:


Here the red circles indicate the sample points, and the blue X indicates where the algorithm guesses the lowest point will be. As you can see, it slightly overestimates the actual height of this lowest point, but this is because two of the samples are slightly above the curve due to the random salt.

On the second iteration it still thinks that 0.72 is the optimum, but its estimate of where this optimum is has improved, going from 0.39 to 0.28:


You may wonder why it is still guessing so high for the optimal point given that these three samples would suggest it is much lower.  The reason is that when it fits the curve, it will use all available samples, not just the three most recent.

Also note that the three sample points used are now much closer to the optimal than in the first iteration. This means that our “guinea pig” users should be getting a better experience than on the first iteration.

I’ve made this code available under the Lesser GNU Public License, so you are welcome to take it, play with it, and perhaps improve it.  All I ask is that if you do improve it, you let me know about your improvements.  Browse the code here, get the details for pulling it from Subversion here.

Rediscovering Standard ML

Back when I was studying computer science in Edinburgh (1995-99), we all learned the ML programming language, specifically Standard ML, which was developed in my CS department in the 70s.  Even though I think most people on my course quickly decided they didn’t like it, it grew on me as I became more familiar with it (and it seemed pretty simple relative to Prolog, which we had to learn in our AI courses).

I hadn’t used it since I left university, but given the recent resurgence of interest in higher-level programming languages, with features like type inference, such as with Scala and F#, I decided it was time to indulge in a bit of nostalgia, and refresh my memory.

At university we had a book called ML for the Working Programmer, which I remember being pretty good, but I’d long-since lost it, so I went onto Amazon and purchased it again.

I then downloaded Standard ML of New Jersey, a popular and powerful SML compiler/interpreter. I had previously considered Caml, an alternate dialect of ML to Standard ML, but I couldn’t really get used to it, and I liked the fact that SML supports continuations, a feature useful for building highly concurrent applications.  Furthermore, there is MLTon is an extremely powerful “whole-program” SML compiler that reputedly produces extremely efficient code.

So what does SML look like? Here is a simple recursive function to count the elements in a list:

fun count([]) = 0
| count(h::t) = 1 + count(t);

SML relies on pattern matching. The first line tells SML to return 0 for an empty list. The second line will only be reached if the first line doesn’t match, and will only match if the list has at least one element. It splits the list into its first element, called ‘h’, and the remaining elements, called ‘t’ using the ‘::’ operator. It then returns 1 plus the number of elements in the tail.
After typing this into SML’s interpreter, SML produces this:

val count = fn : 'a list -> int

This is the type inference at work. Without specifying a single type, SML has inferred that the function count takes a list of type ‘a (pronounced “alpha”) which can represent any type, and returns an integer.  Note that SML has correctly inferred that the type of the elements in the list passed to count is irrelevant for the purpose of counting them.

Lets try something else, a sum function to add up the elements in a list:

fun sum([]) = 0
| sum(h::t) = h + sum(t);

The general structure is similar to count, but this time SML infers different types:

val sum = fn : int list -> int

As expected, it has inferred that the list must be a list of integers.

SML is a powerful language, and I’d recommend it for anyone keen to broaden their knowledge of programming languages.

Unlike Haskell it is “impure”, meaning that functions can have side-effects. Some people seem really hung-up on this idea of purity, but so far as I can tell it is more of a fetish which makes life more complicated, and with few practical benefits (whenever I ask people to explain what is so desirable about purity, the response is generally rather theoretical or hand-wavy).

I also think its nicer than Scala in some ways, for example in ML you can do pattern matching on function definitions, rather than having to use match. And of course, its way more mature than Scala.

If you are going to try it, I suggest using Emacs together with sml-mode.

Don’t try to serialize large ConcurrentSkipListMaps

Today I learned the hard way that Java can’t serialize large ConcurrentSkipListMaps (and probably ConcurrentSkipListSets too). The issue is that there doesn’t seem to be any custom serialization implementation for this class, and the default serialization mechanism ends up joyfully recursing its way along the map’s internal object graph – eventually causing a StackOverflowException. Nice.

The root issue here is, of course, Java’s default serialization mechanism, which really shouldn’t operate recursively like this. Turns out that this has been a known issue for an embarrassing length of time (see this 9 year old bug report).

The workaround is to subclass ConcurrentSkipListMap providing a more efficient serialization mechanism – actually its not too hard, the code is here. Its public domain, so enjoy!

A solution for all those that hate regular expressions

Regular expressions are hideous, near-unintelligible strings of code that are hard to write, and almost impossible to read. They are used to extract things from text, such as a date, or a person’s name.

Just this morning I was thinking about how one might create a tool which would generate regular expressions automatically from some example text, when I stumbled on You type in an example of the type of text you want to match, and then a few clicks later you have your regular expression! Even better, it will generate code for you in a variety of languages, including my favorite, Java.

Definitely one of my better finds this month…

Giving up on FaceBook’s Thrift

I’m working on some commercial software (implemented in Java) that needs to expose an API that a wide variety of programming languages can interface with over the network.

A few months ago my attention was drawn to FaceBook’s Thrift framework, a tool that allows you to define an API in a language-neutral way, and which will then generate code in a variety of languages (C, Java, Python, Ruby, etc) to communicate with the API, either as a client or a server. Because Thrift uses a compact binary protocol, it promises to be more efficient than approaches like XML-RPC, SOAP, or JSON-RPC. Also, because it generates both client and server code automatically, the likelihood of bugs in how the client and server communicate is greatly reduced.

Thrift’s documentation is dismal, and what little of it there is, is hard to find. There is an example API definition file on the Thrift homepage, and this serves as an adequate explanation of how to write one because it is well commented, but still, some proper documentation would be nice here.

Once you’ve created your API definition, and actually want to integrate the bindings that Thrift automatically generates into your code, the lack of documentation becomes a much bigger problem. I stumbled around blindly for quite a while before discovering that some example files are included with the Thrift source code in a subdirectory. Of course, there isn’t a single comment in any of the example code, at least the Java code I looked at. This just isn’t good enough – even if Thrift had never been released to the public, its amazing that whoever managed the project internally at Facebook didn’t insist that they provide at least some documentation.

So the next challenge is that I knew at least one user of this code required C# bindings. These aren’t present in the current Thrift release, but I discovered that code to support this had been committed back in January, so I went ahead and attempted to build the latest Thrift code directly from their Subversion repository.

And here I ran into yet more roadblocks, I couldn’t even get the ./configure script to run (this is what checks all the dependencies before you begin compiling). I emailed Thrift’s mailing list, and here I have to say that those on the list, mostly Facebook employees from what I could see, were extremely responsive, often replying within minutes. Their answers weren’t always as direct as I would have liked, on one or more occasion they just explained what the problem was, rather than telling me how to solve it.

Nonetheless, within a few hours I’d resolved the ./configure script issue, and ran into another issue during compilation. Again, after a few (very promptly answered) emails to the list, I was able to compile the thing, and generate the C# bindings I needed. Of course, there weren’t even example files explaining how to use the C# bindings, and it later transpired that they would only work with .NET 3.x – this was a dealbreaker for the user that needed the C# bindings in the first place. Groan.

Anyway, so after all of these issues, I’ve given up on Thrift, and will probably create a simple REST interface using something like Restlet.

To summarize my Thrift experience:

Concept: A+
Simplicity: B-
Documentation: F
Support B+
Maturity: F
Overall: D

A decent stand-alone Java Bloom Filter implementation

Update (11th Jan 2010): Magnus Skjegstad has started a project with a Bloom Filter implementation inspired by this one, but which corrects a number of deficiencies. I recommend that you use this implementation instead of the one shown here, you can find it on Github.

Update (3rd July 2015): These days if you need a Bloom Filter I’d recommend using the implementation in Google’s Guava library.

They say necessity is the mother of invention, and a few days ago I needed a stand-alone Java Bloom Filter implementation. A Bloom Filter is a relatively simple way to efficiently remember whether you’ve seen something before, without actually having to store everything you’ve seen. Wikipedia can fill you in on the details.

Anyway, I did some hunting around, and the best available option seemed to be this implementation in the Apache Hadoop project.

Fortunately, I took a look at the source code before going ahead and using it. The first warning sign was that at the heart of their implementation they use an array of boolean values (boolean[]). This is a serious problem, because most Java implementations I know will use a byte to represent a boolean in a boolean array. This means that for every actual bit stored, there are seven wasted bits – and this is in a datastructure that is supposed to be saving you memory!

Some other parts of their code also concerned me, such as how they were using SHA1 as their hash function, it seemed somewhat haphazard to say the least. Oh, the other minor problem was that in my initial testing, I couldn’t actually get their implementation to work.

So, I decided to implement my own, and as luck would have it I hit on a very simple method to do it, using the java.util.Random class. The two key methods, add() and contains(), are 7 and 8 short lines of very simple readable code respectively.

Of course, crypto geeks will point out that java.util.Random is not a very secure hash function, but the worst that can happen is an increased false-positive rate, and I think even this is unlikely for most applications (its certainly good enough for anything I’ve ever needed a Bloom Filter for).

Some other nice features:

    • Implements Java’s generic Set interface, since this should be familiar to people
    • You just tell its constructor how many elements you expect to insert into it, and it will automatically configure itself for optimal operation
    • It will calculate its expected false-positive rate for you
    • Its serializable
    • I’ve included a reasonably thorough unit test
    • Unlike the Hadoop implementation, SimpleBloomFilter uses BitSet, which should be nice and compact
    • I’m releasing it under a very liberal “do what you want, just remember who wrote it” license

Here is a simple example usage:

// Create a bloom filter 100 bits in size to contain Strings, optimized to
// contain 4 items
SimpleBloomFilter mammals = new SimpleBloomFilter(100, 4);

// Add some things to the bloom filter

// Test to see if the bloom filter remembers
if (mammals.contains(“dog”)) {
System.out.println(“A dog is a mammal with probability ”
+ (1 – mammals.expectedFalsePositiveProbability()));
} else {
System.out.println(“A dog is definitely not a mammal”);

If you are wondering how many bits to use, here is a table of the false-positive rates you should expect given specific ratios between number of items, and number of bits.

Ratio (items:bits) False-positive probability
1:1 0.63212055882856
1:2 0.39957640089373
1:4 0.14689159766038
1:8 0.02157714146322
1:16 0.00046557303372
1:32 0.00000021167340
1:64 0.00000000000004

Download the source code to the SimpleBloomFilter, and a unit test here. If you would like to review the code, you can do so here with nice syntax highlighting.