Category Archives: Java

Object Relational Mappers (ORMs) are a terrible idea

ORMs (like Hibernate in Java) are a flawed solution to an imagined problem that should be consigned to the wastebasket of history.

They are predicated on the idea that relational databases are old fashioned and should be avoided at all costs, but if you can’t avoid them, use some kind of convoluted wrapper that tries (and generally fails) to pretend that the relational database is something it isn’t – an object oriented datastore.

Often what people are really looking for when they use an ORM is some kind of database abstraction layer.  It is reasonable to abstract the database somehow, but I recommend not using an abstraction layer that seeks to pretend the database is something that it isn’t.

In Java, the best such database abstraction layer I’ve found is Jooq, I highly recommend it.

A great tip for diagnosing Java memory issues

My good friend (and creator of Apache Wicket) Jonathan Locke just gave me a great tip.  I was aware of the jmap utility, but didn’t know it could do this:

Just type:

jmap -histo:live <pid>

Where <pid> is the process id of a Java process, and it will dump out all objects, starting with whichever is taking up the most memory.

The handy thing is that the Java process doesn’t need to have been started with any kind of special command line options, it will work with any running Java process.

I wish I knew this a few weeks ago while debugging a memory leak on a remote server…

Polynomial regression on a large dataset in Java

For an internal project at my day job I needed to find the n degree polynomial that best fits a (potentially very large) dataset.

I asked a question on Reddit about this, and Paul Lutus was kind enough to respond with a link to some code that did something close to what I needed, however Paul’s code was not well suited to very large amounts of data.

So with his help, I modified his code to decouple it from the original application it was a part of, make it follow good Java coding practices a bit more closely, and in doing so make it more flexible and well suited to handling very large datasets.

The code is under the GNU Public License, which is unfortunate since it’s a library and the GPL is viral and Paul is unwilling to relicense under the LGPL, meaning that I or someone else will need to re-implement under a different license if the library is to be used within non-GPL’d code :-(

Here is the source, collaborative free software at work, and please comment if you find it useful!: PolynomialFitter.java

A good pretty-printer for Google Gson

I decided to do something about the one thing I wasn’t happy about in Google Gson, the lack of good pretty-printing.

I created a class called GsonPrettyPrinter which outputs nice readable JSON in a reasonably intelligent way, you can grab the source code here.

Note, in particular:

  • Requires at least Gson 1.4 (the current version at the time of writing)
  • It will try to represent JSON objects on arrays on a single line if it can
  • It sorts JSON objects by their keys
  • The code is released under the Lesser Gnu Public License.

Which is the best Java JSON library?

I may not have tried them all, but I’ve tried quite a few including:

  • The default libraries from json.org
  • Jackson
  • XStream
  • Google Gson
  • JsonMarshaller
  • JSON.simple

I ended up going with Google Gson, the library is simple, efficient, intuitive and flexible.  It allows you to convert JSON objects directly into equivalent Java POJOs[1] and back again, which makes for clean and simple code.  

One annoyance with Gson is that its JSON pretty-printer isn’t verbose enough for my tastes, but at the time of writing (Oct 09) this is on their todo list. Edit (Sep 2011): I created my own JSON pretty-printer that works with Gson – see here for more information.

[1] “Plain Old Java Object”

How can A.(x-B)^2+C help you avoid a curry overdose?

picture of curryYou might love curry, but no matter how much you love it, you probably don’t want to eat it every night.  We’ve spent the last few months working on a powerful collaborative filter called SenseArray.  A collaborative filter’s job is to quickly figure out what people want, and then give it to them, but most  don’t account for the fact that people need some diversity.  Indeed, the Netflix Prize (pretty-much the olympics of collaborative filters), doesn’t even measure it!

One fundamental question we need to answer is just how much diversity users want.  Too little and users get bored, too much and the quality of your recommendations is too low, so it’s a balance.  This will be different for different applications of SenseArray.  So how do we figure out how much diversity SenseArray should use in its recommendations in any particular situation?

Well, the high-level answer isn’t complicated, it’s “trial and error”.  You try a certain parameter, and you measure the aggregate user response to that parameter.  If more users like what they see with parameter A versus parameter B, then you go with parameter A.

There are a few problems though.  These patterns of user response only become meaningful after you’ve collected a lot of data, so it can take hours, perhaps even days to collect enough data to properly evaluate a particular parameter.

Also, since you are testing this on a live system, you really want to settle on a good parameter as quickly as possible so that users are getting the best possible experience as quickly as possible.  How can we do this?

We should start by getting some data.  Let’s say that we know our parameter must be between 0 and 1.  Initially, since we know nothing, let’s run 3 experiments, testing performance when the input is 0.0, 0.5, and 1.0.

Graph of a curveSo now we have these initial 3 samples, what do we do with them?  I’ll start by making an assumption: we expect there to be an input value at which performance is optimal, where performance gradually deteriorates the further you get from this value.  This means that if you were to graph performance as the parameter changes, where a lower value means better performance, it would look something like the image to the right, a U-shaped curve.

Turns out that you can produce a simple U-shaped curve with the equation:

You can adjust A to change the steepness of the slope, B determines the middle of the curve (the lowest point), and C is the height of the lowest point.

So the next step, given our initial 3 samples, is to find values for A, B, and C that will produce a curve that most closely fits these sample points.  The technique used to do this is a rudimentary genetic algorithm optimization.  We choose some initial values for A, B, and C.  We then measure how good the “fit” is by measuring the output of our curve equation for the sample points, and squaring the difference with the actual values we got for these samples.  We then adjust A, B, and C, measure the fit again, and if it is better, we keep the new values, if not we revert to the old values for A, B, and C and try again.

Typically this approach will find a good curve fit in just a few hundred iterations.  There are undoubtedly  more efficient ways to do this, but this is something we need to do quite infrequently, so a “good enough” approach is just fine.

You will recall that value B represents the lowest point in the curve, this means that whatever our curve-fitting told us B should be, is our guess for the input that will produce the best output, cool!

So what do we do now?  Well, chances are that B won’t be a perfect guess, after all it’s based on only 3 samples, which probably incorporate a lot of noise, and our simple curve almost certainly isn’t an exact match even if there were no noise.  But it has given us an approximation, its given us a clue as to where we should focus our attention next.

We started by testing across the entire range of possible input values, so we should now narrow this range.  I’ve found that narrowing it to 1/4th of its previous size works well, although this is a fairly arbitrary number.  We then repeat the process using this new range centered on our best guess from the previous iteration.

Another refinement is to track whether our guesses appear to be “homing in”, meaning that each successive change in our best guess is smaller and smaller.  While the successive changes are getting smaller, we continue to decrease the range by 1/4th.  If a change is larger than the previous change, then we increase the range again – I arbitrarily chose 1/3rd, and it seems to work (I didn’t want to choose 1/4th because then we’d just re-test the same set of samples we just tested).

So lets see how this works in practice.  Lets say we have a mystery function, you and I know its defined as follows:
[sourcecode language=’java’]
public double mysteryFunction(double input) {
return Math.sin(input+4)+1.1+(random.nextDouble()/3-0.11);
}
[/sourcecode]
This is a simple sine curve with some random salt thrown in to make things a little more interesting (real sample data is rarely as neat as a clean curve). The curve itself (without randomness) looks like this:

Untitled

We will focus on the part of the curve between 0 and 3, because this only has a single lowest point.

I’ve created a class called ParameterOptomizer which implements the algorithm described above.  Here is how we use it:
[sourcecode language=’java’]
public void testMystery() {
// Create our ParameterOptimizer object and tell it that valid inputs
// are in the range 0 to 3
ParameterOptimizer po = new ParameterOptimizer(0, 3);
// Output the initial best guess. Since we don’t have any samples yet,
// this will be mid-way between 0 and 3
System.out.println(” Best guess: “+df.format(po.getBestGuess()));
// Now we run a few tests
for (int x = 0; x < 3; x++) {
// Get three sample points from the ParameterOptimizer
List probes = po.getNextSamples(3);
for (Double probe : probes) {
// For each of these points, obtain the output provided by
// the mysteryFunction. Note that in real-life scenarios,
// this might be a time-consuming process
double output = mysteryFunction(probe);
// Tell the ParameterOptimizer what we learned
po.addSample(probe, output);
// And output the information so we can see what is going on
System.out.print(” “+df.format(probe)+”:”+df.format(output));
}
// Find out the current best guess, along with what we predict the output will be
// at this point
System.out.println(” Best guess: “
+df.format(po.getBestGuess())+”:”+df.format(po.bestFit.getYAt(po.getBestGuess())));
}
}

private double mysteryFunction(double input) {
return Math.sin(input+4)+1.1+(random.nextDouble()/3-0.11);
}
[/sourcecode]
Hopefully this code is relatively self-explanatory.

So how does it perform?  Well, here is the output from a typical run:

 Best guess: 1.50
0.00:0.51 1.50:0.56 3.00:1.70 Best guess: 0.72:0.39
0.34:0.28 0.72:0.12 1.09:0.34 Best guess: 0.72:0.28
0.63:0.03 0.72:0.10 0.81:0.32 Best guess: 0.72:0.23
0.44:0.33 0.72:0.09 1.00:0.11 Best guess: 0.76:0.21
0.00:0.47 0.80:0.28 1.61:0.41 Best guess: 0.76:0.21
0.55:0.17 0.76:0.32 0.97:0.35 Best guess: 0.73:0.22
0.10:0.37 0.73:0.07 1.36:0.48 Best guess: 0.70:0.22

On the first iteration, using just three samples, we get extremely close to the lowest point on the curve:

Untitled

Here the red circles indicate the sample points, and the blue X indicates where the algorithm guesses the lowest point will be. As you can see, it slightly overestimates the actual height of this lowest point, but this is because two of the samples are slightly above the curve due to the random salt.

On the second iteration it still thinks that 0.72 is the optimum, but its estimate of where this optimum is has improved, going from 0.39 to 0.28:

Untitled

You may wonder why it is still guessing so high for the optimal point given that these three samples would suggest it is much lower.  The reason is that when it fits the curve, it will use all available samples, not just the three most recent.

Also note that the three sample points used are now much closer to the optimal than in the first iteration. This means that our “guinea pig” users should be getting a better experience than on the first iteration.

I’ve made this code available under the Lesser GNU Public License, so you are welcome to take it, play with it, and perhaps improve it.  All I ask is that if you do improve it, you let me know about your improvements.  Browse the code here, get the details for pulling it from Subversion here.

Don’t try to serialize large ConcurrentSkipListMaps

Today I learned the hard way that Java can’t serialize large ConcurrentSkipListMaps (and probably ConcurrentSkipListSets too). The issue is that there doesn’t seem to be any custom serialization implementation for this class, and the default serialization mechanism ends up joyfully recursing its way along the map’s internal object graph – eventually causing a StackOverflowException. Nice.

The root issue here is, of course, Java’s default serialization mechanism, which really shouldn’t operate recursively like this. Turns out that this has been a known issue for an embarrassing length of time (see this 9 year old bug report).

The workaround is to subclass ConcurrentSkipListMap providing a more efficient serialization mechanism – actually its not too hard, the code is here. Its public domain, so enjoy!

Seeking new Maintainer for Dijjer

Dijjer is a really cool little piece of software that I initially developed as a skunk works project within Revver back before we’d really figured out exactly what Revver was going to be.

It was born of a few frustrations I had with the then (as now) dominant P2P content distribution technology, BitTorrent.

In no particular order, these frustrations were:

  • BitTorrent downloads pieces of files in a random order, meaning you need to wait for the entire file to download before it can be presented to the user (ie. no streaming video or audio)
  • To be fully effective, BitTorrent requires you to tinker with your firewall, a task beyond the capabilities of many users
  • BitTorrent creates a separate P2P network for each file download, limiting the benefits of a network effect
  • Distributing a file over BitTorrent requires use of a “tracker”, which can be a pain to set up

Dijjer addressed these issues in what I still consider to be a very elegant way:

  • To publish a file on Dijjer, you just put it on an ordinary web-server, and point people to it through Dijjer (a bit like Coral but P2P).
  • Dijjer streams files directly to a HTTP client, starting at the beginning. This means that you can start watching a video or listening to an audio file as it is downloaded from Dijjer. It also means that things embedded in web pages can be downloaded through Dijjer seamlessly.
  • Dijjer employs a novel “small world”-based routing algorithm to create one global unified P2P network for file distribution, rather than separate networks for each file. This means that suddenly popular new files can be distributed quickly.
  • Dijjer employs simple firewall “hole-punching” techniques to seamlessly work through firewalls without having to re-configure them.

We released the first version of Dijjer in early 2005, but by then it was clear that Revver wouldn’t need a P2P system, we’d just be distributing files from a centralized web server like everyone else. As a result, we were forced to shift our focus away from Dijjer, and nobody really did much with it since then.

Recently I decided it was time to revive the project, but I knew I’d need to find someone else to take the reigns as I’ve just got too much going on myself. To ease the transition, I’ve now migrated the Dijjer project over to the excellent Google Code system for hosting free software projects.

If you are a Java coder who might be interested in the fun challenge of taking over the Dijjer project, please get in touch with me (ian AT locut DOT us).

Need Java 1.6 collections in 1.5? Look no further

Java 1.6 has some nice new classes, in particular some useful additions to the concurrent collections packages like ConcurrentSkipListMap. The problem? I’m stuck with Java 1.5 because I use OSX 10.5 on a Mac, and Java 1.6 isn’t properly supported yet (curse you Jooooobs!).

Anyway, only slightly daunted, I made my own using the source code from the original JSR166 proposal (which is public domain – no licensing headaches).

Find the jar file here: jsr166x.jar.

A decent stand-alone Java Bloom Filter implementation

Update (11th Jan 2010): Magnus Skjegstad has started a project with a Bloom Filter implementation inspired by this one, but which corrects a number of deficiencies. I recommend that you use this implementation instead of the one shown here, you can find it on Github.

Update (3rd July 2015): These days if you need a Bloom Filter I’d recommend using the implementation in Google’s Guava library.

They say necessity is the mother of invention, and a few days ago I needed a stand-alone Java Bloom Filter implementation. A Bloom Filter is a relatively simple way to efficiently remember whether you’ve seen something before, without actually having to store everything you’ve seen. Wikipedia can fill you in on the details.

Anyway, I did some hunting around, and the best available option seemed to be this implementation in the Apache Hadoop project.

Fortunately, I took a look at the source code before going ahead and using it. The first warning sign was that at the heart of their implementation they use an array of boolean values (boolean[]). This is a serious problem, because most Java implementations I know will use a byte to represent a boolean in a boolean array. This means that for every actual bit stored, there are seven wasted bits – and this is in a datastructure that is supposed to be saving you memory!

Some other parts of their code also concerned me, such as how they were using SHA1 as their hash function, it seemed somewhat haphazard to say the least. Oh, the other minor problem was that in my initial testing, I couldn’t actually get their implementation to work.

So, I decided to implement my own, and as luck would have it I hit on a very simple method to do it, using the java.util.Random class. The two key methods, add() and contains(), are 7 and 8 short lines of very simple readable code respectively.

Of course, crypto geeks will point out that java.util.Random is not a very secure hash function, but the worst that can happen is an increased false-positive rate, and I think even this is unlikely for most applications (its certainly good enough for anything I’ve ever needed a Bloom Filter for).

Some other nice features:

    • Implements Java’s generic Set interface, since this should be familiar to people
    • You just tell its constructor how many elements you expect to insert into it, and it will automatically configure itself for optimal operation
    • It will calculate its expected false-positive rate for you
    • Its serializable
    • I’ve included a reasonably thorough unit test
    • Unlike the Hadoop implementation, SimpleBloomFilter uses BitSet, which should be nice and compact
    • I’m releasing it under a very liberal “do what you want, just remember who wrote it” license

Here is a simple example usage:

// Create a bloom filter 100 bits in size to contain Strings, optimized to
// contain 4 items
SimpleBloomFilter mammals = new SimpleBloomFilter(100, 4);

// Add some things to the bloom filter
mammals.add(“dog”);
mammals.add(“cat”);
mammals.add(“mouse”);
mammals.add(“dolphin”);

// Test to see if the bloom filter remembers
if (mammals.contains(“dog”)) {
System.out.println(“A dog is a mammal with probability ”
+ (1 – mammals.expectedFalsePositiveProbability()));
} else {
System.out.println(“A dog is definitely not a mammal”);
}

If you are wondering how many bits to use, here is a table of the false-positive rates you should expect given specific ratios between number of items, and number of bits.

Ratio (items:bits) False-positive probability
1:1 0.63212055882856
1:2 0.39957640089373
1:4 0.14689159766038
1:8 0.02157714146322
1:16 0.00046557303372
1:32 0.00000021167340
1:64 0.00000000000004

Download the source code to the SimpleBloomFilter, and a unit test here. If you would like to review the code, you can do so here with nice syntax highlighting.