Category Archives: SenseArray

Polynomial regression on a large dataset in Java

For an internal project at my day job I needed to find the n degree polynomial that best fits a (potentially very large) dataset.

I asked a question on Reddit about this, and Paul Lutus was kind enough to respond with a link to some code that did something close to what I needed, however Paul’s code was not well suited to very large amounts of data.

So with his help, I modified his code to decouple it from the original application it was a part of, make it follow good Java coding practices a bit more closely, and in doing so make it more flexible and well suited to handling very large datasets.

The code is under the GNU Public License, which is unfortunate since it’s a library and the GPL is viral and Paul is unwilling to relicense under the LGPL, meaning that I or someone else will need to re-implement under a different license if the library is to be used within non-GPL’d code :-(

Here is the source, collaborative free software at work, and please comment if you find it useful!: PolynomialFitter.java

Apple misses an opportunity with Genius sidebar

When I heard about Apple’s new Genius tool, which claims to review your music library, and make music recommendations, I had high hopes.

Having installed the new version of iTunes I’m very disappointed, they should really have called it “Dunce”.  It seems that they’ve gone with an item-based collaborative filter, all it does is let you select a song, and then it gives you a list of songs that other people who liked that song also liked.

There are several problems with this.  Because of human nature, it tends to just recommend a bunch of other songs by the same artist, something I really don’t need a “genius” to do for me since this functionality has been in iTunes for ages.  The input to Genius about what interests me is just a single song – come on Apple, we had smarter collaborative filters than that a decade ago!

The real missed opportunity here is that it could so-easily have been a proper user-based collaborative filter.  A system that looked at everything I liked and didn’t like, and tried to build an accurate picture of my musical tastes, and then use this to make intelligent recommendations.

I don’t need them to use SenseArray (although it would be great if they did!), but the least they could do is make Genius a semi-respectable collaborative filter – something they’ve failed to do.

:-(

Defcon talk highlights

Talks at Defcon are sometimes very entertaining (especially if Dan Kaminsky is involved), and sometimes rather dull. I decided I’d try to ensure that my talk fell into the former category. Hopefully I succeeded, I found at least one good review, which is gratifying. Update: Found another one (scroll down).

One of the techniques I use in SenseArray is called gradient descent, and I wanted a fun way to illustrate how powerful it is.

I found a really cool video online which shows a humanoid figure “evolving” using a Genetic Algorithm (a flavor of gradient descent) to jump higher and higher. It was missing one thing though – the right soundtrack, so I added one. It got more than a few laughs when I showed it during my talk, but judge for yourself:

Digg to deploy recommendation engine – but will it work?

According to TechCrunch, Digg is about to deploy a recommendation engine (aka “collaborative filter” or CF).  I’m rather surprised that its taken this long, I spoke with Kevin Rose about it way back around August 2006 – so they’ve been working on this for almost 2 years!

From the description, it sounds like a fairly standard user-based CF, not unlike Daedalus, the CF I built for Revver (which we deployed on Reddit for a while before my friend Aaron Swartz’ departure).  It works by finding users who have rated the same things as you similarly to you, and recommending things they liked, but which you haven’t seen yet.

The problem with this approach, which we found with Daedalus, is that for the similarity between users to be meaningful, it has to be based on a lot of overlap between your ratings, and the ratings of other users.  Typically this won’t be the case except for extremely committed Digg users.  This is made worse by the fact that, apparently, they will only be looking at user data from the last 30 days.  Its really going to be hard to infer meaningful relationships between users that way.

This is why I went for a different approach with SenseArray, because I specifically wanted it to perform well without large amounts of rating data.  It works by, rather than just looking for similar users, finding a mathematical formula that successfully predicts your behavior.  This has the potential to allow the algorithm to have a much more nuanced understanding of your interests, and it can do this with much less overlap between user rating behavior.

The other major difference with SenseArray is that it augments simple user-rating data with other metadata about users, such as their geographic location, their referring website, and their choice of operating system and web browser.

Newsgator announces SenseArray-based recommendations

Newsgator, the well known RSS company, has just announced that they are using SenseArray to power article recommendations on their website, and later in their RSS reader software.

We’ve received some great coverage including articles in CNet, Mashable, and the Dallas Morning News, and I’m pretty excited to see how Newsgator’s new functionality evolves.  A personalized RSS reader is something I’ve wanted long before I got personally involved in collaborative filtering.

How can A.(x-B)^2+C help you avoid a curry overdose?

picture of curryYou might love curry, but no matter how much you love it, you probably don’t want to eat it every night.  We’ve spent the last few months working on a powerful collaborative filter called SenseArray.  A collaborative filter’s job is to quickly figure out what people want, and then give it to them, but most  don’t account for the fact that people need some diversity.  Indeed, the Netflix Prize (pretty-much the olympics of collaborative filters), doesn’t even measure it!

One fundamental question we need to answer is just how much diversity users want.  Too little and users get bored, too much and the quality of your recommendations is too low, so it’s a balance.  This will be different for different applications of SenseArray.  So how do we figure out how much diversity SenseArray should use in its recommendations in any particular situation?

Well, the high-level answer isn’t complicated, it’s “trial and error”.  You try a certain parameter, and you measure the aggregate user response to that parameter.  If more users like what they see with parameter A versus parameter B, then you go with parameter A.

There are a few problems though.  These patterns of user response only become meaningful after you’ve collected a lot of data, so it can take hours, perhaps even days to collect enough data to properly evaluate a particular parameter.

Also, since you are testing this on a live system, you really want to settle on a good parameter as quickly as possible so that users are getting the best possible experience as quickly as possible.  How can we do this?

We should start by getting some data.  Let’s say that we know our parameter must be between 0 and 1.  Initially, since we know nothing, let’s run 3 experiments, testing performance when the input is 0.0, 0.5, and 1.0.

Graph of a curveSo now we have these initial 3 samples, what do we do with them?  I’ll start by making an assumption: we expect there to be an input value at which performance is optimal, where performance gradually deteriorates the further you get from this value.  This means that if you were to graph performance as the parameter changes, where a lower value means better performance, it would look something like the image to the right, a U-shaped curve.

Turns out that you can produce a simple U-shaped curve with the equation:

You can adjust A to change the steepness of the slope, B determines the middle of the curve (the lowest point), and C is the height of the lowest point.

So the next step, given our initial 3 samples, is to find values for A, B, and C that will produce a curve that most closely fits these sample points.  The technique used to do this is a rudimentary genetic algorithm optimization.  We choose some initial values for A, B, and C.  We then measure how good the “fit” is by measuring the output of our curve equation for the sample points, and squaring the difference with the actual values we got for these samples.  We then adjust A, B, and C, measure the fit again, and if it is better, we keep the new values, if not we revert to the old values for A, B, and C and try again.

Typically this approach will find a good curve fit in just a few hundred iterations.  There are undoubtedly  more efficient ways to do this, but this is something we need to do quite infrequently, so a “good enough” approach is just fine.

You will recall that value B represents the lowest point in the curve, this means that whatever our curve-fitting told us B should be, is our guess for the input that will produce the best output, cool!

So what do we do now?  Well, chances are that B won’t be a perfect guess, after all it’s based on only 3 samples, which probably incorporate a lot of noise, and our simple curve almost certainly isn’t an exact match even if there were no noise.  But it has given us an approximation, its given us a clue as to where we should focus our attention next.

We started by testing across the entire range of possible input values, so we should now narrow this range.  I’ve found that narrowing it to 1/4th of its previous size works well, although this is a fairly arbitrary number.  We then repeat the process using this new range centered on our best guess from the previous iteration.

Another refinement is to track whether our guesses appear to be “homing in”, meaning that each successive change in our best guess is smaller and smaller.  While the successive changes are getting smaller, we continue to decrease the range by 1/4th.  If a change is larger than the previous change, then we increase the range again – I arbitrarily chose 1/3rd, and it seems to work (I didn’t want to choose 1/4th because then we’d just re-test the same set of samples we just tested).

So lets see how this works in practice.  Lets say we have a mystery function, you and I know its defined as follows:
[sourcecode language=’java’]
public double mysteryFunction(double input) {
return Math.sin(input+4)+1.1+(random.nextDouble()/3-0.11);
}
[/sourcecode]
This is a simple sine curve with some random salt thrown in to make things a little more interesting (real sample data is rarely as neat as a clean curve). The curve itself (without randomness) looks like this:

Untitled

We will focus on the part of the curve between 0 and 3, because this only has a single lowest point.

I’ve created a class called ParameterOptomizer which implements the algorithm described above.  Here is how we use it:
[sourcecode language=’java’]
public void testMystery() {
// Create our ParameterOptimizer object and tell it that valid inputs
// are in the range 0 to 3
ParameterOptimizer po = new ParameterOptimizer(0, 3);
// Output the initial best guess. Since we don’t have any samples yet,
// this will be mid-way between 0 and 3
System.out.println(” Best guess: “+df.format(po.getBestGuess()));
// Now we run a few tests
for (int x = 0; x < 3; x++) {
// Get three sample points from the ParameterOptimizer
List probes = po.getNextSamples(3);
for (Double probe : probes) {
// For each of these points, obtain the output provided by
// the mysteryFunction. Note that in real-life scenarios,
// this might be a time-consuming process
double output = mysteryFunction(probe);
// Tell the ParameterOptimizer what we learned
po.addSample(probe, output);
// And output the information so we can see what is going on
System.out.print(” “+df.format(probe)+”:”+df.format(output));
}
// Find out the current best guess, along with what we predict the output will be
// at this point
System.out.println(” Best guess: “
+df.format(po.getBestGuess())+”:”+df.format(po.bestFit.getYAt(po.getBestGuess())));
}
}

private double mysteryFunction(double input) {
return Math.sin(input+4)+1.1+(random.nextDouble()/3-0.11);
}
[/sourcecode]
Hopefully this code is relatively self-explanatory.

So how does it perform?  Well, here is the output from a typical run:

 Best guess: 1.50
0.00:0.51 1.50:0.56 3.00:1.70 Best guess: 0.72:0.39
0.34:0.28 0.72:0.12 1.09:0.34 Best guess: 0.72:0.28
0.63:0.03 0.72:0.10 0.81:0.32 Best guess: 0.72:0.23
0.44:0.33 0.72:0.09 1.00:0.11 Best guess: 0.76:0.21
0.00:0.47 0.80:0.28 1.61:0.41 Best guess: 0.76:0.21
0.55:0.17 0.76:0.32 0.97:0.35 Best guess: 0.73:0.22
0.10:0.37 0.73:0.07 1.36:0.48 Best guess: 0.70:0.22

On the first iteration, using just three samples, we get extremely close to the lowest point on the curve:

Untitled

Here the red circles indicate the sample points, and the blue X indicates where the algorithm guesses the lowest point will be. As you can see, it slightly overestimates the actual height of this lowest point, but this is because two of the samples are slightly above the curve due to the random salt.

On the second iteration it still thinks that 0.72 is the optimum, but its estimate of where this optimum is has improved, going from 0.39 to 0.28:

Untitled

You may wonder why it is still guessing so high for the optimal point given that these three samples would suggest it is much lower.  The reason is that when it fits the curve, it will use all available samples, not just the three most recent.

Also note that the three sample points used are now much closer to the optimal than in the first iteration. This means that our “guinea pig” users should be getting a better experience than on the first iteration.

I’ve made this code available under the Lesser GNU Public License, so you are welcome to take it, play with it, and perhaps improve it.  All I ask is that if you do improve it, you let me know about your improvements.  Browse the code here, get the details for pulling it from Subversion here.

Why the Bay Area isn’t necessarily the place to be

I really thought this was a great recent post on the 37 Signals blog, explaining some of the reasons you might not want to build your tech startup in the Bay Area, and its something I wholehartedly agree with.

The dominant Bay Area business approach is raising a few million in Venture Capital, hiring 15 employees and getting a nice office, not worrying much about actually generating revenue, all in the hope of getting acquired in 18 months by Google or Yahoo. I know because I’ve been down that path more than once.

This model obviously works spectacularly well for some people, but they are not recipes for building sustainable businesses.  I won’t repeat the entire post, suffice to say that it does a good job of articulating why I’m bootstrapping Uprizer Labs in Austin, and quite happy to stay here for the time being.

Don’t get me wrong, I’m not saying that there is anything wrong with venture capitalists, all the VCs I’ve worked with over the years have been good, honest people who genuinely want you to succeed.  Obviously their model works too, otherwise their limited partners wouldn’t be entrusting them with billions of dollars, as they do.  And some business can only work with the kind of cash injection at the outset that venture capital provides.

But the problem is that venture capital can allow entrepeneurs to delay or bypass the critical initial stages during which you must really validate whether there are customers out there for what you are building.  If you do that, then really all the venture captical does for you is to pay you a salary while you march towards failure.

With my current project I made the decision early-on that I would go out and find customers for what I was building before I even started to build it.  That is what I did, and I was fortunate enough to find two great customers who have worked with me as I build SenseArray to ensure that it meets their needs.

Knowing that there are people out there ready to write you a check once you’ve built what you are building gives you a great sense of comfort, which I’ve sometimes lacked even after I’ve raised venture capital.  I can’t guarantee that I’ll be successful this time around, but so far I couldn’t ask for things to have gone any better.

My latest commercial project: SenseArray

I’ve just been putting the finishing touches to the website for my latest project, its called SenseArray.  Its a stand-alone software application that companies can license, which allows them to easily integrate personalized recommendations into their service or website.  I’ve been working with several companies over the past few weeks and months that are deploying the SenseArray technology, and I’m now ready to cast a wider net.

The basic idea behind SenseArray is simple: the success of almost any website or service depends on your ability to figure out what your users want, and then give it to them as expeditiously as possible.  This can be the difference between success and failure, or between some success, and huge success.  The problem is that its hard to figure out what your users want, and often they all don’t want the same thing.

I’m not the first to try to tackle this problem, indeed, its not the first, nor the second time I’ve tackled this problem, as I developed two collaborative filters for Revver, and one for Thoof.

But building collaborative filters isn’t easy, it takes time, and a lot of trial and error.  I’ve done the hard work, and successfully navigated the minefield.  The end result is that SenseArray scores more than 5% better than Netflix’ own algorithm with the Netflix Prize dataset.

Furthermore, all collaborative filters suffer from a common problem, they all suck when it comes to new users, because they don’t know anything about those users yet.  SenseArray doesn’t suffer from this problem, because it can utilize all kinds of data about users, much of which is available right from the very first moment they visit a website (such as their IP address, their operating system, and the referring website).

Anyway, take a look at the new SenseArray website, let me know what you think, and tell your friends about it if you think it might be useful to them!