Author Archives: ian

Tungle: A wasted opportunity

Apparently Tungle has shut down development, although they still allow people to sign up. Turns out their acquisition by RIM last year must have been an acquahire (technically an acquisition, but really an admission of defeat).

Tungle had an incredibly viral business model, perhaps the most viral I’ve seen since Plaxo, solving a problem I and many others encounter on a near-daily basis:  Help people schedule meetings and calls with each-other.

So what went wrong? Their usability SUCKED. I desperately wanted Tungle to work, but almost every time I tried using it to schedule a meeting with someone, something would screw up and we’d have to resort to manually scheduling via email.  This was embarrassing when it happened, but even so I tried over and over again.  Every time I did could have been an opportunity for Tungle to sign up a new user, if their usability wasn’t so bad.

So if there is anyone out there looking for an idea that could be the next LinkedIn-scale viral phenomenon, all you have to do is reimplement Tungle, but this time get the usability right.  If I weren’t already rather busy I’d be doing this myself.

Microsoft probably just killed “Do Not Track”

Update (27th Oct 2012): I told you so!  Yahoo will ignore DNT from IE10 for exactly the reason I cite below.

Microsoft just announced that the “do not track” opt-out would be on by default in Internet Explorer 10.  This is a boneheaded move.

“Do not track” is a standard through which a web browser can inform a web page that the user does not wish to be tracked by third-party websites for the purpose of advertising.  So far as I can tell, respecting this is entirely voluntary on the part of the advertisers.

Advertisers often use browser cookies to track users, this allows them to target advertising specifically to people who’ve visited their website, for example.  Google and Microsoft both do it, it’s fairly standard practice these days.  Typically the advertiser isn’t tracking you as an individual, all they know is that you may have previously visited a particular website.

To explain why Microsoft’s move is boneheaded, I’ll relate a story from the early days of Revver, the online video sharing website that I co-founded back in 2004.

We had decided to let video uploaders tell us whether the video contained any content that is not appropriate for children as part of the upload process.  The vast majority of our users did exactly this and all was well, until at some point we realized that people were uploading some pretty serious pornography that we weren’t comfortable with even if it was marked as “adult” by the uploader.

Our panicked solution was to simply remove all videos marked as “adult” from the site, and prevent any further uploads where the videos were so-marked.

Of course you can predict the result: people immediately stopped marking videos as “adult”, making our task vastly more difficult.

The moral?  Don’t expect people to do something voluntarily if you are then going to use it against them.

I think Microsoft has just made exactly the same mistake.  Previously I think there was a reasonable chance that advertisers would choose to respect this, since only a minority of users are likely to enable it, and those are the people that really care about not being tracked.

But if it is enabled by default in Internet Explorer 10, advertisers now have no idea whether the user really cares about being tracked, and as a result they are far less likely to respect it.

Looking at it a different way, Microsoft just gave advertisers the perfect excuse to ignore DNT, because they can correctly claim that in most instances the user will have made no conscious decision to enable it.

Object Relational Mappers (ORMs) are a terrible idea

ORMs (like Hibernate in Java) are a flawed solution to an imagined problem that should be consigned to the wastebasket of history.

They are predicated on the idea that relational databases are old fashioned and should be avoided at all costs, but if you can’t avoid them, use some kind of convoluted wrapper that tries (and generally fails) to pretend that the relational database is something it isn’t – an object oriented datastore.

Often what people are really looking for when they use an ORM is some kind of database abstraction layer.  It is reasonable to abstract the database somehow, but I recommend not using an abstraction layer that seeks to pretend the database is something that it isn’t.

In Java, the best such database abstraction layer I’ve found is Jooq, I highly recommend it.

LastCalc: A powerful calculator meets Quora meets Siri

For the past month or so my main spare-time project has been a crazy idea called LastCalc (link at bottom).

I’ve been having trouble figuring out how to describe it, but here goes: Imagine a powerful web-based calculator that can answer your questions, a little like Google Calculator, Siri, or Wolfram Alpha, but where anyone can teach it how to calculate new things.

Additionally, rather than asking one question, you can ask a series of questions, each potentially referring to previous answers (programmers know this is a Read-Eval-Print-Loop or REPL).

Just like the others it supports basic math and unit conversions, like this (note: the highlighting is automatic and happens as you type – you type the bit before the big silver = and hit return, the answer appears after it):

But it goes a lot further. You can assign the result of a calculation to a variable, and then use it in subsequent calculations:

Internally LastCalc treats all numbers as rationals (x/y where x and y are integers) if possible, even if they are displayed as floating point numbers.  This means that it will not lose precision regardless of how many calculations you do (this can be a problem if using normal floating point numbers which are imprecise).

It’s not just simple numbers, LastCalc understands lists and associative arrays too, using a syntax very similar to JSON:

LastCalc is extensible, so if you find yourself repeating the same calculation over and over again, you can teach LastCalc how to do it (note: parameters are denoted by capitalization, like Prolog):

And it goes further, supporting pattern matching and recursion using these datastructures, just like languages like ML and Haskell:

Then use it with:

You can also pattern-match on maps.  Here I define a function that takes a map and returns a list of its keys:

Currently I’m working on a tutorial and help system so I don’t need to explain all of this before sending people to the site :-)

Right now you can only use functions that you define yourself, but in due course people will be able to share functions, much like they can share answers to questions with Quora.

So far it has only been tested in Chrome and Safari, and it definitely doesn’t work yet in Internet Explorer.  I’m waiting for the Javascript to stabilize before climbing that particular mountain.

Check it out at LastCalc.com.

It’s obviously a work in progress, if you’d like to follow discussion and provide me with feedback please join the LastCalc Google Group, or follow @LastCalc on Twitter.

The perils of treating a model’s predictions as actual probabilities

With our work on SenseArray we are in the fortunate position to apply machine learning to a wide variety of fascinating real-world problems, as diverse as predicting the likelihood that two people will have a successful date, the probability that someone will click on an ad, or the likelihood that a credit card transaction is fraudulent.

Sometimes all you care about is that predictions are correct relative to each-other. You’re looking for the best ad, or the best match on a dating website.

But in other situations you need an actual probability, not just a “score”. We quickly learned that it is dangerous, and frequently just plain wrong to assume that the output of your predictive model is a reliable probability.

As a specific example of this, imagine that you are using your predictive model to bid for the right to show someone an ad through an ad exchange. To do this you need to determine how much you expect to make if you show them the ad, and this depends on the probability of them click on it, and perhaps the probability of them making a purchase later.

In this situation it is critical that your probabilities are really probabilities or you’ll end up bidding too much, or not enough.

The first time we saw this we identified it as a form of “selection bias”, which is described in this paper (#1): Tweedie’s Formula and Selection Bias.

Essentially the paper states that if you are using predictions to pick things, perhaps the likelihood that someone will click on an ad, then the mean of the predictions that you pick will be higher than the actual rate at which people click on the ads. The reason for this, briefly, is that you are more likely to pick predictions with a positive error rather than a negative error, which creates an overall positive bias in the errors of what you’ve picked. Paper #1 does a better job of explaining.

Yesterday during a discussion on Reddit the aptly named Mr_Smartypants directed my attention to paper #2: Transforming Classifier Scores into Accurate Multiclass Probability Estimates, which describes another reason that you can’t treat the predictions of many learning algorithms as probabilities (see Figure 1 from the paper).

This particular problem doesn’t affect SenseArray because it operates in a fundamentally different way to most machine learning algorithms, but it does affect most other supervised learning systems that produce a probability as output.

This is fine if all you care about is that the ordering of the predictions is correct (ie. the relationship between the predictions and reality is isotonic), but if you want to treat them as accurate probabilities you’re in trouble.

Paper #2′s solution, or the one that most appealed to me, was to use a simple algorithm called “pair-adjacent violators” or PAV to determine a mapping from the predictions to the actual observed probabilities. PAV is a form of isotonic regression.  This seemed like it would be an improvement on our current approach. PAV is described in section 3 of paper #2, but the basic approach is this:

Create an list of pairs of predictions and the actual outcome, 1.0 or 0.0, sorted by prediction lowest to highest, like this:

(0.3, 0), (0.5, 1), (0.6, 0), (0.8, 1) …

Now, working through the list starting at the beginning, find a prediction pair where the outcome of the second is lower than the outcome of the first.  This is a “violator”, because we know that a higher prediction should mean a higher outcome.  In the brief example above the 2nd and 3rd prediction-outcome pairs are violators because while 0.6 is greater than 0.5, 0 is not greater than 1.

To resolve this, we remove this pair of violators, and replace them with a new pair consisting of the average (this isn’t quite correct, keep reading) of both the prediction and the outcome.  So now, our new list looks like this:

(0.3, 0), (0.55, 0.5), (0.8, 1) …

We carry on through the list doing this every time we find a pair of violators.  At the end of this process, we are guaranteed to have an isotonic mapping from predictions to outcomes.  Yay!

Actually, not so “yay”.  I decided to test the algorithm with this code:

 

The basic idea was for the “real life” probabilities to be the square of the prediction. So, if the prediction was 0.5, the real probability would be 0.25, and so-on.

Here I compare the curve determined by the PAV algorithm to what it should be:

As you can see, it is a terrible fit, completely useless for our purposes.  What went wrong? (edit: Turns out that paper #2′s description of the PAV algorithm was incorrect, my modification to it was actually the correct implementation).

After some thought I had an idea. Recall that the PAV algorithm works by “merging” pairs of samples repeatedly, using the average prediction and outcome of each to produce the replacement sample.

Using the average in this way assumes that both of the “parent” samples are of equal value, but are they? Well, no. One parent sample might have hundreds of “ancestor” samples, while another might not have any parents at all. PAV ignores this.

The result is that samples in parts of the curve that have a lot of these violations effectively get “watered down”, while samples in parts of the curve with few violations do not. The effect is to distort the shape of the curve, something that is very obvious from my experiment above.

The solution? We need to keep track of a “weight” for every sample, which basically counts the number of “original” samples that they are composed of.

Now, instead of doing a normal averaging of two parent samples, we do a weighted average, and the resulting child weight is the sum of the two parent’s weights.

The effect? Take a look:

Wow, looks like we have solved the problem! Frankly, I’m surprised that people haven’t noticed this flaw in the PAV algorithm in the past (edit: it wasn’t a flaw in the algorithm, it was a flaw in paper #2′s description of the algorithm). I assume that they were using it in situations where the violations were evenly distributed throughout the prediction curve, in which case the distortion may not be so pronounced.

I’ve emailed the authors of paper #2 and will update this post once I hear back from them.

Update: Ronny commented below directing me to another paper that describes the PAV algorithm, but it describes it with my “fix”. It appears that the description of the algorithm’s implementation in paper #2 was incorrect, and the correct implementation is the weighted approach I describe above. I suspected this might be the case, nice that it’s been confirmed.

Remembering Oliver Schmelzle

On Wednesday night my good friend Oliver Schmelzle passed away from a rare blood infection. He was only admitted to hospital earlier this week, and to my knowledge was perfectly healthy before this, so this was very sudden. Oliver was only 38 years old, and leaves behind his wife Lacey and young son Ryder.

Oliver was an uncommon combination of intelligence and genuine niceness. Once every month or two he and I might meet to chat about business, technology, where we grew up (Germany in Oliver’s case), and all manner of interesting stuff. Normally Oliver would suggest meeting for coffee, I’d suggest meeting for beers, and we’d end up going for beers.

Oliver would patiently listen as I bounced my various harebrained ideas off him, and would always give me valuable and insightful feedback. He would often surprise me with his deep understanding of both business and technology.

My wife Janie would sometimes come out with us. At one point Janie was managing a small team of programmers, but her real aspiration was to become a Product Manager, which was Oliver’s role at that time. Oliver was more than happy to answer her questions and provide advice.

In recent weeks I had seen more of Oliver than usual. He had started working at a new company, Vast, and recognized that my company might be able to help them. I had a meeting there on Wednesday, and had hoped to see Oliver there, but his colleagues said he was out sick.

Obviously nobody realized how serious it was at that time, his colleagues expected him to be back in the office in a day or two. We joked that once he got back we would tell Oliver that we’d given up on online advertising and decided to use billboards instead, knowing that he would think this was idiotic. Of course being the nice guy that he is, Oliver wouldn’t have said so, he would have listened to our plan and politely tried to persuade us not to do it.

I will sincerely miss my conversations with Oliver, and the opportunity to work with him. Janie’s and my deepest sympathies are with Lacey and Ryder.

Oliver’s remembrance services will be on October 21st at 5pm at the Weed-Corley-Fish Funeral Home, Lamar location, and at 11am on October 22nd at Tarrytown United Methodist Church.

update: Oliver’s obituary in the Statesman newspaper is now available here.

update 2: Some other friends and colleagues have written about their memories of Oliver, read them here and here.

Bitcoin value versus search volume

I overlayed a graph of the Bitcoin-USD exchange rate (red/green/black), with a graph of search volume for the phrase “bitcoin” (blue), between January and September 2011:

A striking correlation, no?

In fact, its what you’d expect, surprisingly so in fact. Since the supply of new Bitcoins is regulated such that it is essentially constant, you’d expect the value of a Bitcoin to grown and shrink in proportion to the rate at which people are seeking to acquire Bitcoins.

Proportionate A/B testing

More than once I’ve seen people ask questions like “In A/B Testing, how long should you wait before knowing which option is the best?”.

We’ve found that the best solution is to avoid a binary question of whether or not to use a variation. Instead, randomly select the variation to present to a user in proportion to the probability that this variation is the best based on the data (if any) you have so-far.

How? The key is the beta distribution. Let’s say you have a variation with 30 impressions, and 5 conversions. A beta distribution with (alpha=5, beta=30-5) describes the probability distribution for the actual conversion rate. It looks like this:

A graph showing a beta distribution

This shows that while the most likely conversion rate is about 1 in 6 (approx 0.16), the curve is relatively broad indicating that there is still a lot of uncertainty about what the conversion rate will be.

Let’s try the same for 50 conversions and 300 impressions (alpha=50, beta=300-50)”):

You’ll see that while the curve’s peak remains the same, the curve gets a lot narrower, meaning that we’ve got a lot more certainty about what the actual conversion rate is – as you would expect given 10X the data.

Let’s say we have 5 variations, each with various numbers of impressions and conversions. A new user arrives at our website, and we want to decide which variation we show them.

To do this we employ a random number generator, which will pick random numbers according to a beta distribution we provide to it. You can find open source implementations of such a random number generator in most programming languages, here is one for Java.

So we go through each of our variations, and pick a random number within the beta distribution we’ve calculated for that variation. Whichever variation gets the highest random number is the one we show.

The beauty of this approach is that it achieves a really nice, perhaps optimal compromise between sending traffic to new variations to test them, and sending traffic to variations that we know to be good. If a variation doesn’t perform well this algorithm will gradually give it less and less traffic, until eventually it’s getting none. Then we can remove it secure in the knowledge that we aren’t removing it prematurely, no need to set arbitrary significance thresholds.

This approach is easily extended to situations where rather than a simple impression-conversion funnel, we have funnels with multiple steps.

One question is, before you’ve collected any data about a particular variation, what should you “initialize” the beta distribution with. The default answer is (1, 1), since you can’t start with (0, 0). This effectively starts with a “prior expectation” of a 50% conversion rate, but as you collect data this will rapidly converge on reality.

Nonetheless, we can do better. Let’s say that we know that variations tend to have a 1% conversion rate, so you could start with (1,99).

If you really want to take this to an extreme (which is what we do in our software!), let’s say you have an idea of the normal distribution of the conversion rates, let’s say its 1% with a standard deviation of 0.5%.

Note that starting points of (1,99), or (2,198), or (3,297) will all give you a starting mean of 1%, but the higher the numbers, the longer they’ll take to converge away from the mean. If you plug these into Wolfram Alpha (“beta distribution (3,297)”) it will show you the standard deviation for each of them. (1,99) is 0.0099, (2,198) is 0.007, (3, 297) is 0.00574, (4, 396) is 0.005 and so on.

So, since we expect the standard deviation of the actual conversion rates to be 0.5% or 0.005, we know that starting with (4, 396) is about right.

You could find a smart way to find the starting beta parameters with the desired standard deviation, but it’s easier and effective to just do it experimentally as I did.

Note that while we developed this technique independently, we later discovered that it is known as “Thompson sampling” and was originally developed in the 1930s.