Author Archives: ian

The perils of treating a model’s predictions as actual probabilities

With our work on SenseArray we are in the fortunate position to apply machine learning to a wide variety of fascinating real-world problems, as diverse as predicting the likelihood that two people will have a successful date, the probability that someone will click on an ad, or the likelihood that a credit card transaction is fraudulent.

Sometimes all you care about is that predictions are correct relative to each-other. You’re looking for the best ad, or the best match on a dating website.

But in other situations you need an actual probability, not just a “score”. We quickly learned that it is dangerous, and frequently just plain wrong to assume that the output of your predictive model is a reliable probability.

As a specific example of this, imagine that you are using your predictive model to bid for the right to show someone an ad through an ad exchange. To do this you need to determine how much you expect to make if you show them the ad, and this depends on the probability of them click on it, and perhaps the probability of them making a purchase later.

In this situation it is critical that your probabilities are really probabilities or you’ll end up bidding too much, or not enough.

The first time we saw this we identified it as a form of “selection bias”, which is described in this paper (#1): Tweedie’s Formula and Selection Bias.

Essentially the paper states that if you are using predictions to pick things, perhaps the likelihood that someone will click on an ad, then the mean of the predictions that you pick will be higher than the actual rate at which people click on the ads. The reason for this, briefly, is that you are more likely to pick predictions with a positive error rather than a negative error, which creates an overall positive bias in the errors of what you’ve picked. Paper #1 does a better job of explaining.

Yesterday during a discussion on Reddit the aptly named Mr_Smartypants directed my attention to paper #2: Transforming Classifier Scores into Accurate Multiclass Probability Estimates, which describes another reason that you can’t treat the predictions of many learning algorithms as probabilities (see Figure 1 from the paper).

This particular problem doesn’t affect SenseArray because it operates in a fundamentally different way to most machine learning algorithms, but it does affect most other supervised learning systems that produce a probability as output.

This is fine if all you care about is that the ordering of the predictions is correct (ie. the relationship between the predictions and reality is isotonic), but if you want to treat them as accurate probabilities you’re in trouble.

Paper #2’s solution, or the one that most appealed to me, was to use a simple algorithm called “pair-adjacent violators” or PAV to determine a mapping from the predictions to the actual observed probabilities. PAV is a form of isotonic regression.  This seemed like it would be an improvement on our current approach. PAV is described in section 3 of paper #2, but the basic approach is this:

Create an list of pairs of predictions and the actual outcome, 1.0 or 0.0, sorted by prediction lowest to highest, like this:

(0.3, 0), (0.5, 1), (0.6, 0), (0.8, 1) …

Now, working through the list starting at the beginning, find a prediction pair where the outcome of the second is lower than the outcome of the first.  This is a “violator”, because we know that a higher prediction should mean a higher outcome.  In the brief example above the 2nd and 3rd prediction-outcome pairs are violators because while 0.6 is greater than 0.5, 0 is not greater than 1.

To resolve this, we remove this pair of violators, and replace them with a new pair consisting of the average (this isn’t quite correct, keep reading) of both the prediction and the outcome.  So now, our new list looks like this:

(0.3, 0), (0.55, 0.5), (0.8, 1) …

We carry on through the list doing this every time we find a pair of violators.  At the end of this process, we are guaranteed to have an isotonic mapping from predictions to outcomes.  Yay!

Actually, not so “yay”.  I decided to test the algorithm with this code:

 

The basic idea was for the “real life” probabilities to be the square of the prediction. So, if the prediction was 0.5, the real probability would be 0.25, and so-on.

Here I compare the curve determined by the PAV algorithm to what it should be:

As you can see, it is a terrible fit, completely useless for our purposes.  What went wrong? (edit: Turns out that paper #2’s description of the PAV algorithm was incorrect, my modification to it was actually the correct implementation).

After some thought I had an idea. Recall that the PAV algorithm works by “merging” pairs of samples repeatedly, using the average prediction and outcome of each to produce the replacement sample.

Using the average in this way assumes that both of the “parent” samples are of equal value, but are they? Well, no. One parent sample might have hundreds of “ancestor” samples, while another might not have any parents at all. PAV ignores this.

The result is that samples in parts of the curve that have a lot of these violations effectively get “watered down”, while samples in parts of the curve with few violations do not. The effect is to distort the shape of the curve, something that is very obvious from my experiment above.

The solution? We need to keep track of a “weight” for every sample, which basically counts the number of “original” samples that they are composed of.

Now, instead of doing a normal averaging of two parent samples, we do a weighted average, and the resulting child weight is the sum of the two parent’s weights.

The effect? Take a look:

Wow, looks like we have solved the problem! Frankly, I’m surprised that people haven’t noticed this flaw in the PAV algorithm in the past (edit: it wasn’t a flaw in the algorithm, it was a flaw in paper #2’s description of the algorithm). I assume that they were using it in situations where the violations were evenly distributed throughout the prediction curve, in which case the distortion may not be so pronounced.

I’ve emailed the authors of paper #2 and will update this post once I hear back from them.

Update: Ronny commented below directing me to another paper that describes the PAV algorithm, but it describes it with my “fix”. It appears that the description of the algorithm’s implementation in paper #2 was incorrect, and the correct implementation is the weighted approach I describe above. I suspected this might be the case, nice that it’s been confirmed.

Remembering Oliver Schmelzle

On Wednesday night my good friend Oliver Schmelzle passed away from a rare blood infection. He was only admitted to hospital earlier this week, and to my knowledge was perfectly healthy before this, so this was very sudden. Oliver was only 38 years old, and leaves behind his wife Lacey and young son Ryder.

Oliver was an uncommon combination of intelligence and genuine niceness. Once every month or two he and I might meet to chat about business, technology, where we grew up (Germany in Oliver’s case), and all manner of interesting stuff. Normally Oliver would suggest meeting for coffee, I’d suggest meeting for beers, and we’d end up going for beers.

Oliver would patiently listen as I bounced my various harebrained ideas off him, and would always give me valuable and insightful feedback. He would often surprise me with his deep understanding of both business and technology.

My wife Janie would sometimes come out with us. At one point Janie was managing a small team of programmers, but her real aspiration was to become a Product Manager, which was Oliver’s role at that time. Oliver was more than happy to answer her questions and provide advice.

In recent weeks I had seen more of Oliver than usual. He had started working at a new company, Vast, and recognized that my company might be able to help them. I had a meeting there on Wednesday, and had hoped to see Oliver there, but his colleagues said he was out sick.

Obviously nobody realized how serious it was at that time, his colleagues expected him to be back in the office in a day or two. We joked that once he got back we would tell Oliver that we’d given up on online advertising and decided to use billboards instead, knowing that he would think this was idiotic. Of course being the nice guy that he is, Oliver wouldn’t have said so, he would have listened to our plan and politely tried to persuade us not to do it.

I will sincerely miss my conversations with Oliver, and the opportunity to work with him. Janie’s and my deepest sympathies are with Lacey and Ryder.

Oliver’s remembrance services will be on October 21st at 5pm at the Weed-Corley-Fish Funeral Home, Lamar location, and at 11am on October 22nd at Tarrytown United Methodist Church.

update: Oliver’s obituary in the Statesman newspaper is now available here.

update 2: Some other friends and colleagues have written about their memories of Oliver, read them here and here.

Bitcoin value versus search volume

I overlayed a graph of the Bitcoin-USD exchange rate (red/green/black), with a graph of search volume for the phrase “bitcoin” (blue), between January and September 2011:

A striking correlation, no?

In fact, its what you’d expect, surprisingly so in fact. Since the supply of new Bitcoins is regulated such that it is essentially constant, you’d expect the value of a Bitcoin to grown and shrink in proportion to the rate at which people are seeking to acquire Bitcoins.

Proportionate A/B testing

More than once I’ve seen people ask questions like “In A/B Testing, how long should you wait before knowing which option is the best?”.

I’ve found that the best solution is to avoid a binary question of whether or not to use a variation. Instead, randomly select the variation to present to a user in proportion to the probability that this variation is the best based on the data (if any) you have so-far.

How? The key is the beta distribution. Let’s say you have a variation with 30 impressions, and 5 conversions. A beta distribution with (alpha=5, beta=30-5) describes the probability distribution for the actual conversion rate. It looks like this:

A graph showing a beta distribution

This shows that while the most likely conversion rate is about 1 in 6 (approx 0.16), the curve is relatively broad indicating that there is still a lot of uncertainty about what the conversion rate will be.

Let’s try the same for 50 conversions and 300 impressions (alpha=50, beta=300-50)”):

You’ll see that while the curve’s peak remains the same, the curve gets a lot narrower, meaning that we’ve got a lot more certainty about what the actual conversion rate is – as you would expect given 10X the data.

Let’s say we have 5 variations, each with various numbers of impressions and conversions. A new user arrives at our website, and we want to decide which variation we show them.

To do this we employ a random number generator, which will pick random numbers according to a beta distribution we provide to it. You can find open source implementations of such a random number generator in most programming languages, here is one for Java.

So we go through each of our variations, and pick a random number within the beta distribution we’ve calculated for that variation. Whichever variation gets the highest random number is the one we show.

The beauty of this approach is that it achieves a really nice, perhaps optimal compromise between sending traffic to new variations to test them, and sending traffic to variations that we know to be good. If a variation doesn’t perform well this algorithm will gradually give it less and less traffic, until eventually it’s getting none. Then we can remove it secure in the knowledge that we aren’t removing it prematurely, no need to set arbitrary significance thresholds.

This approach is easily extended to situations where rather than a simple impression-conversion funnel, we have funnels with multiple steps.

One question is, before you’ve collected any data about a particular variation, what should you “initialize” the beta distribution with. The default answer is (1, 1), since you can’t start with (0, 0). This effectively starts with a “prior expectation” of a 50% conversion rate, but as you collect data this will rapidly converge on reality.

Nonetheless, we can do better. Let’s say that we know that variations tend to have a 1% conversion rate, so you could start with (1,99).

If you really want to take this to an extreme (which is what we do in our software!), let’s say you have an idea of the normal distribution of the conversion rates, let’s say its 1% with a standard deviation of 0.5%.

Note that starting points of (1,99), or (2,198), or (3,297) will all give you a starting mean of 1%, but the higher the numbers, the longer they’ll take to converge away from the mean. If you plug these into Wolfram Alpha (“beta distribution (3,297)”) it will show you the standard deviation for each of them. (1,99) is 0.0099, (2,198) is 0.007, (3, 297) is 0.00574, (4, 396) is 0.005 and so on.

So, since we expect the standard deviation of the actual conversion rates to be 0.5% or 0.005, we know that starting with (4, 396) is about right.

You could find a smart way to find the starting beta parameters with the desired standard deviation, but it’s easier and effective to just do it experimentally as I did.

Note that while I discovered this technique independently, I later learned that it is known as “Thompson sampling” and was originally developed in the 1930s (although to my knowledge this is the first time it has been applied to A/B testing).

Comparison of RealFlight versus Phoenix radio controlled flightsims

There are two main R/C flight simulators on the market, RealFight and Phoenix.  I purchased RealFlight over a year ago, and recently purchased Phoenix, so I’m in a good position to compare them.

Both Phoenix and RealFlight are Windows software, although I use both of them on a Mac using Parallels.  This works reasonably well provided that I’m careful to avoid running any other CPU intensive software at the same time.  Even then both software packages exhibit occasional slow-downs during which the visuals and audio can “stutter”.  Of the two, Phoenix seems less vulnerable to this.

RealFlight comes with its own dedicated controller, whereas Phoenix comes with a cable that will connect to most common controllers including Spektrum.  If your controller isn’t supported, you can buy a cable that will work with other controller types.

I prefer the Phoenix approach as it lets you fly with the same controller that you fly your aircraft with, which helps you get accustomed to the feel of it.

Phoenix also comes with a better selection of models, including popular brands like eFlite and Align T-Rex.  More recent versions of RealFlight may now come with a better selection, but when I bought it (version 4.5) I didn’t recognize any of the brands it supported out of the box.

One thing RealFlight can do that Phoenix cannot is to offer a wider variety of viewing angles, in particular a “cockpit” view.  This is probably not especially useful for most people, unless you need to practice FPV flying (something I could see myself getting into sooner or later).

And price?  At the time of writing RealFlight is $199.98 and Phoenix is $129.99.  The price difference may be partially explained by the need for RealFlight to bundle its own controller.  Still, I think Phoenix wins on pricing.

Conclusion?  Phoenix is the better option in almost every way.  Click HERE to find it on Amazon

.

Bitcoin’s Most Serious Challenge Yet

MtGox, the most popular way to convert US dollars to and from Bitcoins, has just been hacked, resulting in an immediate market crash, and the usernames, email addresses, and information that can be used to determine people’s passwords (but not the passwords themselves).

It appears that a hacker gained access to an MtGox account with a very large number of coins was compromized.  The hacker sold these coins, and took advantage of the resultant market crash to buy bitcoins very cheaply.  It is likely that the hacker was able to withdraw thousands of dollars worth of these bitcoins.

This is likely to be a fatal blow to MtGox, who some estimate were making $2m/year in revenue from transaction fees. An exchange relies on people entrusting them with money and bitcoins, and it is hard to see that trust surviving this incident.

MtGox have said that they will roll-back transactions from when the incident began, but it seems unlikely they’ll be able to put the toothpaste back in the tube completely, which may result in a dramatic and lasting drop in value for Bitcoins.

While the security principles behind Bitcoin itself appear to be sound, there have been repeated security issues with the various tools and services around Bitcoin.  For example, the official Bitcoin client does not yet encrypt the user’s wallet, meaning that anyone that can access this file can effectively steal that user’s entire balance in a relatively untraceable way, given simple precautions.

However, this incident is perhaps the most serious.  MtGox is probably the most popular mechanism to both purchase and sell Bitcoins, and its credibility is now in ruins.

It isn’t necessarily the case that this will destroy Bitcoins themselves.  It will, however, demand dramatically better security for the various tools and services that grew up while Bitcoins remained an obscure pursuit of enthusiasts.

The list of accounts and their email addresses and password hashes can be found on Freenet at CHK@nQPmGQwCzInR1hYef3I4SYYfT3yfkBobBu0hiwOOmLw,72t6NbXIUnKDELYdFP8Y6LuAe-A6-0yiwnlKAdkyEN8,AAIC–8/mtgox-accounts.csv.gz (this link will only work if Freenet is installed and running).

Would you keep $500k of untraceable cash in your bedroom?

Probably not, but reportedly a user of Bitcoin kept about half a million dollars worth of the new decentralized cryptographic currency on their Windows laptop, and somebody stole it.

Misappropriated Bitcoins are, by design, difficult to trace, and with appropriate precautions, almost impossible.

To steal your Bitcoins, all someone needs is access to your “Bitcoin wallet”, a small file that by default, will be stored unprotected on your hard disk by the official Bitcoin software.  Having a backup of your wallet doesn’t help, anyone that can read your wallet can empty it.  They don’t even need to modify your wallet file to do this.

If someone gains access to your wallet, your only defense is to empty it before they do.

Even simple precautions, like storing your Bitcoin wallet in an encrypted disk, will be scant defense against someone who can gain physical or digital access to your computer (as they can use a keylogger to discover your passwords). Worse, with the large dollar values we’re talking about, extortion also becomes a real threat.

Indeed, the ease with which someone can steal something so valuable, with so little threat of getting caught, is almost unmatched. The very things that make Bitcoins such a powerful concept, are the very things that make it a tempting target for smart thieves.

Additionally, as the value of Bitcoins has skyrocketed since the online currency’s initial creation 2 years ago, many early adopters now own hundreds of thousands, even millions of dollars worth of Bitcoins. Many of these people probably have nothing like the kind of protection that would be employed to protect any other commodity of this value.

At this point it is difficult to know what to do, except perhaps rely on safety in numbers.

So if you are one of the “Bitcoin wealthy”, don’t tell ANYONE!

p.s. Oh and unfortunately for me I’m not one of those people, honest!

A (possibly) novel approach to image rescaling

We’ve all seen fictional examples of increasing the resolution of images to reveal previously unseen detail, here is a reminder:

Unfortunately, all of these examples are basically impossible, because they imply revealing information in the image that simply isn’t there.

Turns out that there is a way to do exactly this, described by this paper: Scene Completion Using Millions of Photographs.

So, anyone want to try applying this approach to “filling in” the missing pixels in an image that has been scaled up? The results would be really interesting.