The perils of treating a model’s predictions as actual probabilities

With our work on SenseArray we are in the fortunate position to apply machine learning to a wide variety of fascinating real-world problems, as diverse as predicting the likelihood that two people will have a successful date, the probability that someone will click on an ad, or the likelihood that a credit card transaction is fraudulent.

Sometimes all you care about is that predictions are correct relative to each-other. You’re looking for the best ad, or the best match on a dating website.

But in other situations you need an actual probability, not just a “score”. We quickly learned that it is dangerous, and frequently just plain wrong to assume that the output of your predictive model is a reliable probability.

As a specific example of this, imagine that you are using your predictive model to bid for the right to show someone an ad through an ad exchange. To do this you need to determine how much you expect to make if you show them the ad, and this depends on the probability of them click on it, and perhaps the probability of them making a purchase later.

In this situation it is critical that your probabilities are really probabilities or you’ll end up bidding too much, or not enough.

The first time we saw this we identified it as a form of “selection bias”, which is described in this paper (#1): Tweedie’s Formula and Selection Bias.

Essentially the paper states that if you are using predictions to pick things, perhaps the likelihood that someone will click on an ad, then the mean of the predictions that you pick will be higher than the actual rate at which people click on the ads. The reason for this, briefly, is that you are more likely to pick predictions with a positive error rather than a negative error, which creates an overall positive bias in the errors of what you’ve picked. Paper #1 does a better job of explaining.

Yesterday during a discussion on Reddit the aptly named Mr_Smartypants directed my attention to paper #2: Transforming Classifier Scores into Accurate Multiclass Probability Estimates, which describes another reason that you can’t treat the predictions of many learning algorithms as probabilities (see Figure 1 from the paper).

This particular problem doesn’t affect SenseArray because it operates in a fundamentally different way to most machine learning algorithms, but it does affect most other supervised learning systems that produce a probability as output.

This is fine if all you care about is that the ordering of the predictions is correct (ie. the relationship between the predictions and reality is isotonic), but if you want to treat them as accurate probabilities you’re in trouble.

Paper #2’s solution, or the one that most appealed to me, was to use a simple algorithm called “pair-adjacent violators” or PAV to determine a mapping from the predictions to the actual observed probabilities. PAV is a form of isotonic regression.  This seemed like it would be an improvement on our current approach. PAV is described in section 3 of paper #2, but the basic approach is this:

Create an list of pairs of predictions and the actual outcome, 1.0 or 0.0, sorted by prediction lowest to highest, like this:

(0.3, 0), (0.5, 1), (0.6, 0), (0.8, 1) …

Now, working through the list starting at the beginning, find a prediction pair where the outcome of the second is lower than the outcome of the first.  This is a “violator”, because we know that a higher prediction should mean a higher outcome.  In the brief example above the 2nd and 3rd prediction-outcome pairs are violators because while 0.6 is greater than 0.5, 0 is not greater than 1.

To resolve this, we remove this pair of violators, and replace them with a new pair consisting of the average (this isn’t quite correct, keep reading) of both the prediction and the outcome.  So now, our new list looks like this:

(0.3, 0), (0.55, 0.5), (0.8, 1) …

We carry on through the list doing this every time we find a pair of violators.  At the end of this process, we are guaranteed to have an isotonic mapping from predictions to outcomes.  Yay!

Actually, not so “yay”.  I decided to test the algorithm with this code:

 

The basic idea was for the “real life” probabilities to be the square of the prediction. So, if the prediction was 0.5, the real probability would be 0.25, and so-on.

Here I compare the curve determined by the PAV algorithm to what it should be:

As you can see, it is a terrible fit, completely useless for our purposes.  What went wrong? (edit: Turns out that paper #2’s description of the PAV algorithm was incorrect, my modification to it was actually the correct implementation).

After some thought I had an idea. Recall that the PAV algorithm works by “merging” pairs of samples repeatedly, using the average prediction and outcome of each to produce the replacement sample.

Using the average in this way assumes that both of the “parent” samples are of equal value, but are they? Well, no. One parent sample might have hundreds of “ancestor” samples, while another might not have any parents at all. PAV ignores this.

The result is that samples in parts of the curve that have a lot of these violations effectively get “watered down”, while samples in parts of the curve with few violations do not. The effect is to distort the shape of the curve, something that is very obvious from my experiment above.

The solution? We need to keep track of a “weight” for every sample, which basically counts the number of “original” samples that they are composed of.

Now, instead of doing a normal averaging of two parent samples, we do a weighted average, and the resulting child weight is the sum of the two parent’s weights.

The effect? Take a look:

Wow, looks like we have solved the problem! Frankly, I’m surprised that people haven’t noticed this flaw in the PAV algorithm in the past (edit: it wasn’t a flaw in the algorithm, it was a flaw in paper #2’s description of the algorithm). I assume that they were using it in situations where the violations were evenly distributed throughout the prediction curve, in which case the distortion may not be so pronounced.

I’ve emailed the authors of paper #2 and will update this post once I hear back from them.

Update: Ronny commented below directing me to another paper that describes the PAV algorithm, but it describes it with my “fix”. It appears that the description of the algorithm’s implementation in paper #2 was incorrect, and the correct implementation is the weighted approach I describe above. I suspected this might be the case, nice that it’s been confirmed.

Remembering Oliver Schmelzle

On Wednesday night my good friend Oliver Schmelzle passed away from a rare blood infection. He was only admitted to hospital earlier this week, and to my knowledge was perfectly healthy before this, so this was very sudden. Oliver was only 38 years old, and leaves behind his wife Lacey and young son Ryder.

Oliver was an uncommon combination of intelligence and genuine niceness. Once every month or two he and I might meet to chat about business, technology, where we grew up (Germany in Oliver’s case), and all manner of interesting stuff. Normally Oliver would suggest meeting for coffee, I’d suggest meeting for beers, and we’d end up going for beers.

Oliver would patiently listen as I bounced my various harebrained ideas off him, and would always give me valuable and insightful feedback. He would often surprise me with his deep understanding of both business and technology.

My wife Janie would sometimes come out with us. At one point Janie was managing a small team of programmers, but her real aspiration was to become a Product Manager, which was Oliver’s role at that time. Oliver was more than happy to answer her questions and provide advice.

In recent weeks I had seen more of Oliver than usual. He had started working at a new company, Vast, and recognized that my company might be able to help them. I had a meeting there on Wednesday, and had hoped to see Oliver there, but his colleagues said he was out sick.

Obviously nobody realized how serious it was at that time, his colleagues expected him to be back in the office in a day or two. We joked that once he got back we would tell Oliver that we’d given up on online advertising and decided to use billboards instead, knowing that he would think this was idiotic. Of course being the nice guy that he is, Oliver wouldn’t have said so, he would have listened to our plan and politely tried to persuade us not to do it.

I will sincerely miss my conversations with Oliver, and the opportunity to work with him. Janie’s and my deepest sympathies are with Lacey and Ryder.

Oliver’s remembrance services will be on October 21st at 5pm at the Weed-Corley-Fish Funeral Home, Lamar location, and at 11am on October 22nd at Tarrytown United Methodist Church.

update: Oliver’s obituary in the Statesman newspaper is now available here.

update 2: Some other friends and colleagues have written about their memories of Oliver, read them here and here.