With our work on SenseArray we are in the fortunate position to apply machine learning to a wide variety of fascinating real-world problems, as diverse as predicting the likelihood that two people will have a successful date, the probability that someone will click on an ad, or the likelihood that a credit card transaction is fraudulent.
Sometimes all you care about is that predictions are correct relative to each-other. You’re looking for the best ad, or the best match on a dating website.
But in other situations you need an actual probability, not just a “score”. We quickly learned that it is dangerous, and frequently just plain wrong to assume that the output of your predictive model is a reliable probability.
As a specific example of this, imagine that you are using your predictive model to bid for the right to show someone an ad through an ad exchange. To do this you need to determine how much you expect to make if you show them the ad, and this depends on the probability of them click on it, and perhaps the probability of them making a purchase later.
In this situation it is critical that your probabilities are really probabilities or you’ll end up bidding too much, or not enough.
The first time we saw this we identified it as a form of “selection bias”, which is described in this paper (#1): Tweedie’s Formula and Selection Bias.
Essentially the paper states that if you are using predictions to pick things, perhaps the likelihood that someone will click on an ad, then the mean of the predictions that you pick will be higher than the actual rate at which people click on the ads. The reason for this, briefly, is that you are more likely to pick predictions with a positive error rather than a negative error, which creates an overall positive bias in the errors of what you’ve picked. Paper #1 does a better job of explaining.
Yesterday during a discussion on Reddit the aptly named Mr_Smartypants directed my attention to paper #2: Transforming Classifier Scores into Accurate Multiclass Probability Estimates, which describes another reason that you can’t treat the predictions of many learning algorithms as probabilities (see Figure 1 from the paper).
This particular problem doesn’t affect SenseArray because it operates in a fundamentally different way to most machine learning algorithms, but it does affect most other supervised learning systems that produce a probability as output.
This is fine if all you care about is that the ordering of the predictions is correct (ie. the relationship between the predictions and reality is isotonic), but if you want to treat them as accurate probabilities you’re in trouble.
Paper #2′s solution, or the one that most appealed to me, was to use a simple algorithm called “pair-adjacent violators” or PAV to determine a mapping from the predictions to the actual observed probabilities. PAV is a form of isotonic regression. This seemed like it would be an improvement on our current approach. PAV is described in section 3 of paper #2, but the basic approach is this:
Create an list of pairs of predictions and the actual outcome, 1.0 or 0.0, sorted by prediction lowest to highest, like this:
(0.3, 0), (0.5, 1), (0.6, 0), (0.8, 1) …
Now, working through the list starting at the beginning, find a prediction pair where the outcome of the second is lower than the outcome of the first. This is a “violator”, because we know that a higher prediction should mean a higher outcome. In the brief example above the 2nd and 3rd prediction-outcome pairs are violators because while 0.6 is greater than 0.5, 0 is not greater than 1.
To resolve this, we remove this pair of violators, and replace them with a new pair consisting of the average (this isn’t quite correct, keep reading) of both the prediction and the outcome. So now, our new list looks like this:
(0.3, 0), (0.55, 0.5), (0.8, 1) …
We carry on through the list doing this every time we find a pair of violators. At the end of this process, we are guaranteed to have an isotonic mapping from predictions to outcomes. Yay!
Actually, not so “yay”. I decided to test the algorithm with this code:
The basic idea was for the “real life” probabilities to be the square of the prediction. So, if the prediction was 0.5, the real probability would be 0.25, and so-on.
Here I compare the curve determined by the PAV algorithm to what it should be:
As you can see, it is a terrible fit, completely useless for our purposes. What went wrong? (edit: Turns out that paper #2′s description of the PAV algorithm was incorrect, my modification to it was actually the correct implementation).
After some thought I had an idea. Recall that the PAV algorithm works by “merging” pairs of samples repeatedly, using the average prediction and outcome of each to produce the replacement sample.
Using the average in this way assumes that both of the “parent” samples are of equal value, but are they? Well, no. One parent sample might have hundreds of “ancestor” samples, while another might not have any parents at all. PAV ignores this.
The result is that samples in parts of the curve that have a lot of these violations effectively get “watered down”, while samples in parts of the curve with few violations do not. The effect is to distort the shape of the curve, something that is very obvious from my experiment above.
The solution? We need to keep track of a “weight” for every sample, which basically counts the number of “original” samples that they are composed of.
Now, instead of doing a normal averaging of two parent samples, we do a weighted average, and the resulting child weight is the sum of the two parent’s weights.
The effect? Take a look:
Wow, looks like we have solved the problem! Frankly, I’m surprised that people haven’t noticed this flaw in the PAV algorithm in the past (edit: it wasn’t a flaw in the algorithm, it was a flaw in paper #2′s description of the algorithm). I assume that they were using it in situations where the violations were evenly distributed throughout the prediction curve, in which case the distortion may not be so pronounced.
I’ve emailed the authors of paper #2 and will update this post once I hear back from them.
Update: Ronny commented below directing me to another paper that describes the PAV algorithm, but it describes it with my “fix”. It appears that the description of the algorithm’s implementation in paper #2 was incorrect, and the correct implementation is the weighted approach I describe above. I suspected this might be the case, nice that it’s been confirmed.