# Netflix Prize – is RMSE a good measurement?

The Netflix Prize is a pretty cool competition where Netflix has made available 100 million user ratings of movies, and you must use these to try to predict what movies users will like. The idea is to stimulate innovation in the field of “Collaborative Filters”. I won’t go into detail, you can read more at the link provided.

One important question is how one measures the success of a collaborative filtering algorithm. Netflix has opted to look at the average difference between the what algorithm predicts a user will rate a movie, and what they actually rate the movie. To be more precise, they look at the square root of the average of the squared differences (aka Root Mean Squared Error or RMSE). This is similar to a normal average, except it is swayed more by larger values.

A naive approach that just looks at a user’s average rating, and the average rating on each movie gets an RMSE of about 1.02. Netflix themselves score around 0.95, and if you can get it down (lower is better) to 0.85, you are in line to win \$1 million. Currently, the best performers are around 0.87, but as they get closer to the goal, progress gets much more difficult.

Like seemingly every other geek on the planet, I decided I would try my hand at this. My goal was not to win, since most of the current top entrants work by blending predictions from a bunch of different algorithms. While this is well within the rules of the competition, it does rely much more on perspiration than inspiration, and thus I find it rather dull. Also, I would question whether the approach is fast enough for practical use (something the competition doesn’t consider).

My attempt is not a terribly original approach, I’m using a perceptron learning algorithm (perceptrons are simple precursors to what are today known as “neural networks”), and my best score to-date is 0.905. I consider this respectable (especially given the simplicity of my approach and that its about two weekends of effort), but its not going to win any time soon (its around 300th on the leader-board).

Then I started to think about what it means for an algorithm to score 0.905, or 0.95, or 0.85, or 1.02 – what bearing do these different scores have on the actual user experience?

Well, user experience is a hard thing to measure, but if we make some simplifying assumptions, we can get an idea. Most collaborative filtering algorithms are used to choose things (in Netflix’ case the things are movies), predict a user’s rating of those things, and then present the things with the highest predicted ratings to the user.

If the user finds the things they are presented with to be appealing, then they should rate them highly, if not, they won’t. Thus, if we look at user’s actual ratings of things with the highest predicted ratings, we should get a better idea of how much difference the algorithm makes to the user’s experience.

Here is a graph which shows actual versus predicted ratings for two algorithms (click to see it full size), the top one is my algorithm half way through its training with an RMSE of 0.913, the bottom one is the algorithm based on simple averages of user and item ratings, this has an RMSE of 1.022.

You can see the predicted ratings along the bottom, and for each, you can see the proportion of ratings of 1 star, 2 star, 3 star, and so on. As one would hope, when the algorithm is predicting a rating of 1, the majority of the actual ratings are indeed 1, and when the algorithm is predicting a rating of 5, the majority of those ratings are actually 5.

But when you compare the two sets of predictions you see something a bit surprising, there isn’t really a huge difference between the two algorithms, even though one is using a fairly naive average-based approach that doesn’t even consider a user’s individual likes and dislikes, and the other is an algorithm that does significantly better than Netflix’ own approach.

The pie charts at the right show the proportion of actual ratings when the predicted rating is over 3.5 stars. This is intended to approximate what the average user might see.

For the smart algorithm, the actual ratings over 3 stars represent 98.9% of those ratings that were predicted to be over 3.5 stars.

But for the naive algorithm, the percentage is 94.4%, a 4.5% difference. Is 4.5% really that impressive? I think a user might have a hard time noticing such a difference in performance. I’m not the first to express concerns of this nature.

You can find the raw data in the form of an Excel spreadsheet here: