5-star ratings are dangerous

…or “Why 99designs is sometimes telling people that the worst design is actually the best”

I’m a big fan of the website 99designs. For those unfamiliar with the site, you can pay for designers to compete with each-other to design your logo or website. I’ve used it for a number of my websites includinghttp://33mail.com/, and we’re currently using it for a sorely needed redesign of http://freenetproject.org/

A critical part of 99designs is to provide as much feedback to designers as possible during the selection process. While this can be fun, it can also be quite laborious. Fortunately, 99designs allows you to enlist your friends and co-workers to help, by allowing the creation of polls.

The process is simple: you select the designs you’re most interested in, and create a poll. You’re then given a link which you can share. Anyone that clicks on the link can rate each design between 1 and 5 stars.

99designs then averages up the votes and tells you which design is best. Of course you don’t need to agree with the poll, but I’m sure it has a significant influence for most people.

I’ve never been a big fan of 5-star ratings. I first thought about them in-depth when Netflix offered a $1 million prize to see who could predict people’s 5 star rating most accurately. It was one of the first large-scale machine learning “contests”, and a precursor to hugely popular websites like Kaggle. The prize was won, but tellingly, Netflix never actually used the winning entry.

The main problem with 5-star ratings is how people often use the results. Many people, like 99designs, simply take the average rating, but this is very dangerous because it implies a bunch of assumptions about how people come up with ratings that almost certainly aren’t accurate.

For example, if someone rates two of the designs 2 stars, but doesn’t rate the other two at all, this could change the ordering — even though this person hasn’t conveyed any useful information about their preferences.

So when we recently ran a poll to select the final designer for Freenet’s website, I was fairly suspicious of what 99designs was telling me was the winning design, and was curious about whether a different more sensible approach to interpreting the results might yield a significantly different outcome.

So what is someone saying when the vote that design A is 2-stars, and design B is 3-stars? Really all they’re saying is that B is better than A. Humans are far better at making these relative judgements than absolute judgements.

So, if we take everyone’s vote, and assume that if someone assigns a higher rating to one design than another, that it only means they think the first design is better, then we can ask a better question:

What ordering would lead to the fewest contradictions with the preferences people have expressed?

99design’s average-based approach ordered the designs 120 (best), 10, 41, 114 (worst).

My approach came up with this ordering, where the top-most order — 10, 114, 41, 120, is the best with just 4 disagreements with how people rated.

A few things to note:

  • This new approach suggests that design 10 is the best, followed by design 114 (the worst design per 99design’s approach)
  • The first ordering where design 120 is considered the best has twice as many “disagreements” as the best order.
  • The specific ordering shown by 99designs has 11 disagreements — one of the three worst orders!

My approach isn’t perfect and can certainly be refined, but it does make me wonder how many people have picked the wrong design due to 99design’s current approach.

Here is the code, feel free to rip, mix, and burn.

Leave a Reply

Your email address will not be published.