Category Archives: Programming

5-star ratings are dangerous

…or “Why 99designs is sometimes telling people that the worst design is actually the best”

I’m a big fan of the website 99designs. For those unfamiliar with the site, you can pay for designers to compete with each-other to design your logo or website. I’ve used it for a number of my websites including, and we’re currently using it for a sorely needed redesign of

A critical part of 99designs is to provide as much feedback to designers as possible during the selection process. While this can be fun, it can also be quite laborious. Fortunately, 99designs allows you to enlist your friends and co-workers to help, by allowing the creation of polls.

The process is simple: you select the designs you’re most interested in, and create a poll. You’re then given a link which you can share. Anyone that clicks on the link can rate each design between 1 and 5 stars.

99designs then averages up the votes and tells you which design is best. Of course you don’t need to agree with the poll, but I’m sure it has a significant influence for most people.

I’ve never been a big fan of 5-star ratings. I first thought about them in-depth when Netflix offered a $1 million prize to see who could predict people’s 5 star rating most accurately. It was one of the first large-scale machine learning “contests”, and a precursor to hugely popular websites like Kaggle. The prize was won, but tellingly, Netflix never actually used the winning entry.

The main problem with 5-star ratings is how people often use the results. Many people, like 99designs, simply take the average rating, but this is very dangerous because it implies a bunch of assumptions about how people come up with ratings that almost certainly aren’t accurate.

For example, if someone rates two of the designs 2 stars, but doesn’t rate the other two at all, this could change the ordering — even though this person hasn’t conveyed any useful information about their preferences.

So when we recently ran a poll to select the final designer for Freenet’s website, I was fairly suspicious of what 99designs was telling me was the winning design, and was curious about whether a different more sensible approach to interpreting the results might yield a significantly different outcome.

So what is someone saying when the vote that design A is 2-stars, and design B is 3-stars? Really all they’re saying is that B is better than A. Humans are far better at making these relative judgements than absolute judgements.

So, if we take everyone’s vote, and assume that if someone assigns a higher rating to one design than another, that it only means they think the first design is better, then we can ask a better question:

What ordering would lead to the fewest contradictions with the preferences people have expressed?

99design’s average-based approach ordered the designs 120 (best), 10, 41, 114 (worst).

My approach came up with this ordering, where the top-most order — 10, 114, 41, 120, is the best with just 4 disagreements with how people rated.

A few things to note:

  • This new approach suggests that design 10 is the best, followed by design 114 (the worst design per 99design’s approach)
  • The first ordering where design 120 is considered the best has twice as many “disagreements” as the best order.
  • The specific ordering shown by 99designs has 11 disagreements — one of the three worst orders!

My approach isn’t perfect and can certainly be refined, but it does make me wonder how many people have picked the wrong design due to 99design’s current approach.

Here is the code, feel free to rip, mix, and burn.

The purpose of software project management

I recently read the article The sad graph of software death by Gregory Brown.

Brown describes a software project wherein tasks are being opened faster than they are being closed in the project’s task tracker.  The author describes this as “broken”, “sad”, and “wasteful.”  The assumption behind the article seems to be that there is something inherently bad about tasks being opened faster than they are being closed.

The author doesn’t explain why this is bad, and to me this article and the confused discussion it prompted on Reddit are symptomatic of the fact that most people don’t have a clear idea of the purpose of software project management.

Another symptom is that so many software projects run into problems, causing tension between engineering, product, and other parts of the company.  It is also the reason there is such a proliferation of tools that purport to help solve the problem of project management, but none of them do because they don’t start from a clear view of what exactly this problem is.

Two complimentary goals

In my view, the two core goals of project management are prioritization and predictability.

Prioritization ensures that at any given time, the project’s developers are working on the tasks with the highest ratio of value to effort

Predictability means accurately estimating what will get done and by when, and communicating that with the rest of the company.

A task tracker maintains a record of who is currently working on specific tasks, which tasks are completed, and the future tasks that could be tackled. As such, the trackers do not address the two core goals of project management directly.

I have actually thought about building a project management tool that addresses these goals, i.e. prioritization and predictability, much more directly than is currently the case with existing systems.  Unfortunately, to date the value to effort ratio hasn’t been high enough relative to other projects 🙂

When a task is created or “opened” in a task tracker, this simply means “here is something we may want to do at some point in the future.”

Opening a task isn’t, or shouldn’t be, an assertion that it must get done, or must get done by a specific time. Although this might imply that some tasks may never be finished, that’s ok. Besides, a row in a modern database table is very cheap indeed.

Therefore, the faster rate at which tasks are opened rather than closed is not an indication of a project’s impending demise; rather, it merely reflects the normal tendency of people to think of new tasks for the project faster than developers are able to complete those tasks.

Once created, tasks should then go through a prioritization or triage process; however, the output isn’t simply “yes, we’ll do it” or “no, we won’t.”  Rather, the output should be an estimate of the value provided to complete the task, as well as an estimate of the effort or resources required to complete it. Based on these two estimates, we can calculate the value/effort for the tasks.  It is only then that we can stack-rank the tasks.

Estimating value and effort

Of course, this makes it sound much simpler than it is.  Accurately estimating the value of a task is a difficult process that may require input from sales, product, marketing, and many other parts of a business.  Similarly, accurately estimating the effort required to complete a task can be challenging for even the most experienced engineer.

There are processes designed to help with these estimates.  Most of these processes, such as planning poker, rely on the wisdom of crowds.  These are steps toward the right direction.

I believe the ultimate solution to estimation will exploit the fact that people are much better at making relative, rather than absolute, estimates. For example, it is easier to guess that an elephant is 4 times heavier than a horse, than to estimate that the absolute weight of an elephant is 8000 pounds.

This was recently supported by a simple experiment that I conducted.  First, I asked a group to individually assign a number of these relative or comparative estimates.  Then, I used a constraint solver to turn these into absolute estimates.  The preliminary results are very promising.  This approach would almost certainly be part of any project management tool that I might build.

Once we have good estimates for value/effort, we can then prioritize the tasks.  Using our effort estimate, combined with an understanding of the resources available, we can come up with better time estimates.  This will enhance predictability that can be shared with the rest of the company.

Pivotal Tracker

I have had quite a bit of experience with Pivotal Tracker, which I would describe as the “least bad” project management tool. Pivotal Tracker doesn’t solve the prioritization problem, but it does attempt to help with the predictability problem.  Unfortunately, it does so in a way that is so simplistic as to make it almost useless.  Let me explain.

Pivotal Tracker assumes that for each task, you have assigned effort estimates which are in the form of “points” (you are responsible for defining what a point means).   It also assumes that you have correctly prioritized the tasks, which are then placed in the “backlog” in priority order.

Pivotal Tracker then monitors how many points are “delivered” within a given time period.  It then uses these points to project when future tasks will be completed.

The key problem with this tool is that it pretends that the backlog is static, i.e. that new tasks won’t be added to the backlog before tasks are prioritized. In reality, tasks are constantly being added to any active project, and these new tasks might go straight to the top of the priority list.

Nevertheless, the good news is that Pivotal Tracker could probably be improved to account for this addition of new tasks without much difficulty.  Perhaps a third party could make these improvements by using the Java library I created for integrating with PT’s API.   🙂

Breaking down tasks

Most tasks start out as being quite large, and need to be broken down into smaller tasks, both to make it easier to divide tasks among developers, but also to improve the accuracy of estimates.

However, there isn’t much point in breaking down tasks when nobody is going to start work on them for weeks or months.  For this reason, I advise setting time-horizon limits for task sizes.  For example, you might say that a task that is estimated to be started within three months can’t be larger than 2 man-weeks, and a task to be started within 1 month cannot be larger than 4 man-days.

As a task crosses each successive time-horizon, it may need to be broken into smaller tasks (each of which will, presumably, be small enough until they hit the next time horizon).  In practice this can be accomplished with a weekly meeting, that can be cancelled if there are no tasks to be broken down.  We would assign one developer to break down each oversized task and then the meeting would break up so that they could go and do that.  Typically each large task would be broken down into 3-5 smaller tasks.

This approach has the additional advantage that it spreads out the process of breaking down tasks over time and among developers.

Resource allocation

So how do you decide who works on what?  This is fairly simple under this approach.  Developers simply pick the highest priority task that they can work on (depending on skill set or interdependencies).

At OneSpot, when we broke down tasks, we left the subtasks in the same position in the priority stack as the larger task they replaced.  Since developers pull new tasks off the top of the priority list, this has the tendency to encourage as many people as possible to be working on related tasks at any given time, which minimizes the number of projects (large tasks) in-flight at any given time.


To conclude, without a clear view of the purpose of successful project management, it is not surprising that so many projects flounder with many project management tools failing to hit the mark. I hope I was able to provide the beginnings of a framework to think about project management in a goal-driven way.

Add a command to list git branches in order of last commit

This is based on this SO answer.

Wouldn’t it be useful if you could order git branches in order of the most recently used, so the ones you are likely to be most interested in are at the top? Here is how, just type:

$ git config --global alias.branches 'for-each-ref --sort=-committerdate refs/heads/ --format=\'%(committerdate:short) %09%(authorname) %09%(refname:short)\''
Now, just type:
$ git branches

And you’ll get something like:

2013-09-21 Ian Clarke gh-pages
2013-09-15 Ian Clarke master
2013-07-14 Ravi Tejasvi contactBook
2013-06-15 Ravi Tejasvi android
2013-06-08 Ian Clarke web-look-and-feel
2013-03-23 Ian Clarke cleanup-topology-maint
2012-06-12 Kieran Donegan topologyMaintenance
2012-05-28 Ian Clarke vaadin
2012-04-27 Ian Clarke refactor-of-peer-info
2011-07-07 Ian Clarke tr-remote-address-everything-needed

Note the date of the last commit, the committer, and the branch name.

Tungle: A wasted opportunity

Apparently Tungle has shut down development, although they still allow people to sign up. Turns out their acquisition by RIM last year must have been an acquahire (technically an acquisition, but really an admission of defeat).

Tungle had an incredibly viral business model, perhaps the most viral I’ve seen since Plaxo, solving a problem I and many others encounter on a near-daily basis:  Help people schedule meetings and calls with each-other.

So what went wrong? Their usability SUCKED. I desperately wanted Tungle to work, but almost every time I tried using it to schedule a meeting with someone, something would screw up and we’d have to resort to manually scheduling via email.  This was embarrassing when it happened, but even so I tried over and over again.  Every time I did could have been an opportunity for Tungle to sign up a new user, if their usability wasn’t so bad.

So if there is anyone out there looking for an idea that could be the next LinkedIn-scale viral phenomenon, all you have to do is reimplement Tungle, but this time get the usability right.  If I weren’t already rather busy I’d be doing this myself.

Microsoft probably just killed “Do Not Track”

Update (27th Oct 2012): I told you so!  Yahoo will ignore DNT from IE10 for exactly the reason I cite below.

Microsoft just announced that the “do not track” opt-out would be on by default in Internet Explorer 10.  This is a boneheaded move.

“Do not track” is a standard through which a web browser can inform a web page that the user does not wish to be tracked by third-party websites for the purpose of advertising.  So far as I can tell, respecting this is entirely voluntary on the part of the advertisers.

Advertisers often use browser cookies to track users, this allows them to target advertising specifically to people who’ve visited their website, for example.  Google and Microsoft both do it, it’s fairly standard practice these days.  Typically the advertiser isn’t tracking you as an individual, all they know is that you may have previously visited a particular website.

To explain why Microsoft’s move is boneheaded, I’ll relate a story from the early days of Revver, the online video sharing website that I co-founded back in 2004.

We had decided to let video uploaders tell us whether the video contained any content that is not appropriate for children as part of the upload process.  The vast majority of our users did exactly this and all was well, until at some point we realized that people were uploading some pretty serious pornography that we weren’t comfortable with even if it was marked as “adult” by the uploader.

Our panicked solution was to simply remove all videos marked as “adult” from the site, and prevent any further uploads where the videos were so-marked.

Of course you can predict the result: people immediately stopped marking videos as “adult”, making our task vastly more difficult.

The moral?  Don’t expect people to do something voluntarily if you are then going to use it against them.

I think Microsoft has just made exactly the same mistake.  Previously I think there was a reasonable chance that advertisers would choose to respect this, since only a minority of users are likely to enable it, and those are the people that really care about not being tracked.

But if it is enabled by default in Internet Explorer 10, advertisers now have no idea whether the user really cares about being tracked, and as a result they are far less likely to respect it.

Looking at it a different way, Microsoft just gave advertisers the perfect excuse to ignore DNT, because they can correctly claim that in most instances the user will have made no conscious decision to enable it.

Object Relational Mappers (ORMs) are a terrible idea

ORMs (like Hibernate in Java) are a flawed solution to an imagined problem that should be consigned to the wastebasket of history.

They are predicated on the idea that relational databases are old fashioned and should be avoided at all costs, but if you can’t avoid them, use some kind of convoluted wrapper that tries (and generally fails) to pretend that the relational database is something it isn’t – an object oriented datastore.

Often what people are really looking for when they use an ORM is some kind of database abstraction layer.  It is reasonable to abstract the database somehow, but I recommend not using an abstraction layer that seeks to pretend the database is something that it isn’t.

In Java, the best such database abstraction layer I’ve found is Jooq, I highly recommend it.

LastCalc: A powerful calculator meets Quora meets Siri

For the past month or so my main spare-time project has been a crazy idea called LastCalc (link at bottom).

I’ve been having trouble figuring out how to describe it, but here goes: Imagine a powerful web-based calculator that can answer your questions, a little like Google Calculator, Siri, or Wolfram Alpha, but where anyone can teach it how to calculate new things.

Additionally, rather than asking one question, you can ask a series of questions, each potentially referring to previous answers (programmers know this is a Read-Eval-Print-Loop or REPL).

Just like the others it supports basic math and unit conversions, like this (note: the highlighting is automatic and happens as you type – you type the bit before the big silver = and hit return, the answer appears after it):

But it goes a lot further. You can assign the result of a calculation to a variable, and then use it in subsequent calculations:

Internally LastCalc treats all numbers as rationals (x/y where x and y are integers) if possible, even if they are displayed as floating point numbers.  This means that it will not lose precision regardless of how many calculations you do (this can be a problem if using normal floating point numbers which are imprecise).

It’s not just simple numbers, LastCalc understands lists and associative arrays too, using a syntax very similar to JSON:

LastCalc is extensible, so if you find yourself repeating the same calculation over and over again, you can teach LastCalc how to do it (note: parameters are denoted by capitalization, like Prolog):

And it goes further, supporting pattern matching and recursion using these datastructures, just like languages like ML and Haskell:

Then use it with:

You can also pattern-match on maps.  Here I define a function that takes a map and returns a list of its keys:

Currently I’m working on a tutorial and help system so I don’t need to explain all of this before sending people to the site 🙂

Right now you can only use functions that you define yourself, but in due course people will be able to share functions, much like they can share answers to questions with Quora.

So far it has only been tested in Chrome and Safari, and it definitely doesn’t work yet in Internet Explorer.  I’m waiting for the Javascript to stabilize before climbing that particular mountain.

Check it out at

It’s obviously a work in progress, if you’d like to follow discussion and provide me with feedback please join the LastCalc Google Group, or follow @LastCalc on Twitter.

Proportionate A/B testing

More than once I’ve seen people ask questions like “In A/B Testing, how long should you wait before knowing which option is the best?”.

I’ve found that the best solution is to avoid a binary question of whether or not to use a variation. Instead, randomly select the variation to present to a user in proportion to the probability that this variation is the best based on the data (if any) you have so-far.

How? The key is the beta distribution. Let’s say you have a variation with 30 impressions, and 5 conversions. A beta distribution with (alpha=5, beta=30-5) describes the probability distribution for the actual conversion rate. It looks like this:

A graph showing a beta distribution

This shows that while the most likely conversion rate is about 1 in 6 (approx 0.16), the curve is relatively broad indicating that there is still a lot of uncertainty about what the conversion rate will be.

Let’s try the same for 50 conversions and 300 impressions (alpha=50, beta=300-50)”):

You’ll see that while the curve’s peak remains the same, the curve gets a lot narrower, meaning that we’ve got a lot more certainty about what the actual conversion rate is – as you would expect given 10X the data.

Let’s say we have 5 variations, each with various numbers of impressions and conversions. A new user arrives at our website, and we want to decide which variation we show them.

To do this we employ a random number generator, which will pick random numbers according to a beta distribution we provide to it. You can find open source implementations of such a random number generator in most programming languages, here is one for Java.

So we go through each of our variations, and pick a random number within the beta distribution we’ve calculated for that variation. Whichever variation gets the highest random number is the one we show.

The beauty of this approach is that it achieves a really nice, perhaps optimal compromise between sending traffic to new variations to test them, and sending traffic to variations that we know to be good. If a variation doesn’t perform well this algorithm will gradually give it less and less traffic, until eventually it’s getting none. Then we can remove it secure in the knowledge that we aren’t removing it prematurely, no need to set arbitrary significance thresholds.

This approach is easily extended to situations where rather than a simple impression-conversion funnel, we have funnels with multiple steps.

One question is, before you’ve collected any data about a particular variation, what should you “initialize” the beta distribution with. The default answer is (1, 1), since you can’t start with (0, 0). This effectively starts with a “prior expectation” of a 50% conversion rate, but as you collect data this will rapidly converge on reality.

Nonetheless, we can do better. Let’s say that we know that variations tend to have a 1% conversion rate, so you could start with (1,99).

If you really want to take this to an extreme (which is what we do in our software!), let’s say you have an idea of the normal distribution of the conversion rates, let’s say its 1% with a standard deviation of 0.5%.

Note that starting points of (1,99), or (2,198), or (3,297) will all give you a starting mean of 1%, but the higher the numbers, the longer they’ll take to converge away from the mean. If you plug these into Wolfram Alpha (“beta distribution (3,297)”) it will show you the standard deviation for each of them. (1,99) is 0.0099, (2,198) is 0.007, (3, 297) is 0.00574, (4, 396) is 0.005 and so on.

So, since we expect the standard deviation of the actual conversion rates to be 0.5% or 0.005, we know that starting with (4, 396) is about right.

You could find a smart way to find the starting beta parameters with the desired standard deviation, but it’s easier and effective to just do it experimentally as I did.

Note that while I discovered this technique independently, I later learned that it is known as “Thompson sampling” and was originally developed in the 1930s (although to my knowledge this is the first time it has been applied to A/B testing).

A (possibly) novel approach to image rescaling

We’ve all seen fictional examples of increasing the resolution of images to reveal previously unseen detail, here is a reminder:

Unfortunately, all of these examples are basically impossible, because they imply revealing information in the image that simply isn’t there.

Turns out that there is a way to do exactly this, described by this paper: Scene Completion Using Millions of Photographs.

So, anyone want to try applying this approach to “filling in” the missing pixels in an image that has been scaled up? The results would be really interesting.

Why doesn’t this exist yet: Syntax-aware merge

Hey programmers – what if I told you that there was a fairly interesting problem out there, that would probably be just a few weekends of work, and which if you implemented it you could find your code part of the toolset of a significant proportion of software developers worldwide (especially if you open sourced it)?

Well, just such a problem has been floating around for several years, and is as-yet unanswered.  I proposed it two years ago, and others have suggested it several years before that.

Anyone familiar with source control systems like svn and git will be familiar with the occasional need to “merge” changes.  This is to take changes to one version of a file, and apply the same changes to a different (but not too different) version of that same file.

Today’s merge tools, at least to the best of my knowledge, basically treat all text files the same.  Whether its Java, HTML, C++, or CSS, as far as the merge tool is concerned, they are all just lines of text.

This works ok in many situations, but it can get tripped up by situations like this:

   Book book = new Book(
        String name,
        int length);

Let’s say we change this to:

   Book book = new Book(String name, int length);

A programmer knows that all we’ve changed here is formatting, but a merge algorithm will see this has two lines deleted, and one line changed – quite a major alteration.

It is true that passing both files through a formatter would help in this case, but there are many other situations where it wouldn’t.  Consider renaming a variable throughout a file, inlining a variable, and other common refactoring operations.

My proposal is fairly simple.  Rather than treat Java files as simple text, we parse the files into an “Abstract Syntax Tree” using a Java parser, maybe even matching identifier names like variables, and treat the merge operation as a merge of these trees, rather than simple lines of text.

After the merge we convert the resultant tree back to text, perhaps even formatting it as we go.

The result of this, I suspect, would be dramatically more robust merge operations.

So what would be involved here?  Well, it shouldn’t be necessary to write a bunch of parsers, because these already exist for most languages.  The main thing would be the tree merge algorithm, which would be a neat self-contained computer science project.  Perhaps there is even an off-the-shelf open source tree merge algorithm out there.

Regrettably I’ve already got a bunch of projects on my plate already, but perhaps there is someone out there interested in taking on this challenge?  If they did I could almost guarantee that the resultant software (which I assume they’d open source) would be very widely adopted.

If you are interested I’ll be happy to help and advise. You can email me at