Category Archives: Programming

Tungle: A wasted opportunity

Apparently Tungle has shut down development, although they still allow people to sign up. Turns out their acquisition by RIM last year must have been an acquahire (technically an acquisition, but really an admission of defeat).

Tungle had an incredibly viral business model, perhaps the most viral I’ve seen since Plaxo, solving a problem I and many others encounter on a near-daily basis:  Help people schedule meetings and calls with each-other.

So what went wrong? Their usability SUCKED. I desperately wanted Tungle to work, but almost every time I tried using it to schedule a meeting with someone, something would screw up and we’d have to resort to manually scheduling via email.  This was embarrassing when it happened, but even so I tried over and over again.  Every time I did could have been an opportunity for Tungle to sign up a new user, if their usability wasn’t so bad.

So if there is anyone out there looking for an idea that could be the next LinkedIn-scale viral phenomenon, all you have to do is reimplement Tungle, but this time get the usability right.  If I weren’t already rather busy I’d be doing this myself.

Microsoft probably just killed “Do Not Track”

Update (27th Oct 2012): I told you so!  Yahoo will ignore DNT from IE10 for exactly the reason I cite below.

Microsoft just announced that the “do not track” opt-out would be on by default in Internet Explorer 10.  This is a boneheaded move.

“Do not track” is a standard through which a web browser can inform a web page that the user does not wish to be tracked by third-party websites for the purpose of advertising.  So far as I can tell, respecting this is entirely voluntary on the part of the advertisers.

Advertisers often use browser cookies to track users, this allows them to target advertising specifically to people who’ve visited their website, for example.  Google and Microsoft both do it, it’s fairly standard practice these days.  Typically the advertiser isn’t tracking you as an individual, all they know is that you may have previously visited a particular website.

To explain why Microsoft’s move is boneheaded, I’ll relate a story from the early days of Revver, the online video sharing website that I co-founded back in 2004.

We had decided to let video uploaders tell us whether the video contained any content that is not appropriate for children as part of the upload process.  The vast majority of our users did exactly this and all was well, until at some point we realized that people were uploading some pretty serious pornography that we weren’t comfortable with even if it was marked as “adult” by the uploader.

Our panicked solution was to simply remove all videos marked as “adult” from the site, and prevent any further uploads where the videos were so-marked.

Of course you can predict the result: people immediately stopped marking videos as “adult”, making our task vastly more difficult.

The moral?  Don’t expect people to do something voluntarily if you are then going to use it against them.

I think Microsoft has just made exactly the same mistake.  Previously I think there was a reasonable chance that advertisers would choose to respect this, since only a minority of users are likely to enable it, and those are the people that really care about not being tracked.

But if it is enabled by default in Internet Explorer 10, advertisers now have no idea whether the user really cares about being tracked, and as a result they are far less likely to respect it.

Looking at it a different way, Microsoft just gave advertisers the perfect excuse to ignore DNT, because they can correctly claim that in most instances the user will have made no conscious decision to enable it.

Object Relational Mappers (ORMs) are a terrible idea

ORMs (like Hibernate in Java) are a flawed solution to an imagined problem that should be consigned to the wastebasket of history.

They are predicated on the idea that relational databases are old fashioned and should be avoided at all costs, but if you can’t avoid them, use some kind of convoluted wrapper that tries (and generally fails) to pretend that the relational database is something it isn’t – an object oriented datastore.

Often what people are really looking for when they use an ORM is some kind of database abstraction layer.  It is reasonable to abstract the database somehow, but I recommend not using an abstraction layer that seeks to pretend the database is something that it isn’t.

In Java, the best such database abstraction layer I’ve found is Jooq, I highly recommend it.

LastCalc: A powerful calculator meets Quora meets Siri

For the past month or so my main spare-time project has been a crazy idea called LastCalc (link at bottom).

I’ve been having trouble figuring out how to describe it, but here goes: Imagine a powerful web-based calculator that can answer your questions, a little like Google Calculator, Siri, or Wolfram Alpha, but where anyone can teach it how to calculate new things.

Additionally, rather than asking one question, you can ask a series of questions, each potentially referring to previous answers (programmers know this is a Read-Eval-Print-Loop or REPL).

Just like the others it supports basic math and unit conversions, like this (note: the highlighting is automatic and happens as you type – you type the bit before the big silver = and hit return, the answer appears after it):

But it goes a lot further. You can assign the result of a calculation to a variable, and then use it in subsequent calculations:

Internally LastCalc treats all numbers as rationals (x/y where x and y are integers) if possible, even if they are displayed as floating point numbers.  This means that it will not lose precision regardless of how many calculations you do (this can be a problem if using normal floating point numbers which are imprecise).

It’s not just simple numbers, LastCalc understands lists and associative arrays too, using a syntax very similar to JSON:

LastCalc is extensible, so if you find yourself repeating the same calculation over and over again, you can teach LastCalc how to do it (note: parameters are denoted by capitalization, like Prolog):

And it goes further, supporting pattern matching and recursion using these datastructures, just like languages like ML and Haskell:

Then use it with:

You can also pattern-match on maps.  Here I define a function that takes a map and returns a list of its keys:

Currently I’m working on a tutorial and help system so I don’t need to explain all of this before sending people to the site :-)

Right now you can only use functions that you define yourself, but in due course people will be able to share functions, much like they can share answers to questions with Quora.

So far it has only been tested in Chrome and Safari, and it definitely doesn’t work yet in Internet Explorer.  I’m waiting for the Javascript to stabilize before climbing that particular mountain.

Check it out at LastCalc.com.

It’s obviously a work in progress, if you’d like to follow discussion and provide me with feedback please join the LastCalc Google Group, or follow @LastCalc on Twitter.

Proportionate A/B testing

More than once I’ve seen people ask questions like “In A/B Testing, how long should you wait before knowing which option is the best?”.

We’ve found that the best solution is to avoid a binary question of whether or not to use a variation. Instead, randomly select the variation to present to a user in proportion to the probability that this variation is the best based on the data (if any) you have so-far.

How? The key is the beta distribution. Let’s say you have a variation with 30 impressions, and 5 conversions. A beta distribution with (alpha=5, beta=30-5) describes the probability distribution for the actual conversion rate. It looks like this:

A graph showing a beta distribution

This shows that while the most likely conversion rate is about 1 in 6 (approx 0.16), the curve is relatively broad indicating that there is still a lot of uncertainty about what the conversion rate will be.

Let’s try the same for 50 conversions and 300 impressions (alpha=50, beta=300-50)”):

You’ll see that while the curve’s peak remains the same, the curve gets a lot narrower, meaning that we’ve got a lot more certainty about what the actual conversion rate is – as you would expect given 10X the data.

Let’s say we have 5 variations, each with various numbers of impressions and conversions. A new user arrives at our website, and we want to decide which variation we show them.

To do this we employ a random number generator, which will pick random numbers according to a beta distribution we provide to it. You can find open source implementations of such a random number generator in most programming languages, here is one for Java.

So we go through each of our variations, and pick a random number within the beta distribution we’ve calculated for that variation. Whichever variation gets the highest random number is the one we show.

The beauty of this approach is that it achieves a really nice, perhaps optimal compromise between sending traffic to new variations to test them, and sending traffic to variations that we know to be good. If a variation doesn’t perform well this algorithm will gradually give it less and less traffic, until eventually it’s getting none. Then we can remove it secure in the knowledge that we aren’t removing it prematurely, no need to set arbitrary significance thresholds.

This approach is easily extended to situations where rather than a simple impression-conversion funnel, we have funnels with multiple steps.

One question is, before you’ve collected any data about a particular variation, what should you “initialize” the beta distribution with. The default answer is (1, 1), since you can’t start with (0, 0). This effectively starts with a “prior expectation” of a 50% conversion rate, but as you collect data this will rapidly converge on reality.

Nonetheless, we can do better. Let’s say that we know that variations tend to have a 1% conversion rate, so you could start with (1,99).

If you really want to take this to an extreme (which is what we do in our software!), let’s say you have an idea of the normal distribution of the conversion rates, let’s say its 1% with a standard deviation of 0.5%.

Note that starting points of (1,99), or (2,198), or (3,297) will all give you a starting mean of 1%, but the higher the numbers, the longer they’ll take to converge away from the mean. If you plug these into Wolfram Alpha (“beta distribution (3,297)”) it will show you the standard deviation for each of them. (1,99) is 0.0099, (2,198) is 0.007, (3, 297) is 0.00574, (4, 396) is 0.005 and so on.

So, since we expect the standard deviation of the actual conversion rates to be 0.5% or 0.005, we know that starting with (4, 396) is about right.

You could find a smart way to find the starting beta parameters with the desired standard deviation, but it’s easier and effective to just do it experimentally as I did.

Note that while we developed this technique independently, we later discovered that it is known as “Thompson sampling” and was originally developed in the 1930s.

A (possibly) novel approach to image rescaling

We’ve all seen fictional examples of increasing the resolution of images to reveal previously unseen detail, here is a reminder:

Unfortunately, all of these examples are basically impossible, because they imply revealing information in the image that simply isn’t there.

Turns out that there is a way to do exactly this, described by this paper: Scene Completion Using Millions of Photographs.

So, anyone want to try applying this approach to “filling in” the missing pixels in an image that has been scaled up? The results would be really interesting.

Why doesn’t this exist yet: Syntax-aware merge

Hey programmers – what if I told you that there was a fairly interesting problem out there, that would probably be just a few weekends of work, and which if you implemented it you could find your code part of the toolset of a significant proportion of software developers worldwide (especially if you open sourced it)?

Well, just such a problem has been floating around for several years, and is as-yet unanswered.  I proposed it two years ago, and others have suggested it several years before that.

Anyone familiar with source control systems like svn and git will be familiar with the occasional need to “merge” changes.  This is to take changes to one version of a file, and apply the same changes to a different (but not too different) version of that same file.

Today’s merge tools, at least to the best of my knowledge, basically treat all text files the same.  Whether its Java, HTML, C++, or CSS, as far as the merge tool is concerned, they are all just lines of text.

This works ok in many situations, but it can get tripped up by situations like this:

   Book book = new Book(
        String name,
        int length);

Let’s say we change this to:

   Book book = new Book(String name, int length);

A programmer knows that all we’ve changed here is formatting, but a merge algorithm will see this has two lines deleted, and one line changed – quite a major alteration.

It is true that passing both files through a formatter would help in this case, but there are many other situations where it wouldn’t.  Consider renaming a variable throughout a file, inlining a variable, and other common refactoring operations.

My proposal is fairly simple.  Rather than treat Java files as simple text, we parse the files into an “Abstract Syntax Tree” using a Java parser, maybe even matching identifier names like variables, and treat the merge operation as a merge of these trees, rather than simple lines of text.

After the merge we convert the resultant tree back to text, perhaps even formatting it as we go.

The result of this, I suspect, would be dramatically more robust merge operations.

So what would be involved here?  Well, it shouldn’t be necessary to write a bunch of parsers, because these already exist for most languages.  The main thing would be the tree merge algorithm, which would be a neat self-contained computer science project.  Perhaps there is even an off-the-shelf open source tree merge algorithm out there.

Regrettably I’ve already got a bunch of projects on my plate already, but perhaps there is someone out there interested in taking on this challenge?  If they did I could almost guarantee that the resultant software (which I assume they’d open source) would be very widely adopted.

If you are interested I’ll be happy to help and advise. You can email me at syntaxmerge@ian.33barfmail.com.

The game theory of sealed-bid auctions

According to Wikipedia, a “frenemy” is someone who is simultaneously a partner and a competitor.

The term is typically used in social relationships, but frenemies can also exist in economic relationships, and I was fortunate enough to encounter such a situation in a project I’m working on, and I decided to explore it. The conclusions are directly applicable to at least one specific situation where real money is at stake (I describe this at the end). More interestingly, this may suggest a selfish rationale for unselfish behavior in a wider class of situations.

First, a little background:

Sealed Bid Auction

Consider a simplified version of eBay where everyone bids once on an item, nobody sees each-other’s bid, and the highest bid wins. This is called a “First-price sealed-bid auction”.

One day you find a trustworthy guy called Bill that promises that they can pay you $200 for a Nexus One phone. You discover that Nexus One phones frequently sell for less than $200 on eBay, and several of these phones are auctioned off every day. Bill hates to use eBay and won’t ever use it no-matter what, and he doesn’t really care what you bid for the phones, so long as he gets them. You realize that there is an opportunity to make some money here.

So what do you bid when you see one of these phones? Its a compromise, since if you bid $200 you’ll definitely win the phone, but you’ll make no money. If you bid less, then you’ll make more money if you win, but you’ll win less frequently. If you bid too little then you’ll never win and you’ll make no money.

This is not a hard question to answer if you are reasonably smart and have a decent amount of information about past winning bids, since you’ll be able to discover, given a bid b1, a function f1(b1) that will tell you the probability of winning the auction, given whatever you bid is. Let’s assume that you are smart enough and you do have enough information. Let’s say your cut of the profit is c1, then, applying some high school math, your expected profit is $200*c1*f1($200*(1-c1)).

From this equation you are able to decide the optimal bid (and therefore how big your cut is) to maximize your own profit. You can either do this through some fancy math, or just experimentally plug in different values for c1 until you find what works best, the latter being my preferred approach (because I’m lazy and computers aren’t).

Bill gets clever

Then Bill comes to you and tells you that he may not always be able to pay you $200 for the phone, sometimes he’ll pay more, sometimes he’ll pay less, but he will tell you what he will pay before you must place your bid. Fair enough you think, you’ve got your function f1(b1), you can determine the optimal bid depending on whatever Bill is willing to pay.

Weeks go by and you are making good money off this relationship. One night you go out for a beer with Bill, and he drops a bombshell. He tells you that actually, he is playing the same game you are. He knows a guy, Jim, who is buying the phones from him. He won’t tell you who Jim is (Bill isn’t an idiot), but it turns out that Jim is paying even more for these Nexus Ones than Bill is! Worse still, it turns out that not only is Bill no idiot, he is at least as smart as you are. He has determined the probability of you winning your bid, and is choosing his cut to optimize his total profit, in the exact same way that you are (although his function, f2(b2), won’t be the same as your’s because you are reducing his bid by your cut).

You go home drunk and don’t think about it much, but the next day you have a real headache, and its not just all the beer you had with Bill last night. What exactly is the relationship between you and Bill? In one sense, you are on the same side. If your cut combined with Bill’s cut is too big, then both of you will make less money. But in another sense, you are on opposite sides, the smaller his cut, the bigger your cut can be, and vice versa.

Given this relationship, and assuming that you can’t deal directly with Jim, Bill can’t deal directly with eBay, and Bill is at least as smart as you are, how do you maximize your profit?

My Simulation

A mathematician or game theorist at this point would probably go into a dark room for a week, month, or year, and come out with the mathematically optimal answer. Since I don’t have the time or patience required for a rigorous mathematical treatment, I decided to do a few experiments instead. You can find the code here.

The Auction

In the results described here, Bill’s buyer pays $1 (ok, we’re somewhat abandoning the Nexus One analogy here) to Bill for a phone. I wanted to create a reasonably realistic auction, which I do in the bidWinProb() method on line 30. Basically I simulate 100,000 auctions, each consisting of 10 bidders whose bids are spread according to a Gaussian distribution with mean of $0.50, and standard distribution of 0.1. I then record all of the winning bids, which allows me to quickly estimate a probability of winning for any given bid.

Yes, I realize that I could do this much more efficiently mathematically, but speed doesn’t really matter here, and this solution is simple enough that bugs are unlikely.

It has the added benefit that you can just provide it with actual data from a real auction, rather than simulated data.

Optimize Cut

The first step is that we want to be able to determine the optimal cut for ourselves, given whatever cut Bill has, or the optimal cut for Bill given whatever our cut is. This is done in the optimizeCut() method on line 78. The approach is basic trial and error. While a more sophisticated, accurate, and efficient approach is certainly possible, the current implementation is more than adequate for our needs (not to mention being easier to be confident that it does what its supposed to).

Experiment #1: Best combined cut

So what happens if Bill and I were entirely transparent with each-other, and agreed that we’d find the best cut for both of us, and then split it down the middle?

In this situation it turns out that Bill and I should collectively take $0.28 of the dollar Bill is paid if the auction is won (which when we consider the possibility of losing the auction, is an expected revenue of $0.24). Split down the middle this is an expected revenue of $0.12 each.

Experiment #2: Iterated selfish cut

But what if Bill and I can’t agree on an even split, and we are basically each left to our own devices to make as much as we each can?

One approach to do this would be for me to optimize my cut based on whatever Bill is paying me (which will depend on his cut), and for Bill to optimize his cut based on his perspective of his probability given various payments to me (which will depend on my cut).

When I tried this the results were interesting. I started from a point where Bill isn’t taking any cut at all, then each of us take turns to optimize our cuts based on the other one:

% Cut          $ Cut
Bill    Me      Bill    Me      Ttl $   1/2
0.000   0.280   0.000   0.244   0.244   0.122
0.100   0.220   0.080   0.159   0.239   0.120
0.140   0.190   0.109   0.128   0.237   0.118
0.160   0.180   0.119   0.112   0.231   0.115
0.160   0.180   0.119   0.112   0.231   0.115

This is interesting because they converge to where Bill takes a 16% cut, and I take an 18% cut. Further, when you look at the expected revenue, it is close to evenly split between us, but not exactly. Note that after this iterative process Bill and I collectively are making only $0.231, whereas we would be making $0.244 if we had worked together and split the difference.

Bill makes 3% more money if we work together and split the profit, and I make 9% more!

Experiment #3: Machiavellianism

My last two experiments I call “Evil Me” and “Evil Bill”. In “Evil Me” I exploit the fact that Bill will optimize his cut based on my cut, so rather than selecting my cut based on Bill’s cut, I select my cut to maximize my profit knowing that Bill will optimize for whatever is in his best interests given my cut. Evil Bill is the same but the other way around.

% Cut          $ Cut
Bill    Me      Bill    Me      Ttl $   1/2
0.100   0.270   0.055   0.134   0.189   0.094  <-- I'm evil
0.240   0.120   0.151   0.057   0.208   0.104  <-- Bill is evil

Wow, so in this situation being Machiavellian allows me to make $0.134 while poor Bill only makes $0.055! Bill reverses this if he is Machiavellian and I am not.

What is fascinating is that, while I increase my profit by 20%, Bill's drops to less than half of what it was before! One of us being Machiavellian helps that person a bit, but it hurts the other a lot.

Conclusions

So what are our conclusions based on this one experiment?

  1. If our sole motivation is to maximize our own profit, we should adopt the Machiavellian approach, even though this will totally screw Bill.
  2. If we at least have some semblance of compassion for Bill, the next best approach is to pre-agree a split with Bill, and then optimize together, as in Experiment #1.
  3. The pre-agreed split should probably be 50:50, because if Bill and I can't agree on a split and were to optimize against each other as in Experiment #2, then we wind up making approximately the same amount anyway, but we both make less than if the split is pre-agreed.

An important caveat is that these conclusions are based on a single experiment that makes a variety of assumptions. These assumptions seem realistic enough that hopefully these results do generalize, but we can't say that for sure.

Unanswered questions

Some may feel dissatisfied by the fact that I've answered these questions experimentally, rather than explore the math behind why we get these results. I would strongly encourage such people to feel free to explore themselves, especially if they enjoy algebra and statistics. I'll gladly link to any serious attempts here.

Further, I've explored a situation where one participant is manipulative, while the other tries to make the best of whatever situation they are in. But what if both participants are manipulative? My suspicion is that in that case it degenerates into a Game of Chicken with an ultimate outcome that is far worse for both participants, but further investigation is definitely warranted.

Why is this problem relevant?

Realtime bidding is a recent innovation in advertising where advertisers bid against each-other for the right to show you an advertisement. The analogy with the situation described above is that the sealed-bid eBay is an ad exchange, I'm an ad network, and Bill is the advertiser. Bill is selling something where he has a (hopefully) known profit margin, and he must decide how much of this profit margin he will spend to sell his product. Similarly, I as the ad network must decide on my cut.

Note 26th July 2010 - This is a republish of an article from 24th July with significant additions and modifications.

A great tip for diagnosing Java memory issues

My good friend (and creator of Apache Wicket) Jonathan Locke just gave me a great tip.  I was aware of the jmap utility, but didn’t know it could do this:

Just type:

jmap -histo:live <pid>

Where <pid> is the process id of a Java process, and it will dump out all objects, starting with whichever is taking up the most memory.

The handy thing is that the Java process doesn’t need to have been started with any kind of special command line options, it will work with any running Java process.

I wish I knew this a few weeks ago while debugging a memory leak on a remote server…

The Guardian writes about Freenet

A few weeks ago I spoke to Andy Beckett of The Guardian in the UK, he was writing an article about “the darknet”, and wanted to talk to me about Freenet.  I was quite pleased about this because when I’m in the UK the Guardian is by far my favorite newspaper.

We had a good conversation, I talked about the original motivation behind Freenet (read about it here), our challenges, like balancing the very theoretical issues we face with the need to write software that non-techies can use, and other things.

At one point he mentioned the dangers of “bad” uses of Freenet, but he did so in a very coy almost apologetic way.  I immediately assumed he was thinking of child pornography, and made it pretty clear that I was more than happy to discuss that, indeed of the hundreds of conversations I’ve had with journalists about Freenet over the past 10 years, I can’t think of one where I didn’t discuss the CP issue!

I told him what I tell everyone, which is that like most people I wish CP didn’t exist, but there are many ways to get it other than Freenet, and I don’t think people should be denied the freedom to communicate just because a small minority might use it for something we don’t agree with.

My impression at the time was that he really wanted to get off the subject, which I thought was a bit surprising since journalists love controversy, and there are few things as controversial as child porn.

Anyway, Google Alerts just told me that the article has appeared on the Guardian’s website (skip to the bottom for the link).

Overall I think it was quite a good article, I was particularly excited to read:

Installing the software takes barely a couple of minutes and requires minimal computer skills

I think Andy may be the first mainstream journalist in the history of Freenet to actually install and use the software before writing about it!

I was a little surprised that it did focus almost exclusively on the negatives implications of an absolutist “free communication” philosophy.  For example the subtitle is:

In the ‘deep web’, Freenet software allows users complete anonymity as they share viruses, criminal contacts and child pornography

I was neither surprised nor annoyed that a journalist would want to talk about computer viruses, criminals, and child porn, after all - controversy attracts clicks and lets face it, our newspapers need all the help they can get attracting revenue these days.  Further, it is perfectly legitimate to talk about the fact that freedom of communication implies freedom for people to communicate data and ideas we don’t like.

I was a little surprised, however, that Andy didn’t seem terribly interested in discussing these issues when we spoke by phone – even though I made it very clear that I was happy to and I even tried to steer the conversation in that direction.

Regardless of this slight surprise, I haven’t yet noticed any major errors in it, at least no errors on Andy’s part (although the [sic] after the American spelling of “pedophile” is not exactly in the spirit of cross-Atlantic harmony).  I am curious about how he knew I was a pasty teenager (although he was quite right)!  Not so much now that I’ve been living in Texas for a few years.

I did find this interesting though:

According to the police, for criminal users of services such as Freenet, the end is coming anyway. The PCeU spokesman says, “The anonymity things, there are ways to get round them, and we do get round them.

If, by “the anonymity things” they are referring to stuff like simple referrers, or even just using your browser in “incognito” mode, then they are correct.  But if they are referring to technologies like Freenet and Tor – then either they are mistaken, lying, or they know something about Freenet and Tor that neither I, nor anyone I’ve ever met or heard from, has discovered.  This is extremely unlikely, I know many of the best security people in the world, and none of them work for the British police.

This isn’t to say that identifying a user of software like Freenet or Tor is impossible, but it would require either an impractical expenditure of resources (bugging computers, etc), or for the user to do something like accidentally disclosing their identity on Freenet.  We can guard against many things, but we can’t guard against stupidity :-)

Anyway, as I said its a pretty good article, and despite its focus on the negative – it will hopefully bring some new users to Freenet.  Read it here: The dark side of the internet