Social and professional networking

View Ian Clarke's profile on LinkedIn
Shameless plug

Does your company's revenue depend on being able to predict the future based on past data?  SenseArray may be able to help.

RSS
Links
« BeanShell conveniently packaged for the Mac | Main | Complexity from simplicity »
Saturday
Nov292008

A plan for better source code diffs and merging

I've been using Subversion for years, but a few months ago I was really starting to feel the limitations of being able to create and merge branches easily. I'd heard that Git made this very easy indeed, and so I decided to try it.

Anyway, this isn't yet another "I discovered Git and now I've achieved self-actualization" blog post, so to cut a long story short, I now use git for everything (together with the excellent GitHub).

Even though merging is a lot better with Git than Subversion, I've still found myself getting into situations where it requires a lot of work to merge a branch back into another branch, and it got me thinking about better ways to do merging.

While I'm no merging expert, it seems that most merging algorithms do it on a line-by-line basis, treating source code as nothing but a list of lines of text.  It got me thinking, what if the merging algorithm understood the structure of the source code it is trying to merge?

So the idea is this:

Provide the merge algorithm with the grammar of the programming language, perhaps in the form of a Bison, Yacc, or JavaCC grammar file

The merge algorithm then uses this to parse the files to be diffed and/or merged into trees, and then the diff and merge are treated as operations on these trees.  These operations may include creating, deleting, or moving nodes or branches, renaming nodes, etc.  There has been quite a bit (pdf) of academic research on this topic, although I haven't yet found off-the-shelf code that will do what we need.  Still, it shouldn't be terribly hard to implement.

The beauty of this approach is that the merge algorithm should be far less likely to be confused by formatting changes, and much more likely to correctly identify what has changed.

I can't think of any reason that such a tool wouldn't work in the exact same way as existing diff/merge tools from the programmer's perspective. The tool would automatically select the correct grammar based on the file extension, or fall-back to line-based diffs if the extension is unrecognized (or the file isn't valid for the selected grammar). Thus, it should be trivial to use this new tool with existing version control systems.

I'd love to have the time to implement this, although regretfully it is at the bottom of a very large pile of "some day" projects.  I think this is an interesting enough idea, and one that would be immediately useful, that if I put it out there someone somewhere might be able to make it a reality.

Any takers? I've set up a Google Group for further discussion, please join if interested.

Reader Comments (17)

Monticello in Squeak, a Smalltalk, does this.

November 29, 2008 | Unregistered CommenterJosh ben Jore

Isn't this what some tokenizer projects try to achieve? Lexical analyzation of source code?

November 29, 2008 | Unregistered Commentermarkus

markus, tokenizers do parse source code, but that is only the first stage of my proposal. The important bit is the tree merging.

November 29, 2008 | Unregistered CommenterIan Clarke

Charles Simonyi has started a company called Intentional Software. Their objective is to develop a system that stores source as an AST (abstract syntax tree) or semantic tree (I don't recall which is the better description), rather than text source code.

You use their system to develop domain specific languages. One of the problems they they had to solve is this kind of diffing/merging, but applied to trees. When discussing it, they stated that specific merging policies must be developed for each language, its not just a matter of comparing trees using a general algorithm.

It would seem that using trees may not make this problem any easier, as the resulting merged tree may not be valid. Imagine diffing/merging XML documents at the tree/node level. The two documents are valid according to the schema, but depending on how the merge algorithm shuffles the nodes around, the resulting merged document may not be valid.

A lot of the meaning of an AST is determined by the semantic model used to process it (compiler/interpreter/etc). Some tree constructs are valid, others not, and there is probably no way of producing a general tree diffing/merging algorithm that will generate a valid tree all the time.

In addition to this, the algorithms presented in the paper that you mentioned, run in O( |T|.|T2|.d.d2 ) time (T = tree 1, T2 = tree 2, |T| and |T2| are the # of nodes in the trees, and d/d2 are the depths of the trees (I think)). In a project of my own, a Python source file that was about 1k lines long resulted in an AST of about 15k nodes, not sure of the depth. Comparing documents of that size could take some time....

November 29, 2008 | Unregistered CommenterGeoffrey French

Check out Harmony (now Boomerang) - research project with code by Benjamin C. Pierce and others.

http://www.seas.upenn.edu/~harmony

November 29, 2008 | Unregistered CommenterRobert Cowham

In addition to Geoffrey's comment, you'd have to deal with the tree-diff being confused by syntactially erroneous source code which may have been submitted to source control.

November 29, 2008 | Unregistered CommenterJRV

NetBeans and Eclipse both have features where you can 'replay' refactorings built in one IDE on a source code base in another IDE. It might be interesting to think about how these two concepts meet.

November 29, 2008 | Unregistered CommenterAndrew

That was my first thought too, where you have a submission of intentionally broken and partially completed code because the author hasn't figured everything out yet. So maybe the use case is to ensure that everything passes a set of sanity checks first.

November 29, 2008 | Unregistered CommenterBen

I've been wondering recently if it's possible to annotate changes per diff chunk, somewhere. And later, see it while diffing. Also I'm not talking about the commit message.

Adding it as comments to source code is the obvious way but this way it will get full with useless comments. I don't care why x lines changed between r333 and r334, I only care why the code does what it does.

So, is there a tool (probably a diff tool) which allows you to annotate each diff chunk? Or do you like the idea?

November 29, 2008 | Unregistered CommenterGabriel

Great idea, but why stop with version control? Why should source code be text in the first place? It could be written directly in tree form. There would be no formatting, only cells with restricted values. Like intellisense on steriods. Python's logical whitespace can be seen as a babystep in this direction. Repositories would be much smaller, merges lightning fast. And by uncoupling formatting from content, programmers would be free to invent new visualizations that might make code easier to read and navigate (after all, text evolved to suit prose). Find-and-replace would be logical too -- no more horseplay. In fact, if you needed to change the name of a class or variable, it would be updated everywhere all at once, without any extra steps. Moreover, the diff would know that no change took place, since cell names are not logical.

November 29, 2008 | Unregistered CommenterCarl Lumma

I have already had that same idea some years ago and looking in the internet i did find others who had it before too.
Too bad we don't have the tools for that available yet.

November 29, 2008 | Unregistered CommenterYetAnother

@Gabriel: Such tools exist, and they're generally called code reviewing software. An open source example is reviewboard.

November 30, 2008 | Unregistered CommenterMuhammad Haggag

My life would be so much easier if there was a way for diff to just see that

<foo>
<something/>
</foo>

was the same as
<foo><something/></foo>

Of course, whitespace can't be ignored in every case... guess that's where the parsers/trees would come in ;)


Off topic, but I just wanted to mention that I found it really difficult to tell your links apart from plain text since there's no underline and #002157 and #666666 look really similar on my monitor for some reason. I didn't even realize you had links throughout your post until one of them had "(pdf)" after them!

December 2, 2008 | Unregistered CommenterCarolN

Here's a thought: How about running both files (old and new) through a pretty-printer, noting the original location for each lexeme, sorting members first, then methods and by name. Then do a regular text diff on the two resulting files, and map the diffs back to the original code. You probably would have to display the result in a GUI like Meld or something since diffs could now be great length apart. At least you would be able to extract more of the meaning of the difference instead of insignificant syntax (such as whitespace differences).

December 4, 2008 | Unregistered CommenterRoland Kaufmann

This post prompted a look at Linus's presentation of git to the assembled googloids (http://www.youtube.com/watch?v=4XpnKHJAok8).

What I found interesting was his comment that SVN people are "stupid and ugly" because they tried to do CVS better. Rather ironic given that regardless of git's (and the distributed approach to source control's) merits, he is also failing to recognize the fundamental issue that gives rise to the (clearly) very difficult problem of merging software artifacts. (Per his presentation, it doesn't sound like git has solved the merge problem.)

We are practitioners of a discipline that is still decidedly in its 'arts and crafts' stage, yet is subject to industrial consumer demands. We are not an industry and software engineering is a somewhat presumptuous term, when we consider other engineering disciplines.

The 'merge problem' will simply disappear, become irrelevant, once we finally turn the corner and have a proper industrial revolution for software, where parts are machine fabricated and the essential human input is at the design stage.

Until that time, one supposes we will be faced with this very difficult problem and pursue solutions (such as the AST approach) which is somewhat akin to a molecular structure analysis on merging candidate physical artifacts of a complex system.

Given that, wouldn't it be simpler to simply attack this problem at the methodology level?

/R

December 4, 2008 | Unregistered CommenterJoubin Houshyar

A sail veering about the blank bay waiting for a swollen designer sunglasses wholesale bundle to bob up, roll over to the sun a puffy replica sunglasses wholesale face, salt white. Here I am. They followed the winding path down to the creek. Buck Mulligan stood on a stone, in shirtsleeves, his wholesale cheap sunglasses unclipped tie rippling over his shoulder. A young man clinging to a spur of rock near him moved slowly frogwise his green sunglass wholesale legs in the deep jelly of the water. Buck Mulligan sat down to unlace his wholesale oakley sunglasses boots. An elderly man shot up near the spur of rock a blowing red sunglasses for wholesale face. He scrambled up by the stones, water glistening on his fashion sunglasses wholesale pate and on its garland of grey hair, water rilling over his chest and paunch and spilling jets out of his wholesale sunglasses china black sagging loincloth.

July 13, 2010 | Unregistered Commentersunglass wholesale

A sail veering about the blank bay waiting for a swollen designer sunglasses wholesale bundle to bob up, roll over to the sun a puffy replica sunglasses wholesale face, salt white. Here I am. They followed the winding path down to the creek. Buck Mulligan stood on a stone, in shirtsleeves, his wholesale cheap sunglasses unclipped tie rippling over his shoulder. A young man clinging to a spur of rock near him moved slowly frogwise his green sunglass wholesale legs in the deep jelly of the water. Buck Mulligan sat down to unlace his wholesale oakley sunglasses boots. An elderly man shot up near the spur of rock a blowing red sunglasses for wholesale face. He scrambled up by the stones, water glistening on his fashion sunglasses wholesale pate and on its garland of grey hair, water rilling over his chest and paunch and spilling jets out of his wholesale sunglasses china black sagging loincloth.

July 13, 2010 | Unregistered Commentersunglass wholesale

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>