Why doesn’t this exist yet: Syntax-aware merge

Hey programmers – what if I told you that there was a fairly interesting problem out there, that would probably be just a few weekends of work, and which if you implemented it you could find your code part of the toolset of a significant proportion of software developers worldwide (especially if you open sourced it)?

Well, just such a problem has been floating around for several years, and is as-yet unanswered.  I proposed it two years ago, and others have suggested it several years before that.

Anyone familiar with source control systems like svn and git will be familiar with the occasional need to “merge” changes.  This is to take changes to one version of a file, and apply the same changes to a different (but not too different) version of that same file.

Today’s merge tools, at least to the best of my knowledge, basically treat all text files the same.  Whether its Java, HTML, C++, or CSS, as far as the merge tool is concerned, they are all just lines of text.

This works ok in many situations, but it can get tripped up by situations like this:

   Book book = new Book(
        String name,
        int length);

Let’s say we change this to:

   Book book = new Book(String name, int length);

A programmer knows that all we’ve changed here is formatting, but a merge algorithm will see this has two lines deleted, and one line changed – quite a major alteration.

It is true that passing both files through a formatter would help in this case, but there are many other situations where it wouldn’t.  Consider renaming a variable throughout a file, inlining a variable, and other common refactoring operations.

My proposal is fairly simple.  Rather than treat Java files as simple text, we parse the files into an “Abstract Syntax Tree” using a Java parser, maybe even matching identifier names like variables, and treat the merge operation as a merge of these trees, rather than simple lines of text.

After the merge we convert the resultant tree back to text, perhaps even formatting it as we go.

The result of this, I suspect, would be dramatically more robust merge operations.

So what would be involved here?  Well, it shouldn’t be necessary to write a bunch of parsers, because these already exist for most languages.  The main thing would be the tree merge algorithm, which would be a neat self-contained computer science project.  Perhaps there is even an off-the-shelf open source tree merge algorithm out there.

Regrettably I’ve already got a bunch of projects on my plate already, but perhaps there is someone out there interested in taking on this challenge?  If they did I could almost guarantee that the resultant software (which I assume they’d open source) would be very widely adopted.

If you are interested I’ll be happy to help and advise. You can email me at syntaxmerge@ian.33barfmail.com.

19 thoughts on “Why doesn’t this exist yet: Syntax-aware merge

  1. Jim O'Flaherty


    I agree. I have been wanting something like this for around 20 years. Back then, I did have to do all the parsing logic and it was very slow to do for every file.

    However, given that Eclipse keeps an AST of the current file, it sure seems that some better way to save the AST (as opposed to it’s expressed text) to the source control system.

    IOW, it would be really nice if one could write entire Java projects where all the existed in the source control system were ASTs. And then, only when a particular user wanted to view/edit the source file, it could be realized as text IN THEIR PARTICULAR FORMATTING STYLE (i.e. two character tabs, curly brace on different line from method definition, specific line-wrapping for long strings, etc.). After edits, the check-in would then move it right back to the real data structure, an AST, as opposed to the more error prone formatting-flame-war inducing text format.

    I wonder how difficult it would be to get something like Eclipse re-configured to store and load ASTs for Java source files as opposed to using text files?

    My two beefs are that writing code still requires text source in the form of OS files as the baseline for all tool development. My strong preference would be a more sophisticated AST in the form of DB records as the baseline, instead. It substantially improves system portability while also eliminating lots of “flame-war” surface area.

  2. ian Post author

    Jim, I agree that it would be nice to see that in Eclipse. Only thing is that many Java projects contain XML, HTML, Javascript, CSS, and other file types. It would be nice to see the benefits of this over a variety of file types, not just Java.

  3. Daniel Ehrenberg

    I remember seeing some relatively old academic papers about algorithms for merging source code, but I can’t find them right now. For the specific case of XML (which isn’t all that specific), I wrote a little possibly-inaccurate survey article on the topic: http://www.scribd.com/doc/14482474/XML-diff-survey . Algorithms there are from the specific domain of XML comparison, and from the more general domain of tree edits. There are a couple issues with deploying something like this in practice:

    – You have to have a special implementation for each programming language, and now the version control system has to know a lot more about the languages.
    – It’s computationally expensive to do really good diffs and merges, if you want to take transpositions into account.
    – You have to be able to present conflicts in a meaningful way.
    – You have to be careful to have a good heuristic for not merging things which might be bad semantically. I think diff3 takes a line above and below a change into account for context, so if you change two things on adjacent lines in two branches and merge it, then it’ll report a conflict. How will this work in merging trees?
    – Should the merge take semantic information into account, like types or XML schemas? You might be more accurate and even more efficient if you do, but it’s hard (and what if the types/schemas change?).
    – Can you made a O(n log n) or better algorithm that’ll have good enough results? If you have an algorithm that’s quadratic or cubic in the length of the file, it’ll probably be impractical.

    And this is just for tree edits. Other refactorings like renaming things are even harder to figure out (though I think there have been papers about this). If you have an editor which performs the refactorings in an automated way, then you can make that editor output a log of the changes it made, and do a merge based on that–this *might* be easier and more accurate, but obviously has implementation challenges of coordinating everything.

    Something to do that’s even simpler than this, but could get more accurate merge results, is if we did diff/merge by characters instead of by lines. But lines have several advantages in practice:
    – It’s computationally simpler, since there are fewer lines than characters
    – We have well-publicized and de facto standardized file formats for patches and merges
    – It’s easy to present patches and merge conflicts with lines in a way that people can read
    – You don’t get many false positives for what correlates with what if the lines are the same.

    An intermediate possibility is to compare token streams rather than small characters or huge lines. This could be insensitive to insignificant changes in whitespace. It has the disadvantage that it doesn’t understand tree operations (for example, moving some code from a higher level to a branch of an if statement is treated like adding braces on both sides, rather than a single operation of lowering it by a level in a tree) but I’m not sure if this is useful for programming language stuff anyway.

    I’d be very happy to see someone figure out these various practical problems and design a real system to solve this problem! But until then, line-based diffs and merges aren’t that bad.

  4. Matchu

    Hm. Sounds like it’d be really useful, and really, really slow. Probably wouldn’t be a big enough deal to merit the tax on our patience :/

    But if it were about as fast as line-based merges, then, yes, please!

  5. ian Post author

    Matchu, I can’t see any reason why it would need to be slow. Parsing is fast, and I’m sure the tree merge algorithm could be fast too if intelligently implemented.

  6. Pingback: Tweets that mention Why doesn’t this exist yet: Syntax-aware merge | Hypergraphia Indulged -- Topsy.com

  7. twowaystreet

    ian: Alright, let’s say the parsing takes your syntax and converts it to something similar to the output of the closure compiler for javascript. For performance, this is not an unreasonable assumption.

    Now try to map the output of that to the original input. It’s not a particularly easy task.

  8. Cesar

    you don’t need to invent a tree merge algorithm. serialize your ast so that each production you are interested in results in one textual line. you need to add to each production a uid or other marker that allows you to recover where it came from. do the merge as a textual merge behind the scenes, and map back to the original surface syntax using the uids.

    if you have a lisp background, think of pretty-printing before the merge, and rendering the diff without pretty-printing.

  9. Ruudjah

    Speed can be improved by keeping an AST from the previous build, make a copy and then update the copy using the changeset. No need to build a complete AST from the ground up each time. For a weekend hack, it would be quite nice to start with a complete AST buildup, but later finetuning can add AST instance comparison.

    To support multiple languages, an abstracted set of interfaces is needed to perform the comparison. Quite the cool job to design!

    Another great benefit of this idea. Formatting of the text in the codefile could be just a setting in the IDE. No more codestyle guides, the submitted code can be formatted as the programmer wishes. Not adhering to the codestyle does not matter, the IDE can take the AST of the codefile and then present it. (hmm, can’t this be done already? IDE should has an AST instance)

    Minor remark about the blogpost: horrible example. A Book with a “name”? A book with “length”? I do know books with titles and amountPages though ;).

  10. Neil Fraser

    The XML merge you are looking for is described in the DocEng 2009 paper “Efficient Change Control of XML Documents” by Sebastian Rönnau, Geraint Philipp and Uwe M. Borghoff. Highly recommended, it won the best paper award that year.

  11. Chris Lesner

    I have my auto formatting rules/preferences break up syntax as much as possible across lines (for example every import is on it’s own line in java and in SQL every table and every boolean where clause is on its own line). With this method merging using existing tools give you a good approximation to a syntax merge for most changes that happen in practice. My guess this is why no one has yet bothered with a full syntax merge — existing approximations are good enough.

  12. Jim Rush

    I agree, but it should go a step further. I would like to see the entire source code tool chain store source logically instead of text files. The view (source formatting) really should be an end user choice.

    I’ve seen a few development groups attempt to do things like set a style conversion on check-in, but this doesn’t prevent diff tools from thinking that there have been changes due to formatting differences.

    Code is not art (contrary to a few people’s view of the world). We do not necessarily need to see code the way the author intended.

  13. Prakash Vishnu

    It is the easy job indeed to deploy the VB string search function. In this way you can check the insides for the syntaxes of the string to be which you are working.

    Simple was is to make color string by the color function of the VB and the rich text control. You don’t forget to be setting the dimensions of the control no no.

  14. zimbricchio

    Remapping the ADT to code is tough, better to rewrite it back via our favourite formatter as several folks here said.

    For this reason there needs to be a well defined syntax for comments, that tells to which block of code the comment belongs to.
    I don’t know any programming language that has such a syntax.

    Nice idea and nice thread.


Leave a Reply

Your email address will not be published.