Why doesn’t this exist yet: Syntax-aware merge

Hey programmers – what if I told you that there was a fairly interesting problem out there, that would probably be just a few weekends of work, and which if you implemented it you could find your code part of the toolset of a significant proportion of software developers worldwide (especially if you open sourced it)?

Well, just such a problem has been floating around for several years, and is as-yet unanswered.  I proposed it two years ago, and others have suggested it several years before that.

Anyone familiar with source control systems like svn and git will be familiar with the occasional need to “merge” changes.  This is to take changes to one version of a file, and apply the same changes to a different (but not too different) version of that same file.

Today’s merge tools, at least to the best of my knowledge, basically treat all text files the same.  Whether its Java, HTML, C++, or CSS, as far as the merge tool is concerned, they are all just lines of text.

This works ok in many situations, but it can get tripped up by situations like this:

   Book book = new Book(
        String name,
        int length);

Let’s say we change this to:

   Book book = new Book(String name, int length);

A programmer knows that all we’ve changed here is formatting, but a merge algorithm will see this has two lines deleted, and one line changed – quite a major alteration.

It is true that passing both files through a formatter would help in this case, but there are many other situations where it wouldn’t.  Consider renaming a variable throughout a file, inlining a variable, and other common refactoring operations.

My proposal is fairly simple.  Rather than treat Java files as simple text, we parse the files into an “Abstract Syntax Tree” using a Java parser, maybe even matching identifier names like variables, and treat the merge operation as a merge of these trees, rather than simple lines of text.

After the merge we convert the resultant tree back to text, perhaps even formatting it as we go.

The result of this, I suspect, would be dramatically more robust merge operations.

So what would be involved here?  Well, it shouldn’t be necessary to write a bunch of parsers, because these already exist for most languages.  The main thing would be the tree merge algorithm, which would be a neat self-contained computer science project.  Perhaps there is even an off-the-shelf open source tree merge algorithm out there.

Regrettably I’ve already got a bunch of projects on my plate already, but perhaps there is someone out there interested in taking on this challenge?  If they did I could almost guarantee that the resultant software (which I assume they’d open source) would be very widely adopted.

If you are interested I’ll be happy to help and advise. You can email me at syntaxmerge@ian.33barfmail.com.

51 thoughts on “Why doesn’t this exist yet: Syntax-aware merge

  1. Jim O'Flaherty


    I agree. I have been wanting something like this for around 20 years. Back then, I did have to do all the parsing logic and it was very slow to do for every file.

    However, given that Eclipse keeps an AST of the current file, it sure seems that some better way to save the AST (as opposed to it’s expressed text) to the source control system.

    IOW, it would be really nice if one could write entire Java projects where all the existed in the source control system were ASTs. And then, only when a particular user wanted to view/edit the source file, it could be realized as text IN THEIR PARTICULAR FORMATTING STYLE (i.e. two character tabs, curly brace on different line from method definition, specific line-wrapping for long strings, etc.). After edits, the check-in would then move it right back to the real data structure, an AST, as opposed to the more error prone formatting-flame-war inducing text format.

    I wonder how difficult it would be to get something like Eclipse re-configured to store and load ASTs for Java source files as opposed to using text files?

    My two beefs are that writing code still requires text source in the form of OS files as the baseline for all tool development. My strong preference would be a more sophisticated AST in the form of DB records as the baseline, instead. It substantially improves system portability while also eliminating lots of “flame-war” surface area.

  2. ian Post author

    Jim, I agree that it would be nice to see that in Eclipse. Only thing is that many Java projects contain XML, HTML, Javascript, CSS, and other file types. It would be nice to see the benefits of this over a variety of file types, not just Java.

  3. Daniel Ehrenberg

    I remember seeing some relatively old academic papers about algorithms for merging source code, but I can’t find them right now. For the specific case of XML (which isn’t all that specific), I wrote a little possibly-inaccurate survey article on the topic: http://www.scribd.com/doc/14482474/XML-diff-survey . Algorithms there are from the specific domain of XML comparison, and from the more general domain of tree edits. There are a couple issues with deploying something like this in practice:

    – You have to have a special implementation for each programming language, and now the version control system has to know a lot more about the languages.
    – It’s computationally expensive to do really good diffs and merges, if you want to take transpositions into account.
    – You have to be able to present conflicts in a meaningful way.
    – You have to be careful to have a good heuristic for not merging things which might be bad semantically. I think diff3 takes a line above and below a change into account for context, so if you change two things on adjacent lines in two branches and merge it, then it’ll report a conflict. How will this work in merging trees?
    – Should the merge take semantic information into account, like types or XML schemas? You might be more accurate and even more efficient if you do, but it’s hard (and what if the types/schemas change?).
    – Can you made a O(n log n) or better algorithm that’ll have good enough results? If you have an algorithm that’s quadratic or cubic in the length of the file, it’ll probably be impractical.

    And this is just for tree edits. Other refactorings like renaming things are even harder to figure out (though I think there have been papers about this). If you have an editor which performs the refactorings in an automated way, then you can make that editor output a log of the changes it made, and do a merge based on that–this *might* be easier and more accurate, but obviously has implementation challenges of coordinating everything.

    Something to do that’s even simpler than this, but could get more accurate merge results, is if we did diff/merge by characters instead of by lines. But lines have several advantages in practice:
    – It’s computationally simpler, since there are fewer lines than characters
    – We have well-publicized and de facto standardized file formats for patches and merges
    – It’s easy to present patches and merge conflicts with lines in a way that people can read
    – You don’t get many false positives for what correlates with what if the lines are the same.

    An intermediate possibility is to compare token streams rather than small characters or huge lines. This could be insensitive to insignificant changes in whitespace. It has the disadvantage that it doesn’t understand tree operations (for example, moving some code from a higher level to a branch of an if statement is treated like adding braces on both sides, rather than a single operation of lowering it by a level in a tree) but I’m not sure if this is useful for programming language stuff anyway.

    I’d be very happy to see someone figure out these various practical problems and design a real system to solve this problem! But until then, line-based diffs and merges aren’t that bad.

  4. Matchu

    Hm. Sounds like it’d be really useful, and really, really slow. Probably wouldn’t be a big enough deal to merit the tax on our patience :/

    But if it were about as fast as line-based merges, then, yes, please!

  5. ian Post author

    Matchu, I can’t see any reason why it would need to be slow. Parsing is fast, and I’m sure the tree merge algorithm could be fast too if intelligently implemented.

  6. Pingback: Tweets that mention Why doesn’t this exist yet: Syntax-aware merge | Hypergraphia Indulged -- Topsy.com

  7. twowaystreet

    ian: Alright, let’s say the parsing takes your syntax and converts it to something similar to the output of the closure compiler for javascript. For performance, this is not an unreasonable assumption.

    Now try to map the output of that to the original input. It’s not a particularly easy task.

  8. Cesar

    you don’t need to invent a tree merge algorithm. serialize your ast so that each production you are interested in results in one textual line. you need to add to each production a uid or other marker that allows you to recover where it came from. do the merge as a textual merge behind the scenes, and map back to the original surface syntax using the uids.

    if you have a lisp background, think of pretty-printing before the merge, and rendering the diff without pretty-printing.

  9. Ruudjah

    Speed can be improved by keeping an AST from the previous build, make a copy and then update the copy using the changeset. No need to build a complete AST from the ground up each time. For a weekend hack, it would be quite nice to start with a complete AST buildup, but later finetuning can add AST instance comparison.

    To support multiple languages, an abstracted set of interfaces is needed to perform the comparison. Quite the cool job to design!

    Another great benefit of this idea. Formatting of the text in the codefile could be just a setting in the IDE. No more codestyle guides, the submitted code can be formatted as the programmer wishes. Not adhering to the codestyle does not matter, the IDE can take the AST of the codefile and then present it. (hmm, can’t this be done already? IDE should has an AST instance)

    Minor remark about the blogpost: horrible example. A Book with a “name”? A book with “length”? I do know books with titles and amountPages though ;).

  10. Neil Fraser

    The XML merge you are looking for is described in the DocEng 2009 paper “Efficient Change Control of XML Documents” by Sebastian Rönnau, Geraint Philipp and Uwe M. Borghoff. Highly recommended, it won the best paper award that year.

  11. Chris Lesner

    I have my auto formatting rules/preferences break up syntax as much as possible across lines (for example every import is on it’s own line in java and in SQL every table and every boolean where clause is on its own line). With this method merging using existing tools give you a good approximation to a syntax merge for most changes that happen in practice. My guess this is why no one has yet bothered with a full syntax merge — existing approximations are good enough.

  12. Jim Rush

    I agree, but it should go a step further. I would like to see the entire source code tool chain store source logically instead of text files. The view (source formatting) really should be an end user choice.

    I’ve seen a few development groups attempt to do things like set a style conversion on check-in, but this doesn’t prevent diff tools from thinking that there have been changes due to formatting differences.

    Code is not art (contrary to a few people’s view of the world). We do not necessarily need to see code the way the author intended.

  13. Prakash Vishnu

    It is the easy job indeed to deploy the VB string search function. In this way you can check the insides for the syntaxes of the string to be which you are working.

    Simple was is to make color string by the color function of the VB and the rich text control. You don’t forget to be setting the dimensions of the control no no.

  14. Brian Peterson

    Isn’t there the possibility of logic or flow-control errors arising? Sometimes placement within a file matters.

  15. bkfwarszawa

    It’s really a cool and helpful piece of info. I’m glad
    that you shared this helpful information with us.
    Please keep us up to date like this. Thanks for

  16. Frankie

    Turn the heat off and allow the treats to cool thoroughly before removing.
    If you would like to own pasta, you can add veggies as side
    dishes or make a salad utilizing pasta. Place a frying pan or iron skillet
    over medium to high heat.

    Feel free to surf to my page :: good recipes for church
    dinners (Frankie)

  17. klikurl.nl

    > Fix Dll and also other registry corruption by means of Reginout scan and Replace Microsoft Direct -X.

    In that same article, Khalaf writes, ‘there appears to be more of a middle class in the
    App Store; that is, more companies bringing in respectable revenues.
    Here are 4 must-have tips to get appointments with your crazy-busy clients.

  18. Local Breads: Sourdough and Whole-Grain Recipes from Europe's Best Artisan Bakers pdf

    Hey! This is kind of off topic but I need some help from an established blog.
    Is it very hard to set up your own blog? I’m not very techincal but I can figure things out pretty quick.
    I’m thinking about setting up my own but I’m not sure where to begin. Do you have any tips or suggestions?
    With thanks

    My web blog Local Breads: Sourdough and Whole-Grain Recipes from Europe’s Best Artisan Bakers pdf

  19. กระดาษทรายขัดน้ำ

    On the nose, it’s what should be expected for a Double IPA.
    Some natural remedies will help whiten teeth, such as apple
    cider vinegar. A modern jet airliner cruises at an altitude
    of 12,000 metres, consequently jets flying
    through contaminated airspace would need to fly through the ash cloud when attaining their cruising altitude and when descending to land.

  20. health

    Please let me know if you’re looking for a author for your weblog.
    You have some really great posts and I feel I would be a
    good asset. If you ever want to take some of the load off,
    I’d absolutely love to write some articles for your blog in exchange for a link back to
    mine. Please send me an e-mail if interested. Regards!

  21. Florencia

    Do you understand what the diminished car
    value from an accident means. Home based auto repair shops give
    services for usual upkeep or repairs, which is why simpler tools are enough for their use.
    Many owners pop the hood of their truck and
    blast at their engine with a shiny cleaning solution.

    my website; Local Auto Shops Allentown (Florencia)

  22. Tyson

    I think that everything wrote made a ton of sense.
    However, consider this, what if you added a little content?
    I am not saying your content isn’t solid., however what if you
    added something that makes people want more? I mean Why doesn

  23. http://romimoki.tumblr.com/

    I’m really enjoying the design and layout of your site.
    It’s a very easy on the eyes which makes it much more enjoyable for me to come here
    and visit more often. Did you hire out a developer to create
    your theme? Exceptional work!

  24. Dawn

    If you want that “Best Lawn” title, Sergio’s Landscaping is the
    way to go. The organic matter that lawns produce gives rich nutrients to
    the soil, which are actually important in keeping its structure healthy.
    So, it is better to refer to experts for you to
    save time and money as well.

    My webpage … yard maintenance services carlsbad [Dawn]

  25. video camera

    So now I will give you a step to step guide about how to
    recover formatted photos. you’ll find out the Canon Digital Rebel XSi is a really powerful Canon DSLR after you view the
    Canon Digital Rebel XSi Review, Canon Digital Camera Review.
    This past Christmas Santa agreed that I had been very, very good.

    Feel free to visit my website – video camera

  26. corporate gifts singapore cheap

    May I just say what a relief to uncover someone that truly
    knows what they’re talking about online. You certainly know how to bring an issue to
    light and make it important. More and more people need to read this and understand this side of the story.
    I was surprised that you’re not more popular because
    you certainly possess the gift.

  27. zimbricchio

    Remapping the ADT to code is tough, better to rewrite it back via our favourite formatter as several folks here said.

    For this reason there needs to be a well defined syntax for comments, that tells to which block of code the comment belongs to.
    I don’t know any programming language that has such a syntax.

    Nice idea and nice thread.

  28. Victor

    Hello my family member! I want to say that this post is amazing, great written and include almost all important infos.
    I’d like to look extra posts like this .

    Review my web blog: xvideos.com (Victor)

  29. http://lynneisdamson017.bravesites.com/

    I do not know whether it’s just me or if perhaps everyone else encountering issues with your blog.
    It appears like some of the written text within your
    posts are running off the screen. Can somebody else please provide feedback and let me
    know if this is happening to them too? This might be
    a problem with my internet browser because I’ve had this happen before.

    Feel free to visit my weblog Aleuriaaurantia (http://lynneisdamson017.bravesites.com/)

  30. Hildegard

    I see you share interesting things here, you can earn some extra cash, your blog has big potential, for
    the monetizing method, just search in google – K2 advices how to monetize a website

  31. www.pinpimp.com

    If Nunchuck is left attached to the consoles, the battery drains out
    at a lightening speed. He is seen pulling himself on to a legde, ala Sonic Adventure 2.
    You won’t have to listen to music on a separate system thanks to homebrew.

  32. شركة تنظيف خزانات بجده

    Fantastic items from you, man. I’ve understand your stuff previous to and you are simply extremely wonderful.
    I actually like what you’ve acquired here, certainly like what you are
    stating and the best way wherein you say it. You’re making it entertaining and you
    continue to care for to keep it sensible.
    I can’t wait to learn much more from you. That is actually a great web site.

  33. kviklan

    Hello! Would you mind if I share your blog with my twitter group?
    There’s a lot of people that I think would really appreciate your content.
    Please let me know. Thanks

  34. Moonshine Still

    Contrast their skill set to my Washington, DC neighbor who claimed that a properly conjured stream of profanity and
    a swift blow with a wrench could fix almost anything.
    For heating purposes: If you want to heat a room for
    you or your family, there is a ceramicor clay pot, you can buy, that gets heated up by the alcohol stoves and warms
    your room. Indeed in those types of situations, they are the only real entrepreneurs, even if their system is full
    of corruption.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>