Comments:"ydiff: Structural Comparison of Programs | Surely I Am Joking"
URL:http://yinwang0.wordpress.com/2012/01/03/ydiff/
(Click on the above picture to see it in action. See the end of the post for more demos)
Motivation
I have been imagining a world where programs are not represented as text, but as data structures. They will be edited not with text editors, but with structural editors, which create and manipulate the abstract syntax trees (AST) of programs directly. In such a programming environment, the line-by-line diff utilities and version control systems will stop to work, because we no longer have “lines” or “words” in our programs.
ydiff is a proof-of-concept for solving this problem. It is a diff tool (and hopefully evolving into a full-fledged version control system) designed for “structural programming”.
Currently ydiff takes pains to first parse program text into ASTs and do structural comparison on the trees, but hopefully in the future, programs will be stored directly as data structures, so that the parsing step will be skipped, and ydiff will be readily extensible to all programming languages.
The following is the introduction to ydiff and the demos taken from my homepage
Why ydiff?
We very often want to know how our code is modified. This is normally done by using the diff program. While effective for finding differences between close revisions, the result from diff can be very hard to read for revisions over a long time span. The worst case is when the code is massively rearranged or refactored, where diff may consider every line to have been changed.
ydiff can help in this kind of situation. Different from line-based or character-based diff, ydiff is “language-based”. It parses the programs and then performs structural comparison on the parse trees. In addition to the usual functionalities of diff, it has the following properties:
- Language-awareness. ydiff truly understands language constructs and will not make non-sensical comparisons. For example it will never compare a string “10000″ with an integer 10000 even though they look similar. Also, it tries to find definitions with the same name before it attempts to destruct and compare definitions of different names.
- Refactor detection. It can find renamed, moved, reordered, wrapped, lifted, combined or fragmented code. Refactored code can be detected however deep they are into the structures.
- Format insensitivity. The comparison result will not be affected by linebreaks or whitespaces.
- Comprehensible output. The output of ydiff is designed for human understanding. The interactive UI helps the user navigate and understand changes efficiently.
These properties make ydiff helpful for understanding changes. It may also be possibly used for detecting plagiarism in programming classes or copyright infringement of code.
Supported Languages
Currently ydiff supports the following languages:
- Scheme
- Emacs Lisp
- Python
- Javascript
- C++
The supporting part of a language is just a parser for that language, into a unified parse tree data structure. I have designed a parser combinator library to make parser-writing less painful, but they are still quite tricky to write. Hopefully in the future, the programs will be edited by structural editors and stored as data structures, so no parsers will be needed any more.
Demos
Before going further into the design details, you are invited to take a look at the demos. They are interactive HTML files with a nice UI design. The left and right windows are always locked in their relative position. A mouse click on changed, moved or unchanged nodes will highlight the matched nodes and scroll the other window to match. After that, the windows will be locked into their new relative position for browsing.
Okay, here are the demos.
- Scheme. Comparison of the original miniKanren from Professor Dan Friendman and the version I modified in order to support condc, a “negation operator”. Pay attention to the function
unify
, whose major part is moved intounify-good
. - Emacs Lisp. Comparison of two versions (v20 and v22) of Taylor Campbell’s paredit-mode, a structural editing mode of emacs for Lisp programs. (example removed for non-technical reasons)
- Arbitrary S-expressions. Trying to find a bug in an optimizing pass of my Scheme compiler by comparing the IRs (represented as S-expressions).
- Python. Comparison of two implementations of a small list library that I wrote, which implements Lisp-style lists. The first implementation uses recursive procedures while the second uses Python’s generator syntax and is iterative. Pay some attention to
append
, whose code is moved inside another functionappendAll
. - Javascript. Comparison between two major revisions of the UI code of ydiff itself.
- C++ demo1 and C++ demo2. There are two demos for C++. The first demo compares two versions of the d8 Javascript debugger from the V8 project(v3404 from 2009 and v8424 from 2011). The second demo compares V8′s simulators for two different processors (MIPS and ARM).The d8 demo is especially interesting because by clicking on the lines of the method
Shell::Initialize
in the old version, it can be clearly observed that its body has been distributed into several procedures in the new version:Shell::Initialize Shell::CreateGlobalTemplate Shell::RenewEvaluationContext Shell::InstallUtilityScript
Also the major part of
Shell::Main
is moved into the helperShell::RunMain
.
Algorithms and Technical Details
There are basically two main ideas inside ydiff: Tree Editing Distance and Substructure Extraction, plus the use of the neat idea of parser combinators (similar to Parsec).
- Tree Editing Distance. Tree Editing Distance measures difference between two trees. In our case, it is the abstract syntax tree. When we encounter two trees, how do we compare them? If the first nodes of them appear to be the same, then we simply proceed to the next right? But what if the first nodes are different? There are three possibilities:
The programmer inserteda node into the first tree.
The programmer deleteda node from the first tree.
The programmer modified a node in the first tree.
We have to consider those possibilities for each level of nodes, so this naturally becomes a dynamic programming problem. Normally dynamic programming is done by cleverly find out the order to evaluate the problem space, but ydiff uses another important technique: memoization. There is a two dimensional hash table which stores the “delta” (changes) of two nodes, indexed by the two nodes themselves. The dynamic programming is then done in a recursive way, instead of the usual iterative way. The hash table automatically reduces the cost of recursion to the magnitude of iteration, by avoiding redundant evaluation of the nodes that we already compared.
- Substructure Extraction. Tree Editing Distance can only find the “edits” — insertions, deletions and modifications, but it cannot find refactorization of code. If a piece of code is lifted out of a scope, the Tree Editing Distance function will produce two big changes: a deletion of a big node (for example, an if-statement), and an insertion of a slightly smaller node (with the “if (condition) { … }” wrap stripped off). Notice that although we only made a small change to the program, the resulting delta can be large. What we really want is that it shows a deletion of the wrap (“if (condition) { … }”), but a movement of the parts inside the wrap. This is called Substructure Extraction.Substructure Extraction works by recursively compare the substructures of two nodes, finding possible matches between one node and the subnodes of another. This substructural comparison is only done on the pairs of the resulting insertions and deletions from the Tree Editing Distance function. This little trick captures almost all possible refactorizations between two programs, however deep they are. This is demonstrated in the following picture, where ydiff successfully detects the movement of the function
append
into another functionappendAll
. - Parser Combinators. The parsers of ydiff are written with the help of a parser combinator library (in Scheme). In addition to the common features of Parsec, it detects left recursion and reports a “stack trace” of the left recursion, so that the parser writer can find the origin of left-recursion and fix the problem accordingly. This helps a lot when constructing the parsers. For examples, I constructed the JavaScript parser in one day, and the C++ parser in just two days. There are corner cases missing from the C++ parser and especially the macro part, but it parses significant pieces of code correctly (for example Google’s V8 project code).
Get the code
ydiff is an open source project. You can follow its development on github: yinwang0/ydiff or get its source code from there.
Ideas about structural version control
I have thought about extending ydiff to a structural version control system and gave a talk on that. You can find the talk slides from another post