by Phil Freeman on 2012/12/21

Text-based diff is of limited usefulness for analysing changes to a large code base over time. I've wondered for a while how one might generalize the `diff`

and `patch`

functions on lists to more general data structures such as abstract syntax trees.

It's pretty easy to write down a version of the `diff`

function for, say, binary trees, but less simple to write a function which works generically across multiple data types. As usual, I spent a while thinking about this before realizing the problem had already been solved ([1], [2]). I'm still quite happy with the approach I came up with, and I think it's sufficiently interesting to write about here anyway.

```
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE UndecidableInstances #-}
module Diff where
import Data.Function (on)
import Data.List (maximumBy)
```

The `diff`

function can be specified as follows: for lists `xs`

and `ys`

, find the shortest edit sequence taking `xs`

to `ys`

. We can reduce the problem to finding the longest common subsequence of `xs`

and `ys`

: to obtain the least common subsequence, first remove a subset of elements from `xs`

, and then to obtain `ys`

, insert a subset of elements of `ys`

into the least common subsequence.

Ignoring optimization for the time being, let's take this as our starting point. We'll generalize the longest common subsequence to the largest common substructure.

First, let's examine the inductive definition of all subsequences of a list:

```
subsequences :: [a] -> [[a]]
subsequences [] = [[]]
subsequences (x:xs) = [ xs'' | xs' <- subsequences xs, xs'' <- [ xs', x:xs' ] ]
```

In words: if the input list has no elements, return just the input, otherwise, for each subsequence of the tail, yield two lists: the first including the head of the input, and the second excluding it.

Let's recast that definition in the more generic language of containers:

If the input has no subcontainers, return just the input, otherwise, for each substructure of each subcontainer, yield two substructures: the first including the structure of the input, and the second including only the substructure.

Let's define a type class:

```
newtype Rec f = Rec { runRec :: f (Rec f) }
class Container f where
data Context f :: * -> *
children :: f a -> [(a, Context f a)]
plugIn :: a -> Context f a -> f a
childAt :: f a -> Context f a -> a
```

The type `Rec f`

is the usual type of recursive data structures of shape `f`

. The `Container`

class contains a few methods which deserve explanation.

The associated data type `Context f`

is the type of one-hole contexts of `f`

[3]. The `children`

function takes an input of type `f a`

and returns an array of the contained `a`

's along with their one-hole contexts. The `plugIn`

method takes an `a`

and a context, and plugs the `a`

into the hole defined by context. Finally, the `childAt`

function takes an input of type `f a`

and a context, and returns the `a`

in the input at that context.

There are some fairly obvious laws which instances of the `Container`

class should satisfy:

```
childAt (plugIn a c) c = a
plugIn (childAt a c) c = a
(childAt a c) `elem` (children a)
```

Let's define the type of changes for structures of shape `f`

:

```
type Step f = Context f (Rec f)
data Change f = Skip (Step f) | Take (Step f) | Replace (Rec f) (Rec f)
type Path f = [Change f]
```

and derive some instances:

```
deriving instance (Eq (f (Rec f))) => Eq (Rec f)
deriving instance (Show (f (Rec f))) => Show (Rec f)
deriving instance (Show (Context f (Rec f)), Show (Rec f)) => Show (Change f)
```

With that, let's define the generalization of `subsequences`

:

```
type InContext f = (Rec f, Path f)
substructures :: (Container f) => Rec f -> [InContext f]
substructures (Rec x)
| null $ children x = [(Rec x, [])]
| otherwise =
[ substructure
| (x', ctx) <- children x,
(x'', ctxs) <- substructures x',
substructure <- [ (x'', Skip ctx:ctxs), (Rec $ plugIn x'' ctx, Take ctx:ctxs) ] ]
```

Here, the `plugIn`

and `children`

methods are used to generalize the inclusion/exclusion which took place when we calculated subsequences of a list.

We can now easily find the largest common substructure, and therefore the difference of two structures.

```
type ChangeSet f = (Path f, Path f)
common :: (Container f, Eq (Rec f)) => [InContext f] -> [InContext f] -> [ChangeSet f]
common xs ys = [ (p1, p2) | (s1, p1) <- xs, (s2, p2) <- ys, s1 == s2 ]
takes :: Path f -> Path f
takes = filter take where
take (Take _) = True
take _ = False
size :: ChangeSet f -> Int
size (xs, ys) = length (takes xs) + length (takes ys)
largest :: [ChangeSet f] -> ChangeSet f
largest = maximumBy (compare `on` size)
diff :: (Container f, Eq (Rec f)) => Rec f -> Rec f -> ChangeSet f
diff old new = largest $ (common `on` substructures) old new
```

Let's verify the code on the simple case of lists.

```
data ListF a x = Nil | Cons a x deriving (Show, Eq)
type List a = Rec (ListF a)
instance Container (ListF a) where
data Context (ListF a) x = ListContext a deriving (Show, Eq)
children Nil = []
children (Cons a x) = [(x, ListContext a)]
plugIn x (ListContext a) = Cons a x
childAt (Cons _ x) (ListContext _) = x
nil :: List a
nil = Rec Nil
cons :: a -> List a -> List a
cons a x = Rec $ Cons a x
fromList :: [a] -> List a
fromList = foldr cons nil
```

In GHCi:

```
> diff (fromList [1,2,3]) (fromList [2,3,4])
([Skip (ListContext 1), Take (ListContext 2), Take (ListContext 3)],
[Take (ListContext 2), Take (ListContext 3), Skip (ListContext 4)])
```

The result is as expected: from the first list we skip the first element and take the rest, and from the second list we skip the last element and take the rest.

Now here's a more exciting example - the type of untyped lambda terms:

```
data ExprF x = App x x | Abst String x | Var String deriving (Show, Eq)
type Expr a = Rec ExprF
instance Container ExprF where
data Context ExprF x = AppContext (Either x x) | AbstContext String deriving (Show, Eq)
children (App f x) = [(f, AppContext $ Right x), (x, AppContext $ Left f)]
children (Abst i x) = [(x, AbstContext i)]
children (Var _) = []
plugIn x (AppContext (Left f)) = App f x
plugIn f (AppContext (Right x)) = App f x
plugIn x (AbstContext i) = Abst i x
childAt (App _ x) (AppContext (Left _)) = x
childAt (App f _) (AppContext (Right _)) = f
childAt (Abst _ x) (AbstContext _) = x
abst :: Int -> Expr -> Expr
abst i x = Rec $ Abst i x
app :: Expr -> Expr -> Expr
app f x = Rec $ App f x
var :: Int -> Expr
var i = Rec $ Var i
k :: Expr
k = abst "x" $ abst "y" $ var "x"
s :: Expr
s = abst "x" $ abst "y" $ abst "z" $ app (app (var "x") (var "z")) (app (var "y") (var "z"))
```

And again, we can test in GHCi:

```
diff s k
> ([Take (AbstContext "x"), Take (AbstContext "y"),
Skip (AbstContext "z"), Skip (AppContext (Right (Rec {runRec = App (Rec { runRec = Var "y" }) ... }))),
Skip (AppContext (Right (Rec { runRec = Var "z" })))],
[Take (AbstContext "x"), Take (AbstContext "y")])
```

The diff algorithm has identified `k`

as the largest common substructure, and the path through `s`

which picks out this substructure discards the third abstraction over the variable `z`

and the corresponding applications.

We can also define a function `patch`

which applies the result of `diff`

to a structure:

```
patch :: (Container f, Eq (Rec f)) => Rec f -> ChangeSet f -> Rec f
patch old (inserts, deletes) = foldr wrap (unwrap old inserts) deletes where
wrap (Skip _) x = x
wrap (Take ctx) x = Rec $ plugIn x ctx
wrap (Replace x y) _ = y
unwrap x [] = x
unwrap x (Skip ctx:cs) = Rec $ plugIn (unwrap (childAt (runRec x) ctx) cs) ctx
unwrap x (Take ctx:cs) = unwrap (childAt (runRec x) ctx) cs
unwrap _ (Replace x y:cs) = unwrap x cs
```

The functions `plugIn`

and `childAt`

play a key role, since we need to be able to glue structures together and take them apart again based on the content of the changeset.

We can verify using the examples above (and others) that `diff`

and `patch`

satisfy the required laws `patch x $ diff x y = y`

and `diff x $ patch x p = p`

.

The function `diff`

above does its job, but has exponential complexity in the size of its input, due to the need to generate all substructures.

The key observation is that if two substructures are unequal, then two larger structures can never be equal, so we can prune large subtrees from the search tree by testing equality of contexts.

Here is the optimized version of `diff`

, which is observably much faster than the original:

```
diff2 :: (Container f, Eq (Context f (Rec f))) => Rec f -> Rec f -> ChangeSet f
diff2 old new
| null $ children $ runRec old = ([], [Replace old new])
| null $ children $ runRec new = ([Replace old new], [])
| otherwise = let matches = [ (x', y', ctx1)
| (x', ctx1) <- children $ runRec old
, (y', ctx2) <- children $ runRec new
, ctx1 == ctx2 ] in
if null matches then
largest $ [ let (xs, ys) = diff2 x' new in (Skip ctx:xs, ys)
| (x',ctx) <- children $ runRec old ] ++
[ let (xs, ys) = diff2 old y' in (xs, Skip ctx:ys)
| (y',ctx) <- children $ runRec new ]
else
largest [ let (xs, ys) = diff2 x' y' in (Take ctx:xs, Take ctx:ys)
| (x', y', ctx) <- matches ]
```

Notice that the new constraints only require equality of contexts, not of structures.

Compare this implementation with the dynamic programming implementation of the longest common subsequence on lists:

```
diffList :: (Eq a) => [a] -> [a] -> [a]
diffList [] new = []
diffList old [] = []
diffList old@(x:xs) new@(y:ys)
| x == y = (x:diffList xs ys)
| otherwise = maximumBy (compare `on` length) [ diff old ys, diff xs new ]
```

Hopefully the structural similarity of the two algorithms is clear.

One-hole contexts give a pleasant generalization of `diff`

and `patch`

to containers.

There are still some issues remaining with this implementation, such as the lack of memoization and the incorrect handling of containers with multiple non-recursive constructors, or structures with no common substructures.

It would also be interesting to explore the extension to mutually recursive types, such as the types of statements and expressions in an abstract syntax tree.

- Package Data.Generic.Diff on Hackage
- Generic type-safe diff and patch for families of datatypes by Eelco Lempsick, 2009
- The Derivative of a Regular Type is its Type of One-Hole Contexts by Conor McBride

Copyright Phil Freeman 2010-2015