Text-based diff is of limited usefulness for analysing changes to a large code base over time. I've wondered for a while how one might generalize the diff
and patch
functions on lists to more general data structures such as abstract syntax trees.
It's pretty easy to write down a version of the diff
function for, say, binary trees, but less simple to write a function which works generically across multiple data types. As usual, I spent a while thinking about this before realizing the problem had already been solved ([1], [2]). I'm still quite happy with the approach I came up with, and I think it's sufficiently interesting to write about here anyway.
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE UndecidableInstances #-}
module Diff where
import Data.Function (on)
import Data.List (maximumBy)
Longest Common Subsequences and Largest Common Substructures
The diff
function can be specified as follows: for lists xs
and ys
, find the shortest edit sequence taking xs
to ys
. We can reduce the problem to finding the longest common subsequence of xs
and ys
: to obtain the least common subsequence, first remove a subset of elements from xs
, and then to obtain ys
, insert a subset of elements of ys
into the least common subsequence.
Ignoring optimization for the time being, let's take this as our starting point. We'll generalize the longest common subsequence to the largest common substructure.
First, let's examine the inductive definition of all subsequences of a list:
subsequences :: [a] -> [[a]]
subsequences [] = [[]]
subsequences (x:xs) = [ xs'' | xs' <- subsequences xs, xs'' <- [ xs', x:xs' ] ]
In words: if the input list has no elements, return just the input, otherwise, for each subsequence of the tail, yield two lists: the first including the head of the input, and the second excluding it.
Let's recast that definition in the more generic language of containers:
If the input has no subcontainers, return just the input, otherwise, for each substructure of each subcontainer, yield two substructures: the first including the structure of the input, and the second including only the substructure.
Let's define a type class:
newtype Rec f = Rec { runRec :: f (Rec f) }
class Container f where
data Context f :: * -> *
children :: f a -> [(a, Context f a)]
plugIn :: a -> Context f a -> f a
childAt :: f a -> Context f a -> a
The type Rec f
is the usual type of recursive data structures of shape f
. The Container
class contains a few methods which deserve explanation.
The associated data type Context f
is the type of one-hole contexts of f
[3]. The children
function takes an input of type f a
and returns an array of the contained a
's along with their one-hole contexts. The plugIn
method takes an a
and a context, and plugs the a
into the hole defined by context. Finally, the childAt
function takes an input of type f a
and a context, and returns the a
in the input at that context.
There are some fairly obvious laws which instances of the Container
class should satisfy:
childAt (plugIn a c) c = a
plugIn (childAt a c) c = a
(childAt a c) `elem` (children a)
Let's define the type of changes for structures of shape f
:
type Step f = Context f (Rec f)
data Change f = Skip (Step f) | Take (Step f) | Replace (Rec f) (Rec f)
type Path f = [Change f]
and derive some instances:
deriving instance (Eq (f (Rec f))) => Eq (Rec f)
deriving instance (Show (f (Rec f))) => Show (Rec f)
deriving instance (Show (Context f (Rec f)), Show (Rec f)) => Show (Change f)
With that, let's define the generalization of subsequences
:
type InContext f = (Rec f, Path f)
substructures :: (Container f) => Rec f -> [InContext f]
substructures (Rec x)
| null $ children x = [(Rec x, [])]
| otherwise =
[ substructure
| (x', ctx) <- children x,
(x'', ctxs) <- substructures x',
substructure <- [ (x'', Skip ctx:ctxs), (Rec $ plugIn x'' ctx, Take ctx:ctxs) ] ]
Here, the plugIn
and children
methods are used to generalize the inclusion/exclusion which took place when we calculated subsequences of a list.
We can now easily find the largest common substructure, and therefore the difference of two structures.
type ChangeSet f = (Path f, Path f)
common :: (Container f, Eq (Rec f)) => [InContext f] -> [InContext f] -> [ChangeSet f]
common xs ys = [ (p1, p2) | (s1, p1) <- xs, (s2, p2) <- ys, s1 == s2 ]
takes :: Path f -> Path f
takes = filter take where
take (Take _) = True
take _ = False
size :: ChangeSet f -> Int
size (xs, ys) = length (takes xs) + length (takes ys)
largest :: [ChangeSet f] -> ChangeSet f
largest = maximumBy (compare `on` size)
diff :: (Container f, Eq (Rec f)) => Rec f -> Rec f -> ChangeSet f
diff old new = largest $ (common `on` substructures) old new
Example 1 - Lists
Let's verify the code on the simple case of lists.
data ListF a x = Nil | Cons a x deriving (Show, Eq)
type List a = Rec (ListF a)
instance Container (ListF a) where
data Context (ListF a) x = ListContext a deriving (Show, Eq)
children Nil = []
children (Cons a x) = [(x, ListContext a)]
plugIn x (ListContext a) = Cons a x
childAt (Cons _ x) (ListContext _) = x
nil :: List a
nil = Rec Nil
cons :: a -> List a -> List a
cons a x = Rec $ Cons a x
fromList :: [a] -> List a
fromList = foldr cons nil
In GHCi:
> diff (fromList [1,2,3]) (fromList [2,3,4])
([Skip (ListContext 1), Take (ListContext 2), Take (ListContext 3)],
[Take (ListContext 2), Take (ListContext 3), Skip (ListContext 4)])
The result is as expected: from the first list we skip the first element and take the rest, and from the second list we skip the last element and take the rest.
Example 2 : Lambda Terms
Now here's a more exciting example - the type of untyped lambda terms:
data ExprF x = App x x | Abst String x | Var String deriving (Show, Eq)
type Expr a = Rec ExprF
instance Container ExprF where
data Context ExprF x = AppContext (Either x x) | AbstContext String deriving (Show, Eq)
children (App f x) = [(f, AppContext $ Right x), (x, AppContext $ Left f)]
children (Abst i x) = [(x, AbstContext i)]
children (Var _) = []
plugIn x (AppContext (Left f)) = App f x
plugIn f (AppContext (Right x)) = App f x
plugIn x (AbstContext i) = Abst i x
childAt (App _ x) (AppContext (Left _)) = x
childAt (App f _) (AppContext (Right _)) = f
childAt (Abst _ x) (AbstContext _) = x
abst :: Int -> Expr -> Expr
abst i x = Rec $ Abst i x
app :: Expr -> Expr -> Expr
app f x = Rec $ App f x
var :: Int -> Expr
var i = Rec $ Var i
k :: Expr
k = abst "x" $ abst "y" $ var "x"
s :: Expr
s = abst "x" $ abst "y" $ abst "z" $ app (app (var "x") (var "z")) (app (var "y") (var "z"))
And again, we can test in GHCi:
diff s k
> ([Take (AbstContext "x"), Take (AbstContext "y"),
Skip (AbstContext "z"), Skip (AppContext (Right (Rec {runRec = App (Rec { runRec = Var "y" }) ... }))),
Skip (AppContext (Right (Rec { runRec = Var "z" })))],
[Take (AbstContext "x"), Take (AbstContext "y")])
The diff algorithm has identified k
as the largest common substructure, and the path through s
which picks out this substructure discards the third abstraction over the variable z
and the corresponding applications.
Applying Patches
We can also define a function patch
which applies the result of diff
to a structure:
patch :: (Container f, Eq (Rec f)) => Rec f -> ChangeSet f -> Rec f
patch old (inserts, deletes) = foldr wrap (unwrap old inserts) deletes where
wrap (Skip _) x = x
wrap (Take ctx) x = Rec $ plugIn x ctx
wrap (Replace x y) _ = y
unwrap x [] = x
unwrap x (Skip ctx:cs) = Rec $ plugIn (unwrap (childAt (runRec x) ctx) cs) ctx
unwrap x (Take ctx:cs) = unwrap (childAt (runRec x) ctx) cs
unwrap _ (Replace x y:cs) = unwrap x cs
The functions plugIn
and childAt
play a key role, since we need to be able to glue structures together and take them apart again based on the content of the changeset.
We can verify using the examples above (and others) that diff
and patch
satisfy the required laws patch x $ diff x y = y
and diff x $ patch x p = p
.
Optimization
The function diff
above does its job, but has exponential complexity in the size of its input, due to the need to generate all substructures.
The key observation is that if two substructures are unequal, then two larger structures can never be equal, so we can prune large subtrees from the search tree by testing equality of contexts.
Here is the optimized version of diff
, which is observably much faster than the original:
diff2 :: (Container f, Eq (Context f (Rec f))) => Rec f -> Rec f -> ChangeSet f
diff2 old new
| null $ children $ runRec old = ([], [Replace old new])
| null $ children $ runRec new = ([Replace old new], [])
| otherwise = let matches = [ (x', y', ctx1)
| (x', ctx1) <- children $ runRec old
, (y', ctx2) <- children $ runRec new
, ctx1 == ctx2 ] in
if null matches then
largest $ [ let (xs, ys) = diff2 x' new in (Skip ctx:xs, ys)
| (x',ctx) <- children $ runRec old ] ++
[ let (xs, ys) = diff2 old y' in (xs, Skip ctx:ys)
| (y',ctx) <- children $ runRec new ]
else
largest [ let (xs, ys) = diff2 x' y' in (Take ctx:xs, Take ctx:ys)
| (x', y', ctx) <- matches ]
Notice that the new constraints only require equality of contexts, not of structures.
Compare this implementation with the dynamic programming implementation of the longest common subsequence on lists:
diffList :: (Eq a) => [a] -> [a] -> [a]
diffList [] new = []
diffList old [] = []
diffList old@(x:xs) new@(y:ys)
| x == y = (x:diffList xs ys)
| otherwise = maximumBy (compare `on` length) [ diff old ys, diff xs new ]
Hopefully the structural similarity of the two algorithms is clear.
Conclusion
One-hole contexts give a pleasant generalization of diff
and patch
to containers.
There are still some issues remaining with this implementation, such as the lack of memoization and the incorrect handling of containers with multiple non-recursive constructors, or structures with no common substructures.
It would also be interesting to explore the extension to mutually recursive types, such as the types of statements and expressions in an abstract syntax tree.
References
- Package Data.Generic.Diff on Hackage
- Generic type-safe diff and patch for families of datatypes by Eelco Lempsick, 2009
- The Derivative of a Regular Type is its Type of One-Hole Contexts by Conor McBride