In this post, I'm going to implement a simple markup language including a parser and type checker. The language's syntax will be based on Haskell's syntax, and the type system will look a little bit like System F.
Why am I interested in implementing such a markup language? Well, markup languages like JSON are good at describing the values in a dataset, but not their types, and XML extends this by allowing you to specify a schema, which corresponds to a set of typing rules. Languages like XAML allow you to extend this idea even futher by using an actual type system as a sort of global schema, and then extending the expressiveness of the language with things like subclass polymorphism.
In my opinion, XAML is great for quickly bootstrapping a DSL without having to worry about parsing, but the lack of generics makes it impossible to encode some simple constraints in the schema. I often find myself wanting a markup language which looks like Haskell and which uses similar typing rules.
The simplicity of the language below allows for a very simple implementation: typing is a simple bottom-up pass over a term.
With that, let's include some modules:
module Markup where
import Text.Parsec
import qualified Text.Parsec.Token as P
import qualified Text.Parsec.Language as L
import qualified Text.Parsec.Expr as E
import Data.Char
import Data.List (groupBy, nub, (\\))
import Data.Function (on)
import Control.Arrow ((***))
import Control.Applicative ((<$>), (<*>))
import Control.Monad (guard, zipWithM)
import qualified Control.Monad.State as S
Language Definition and Examples
I would like separate languages for type declarations and values. Here are some examples of the sorts of things I would like to be able to write (note, not all of these are expected to pass the typechecker):
exampleSchema = unlines
[ "Nil :: List Z a;"
, "Singleton :: a -> List (S Z) a;"
, "Cons :: a -> List n a -> List (S n) a;"
, "Reverse :: List n a -> List n a;"
, "Zip :: List n a -> List n b -> List n (Pair a b);"
, "Tip :: Tree Z a;"
, "Branch :: Tree n a -> a -> Tree n a -> Tree (S n) a" ]
example1 = "x = Zip (Cons 1 $ Cons 2 Nil) (Cons 1 Nil)"
example2 = "x = Zip (Cons 1 $ Reverse $ Cons 2 $ Cons 3 Nil) (Cons \"a\" $ Cons \"b\" $ Singleton \"c\")"
example3 = "x = Branch (Branch Tip 1 Tip) 2 (Branch Tip 3 Tip)"
example4 = "x = Branch (Branch Tip 1 Tip) 2 Tip"
example5 = "x = Cons (Branch Tip 1 Tip) Nil"
example6 = "x = 1 `Cons` (2 `Cons` (3 `Cons` Nil))"
example7 = "x = 1 `Cons` Nil `Zip` (2 `Cons` Nil)"
example8 = "x = Nil `Cons` Nil `Zip` ((Cons 1 $ Nil) `Cons` Nil)"
example9 = unlines
[ "x = Nil;"
, "y = Cons x Nil;"
, "z = Cons y Nil" ]
Notice that the type definition language supports things like type constructors and phantom types to allow me to encode simple constraints in the type of a declaration. For example, I can enforce the fact that the Zip constructor takes two lists of the same length. Also, there are no higher order functions, since there are no user-defined functions other than constructors, and no data declarations: all types are open for extension.
The value language copies several ideas from Haskell: infix constructors using backticks, and application using both juxtaposition and $
. Other than that, the syntax is very simple.
Here are the data types which represent terms and types, along with their pretty printing functions:
data Tm
= TmName String
| TmAtom String
| TmStr String
| TmInt Integer
| TmApp Tm Tm deriving (Show, Eq)
prettyTm :: Tm -> String
prettyTm (TmName s) = s
prettyTm (TmAtom s) = s
prettyTm (TmStr s) = show s
prettyTm (TmInt n) = show n
prettyTm (TmApp t1 t2@(TmApp _ _)) = prettyTm t1 ++ " (" ++ prettyTm t2 ++ ")"
prettyTm (TmApp t1 t2) = prettyTm t1 ++ " " ++ prettyTm t2
data Ty
= TyCon String
| TyApp Ty Ty
| TyArr Ty Ty
| TyVar String deriving (Show, Eq)
prettyTy :: Ty -> String
prettyTy (TyCon s) = s
prettyTy (TyVar v) = v
prettyTy (TyApp t1 t2@(TyCon _)) = prettyTy t1 ++ " " ++ prettyTy t2
prettyTy (TyApp t1 t2) = prettyTy t1 ++ " (" ++ prettyTy t2 ++ ")"
prettyTy (TyArr t1 t2) = prettyTy t1 ++ " -> (" ++ prettyTy t2 ++ ")"
Parsing Types and Values Using Parsec
Now let's implement a parser. To keep things simple, I'm going to use the lexer generator from Text.Parsec.Token
.
languageDef :: P.LanguageDef st
languageDef = L.haskellStyle
{ P.opStart = fail "Operators are not supported"
, P.opLetter = fail "Operators are not supported"
, P.reservedNames = []
, P.reservedOpNames = []
, P.caseSensitive = True }
lexer = P.makeTokenParser languageDef
identifier = P.identifier lexer
whiteSpace = P.whiteSpace lexer
stringLiteral = P.stringLiteral lexer
integer = P.integer lexer
symbol = P.symbol lexer
lexeme = P.lexeme lexer
parens = P.parens lexer
semi = P.semi lexer
identu :: Parsec String u String
identu = lookAhead upper >> identifier
identl :: Parsec String u String
identl = lookAhead lower >> identifier
There are parsers for terms:
tmName :: Parsec String u Tm
tmName = lexeme $ identl >>= return . TmName
tmAtom :: Parsec String u Tm
tmAtom = lexeme $ identu >>= return . TmAtom
tmStr :: Parsec String u Tm
tmStr = lexeme $ stringLiteral >>= return . TmStr
tmInt :: Parsec String u Tm
tmInt = lexeme $ integer >>= return . TmInt
tmBracket :: Parsec String u Tm
tmBracket = lexeme $ parens tm
tm :: Parsec String u Tm
tm = E.buildExpressionParser
[ [ E.Infix (return TmApp) E.AssocLeft ]
, [ E.Infix (((.) TmApp . TmApp) <$> between (symbol "`") (symbol "`") tmAtom) E.AssocLeft ]
, [ E.Infix (symbol "$" >> return TmApp) E.AssocRight ] ]
(tmBracket <|> tmStr <|> tmInt <|> try tmName <|> tmAtom)
and parsers for types:
tyVar :: Parsec String u Ty
tyVar = lexeme $ identl >>= return . TyVar
tyCon :: Parsec String u Ty
tyCon = lexeme $ identu >>= return . TyCon
tyBracket :: Parsec String u Ty
tyBracket = parens ty
ty :: Parsec String u Ty
ty = E.buildExpressionParser
[ [ E.Infix (return TyApp) E.AssocLeft ]
, [ E.Infix (symbol "->" >> return TyArr) E.AssocRight ] ]
(tyBracket <|> try tyVar <|> tyCon)
The parsers tm
and ty
are then used to construct top-level parsers schema
for type declarations and doc
for value declarations:
valueDecl :: Parsec String u (String, Tm)
valueDecl = do
name <- identl
symbol "="
t <- tm
return (name, t)
doc :: Parsec String u [(String, Tm)]
doc = do
whiteSpace
ds <- sepBy valueDecl (lexeme semi)
eof
return ds
typeDecl :: Parsec String u (String, Ty)
typeDecl = do
name <- identu
symbol "::"
t <- ty
return (name, t)
schema :: Parsec String u [(String, Ty)]
schema = do
whiteSpace
ds <- sepBy typeDecl (lexeme semi)
eof
return ds
There is quite a bit going on here, but for the most part, the parser is assembled from reusable pieces in Text.Parsec.Token
and some utility functions such as sepBy
. The parser serves as a fairly good definition of the language syntax: the tm
and ty
parsers are just constructed using buildExpressionParser
from the precedence tables for the term and type constructors respectively.
Implementing a Typechecker
The next step is to write the typechecker. This is actually quite simple since there are no function types other than constructors, and all quantification over free type variables implicitly happens at the top level.
The first step is to implement a unification function. This is very easy: types are either equal modulo variable substitution, or they do not unify at all:
unify :: Ty -> Ty -> Either String [(String, Ty)]
unify (TyVar s) (TyVar t) = return []
unify (TyVar s) t = return [(s, t)]
unify t (TyVar s) = return [(s, t)]
unify (TyCon ct1) (TyCon ct2)
| ct1 == ct2 = return []
unify (TyApp ty1 ty2) (TyApp ty3 ty4) = (++) <$> unify ty1 ty3 <*> unify ty2 ty4
unify (TyArr ty1 ty2) (TyArr ty3 ty4) = (++) <$> unify ty1 ty3 <*> unify ty2 ty4
unify t1 t2 = Left $ "Cannot unify " ++ prettyTy t1 ++ " with " ++ prettyTy t2
Type variable substitution is also a simple pattern match:
subst :: [(String, Ty)] -> Ty -> Ty
subst vs v@(TyVar s) = maybe v id $ lookup s vs
subst vs (TyApp t1 t2) = TyApp (subst vs t1) (subst vs t2)
subst vs (TyArr t1 t2) = TyArr (subst vs t1) (subst vs t2)
subst _ t = t
One important function which is used in the typechecker is specialize
, which is used to apply a polymorphic function to an argument. The idea is that specialize f x
should return the most general type of f x
. The helper function isFunction
is used to ensure that each type variable was unified with at most one type.
isFunction :: (Eq dom, Eq cod) => [(dom, cod)] -> Bool
isFunction = all ((== 1) . length . nub . map snd) . groupBy ((==) `on` fst)
specialize :: Ty -> Ty -> Either String Ty
specialize (TyArr t1 t2) t3 = do
vs <- unify t1 t3
guard $ isFunction vs
return $ subst vs t2
specialize t _ = Left $ "Expected a function, found " ++ prettyTy t
We need one more helper function, which is used to rename free variables in a polymorphic type. This is used to make sure type variables do not overlap when a function application gets typed.
The renameAll
function gives new names to all of the free type variables appearing in a type, using an association list to store the currently-generated names:
renameAll :: (Eq a) => a -> Ty -> S.State [((String, a), String)] Ty
renameAll _ t@(TyCon _) = return t
renameAll key (TyVar v) = do
m <- S.get
case lookup (v, key) m of
Nothing -> do
let name = unusedName (map snd m)
S.modify $ (:) ((v, key), name)
return $ TyVar name
Just name -> return $ TyVar name
renameAll key (TyApp t1 t2) = TyApp <$> renameAll key t1 <*> renameAll key t2
renameAll key (TyArr t1 t2) = TyArr <$> renameAll key t1 <*> renameAll key t2
unusedName :: [String] -> String
unusedName used = head $ map (("t" ++) . show) [0..] \\ used
We can now write the typechecker. It is a simple application of the specialize
and renameAll
helper functions:
typeOf :: [(String, Ty)] -> Tm -> Either String Ty
typeOf ctx (TmName s) = maybe (Left $ "Unknown name " ++ s) Right $ lookup s ctx
typeOf ctx (TmAtom s) = maybe (Left $ "Unknown constructor " ++ s) Right $ lookup s ctx
typeOf _ (TmInt _) = return (TyCon "Int")
typeOf _ (TmStr _) = return (TyCon "String")
typeOf ctx (TmApp t1 t2) = do
tyArg <- typeOf ctx t2
tyFun <- typeOf ctx t1
flip S.evalState [] $ do
tyFun' <- renameAll True tyFun
tyArg' <- renameAll False tyArg
return $ specialize tyFun' tyArg'
typesOf :: [(String, Ty)] -> [(String, Tm)] -> Either String [(String, Ty)]
typesOf _ [] = return []
typesOf ctx ((name, tm):tms) = do
ty <- typeOf ctx tm
let first = (name, ty)
rest <- typesOf (first:ctx) tms
return (first:rest)
Let's test the typechecker and parser on some of the examples at the top of the post:
parseErr p = either (Left . show) Right . parse p ""
test sch val = do
s <- parseErr schema sch
d <- parseErr doc val
tys <- typesOf s d
return $ map (id *** prettyTy) tys
ghci> test exampleSchema example1
Left "Cannot unify S Z with Z"
ghci> test exampleSchema example2
Right [("x","List (S (S (S Z))) (Pair Int String)")]
ghci> test exampleSchema example3
Right [("x","Tree (S (S Z)) Int")]
This is good - the ill-typed expressions returned a typing error, and the well-typed expressions returned the correct types, including the correctly inferred placeholders for phantom types.
There are a few other small additions to the language that I would like to make such as infix operators, kind checking, better error messages and support for other primitive types.
A Mini Language For Relational Queries
As a more useful example, I've put together a minimal DSL for partially typed relational queries. The type system is used to make sure pairs of columns equated in joins have the same type, and to track the tables used in a query. Try replacing some of the constructors to see the ways in which the type system is used!
relationSchema = unlines
[ "UserID :: Column User PK;"
, "UserName :: Column User String;"
, "UserCompanyID :: Column User PK;"
, "CompanyID :: Column Company PK;"
, "CompanyName :: Column Company String;"
, "Table :: HList (Column t) -> Query (Table t);"
, "Join :: Query r1 -> Query r2 -> Expr (Cross r1 r2) Bool -> Query (Cross r1 r2);"
, "Column :: Column t ty -> Expr (Table t) ty;"
, "Left :: Expr t1 ty -> Expr (Cross t1 t2) ty;"
, "Right :: Expr t2 ty -> Expr (Cross t1 t2) ty;"
, "Eq :: Expr t ty -> Expr t ty -> Expr t Bool;"
, "Nil :: HList f;"
, "Cons :: f t -> HList f -> HList f" ]
relationExample = unlines
[ "query = Join"
, " (Table (Cons UserName $ Cons UserCompanyID Nil))"
, " (Table (Cons CompanyID $ Cons CompanyName Nil))"
, " ((Left $ Column UserCompanyID) `Eq` (Right $ Column CompanyID))" ]
ghci> test relationSchema relationExample
Right [("query","Query (Cross (Table User) (Table Company))")]
Implementing a Kind Checker
Ensuring terms are well-typed is pretty useless if the type language is itself typeless, so let's implement a kind checker for type definitions in a schema.
The kind checker below is a modified version of a type checker from an earlier post. It works by generating kind constraints by performing a bottom-up traversal of a type, and then solving tose constraints by substitution.
Here is the type of kinds, including placeholders for unknown kinds:
type Unknown = Int
data Kind
= KindUnknown Unknown
| KindStar
| KindArr Kind Kind deriving (Show, Eq)
prettyKind :: Kind -> String
prettyKind (KindUnknown n) = "k" ++ show n
prettyKind KindStar = "*"
prettyKind (KindArr k1@(KindArr _ _) k2) = "(" ++ prettyKind k1 ++ ") -> " ++ prettyKind k2
prettyKind (KindArr k1 k2) = prettyKind k1 ++ " -> " ++ prettyKind k2
I'll start as usual with some basic utility functions which will be used during kind checking. Two crucial functions are the replacement of an unknown with another kind, and an occurs-check function:
replace :: Unknown -> Kind -> Kind -> Kind
replace i x u@(KindUnknown j)
| i == j = x
| otherwise = u
replace _ _ KindStar = KindStar
replace i x (KindArr k1 k2) = KindArr (replace i x k1) (replace i x k2)
occursCheck :: Unknown -> Kind -> Either String ()
occursCheck i (KindUnknown j) | i == j = return ()
occursCheck i x = occursCheck' i x where
occursCheck' i (KindUnknown j) | i == j = fail "Occurs check failed during kind checking"
occursCheck' i (KindArr k1 k2) = occursCheck' i k1 >> occursCheck' i k2
occursCheck' _ _ = return ()
Constraint generation happens by unification. The unify
function takes two kinds and generates the constraints imposed by their equality, or returns a unification error.
type KindConstraint = (Unknown, Kind)
type SolutionSet = Unknown -> Kind
unifyKinds :: Kind -> Kind -> Either String [KindConstraint]
unifyKinds (KindUnknown i) x = return [(i, x)]
unifyKinds x (KindUnknown i) = return [(i, x)]
unifyKinds KindStar KindStar = return []
unifyKinds (KindArr k1 k2) (KindArr k3 k4) = (++) <$> unifyKinds k1 k3 <*> unifyKinds k2 k4
unifyKinds k1 k2 = fail $ "Cannot unify " ++ show k1 ++ " with " ++ show k2
The general constraint generation function works bottom-up over the structure of a type:
generateConstraints :: (Monad m, S.MonadState ([(String, Unknown)], Unknown) m) => Ty -> m [KindConstraint]
generateConstraints ty = do
(cs, n) <- generateConstraints' ty
return $ (n, KindStar) : cs
where
generateConstraints' :: (Monad m, S.MonadState ([(String, Unknown)], Unknown) m) => Ty -> m ([KindConstraint], Unknown)
generateConstraints' (TyVar v) = do
(m, _) <- S.get
case lookup v m of
Nothing -> do
n <- newUnknown
S.modify $ (:) (v, n) *** id
return ([], n)
Just unk -> return ([], unk)
generateConstraints' (TyCon s) = do
(m, _) <- S.get
case lookup s m of
Nothing -> do
n <- newUnknown
S.modify $ (:) (s, n) *** id
return ([], n)
Just unk -> return ([], unk)
generateConstraints' (TyApp ty1 ty2) = do
(c1, n1) <- generateConstraints' ty1
(c2, n2) <- generateConstraints' ty2
thisName <- newUnknown
let newConstraint = (n1, KindArr (KindUnknown n2) (KindUnknown thisName))
return $ ((newConstraint:c1) ++ c2, thisName)
newUnknown :: (Monad m, S.MonadState (a, Unknown) m) => m Unknown
newUnknown = S.modify (id *** (+1)) >> S.get >>= return . snd
Constraints are solved by substitution using replace
:
solve :: [KindConstraint] -> Either String SolutionSet
solve cs = solve' cs KindUnknown where
solve' [] ss = return ss
solve' (c@(i, x):cs) ss = do
occursCheck i x
cs' <- substConstraints c cs
let ss' = substSolutionSet c ss
solve' cs' ss'
substConstraints c [] = return []
substConstraints (i, x) ((j, y): cs)
| i == j = (++) <$> substConstraints (i, x) cs <*> unifyKinds x y
| otherwise = (:) (j, replace i x y) <$> substConstraints (i, x) cs
substSolutionSet (i, x) = (.) (replace i x)
Now we can write the kind checker. The applications
function splits an arrow type into parts, so that kind checking can work on each part.
applications :: Ty -> [Ty]
applications t = applications' t [] where
applications' (TyArr t1 t2) = applications' t1 . applications' t2
applications' t = (:) t
kindCheck :: [Ty] -> Either String [(String, Kind)]
kindCheck = uncurry kindCheck'
. (solve *** id)
. (concat *** fst)
. flip S.runState ([], 0)
. flip forM generateConstraints
. concatMap applications
where
kindCheck' :: Either String SolutionSet -> [(String, Unknown)] -> Either String [(String, Kind)]
kindCheck' ss unks = do
ss' <- ss
return $ map (id *** ss') unks
Let's try it out.
ghci> parseErr schema exampleSchema >>= fmap (map (id *** prettyKind)) . kindCheck . map snd
Right [
("Tree", "k10 -> * -> *"),
("Pair", "* -> * -> *"),
("b", "*"),
("n", "k10"),
("S", "k10 -> k10"),
("a", "*"),
("Z", "k10"),
("List", "k10 -> * -> *") ]