PEG: Parsing Expression Grammar
by Jon Rafkind (rafkind at cs dot utah dot edu)
This package provides a syntactic form for writing a BNF which will act as a parsing expression grammar. The result is a packrat parser as described by Bryan Ford in Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking.
1 Example: Reading a’s and b’s
In this short example a peg is defined that will parse a string which can contain any number of a’s and any number of b’s in any order. That is, it is equivalent to the regular expression (a|b)*.
Examples: | ||||||||
| ||||||||
> (parse parser "aabba") | ||||||||
(the-a the-a the-b the-b the-a) |
2 Provided functions and syntax
(peg (start s) (grammar (nonterminal ((pattern ) action) ) )) |
Defines a peg that will start parsing using the nonterminal s. Patterns and actions are described in the Patterns section. The result is a function of the following type
(peg-result input) → (any/c) |
input : (lambda/c) |
input :: (lambda (column) ...) = a function that accepts one argument: an integer representing the current position in the input stream. The peg will start parsing from column 0.
nonterminal can also take arguments. To do this use the following form
(nonterminal args ) |
So inside a grammar it would look like
(peg (start s) (grammar ((s a b) ((pattern ...) action))))
Where a and b are bound within pattern and action. Nonterminals defined in this way can only be used with the apply pattern.
(parse parser obj) → (any/c) |
parser : parser? |
obj : (or/c string? path?) |
Parse obj using parser which should be created using the peg syntax. If obj is a string then the parser will retrieve the data from the obj itself. If obj is a path then the parser will retrieve data from the file specified by obj.
(parse-file parser file) |
Parse file using parser which should be created using the peg syntax.
3 Patterns
pattern in the peg form can be one of the following
literal
Matches literal data, either a string or a char.
Examples:
(define parser
(peg
(start s)
(grammar (s (("ab") 'a)
((#\c) 'c)))))
> (parse parser "ab")
a
> (parse parser "c")
c
> (parse parser "d")
#f
_
Matches any single character.
Examples:
(define parser
(peg
(start s)
(grammar (s ((_) _)))))
> (parse parser "a")
#\a
> (parse parser "b")
#\b
> (parse parser "c")
#\c
id
Jump to nonterminal id.
Examples:
(define parser
(peg
(start s)
(grammar (s ((nt1) 'ok))
(nt1 ((nt2 nt3) _))
(nt2 (("a") _))
(nt3 (("b") _)))))
> (parse parser "ab")
ok
eof
Only recognizes the end of stream.
Examples:
(define parser
(peg
(start s)
(grammar (s (("a" eof) "complete")
(("a") "incomplete")))))
> (parse parser "aa")
"incomplete"
> (parse parser "a")
"complete"
(bind variable pattern )
Binds variable to the result of pattern. variable is bound in the scope of the rest of the patterns that come after the bind.
Examples:
(define parser
(peg
(start s)
(grammar (s (((bind c x) (bind d y))
(string-append c d)))
(x (("x") "a"))
(y (("y") "b")))))
> (parse parser "xy")
"ab"
(* pattern )
Will match pattern zero or more times until it doesn’t match anymore. The result is always a list.
Examples:
(define parser
(peg
(start s)
(grammar (s (("x" (* "a" "b")) _)
(("y" (* "a")) _)))))
> (parse parser "yaaaa")
("a" "a" "a" "a")
> (parse parser "y")
()
> (parse parser "xabab")
("b" "b")
(+ pattern )
Will match pattern one or more times until it doesn’t match anymore. The result is always a list or #f.
Examples:
(define parser
(peg
(start s)
(grammar (s (("f" (+ "a" "b")) _)))))
> (parse parser "f")
#f
> (parse parser "fabab")
("b" "b")
(? pattern)
Will match pattern zero or one times. The result is always a list, '() if pattern didn’t match and (list result-of-pattern) if it did.
Examples:
(define parser
(peg
(start s)
(grammar (s (("f" (? "a" "b") eof) 'ok)))))
> (parse parser "f")
ok
> (parse parser "fab")
ok
> (parse parser "fabab")
#f
(or pattern )
Will match the first pattern in pattern , trying each pattern in sequence if the previous patterns failed.
Examples:
(define parser
(peg
(start s)
(grammar (s (((or "a" "b" "c")) _)))))
> (parse parser "a")
"a"
> (parse parser "b")
"b"
> (parse parser "c")
"c"
(foreign peg nonterminal)
Use a different peg starting at nonterminal nonterminal.
Examples:
(define a-parser
(peg
(start s)
(grammar (s ((a-s) _))
(a-s (((+ "a")) _)))))
(define parser
(peg
(start s)
(grammar (s (((* "b") (foreign a-parser s)) _)))))
> (parse parser "bbbaa")
("a" "a")
> (parse parser "aa")
("a" "a")
> (parse parser "bb")
#f
(predicate code)
If code evaluates to anything besides #f then the parse will continue, otherwise if the result is #f the current pattern will fail. This does not consume any input.
Examples:
(define parser
(peg
(start s)
(grammar (s (((bind a (or "a" "b")) (predicate (string=? "a" a)))
a)))))
> (parse parser "a")
"a"
> (parse parser "b")
#f
(apply nonterminal args )
Pass args to nonterminal and then jump to that nonterminal.
Examples:
(define parser
(peg
(start s)
(grammar (s (((apply foos #\a) (apply foos #\b))
'ok))
((foos x) (((bind c _) (predicate (if (char=? c x) x #f)))
_)))))
> (parse parser "ab")
ok
> (parse parser "aa")
#f
> (parse parser "bb")
#f
(not pattern )
The parse will continue only if pattern fails. This does not consume any input.
Examples:
(define parser
(peg
(start s)
(grammar (s (((not "a") "b") 'no-a)
(("a" "b") 'yes-a)))))
> (parse parser "ab")
yes-a
> (parse parser "b")
no-a
(ensure pattern )
The parse will continue only if pattern succeeds. In other words, exactly the opposite of not. This does not consume any input.
Examples:
(define parser
(peg
(start s)
(grammar (s (("f" "b" (ensure "a") "a") 'yes-a)
(("f" "b") 'no-a)))))
> (parse parser "fba")
yes-a
> (parse parser "fb")
no-a
(except pattern )
Matches one character only if pattern does not succeed. Only consumes the one character if the pattern succeeds.
Examples:
(define parser
(peg
(start s)
(grammar (s (((except "french")) _)))))
eval:152:0: except: bad syntax in: (except "french")
> (parse parser "french")
#f
> (parse parser "b")
#f
4 Actions
action can be an arbitrary piece of code or the special _ syntax. When _ is used the result from the last pattern in the production is used as the return value.
Examples: | |||||||||||
| |||||||||||
> (parse (parser) "abc") | |||||||||||
3 | |||||||||||
> (parse (parser) "cba") | |||||||||||
1 | |||||||||||
> (parse (parser) "acb") | |||||||||||
happy-birthday |
5 Parsing
The PEG parser will start at the nonterminal given by (start s). Each production of a nonterminal will be tried in sequence until one of the productions succeeds or all the productions fail. The result of parsing with a nonterminal is either #f or an arbitrary datum as the result of action.
The result of a nonterminal is stored in a table for future lookups so that if the same nonterminal is used to parse the input stream at the same position the result that was computed for the first time will be returned. This situation occurs frequently in a PEG parser. Consider the following nonterminal.
(foo (("hello" " " world " " "bob") "bob's world") |
(("hello" " " world " " "george") "george's world")) |
(world (("world") "world")) |
If this set of productions is tried on the input string "hello world george", the first production of foo will not match because "bob" does not match "george" in the input stream. world already matched, however, so in the second production of foo, world will not have to be recomputed because at position 6 it is known to parse correctly and return the result "world".
6 Best practices
Generating a new parser for each object you would like to parse reduces unused memory. The PEG memoizes intermediate results which can go unused if the same object is not parsed twice. Consider the following.
(define my-parser (peg ...))
(let loop ()
(parse my-parser "blah blah blah")
(loop))
Will slowly use up all available memory. This table summarizes the memory usage of this program over a few seconds.
"mzscheme 15m""\n"
"mzscheme 28m""\n"
"mzscheme 34m""\n"
"mzscheme 47m""\n"
"mzscheme 58m"
However if the parser is created when it is needed memory stays constant.
(define (my-parser) (peg ...))
(let loop ()
(parse (my-parser) "blah blah blah")
(loop))
"mzscheme 15m""\n"
"mzscheme 15m""\n"
"mzscheme 15m""\n"
"mzscheme 15m"
Use not and ensure to force the parser to choose one production over another. In the following example we see a case where the parser will match one nonterminal when it should have matched another but by that time it is too late. We can use not to be sure that the parser won’t match a production that matches similar text as the proper nonterminal.
Examples:
(define (parser)
(peg
(start s)
(grammar (s (((bind s stmt) eof) s))
(stmt (((bind f function) (not "(")) 'no-args)
(((bind f function) "(" (* arg (? ",")) ")") 'args))
(function (("x") _))
(arg (("y") _)))))
> (parse (parser) "x")
no-args
> (parse (parser) "x(y,y)")
args