4 Simple Text Parser
(require (planet orseau/lazy-doc:1:2/simple-parser)) |
This module provides a simple text parser that can read strings and turn them into data without first building lexems (although it can be used to either lex or parse).
More complex or faster parsers may require the use of the parser-tools intergrated in Scheme.
A parser is given a list of matcher procedures and associated action procedures. A matcher is generally a regexp, the associated action turns the matched text into something else. On the input string, the parser recursively looks for the matcher that matches the earliest character and applies its action. no-match-proc is applied to the portion of the string (before the first matched character) that has not been matched.
The parser has an internal state, the "phase", where it is possible to define local parsers that only work when the parser is in that phase. Actions can make the parser switch to a given phase. Automata transitions can then easily be defined.
Instead of switching to another phase, it is also possible to set the parser into a "sub-parser" mode, and to provide the sub-parser with a callback that will be applied only once the sub-parser has returned.
The fastest and easiest way to understand how it works is probably to look at the examples in the "examples" directory. Somes simple examples are also given at the end of this page. See also the "defs-parser.ss" source file for a more complex usage.
4.1 Priorities
4.2 Main Functions
| |||||||||||||||||||||
no-match-proc : procedure? = identity | |||||||||||||||||||||
phase : any/c = 'start | |||||||||||||||||||||
appender : procedure? = string-append |
(add-item parser phase? in out) → void? |
parser : parser? |
phase? : any/c |
in : (or/c #t procedure? list? symbol? string?) |
out : (or/c procedure? symbol? string?) |
If phase? is a procedure, it will be used as is to match the parser’s phase. If phase? equals #t it will be changed to (λ args #t) such that it matches any phase. Any other value of phase will be turned into a procedure that matches this value with equal?.
If in is a string it will be turned into a procedure that matches the corresponding pregexp. If in is a symbol, it will be turned into a procedure that matches the corresponding pregexp with word boundaries on both sides, (useful for matching names or programming languages keywords). If in is a list, then add-item is called recursively on each member of in with the same parser, phase? and out. If in equals #t, it will modify the no-match-proc procedure to add the corresponding action when phase? applies to the parser. In the end, in has returns the same kind of values as regexp-match-positions.
out must be a procedure that accepts the same number of arguments as the number of values returned by the matcher in. For example, if in is "aa(b+)c(d+)e", then out must take 3 arguments (one for the whole string, and two for the b’s and the d’s). If out is not a procedure, it will be turned into a procedure that accepts any number of arguments and returns out.
(add-items parser [phase? [search-proc output-proc] ...] ...) |
(parse-text parser [#:phase phase] text ...) → (listof any/c) |
parser : parser? |
phase : any/c = ((parser-phase parser)) |
text : string? |
It is thus possible to call the parser inside the parsing phase, i.e once a portion of the text has been parsed, it can be given to the parser itself in some phase to make further transformations. This is not the same as sub-parsing because there is no callback.
4.3 Matchers
(re s) → procedure? |
s : string? |
(txt s) → procedure? |
s : string? |
(kw s) → procedure? |
s : string? |
4.4 Actions
(switch-phase phase) → string? |
phase : any/c |
| |||||||||||||||||||||
new-phase : any/c | |||||||||||||||||||||
callback : procedure? = identity | |||||||||||||||||||||
appender : procedure? = (parser-appender (current-parser)) |
Sub-parsers can be called recursively, once in a sub-parsing mode or in the callback.
Returns "".
(sub-parse-return [out]) → any |
out : any/c = #f |
4.5 Examples
Examples: | ||||||||||||||
| ||||||||||||||
"YaïCaïDaï -glitch- CaïDaï -gloutch- \nTaïPaïCHaï" | ||||||||||||||
| ||||||||||||||
(tree: (root (node1 (leaf1 leaf2) leaf3) (node2 leaf4 (node3 leaf5) leaf6) leaf7)) |
Note that the result of the last example is Scheme data, not a string.