HtmlPrag: Pragmatic Parsing and Emitting of HTML using SXML and SHTML
HtmlPrag: Pragmatic Parsing and Emitting of HTML using SXML and SHTML
Version 0.13, 2005-02-23, `'
by Neil W. Van Dyke <[email protected]>
Copyright (C) 2003 - 2005 Neil W. Van Dyke. This program is Free
Software; you can redistribute it and/or modify it under the terms
of the GNU Lesser General Public License as published by the Free
Software Foundation; either version 2.1 of the License, or (at
your option) any later version. This program is distributed in
the hope that it will be useful, but without any warranty; without
even the implied warranty of merchantability or fitness for a
particular purpose. See the GNU Lesser General Public License
[LGPL] for details. For other license options and commercial
consulting, contact the author.
HtmlPrag provides permissive HTML parsing and emitting capability to
Scheme programs. The parser is useful for software agent extraction of
information from Web pages, for programmatically transforming HTML
files, and for implementing interactive Web browsers. HtmlPrag emits
"SHTML," which is an encoding of HTML in [SXML], so that conventional
HTML may be processed with XML tools such as [SXPath] and [SXML-Tools].
Like [SSAX-HTML], HtmlPrag provides a permissive tokenizer, but also
attempts to recover structure. HtmlPrag also includes procedures for
encoding SHTML in HTML syntax.
The HtmlPrag parsing behavior is permissive in that it accepts
erroneous HTML, handling several classes of HTML syntax errors
gracefully, without yielding a parse error. This is crucial for
parsing arbitrary real-world Web pages, since many pages actually
contain syntax errors that would defeat a strict or validating parser.
HtmlPrag's handling of errors is intended to generally emulate popular
Web browsers' interpretation of the structure of erroneous HTML. We
euphemistically term this kind of parse "pragmatic."
HtmlPrag also has some support for [XHTML], although XML namespace
qualifiers [XML-Names] are currently accepted but stripped from the
resulting SHTML. Note that valid XHTML input is of course better
handled by a validating XML parser like [SSAX].
To receive notification of new versions of HtmlPrag, and to be
polled for input on changes to HtmlPrag being considered, ask the
author to add you to the `scheme-announce' moderated email list,
HtmlPrag requires R5RS, [SRFI-6], and [SRFI-23].
SHTML is a variant of [SXML], with two minor but useful extensions:
1. The SXML keyword symbols, such as `*TOP*', are defined to be in all
uppercase, regardless of the case-sensitivity of the reader of the
hosting Scheme implementation in any context. This avoids several
2. Since not all character entity references used in HTML can be
converted to Scheme characters in all R5RS Scheme implementations,
nor represented in conventional text files or other common
external text formats to which one might wish to write SHTML,
SHTML adds a special `&' syntax for non-ASCII (or
non-Extended-ASCII) characters. The syntax is `(& VAL)', where
VAL is a symbol or string naming with the symbolic name of the
character, or an integer with the numeric value of the character.
> shtml-comment-symbol
> shtml-decl-symbol
> shtml-empty-symbol
> shtml-end-symbol
> shtml-entity-symbol
> shtml-pi-symbol
> shtml-start-symbol
> shtml-text-symbol
> shtml-top-symbol
These variables are bound to the following case-sensitive symbols
used in SHTML, respectively: `*COMMENT*', `*DECL*', `*EMPTY*',
`*END*', `*ENTITY*', `*PI*', `*START*', `*TEXT*', and `*TOP*'.
These can be used in lieu of the literal symbols in programs read
by a case-insensitive Scheme reader.(1)
> shtml-named-char-id
> shtml-numeric-char-id
These variables are bound to the SHTML entity public identifier
strings used in SHTML `*ENTITY*' named and numeric character entity
> (make-shtml-entity val)
Yields an SHTML character entity reference for VAL. For example:
(make-shtml-entity "rArr") => (& rArr)
(make-shtml-entity (string->symbol "rArr")) => (& rArr)
(make-shtml-entity 151) => (& 151)
> (shtml-entity-value obj)
Yields the value for the SHTML entity OBJ, or `#f' if OBJ is not a
recognized entity. Values of named entities are symbols, and
values of numeric entities are numbers. An error may raised if OBJ
is an entity with system ID inconsistent with its public ID. For
(define (f s) (shtml-entity-value (cadr (html->shtml s))))
(f " ") => nbsp
(f "ߐ") => 2000
The tokenizer is used by the higher-level structural parser, but can
also be called directly for debugging purposes or unusual applications.
Some of the list structure of tokens, such as for start tag tokens, is
mutated and incorporated into the SHTML list structure emitted by the
> (make-html-tokenizer in normalized?)
Constructs an HTML tokenizer procedure on input port IN. If
boolean NORMALIZED? is true, then tokens will be in a format
conducive to use with a parser emitting normalized SXML. Each
call to the resulting procedure yields a successive token from the
input. When the tokens have been exhausted, the procedure returns
the null list. For example:
(define input (open-input-string "<a href=\"foo\">bar</a>"))
(define next (make-html-tokenizer input #f))
(next) => (a (@ (href "foo")))
(next) => "bar"
(next) => (*END* a)
(next) => ()
(next) => ()
> (tokenize-html in normalized?)
Returns a list of tokens from input port IN, normalizing according
to boolean NORMALIZED?. This is probably most useful as a
debugging convenience. For example:
(tokenize-html (open-input-string "<a href=\"foo\">bar</a>") #f)
=> ((a (@ (href "foo"))) "bar" (*END* a))
> (shtml-token-kind token)
Returns a symbol indicating the kind of tokenizer TOKEN:
`*COMMENT*', `*DECL*', `*EMPTY*', `*END*', `*ENTITY*', `*PI*',
`*START*', `*TEXT*'. This is used by higher-level parsing code.
For example:
(map shtml-token-kind
(tokenize-html (open-input-string "<a<b>><c</</c") #f))
Most applications will call a parser procedure such as `html->shtml'
rather than calling the tokenizer directly.
> (parse-html/tokenizer tokenizer normalized?)
Emits a parse tree like `html->shtml' and related procedures,
except using TOKENIZER as a source of tokens, rather than
tokenizing from an input port. This procedure is used internally,
and generally should not be called directly.
> (html->sxml-0nf input)
> (html->sxml-1nf input)
> (html->sxml-2nf input)
> (html->sxml input)
> (html->shtml input)
Permissively parse HTML from INPUT, which is either an input port
or a string, and emit an SHTML equivalent or approximation. To
borrow and slightly modify an example from [SSAX-HTML]:
<a href=\"url\">link</a><p align=center><ul compact style=\"aa\">
<p>BLah<!-- comment <comment> --> <i> italic <b> bold <tt> ened</i>
still < bold </b></body><P> But not done yet...")
(*TOP* (html (head (title) (title "whatever"))
(body "\n"
(a (@ (href "url")) "link")
(p (@ (align "center"))
(ul (@ (compact) (style "aa")) "\n"))
(p "BLah"
(*COMMENT* " comment <comment> ")
" "
(i " italic " (b " bold " (tt " ened")))
"still < bold "))
(p " But not done yet...")))
Note that in the emitted SHTML the text token `"still < bold"' is
_not_ inside the `b' element, which represents an unfortunate
failure to emulate all the quirks-handling behavior of some
popular Web browsers.
The procedures `html->sxml-Nnf' for N 0 through 2 correspond to
0th through 2nd normal forms of SXML as specified in [SXML], and
indicate the minimal requirements of the emitted SXML.
`html->sxml' and `html->shtml' are currently aliases for
`html->sxml-0nf', and can be used in scripts and interactively,
when terseness is important and any normal form of SXML would
Emitting HTML
Two procedures encoding the SHTML representation as conventional HTML,
`write-shtml-as-html' and `shtml->html'. These are perhaps most useful
for emitting the result of parsed and transformed input HTML. They can
also be used for emitting HTML from generated or handwritten SHTML.
> (write-shtml-as-html shtml [out [foreign-filter]])
Writes a conventional HTML transliteration of the SHTML SHTML to
output port OUT. If OUT is not specified, the default is the
current output port. HTML elements of types that are always empty
are written using HTML4-compatible XHTML tag syntax.
If FOREIGN-FILTER is specified, it is a procedure of two argument
that is applied to any non-SHTML ("foreign") object encountered in
SHTML, and should yield SHTML. The first argument is the object,
and the second argument is a boolean for whether or not the object
is part of an attribute value.
No inter-tag whitespace or line breaks not explicit in SHTML is
emitted. The SHTML should normally include a newline at the end of
the document. For example:
'((html (head (title "My Title"))
(body (@ (bgcolor "white"))
(h1 "My Heading")
(p "This is a paragraph.")
(p "This is another paragraph.")))))
-| <html><head><title>My Title</title></head><body bgcolor="whi
-| te"><h1>My Heading</h1><p>This is a paragraph.</p><p>This is
-| another paragraph.</p></body></html>
> (shtml->html shtml)
Yields an HTML encoding of SHTML SHTML as a string. For example:
"<P>This is<br<b<I>bold </foo>italic</ b > text.</p>"))
=> "<p>This is<br /><b><i>bold italic</i></b> text.</p>"
Note that, since this procedure constructs a string, it should
normally only be used when the HTML is relatively small. When
encoding HTML documents of conventional size and larger,
`write-shtml-as-html' is much more efficient.
As HtmlPrag evolves towards version 1.0,
The equivalences below show the deprecated expressions below, the
code on the left is deprecated and should be replaced with the code on
the right.
sxml->html == shtml->html
write-sxml-html == write-shtml-as-html
The HtmlPrag test suite can be enabled by editing the source code file
and loading [Testeez]; the test suite is disabled by default.
Version 0.13 -- 2005-02-23
HtmlPrag now requires `syntax-rules', and a reader that can read
`@' as a symbol. SHTML now has a special `&' element for
character entities, and it is emitted by the parser rather than
the old `*ENTITY*' kludge. `shtml-entity-value' supports both the
new and the old character entity representations.
`shtml-entity-value' now yields `#f' on invalid SHTML entity,
rather than raising an error. `write-shtml-as-html' now has a
third argument, `foreign-filter'. `write-shtml-as-html' now emits
SHTML `&' entity references. Changed `shtml-named-char-id' and
`shtml-numeric-char-id', as previously warned. Testeez is now
used for the test suite. Test procedure is now the internal
`%htmlprag:test'. Documentation changes. Notably, much
documentation about using HtmlPrag under various particular Scheme
implementations has been removed.
Version 0.12 -- 2004-07-12
Forward-slash in an unquoted attribute value is now considered a
value constituent rather than an unconsumed terminator of the
value (thanks to Maurice Davis for reporting and a suggested fix).
`xml:' is now preserved as a namespace qualifier (thanks to Peter
Barabas for reporting). Output port term of `write-shtml-as-html'
is now optional. Began documenting loading for particular
implementation-specific packagings.
Version 0.11 -- 2004-05-13
To reduce likely namespace collisions with SXML tools, and in
anticipation of a forthcoming set of new features, introduced the
concept of "SHTML," which will be elaborated upon in a future
version of HtmlPrag. Renamed `sxml-X-symbol' to `shtml-X-symbol',
`sxml-html-X' to `shtml-X', and `sxml-token-kind' to
`shtml-token-kind'. `html->shtml', `shtml->html', and
`write-shtml-as-html' have been added as names. Considered
deprecated but still defined (see the "Deprecated" section of this
documentation) are `sxml->html' and `write-sxml-html'. The
growing pains should now be all but over. Internally,
`htmlprag-internal:error' introduced for Bigloo portability. SISC
returned to the test list; thanks to Scott G. Miller for his
help. Fixed a new character `eq?' bug, thanks to SISC.
Version 0.10 -- 2004-05-11
All public identifiers have been renamed to drop the "`htmlprag:'"
prefix. The portability identifiers have been renamed to begin
with an `htmlprag-internal:' prefix, are now considered strictly
internal-use-only, and have otherwise been changed. `parse-html'
and `always-empty-html-elements' are no longer public.
`test-htmlprag' now tests `html->sxml' rather than `parse-html'.
SISC temporarily removed from the test list, until an open source
Java that works correctly is found.
Version 0.9 -- 2004-05-07
HTML encoding procedures added. Added
`htmlprag:sxml-html-entity-value'. Upper-case `X' in hexadecimal
character entities is now parsed, in addition to lower-case `x'.
Added `htmlprag:always-empty-html-elements'. Added additional
portability bindings. Added more test cases.
Version 0.8 -- 2004-04-27
Entity references (symbolic, decimal numeric, hexadecimal numeric)
are now parsed into `*ENTITY*' SXML. SXML symbols like `*TOP*'
are now always upper-case, regardless of the Scheme
implementation. Identifiers such as `htmlprag:sxml-top-symbol'
are bound to the upper-case symbols. Procedures
`htmlprag:html->sxml-0nf', `htmlprag:html->sxml-1nf', and
`htmlprag:html->sxml-2nf' have been added. `htmlprag:html->sxml'
now an alias for `htmlprag:html->sxml-0nf'. `htmlprag:parse' has
been refashioned as `htmlprag:parse-html' and should no longer be
directly. A number of identifiers have been renamed to be more
appropriate when the `htmlprag:' prefix is dropped in some
implementation-specific packagings of HtmlPrag:
`htmlprag:make-tokenizer' to `htmlprag:make-html-tokenizer',
`htmlprag:parse/tokenizer' to `htmlprag:parse-html/tokenizer',
`htmlprag:html->token-list' to `htmlprag:tokenize-html',
`htmlprag:token-kind' to `htmlprag:sxml-token-kind', and
`htmlprag:test' to `htmlprag:test-htmlprag'. Verbatim elements
with empty-element tag syntax are handled correctly. New versions
of Bigloo and RScheme tested.
Version 0.7 -- 2004-03-10
Verbatim pair elements like `script' and `xmp' are now parsed
correctly. Two Scheme implementations have temporarily been
dropped from regression testing: Kawa, due to a Java bytecode
verifier error likely due to a Java installation problem on the
test machine; and SXM 1.1, due to hitting a limit on the number of
literals late in the test suite code. Tested newer versions of
Bigloo, Chicken, Gauche, Guile, MIT Scheme, PLT MzScheme, RScheme,
SISC, and STklos. RScheme no longer requires the "`(define
get-output-string close-output-port)'" workaround.
Version 0.6 -- 2003-07-03
Fixed uses of `eq?' in character comparisons, thanks to Scott G.
Miller. Added `htmlprag:html->normalized-sxml' and
`htmlprag:html->nonnormalized-sxml'. Started to add
`close-output-port' to uses of output strings, then reverted due to
bug in one of the supported dialects. Tested newer versions of
Bigloo, Gauche, PLT MzScheme, RScheme.
Version 0.5 -- 2003-02-26
Removed uses of `call-with-values'. Re-ordered top-level
definitions, for portability. Now tests under Kawa 1.6.99,
RScheme, Scheme 48 0.57, SISC 1.7.4, STklos 0.54, and SXM
Version 0.4 -- 2003-02-19
Apostrophe-quoted element attribute values are now handled. A bug
that incorrectly assumed left-to-right term evaluation order has
been fixed (thanks to MIT Scheme for confronting us with this).
Now also tests OK under Gauche 0.6.6 and MIT Scheme 7.7.1.
Portability improvement for implementations (e.g., RScheme, Stalin 0.9) that cannot read `@' as a symbol (although
those implementations tend to present other portability issues, as
yet unresolved).
Version 0.3 -- 2003-02-05
A test suite with 66 cases has been added, and necessary changes
have been made for the suite to pass on five popular Scheme
implementations. XML processing instructions are now parsed.
Parent constraints have been added for `colgroup', `tbody', and
`thead' elements. Erroneous input, including invalid hexadecimal
entity reference syntax and extraneous double quotes in element
tags, is now parsed better. `htmlprag:token-kind' emits symbols
more consistent with SXML.
Version 0.2 -- 2003-02-02
Portability improvements.
Version 0.1 -- 2003-01-31
Dusted off old Guile-specific code from April 2001, converted to
emit SXML, mostly ported to R5RS and SRFI-6, added some XHTML
support and documentation. A little preliminary testing has been
done, and the package is already useful for some applications, but
this release should be considered a preview to invite comments.
Dave Raggett, Arnaud Le Hors, Ian Jacobs, eds., "HTML 4.01
Specification," W3C Recommendation, 1999-12-24.
Free Software Foundation, "GNU Lesser General Public License,"
Version 2.1, 1999-02, 59 Temple Place, Suite 330, Boston, MA
02111-1307 USA.
William D. Clinger, "Basic String Ports," SRFI 6, 1999-07-01.
Stephan Houben, "Error reporting mechanism," SRFI 23, 2001-04-26.
Oleg Kiselyov, "A functional-style framework to parse XML
documents," 2002-09-05.
Oleg Kiselyov, "Permissive parsing of perhaps invalid HTML,"
Version 1.1, 2001-11-03.
Oleg Kiselyov, "SXML," revision 3.0.
Kirill Lisovsky, "SXPath and SXPointer,"
Oleg Kiselyov, "SXPath," version 3.5, 2001-01-12.
Neil W. Van Dyke, "Testeez: Simple Test Mechanism for Scheme,"
Version 0.1.
"XHTML 1.0: The Extensible HyperText Markup Language: A
Reformulation of HTML 4 in XML 1.0," W3C Recommendation,
Tim Bray, Dave Hollander, Andrew Layman, eds., "Namespaces in
XML," W3C Recommendation, 1999-01-14.
---------- Footnotes ----------
(1) Scheme implementators who have not yet made `read'
case-sensitive by default are encouraged to do so.