xml-pull: pull-style parsing for large xml files
xml-pull: pull-style parsing for large xml files
Quick Example
-------------
> (require (planet "xml-pull.ss" ("dyoo" "xml-pull.plt" 1 0)))
> (define a-taffy
(start-xml-pull (open-input-string #<<EOF
<test-xml>
<person>
<name>Sue Rhee</name>
</person>
<person>
<name>Dan Garcia</name>
</person>
<person>
<name>Mike Clancy</name>
</person>
</test-xml>
EOF
)))
We can consume this XML structure by morsels:
> (pull-morsel a-taffy)
#3(struct:start-element test-xml ())
> (pull-morsel a-taffy)
#3(struct:characters "\n" "")
> (pull-morsel a-taffy)
#3(struct:start-element person ())
At this point, we are at the start-element of a person. When we
see a start-element that is interesting to us, we can _pull-sexp_
the rest of that element as a normalized SXML fragment:
> (pull-sexp a-taffy)
(person (@) (name (@) "Sue Rhee"))
What's nice about this is that we only consume as much of the XML
from our input-stream as we need, and moreover, memory usage is
bounded to the amount of memory needed to represent the
fragment.
Structures
----------
There are two structures in this module: taffy and morsel.
* taffy
A _taffy_ is a core structure that maintains the state of
the XML parse. Conceptually, a taffy is an iterator of morsels
and SXML fragments.
* morsel
A _morsel_ is one of the following:
* (make-start-element name attributes)
where name is a symbol and attributes is
a (listof (list symbol string))
* (make-end-element n a)
where name is a symbol and attributes is
a (listof (list symbol string))
* (make-characters s1 s2)
where s1 and s2 are strings
* (make-exhausted)
Most of these are self-explanatory. We produce an _exhausted_
structure when there are no more elements in the xml
to parse.
The expected predicates and selectors are also available:
> taffy? : any -> boolean
> morsel? : any -> boolean
> start-element? : any -> boolean
> end-element? : any -> boolean
> characters? : any -> boolean
> exhausted? : any -> boolean
> start-element-name : start-element -> symbol
> start-element-attributes : start-element -> (listof (list symbol string))
> end-element-name: end-element -> symbol
> end-element-attributes end-element -> (listof (list symbol string))
Functions
---------
> start-xml-pull: input-port -> taffy
Given an input-port, starts the XML parse and returns a taffy.
> pull-morsel: taffy -> morsel
Takes a taffy and rips off a morsel.
> pull-sexp: taffy -> (union sexp exhausted)
Assuming that the very last morsel that is pulled off is a start-element,
pulls enough morsels to reproduce that element. If the last morsel is not
a start-event, raises an error.
> pull-sexps/g: taffy symbol -> (generatorof sexp)
The result is a _generator_ whose elements are s-expressions those
names match the given input symbol.
See http://planet.plt-scheme.org/#generator.plt2.0 for more details.
Parameters
----------
> current-namespace-translate: symbol -> symbol
If provided, this is used to translate the namespace portion of
element names in an XML document. By default, this is bound to the
identity function. (This is experimental --- I might remove this in
a later release of this software in favor of a simpler substitution
map similar to what ssax:xml->sxml takes in.)
More extenstive example
-----------------------
Here is code that takes a large XML document --- the collection of common ontology
terms used in bioinformatics --- and prints out the first hundred terms:
(module test-xml-pull-2 mzscheme
(require (lib "url.ss" "net")
(lib "inflate.ss")
(lib "pretty.ss")
(planet "xml-pull.ss" ("dyoo" "xml-pull.plt" 1 0))
(planet "generator.ss" ("dyoo" "generator.plt" 2 0)))
;; wrap-gunzip: input-port -> input-port
;; Wraps an uncompressor around the input stream.
(define (wrap-gunzip original-ip)
(define-values (ip op) (make-pipe 32768))
(thread (lambda () (gunzip-through-ports original-ip op)))
ip)
(define my-url
(string->url "http://archive.godatabase.org/latest-termdb/go_daily-termdb.rdf-xml.gz"))
(define my-input-port (wrap-gunzip (get-pure-port my-url)))
(define my-taffy (start-xml-pull my-input-port))
(define generated-terms (pull-sexps/g my-taffy 'http://www.geneontology.org/dtds/go.dtd#:term))
;; pretty-print the first 100 terms in the Gene Ontology
(let loop ([i 0])
(when (< i 100)
(pretty-print (generator-next generated-terms))
(loop (add1 i)))))
Thanks
------
Thanks to the PLT folks for writing tools that are very enjoyable
to play with. Special thanks to the bioinformaticians at
TAIR (http://arabidopsis.org) who taught me to appreciate
very large XML datasets.
References
----------
SSAX (http://ssax.sourceforge.net/)
SXML (http://okmij.org/ftp/Scheme/SXML.html)
About Pulldom and Minidom (http://www.prescod.net/python/pulldom.html)