summaryrefslogtreecommitdiff
path: root/doc/context/sources/general/manuals/mk/mk-xml.tex
diff options
context:
space:
mode:
Diffstat (limited to 'doc/context/sources/general/manuals/mk/mk-xml.tex')
-rw-r--r--doc/context/sources/general/manuals/mk/mk-xml.tex613
1 files changed, 613 insertions, 0 deletions
diff --git a/doc/context/sources/general/manuals/mk/mk-xml.tex b/doc/context/sources/general/manuals/mk/mk-xml.tex
new file mode 100644
index 000000000..41398a365
--- /dev/null
+++ b/doc/context/sources/general/manuals/mk/mk-xml.tex
@@ -0,0 +1,613 @@
+% language=uk
+
+% \startluacode
+% xml.trace_lpath = true
+% \stopluacode
+
+\startcomponent mk-xml
+
+\environment mk-environment
+
+\chapter{XML revisioned}
+
+{\em The code dealing with \XML\ is evolving and the following
+text might be outdated. So, in case of doubt, check the manual.}
+
+\subject{the parser}
+
+For quite a while \CONTEXT\ has built-in support for \XML\ processing and
+at \PRAGMA\ we use this extensively. One of the first things I tried to deal
+with in \LUA\ was \XML, and now that we have \LUATEX\ up and running it's
+time to investigate this a bit more. First we'll have a look at the basic
+functions, the \LUA\ side of the game.
+
+We load an \XML\ file as follows (the \type {document} namespace
+is predefined in \CONTEXT):
+
+\startbuffer
+\startluacode
+ document.xml = document.xml or { } -- define namespace
+ document.xml = xml.load("mk-xml.xml") -- load the file
+\stopluacode
+\stopbuffer
+
+\typebuffer \getbuffer
+
+The loader constructs a table representing the document structure, including
+whitespace, so let's serialize the code and see what shows up:
+
+\startbuffer
+\startluacode
+ local prn = xml.newhandlers { handle = tex.sprint }
+ tex.sprint("\\starttyping")
+ xml.serialize(document.xml, prn)
+ tex.sprint("\\stoptyping")
+\stopluacode
+\stopbuffer
+
+\typebuffer
+
+In the first version of the serializer, we could pass extra function
+arguments that controlled the way content was processed. This method
+has now been replaced by handlers. In this example we create a
+simple handler where the \type {handle} function is responsible
+for the final print.
+
+\getbuffer
+
+This already gives us a rather basic way to manipulate documents and
+this method is even not that slow because we bypass \TEX\ reading from
+file.
+
+\startbuffer
+\startluacode
+ local str = "<l> <w>hello</w> <w>world</w> </l>"
+ local prn = xml.newhandlers { handle = tex.sprint }
+ tex.sprint("\\starttyping")
+ xml.serialize(xml.convert(str),prn)
+ tex.sprint("\\stoptyping")
+\stopluacode
+\stopbuffer
+
+\typebuffer
+
+Watch the extra print argument, we need this because otherwise the
+verbatim mode will not work out well.
+
+\getbuffer
+
+You need to keep in mind that in these examples we print to \TEX\ under
+the current catcode regime.
+
+You can save a \XML\ table with the command:
+
+\starttyping
+\startluacode
+ xml.save(document.xml,"newfile.xml")
+\stopluacode
+\stoptyping
+
+These examples show that you have access to \XML\ files from
+within your document. If you want to convert the table to just a
+string, you can use \type {xml.tostring}. Actually, this method is
+automatically used for occasions where \LUA\ wants to print an
+\XML\ table or wants to join string snippets. However, as we are
+inside \TEX, we need to print to \TEX\ instead of the console or
+file. For this we use specialized handlers.
+
+The reason why I wrote the \XML\ parser is that we need it in the
+utilities (so it has to provide access to the content of elements)
+as well as in the text processing (so it needs to provide some
+manipulation features). To serve both we have implemented a subset
+of what standard \XML\ tools qualify as path based searching.
+
+\startbuffer
+\startluacode
+ xml.sprint(xml.first(document.xml, "/one/three/some"))
+\stopluacode
+\stopbuffer
+
+\typebuffer
+
+The result of this snippet is the content of the first element
+that matches the specification: \quote{\getbuffer}. As you can
+see, this comes out rather verbose. The reason for this is that we
+need to enter \XML\ mode in order to get such a snippet
+interpreted.
+
+Below we give a few more variants, this time
+we use a generic filter:
+
+\startbuffer
+\startluacode
+ xml.sprint(xml.filter(document.xml, "/one/three/some"))
+\stopluacode
+\stopbuffer
+
+\typebuffer result: \astype{\getbuffer}
+
+\startbuffer
+\startluacode
+ xml.sprint(xml.filter(document.xml, "/one/three/some/first()"))
+\stopluacode
+\stopbuffer
+
+\typebuffer result: \astype{\getbuffer}
+
+\startbuffer
+\startluacode
+ xml.sprint(xml.filter(document.xml, "/one/three/some[1]"))
+\stopluacode
+\stopbuffer
+
+\typebuffer result: \astype{\getbuffer}
+
+\startbuffer
+\startluacode
+ xml.sprint(xml.filter(document.xml, "/one/three/some[-1]"))
+\stopluacode
+\stopbuffer
+
+\typebuffer result: \astype{\getbuffer}
+
+\startbuffer
+\startluacode
+ xml.sprint(xml.filter(document.xml, "/one/three/some/texts()"))
+\stopluacode
+\stopbuffer
+
+\typebuffer result: \astype{\getbuffer}
+
+\startbuffer
+\startluacode
+ xml.sprint(xml.filter(document.xml, "/one/three/some[2]/text()"))
+\stopluacode
+\stopbuffer
+
+\typebuffer result: \astype{\getbuffer}
+
+The next lines shows some more variants. There are more than these and
+we will extend the repertoire over time. If needed you can define
+additional handlers.
+
+\subject{performance}
+
+Before we continue with more examples, a few remarks about the
+performance. The first version of the parser was an enhanced
+version of the one presented in the \LUA\ book: support for
+namespaces, processing instructions, comments, cdata and doctype,
+remapping and a few more things. When playing with the parser I
+was quite satisfied about the performance. However, when I started
+experimenting with 40~megabyte files, the preprocessing (needed
+for the special elements) started to become more noticeable. For
+smaller files its 40\% overhead is not that disturbing, but for
+large files \unknown\
+
+The current version uses \LPEG. We follow the same approach as
+before, stack and top and such but this time parsing is about
+twice as fast which is mostly due to the fact that we don't have
+to prepare the stream for cdata, doctype etc. Loading the
+mentioned large file took 12.5 seconds (1.5 for file io and the
+rest for tree building) on my laptop (a 2.3 Ghz Core Duo running
+Windows Vista). With the \LPEG\ implementation we got that down to
+less 7.3 seconds. Loading the 14 interface definition files (2.6
+meg) went down from 1.05 seconds to 0.55 seconds. Namespace
+related issues take some 10\% of this.
+
+Of course these numbers might change over time. For instance, we
+now have the second implementation of the filter mechanism which
+is more advanced and maybe somewhat slower on some tasks.
+
+\subject{patterns}
+
+We will not implement complete \XPATH\ functionality, but only the
+features that make sense for documents that are well structured
+and needs to be typeset. In addition we (will) implement text
+manipulation functions. Of course speed is also a consideration
+when implementing such mechanisms.
+
+The following list is not complete (after all here we only give an
+impression of the development) but it gives a good impression.
+
+\nonknuthmode
+
+\starttabulate[|l|c|l|]
+\NC \bf pattern \NC \bf supported \NC \bf comment \NC \NR
+\HL
+\NC \type{a} \NC \star \NC not anchored \NC \NR
+\NC \type{!a} \NC \star \NC not anchored,negated \NC \NR
+\NC \type{a/b} \NC \star \NC anchored on preceding \NC \NR
+\NC \type{/a/b} \NC \star \NC anchored (current root) \NC \NR
+\NC \type{^a/c} \NC \star \NC anchored (current root) \NC \NR
+\NC \type{^^/a/c} \NC todo \NC anchored (document root) \NC \NR
+\NC \type{a/*/b} \NC \star \NC one wildcard \NC \NR
+\NC \type{a//b} \NC \star \NC many wildcards \NC \NR
+\NC \type{a/**/b} \NC \star \NC many wildcards \NC \NR
+\NC \type{.} \NC \star \NC ignored self \NC \NR
+\NC \type{..} \NC \star \NC parent \NC \NR
+\NC \type{a[5]} \NC \star \NC index upwards \NC \NR
+\NC \type{a[-5]} \NC \star \NC index downwards \NC \NR
+\NC \type{a[position()=5]} \NC maybe \NC \NC \NR
+\NC \type{a[first()]} \NC maybe \NC \NC \NR
+\NC \type{a[last()]} \NC maybe \NC \NC \NR
+\NC \type{(b|c|d)} \NC \star \NC alternates (one of) \NC \NR
+\NC \type{b|c|d} \NC \star \NC alternates (one of) \NC \NR
+\NC \type{!(b|c|d)} \NC \star \NC not one of \NC \NR
+\NC \type{a/(b|c|d)/e/f} \NC \star \NC anchored alternates \NC \NR
+\NC \type{(c/d|e)} \NC not likely \NC nested subpaths \NC \NR
+\NC \type{a/b[@bla]} \NC \star \NC any value of \NC \NR
+\NC \type{a/b/@bla} \NC \star \NC any value of \NC \NR
+\NC \type{a/b[@bla='oeps']} \NC \star \NC equals value \NC \NR
+\NC \type{a/b[@bla=='oeps']} \NC \star \NC equals value \NC \NR
+\NC \type{a/b[@bla<>'oeps']} \NC \star \NC different value \NC \NR
+\NC \type{a/b[@bla!='oeps']} \NC \star \NC different value \NC \NR
+\TB
+\NC \type{...../attribute(id)} \NC \star \NC \NC \NR
+\NC \type{...../attributes()} \NC \star \NC \NC \NR
+\NC \type{...../text()} \NC \star \NC \NC \NR
+\NC \type{...../texts()} \NC \star \NC \NC \NR
+\NC \type{...../first()} \NC \star \NC \NC \NR
+\NC \type{...../last()} \NC \star \NC \NC \NR
+\NC \type{...../index(n)} \NC \star \NC \NC \NR
+\NC \type{...../position(n)} \NC \star \NC \NC \NR
+\TB
+\NC \type{root::} \NC \star \NC \NC \NR
+\NC \type{parent::} \NC \star \NC \NC \NR
+\NC \type{child::} \NC \star \NC \NC \NR
+\NC \type{ancestor::} \NC \star \NC \NC \NR
+\NC \type{preceding-sibling::} \NC not soon \NC \NC \NR
+\NC \type{following-sibling::} \NC not soon \NC \NC \NR
+\NC \type{preceding-sibling-of-self::} \NC not soon \NC \NC \NR
+\NC \type{following-sibling-or-self::} \NC not soon \NC \NC \NR
+\NC \type{descendent::} \NC \star \NC \NC \NR
+\NC \type{descendent-or-self::} \NC \star \NC \NC \NR
+\NC \type{preceding::} \NC not soon \NC \NC \NR
+\NC \type{following::} \NC not soon \NC \NC \NR
+\NC \type{self::node()} \NC not soon \NC \NC \NR
+\NC \type{id("tag")} \NC not soon \NC \NC \NR
+\NC \type{node()} \NC not soon \NC \NC \NR
+\stoptabulate
+
+This list shows that it is also possible to ask for more matches at
+once. Namespaces are supported (including a wildcard) and there are
+mechanisms for namespace remapping.
+
+\startbuffer
+\startluacode
+ lxml.concat(document.xml,"/one/(three|five)/some",", "," and ")
+\stopluacode
+\stopbuffer
+
+\typebuffer
+
+We get: \astype{\getbuffer} and if we say:
+
+\startbuffer
+\startluacode
+ lxml.concat(document.xml,"/one/(three|five)/some",", "," and ",
+ true)
+\stopluacode
+\stopbuffer
+
+\typebuffer
+
+We get: \quote {\getbuffer}.
+
+Watch how we use the \type {lxml} namespace here! Here live the
+functions that pipe the result to \TEX.
+
+\startbuffer
+\startluacode
+ lxml.count(document.xml,"/one/(three|five)/some")
+\stopluacode
+\stopbuffer
+
+There a several helper functions, like \type {xml.count} which in this case
+returns~\getbuffer.
+
+\typebuffer
+
+Functions like this gives the opportunity to loop over lists of elements
+by index.
+
+\subject{manipulations}
+
+We can manipulate elements too. The next code will add some elements
+at specific locations.
+
+\startbuffer
+\startluacode
+ xml.before(document.xml,"xml:///one/three/some","<be>ok</be>")
+ xml.after (document.xml,"xml:///one/three/some","<af>ok</af>")
+ tex.sprint("\\starttyping")
+ xml.sprint(lxml.filter(document.xml,"/one/three"))
+ tex.sprint("\\stoptyping")
+\stopluacode
+\stopbuffer
+
+\typebuffer
+
+And indeed, we suddenly have a couple of \quote {ok}'s there:
+
+\getbuffer
+
+Of course wel can also delete elements:
+
+\startbuffer
+\startluacode
+ xml.delete(document.xml,"/one/three/some")
+ xml.delete(document.xml,"/one/three/af")
+ tex.sprint("\\starttyping")
+ xml.sprint(lxml.filter(document.xml,"/one/three"))
+ tex.sprint("\\stoptyping")
+\stopluacode
+\stopbuffer
+
+\typebuffer
+
+Now we have:
+
+\getbuffer
+
+Replacing an element is also possible. The replacement can be a
+table (representing elements) or a string which is then converted
+into a table first.
+
+\startbuffer
+\startluacode
+ xml.replace(document.xml,"/one/three/be","<mid>done</mid>")
+ tex.sprint("\\starttyping")
+ xml.sprint(lxml.filter(document.xml,"/one/three"))
+ tex.sprint("\\stoptyping")
+\stopluacode
+\stopbuffer
+
+\typebuffer
+
+And indeed we get:
+
+\getbuffer
+
+These are just a few features of the library. I will add some more (rather) generic
+manipulaters and extend the functionality of the existing ones. Also, there will
+be a few manipulation functions that come in handy when preparing texts for
+processing with \TEX\ (most of the \XML\ that I deal with is rather dirty and needs
+some cleanup).
+
+\subject{streaming trees}
+
+Eventually we will provies series of convenient macros that will provide an
+alternative for most of the \MKII\ code. In \MKII\ we have a streaming parser, which
+boils down to attaching macros to elements. This includes a mechanism for saving
+an restoring data, but this is not always convenient because one also has to
+intercept elements that needs to be hidden.
+
+In \MKIV\ we do things different. First we load the complete document in memory (a
+\LUA\ table). Then we flush the elements that we want to process. We can associate
+setups with elements using the filters mentioned before. We can either use \TEX\ or
+use \LUA\ to manipulate content. Instead if a streaming parser we now have a mixture
+of streaming and tree manipulation available. Interesting is that the \XML\ loader
+is pretty fast and piping data to \TEX\ is also efficient. Since we no longer need to
+manipulate the elements in \TEX\ we gain processing time too, so in practice we have
+now much faster \XML\ processing available.
+
+To give you an idea we show a few commands:
+
+\startbuffer
+\xmlload {main}{mk-xml.xml}
+\stopbuffer
+
+\typebuffer \getbuffer
+
+So that we can do things like (there are and will be a few more):
+
+\starttabulate[|l|l|l|]
+\NC \bf command \NC \bf arguments \NC \bf result \NC \NR
+\NC \type {\xmlfirst} \NC \type {{main} {/one/three/some}} \NC \xmlfirst{main}{/one/three/some} \NC \NR
+\NC \type {\xmllast } \NC \type {{main} {/one/three/some}} \NC \xmllast {main}{/one/three/some} \NC \NR
+\NC \type {\xmlindex} \NC \type {{main} {/one/three/some} {2}} \NC \xmlindex{main}{/one/three/some}{2} \NC \NR
+\stoptabulate
+
+There is a set of about 30 commands that operates on the tree: loading, flushing,
+filtering, associating setups and code in modules to elements. For instance when
+one uses so called cals||tables, the processing is automatically activates when the
+namespace can be resolved. Processing is collected in setups and those registered
+are these are processed after loading the tree. In the following example we register
+a handler for content that needs to end up bold.
+
+\starttyping
+\startxmlsetups xml:mysetups
+ \xmlsetsetup{\xmldocument}{bold|bf}{xml:handlebold}
+\stopxmlsetups
+
+\xmlregistersetup{xml:mysetups}
+
+\startxmlsetups xml:handlebold
+ \dontleavehmode
+ \bgroup
+ \bf
+ \xmlflush{#1}
+ \egroup
+\stopxmlsetups
+\stoptyping
+
+In this example \type {#1} represents the root of the subtree. Say that we
+want to process an index entry which is coded as follows:
+
+\starttyping
+<index>
+ <entry>whatever</entry>
+ <key>whatever</key>
+</index>
+\stoptyping
+
+We register an additional handler (here the \type {*} is a shortcut for
+using the element's tag as setup name):
+
+\starttyping
+\startxmlsetups xml:mysetups
+ \xmlsetsetup{\xmldocument}{bold|bf}{xml:handlebold}
+ \xmlsetsetup{\xmldocument}{index}{*}
+\stopxmlsetups
+
+\xmlregistersetup{xml:mysetups}
+
+\startxmlsetups index
+ \index[\xmlfirst{#1}{key}]{\xmlfirst{#1}{entry}}
+\stopxmlsetups
+\stoptyping
+
+In practice \MKIV\ definitions are more compact than the comparable
+\MKII\ ones, especially for more complex constructs (tables and such).
+
+\starttyping
+\defineXMLenvironment
+ [index]
+ {\bgroup
+ \defineXMLsave[key]%
+ \defineXMLsave[entry]}
+ {\index[\XMLflush{key}]{\XMLflush{entry}}%
+ \egroup}
+\stoptyping
+
+This looks compact, but keep in mind that we also need to get rid of
+spurry spaces and when the code grows, we usually use setups to separate
+the definition from the code. In any case, the \MKII\ solution involves
+a few definitions as well as saving the content of elements. This is often
+much more costly than the \MKIV\ method where we only locate and flush
+content. Of course the document is stored in memory, but that happens
+pretty fast: storing the 14~files (2~per interface) that define the \CONTEXT\
+user interface takes .85 seconds on a 2.3 Ghz Core Duo (Windows Vista) which
+is not that bad if you take into account that we're talking of 2.7 megabytes
+of highly structured data (many elements and attributes, not that much text).
+Loading one of these files using \MKII\ code (for storing elements) takes
+many more seconds.
+
+I didn't do extensive speed tests yet but for normal streamed
+processing of simple documents the penalty of loading the tree can be
+neglected. When comparing traditional \MKII\ code like:
+
+\starttyping
+\defineXMLargument [title][id=] {\subject[\XMLop{at}]}
+\defineXMLenvironment[p] {} {\par}
+
+\starttext
+ \processXMLfilegrouped{testspeed.xml}
+\stoptext
+\stoptyping
+
+with its \MKIV\ counterpart:
+
+\starttyping
+\startxmlsetups document
+ \xmlsetsetup\xmldocument{title|p}{*}
+\stopxmlsetups
+
+\xmlregistersetup{document}
+
+\startxmlsetups title
+ \section[\xmlatt{#1}{id}]{\xmlcontent{#1}{/}}
+\stopxmlsetups
+
+\startxmlsetups p
+ \xmlflush{#1}\endgraf
+\stopxmlsetups
+
+\starttext
+ \processXMLfilegrouped{testspeed.xml}
+\stoptext
+
+I found that processing a one megabyte file with some 400 sections
+takes the same runtime for both approaches. However, as soon as more
+complex manipulations enter the game the \MKIV\ method starts taking
+less time. Think of the manipulations needed for \MATHML\ or converting
+tables into something that \CONTEXT\ can handle. Also, when we deal
+with documents where we need to ignore large portions of shuffle content
+around, the traditional method also has to store data in memory and in
+that case \MKII\ code always loses from \MKIV\ code. Of course any speed
+we gain in handling \XML\ is lost on processing complex fonts and
+attributes but there we gain in quality.
+
+\stoptyping
+
+Another advantage of the \MKIV\ mechanisms is that we suddenly have so called
+fully expandable \XML\ handling. All manipulations take place in \LUA\ and
+there is no interfering code at the \TEX\ end.
+
+\subject{examples}
+
+For the path freaks we now show what patterns lead to. For this we will
+use the following \XML\ data:
+
+\startbuffer[xml]
+<?xml version='1.0' ?>
+<a>
+ <?what is this?>
+ <b>
+ <c n='x'>c1</c><d>d1</d>
+ </b>
+ <b>
+ <c n='y'>c2</c><d>d2</d>
+ </b>
+ <?what is that?>
+ <c><d>d3</d></c>
+ <c n='y'><d>d4</d></c>
+ <c><d>d5</d></c>
+</a>
+\stopbuffer
+
+\typebuffer[xml]
+
+\xmlloadbuffer{xml}{xml}
+
+\startluacode
+ function document.ShowResultOfPattern(root,pattern)
+ local ok = false
+ for r,d,k in xml.elements(lxml.id(root),pattern) do
+ tex.print(xml.tostring(d[k]))
+ tex.sprint(tex.ctxcatcodes,"\\par")
+ ok = true
+ end
+ if not ok then
+ tex.sprint("no match")
+ tex.sprint(tex.ctxcatcodes,"\\par")
+ end
+ end
+\stopluacode
+
+Here come the examples:
+
+\definehead[example][subsubject]
+\setuphead[example][style=\tt,before=\blank,after=\nowhitespace]
+
+\def\ShowResultOfPattern#1#2%
+ {\example{#2}
+ \startpacked \tttf
+ \ctxlua{document.ShowResultOfPattern("#1","#2")}
+ \stoppacked}
+
+\ShowResultOfPattern{xml}{a/b/c}
+\ShowResultOfPattern{xml}{/a/b/c}
+\ShowResultOfPattern{xml}{b/c}
+\ShowResultOfPattern{xml}{c}
+\ShowResultOfPattern{xml}{a/*/c}
+\ShowResultOfPattern{xml}{a/**/c}
+\ShowResultOfPattern{xml}{a//c}
+\ShowResultOfPattern{xml}{a/*/*/c}
+\ShowResultOfPattern{xml}{*/c}
+\ShowResultOfPattern{xml}{**/c}
+\ShowResultOfPattern{xml}{a/../*/c}
+\ShowResultOfPattern{xml}{a/../c}
+\ShowResultOfPattern{xml}{c[@n='x']}
+\ShowResultOfPattern{xml}{c[@n]}
+\ShowResultOfPattern{xml}{c[@n='y']}
+\ShowResultOfPattern{xml}{c[1]}
+\ShowResultOfPattern{xml}{b/c[1]}
+\ShowResultOfPattern{xml}{a/c[1]}
+\ShowResultOfPattern{xml}{a/c[-1]}
+\ShowResultOfPattern{xml}{c[1]}
+\ShowResultOfPattern{xml}{c[-1]}
+\ShowResultOfPattern{xml}{pi::}
+\ShowResultOfPattern{xml}{pi::what}
+
+\stopcomponent