diff options
Diffstat (limited to 'doc/context/sources/general/manuals/mk/mk-xml.tex')
-rw-r--r-- | doc/context/sources/general/manuals/mk/mk-xml.tex | 613 |
1 files changed, 613 insertions, 0 deletions
diff --git a/doc/context/sources/general/manuals/mk/mk-xml.tex b/doc/context/sources/general/manuals/mk/mk-xml.tex new file mode 100644 index 000000000..41398a365 --- /dev/null +++ b/doc/context/sources/general/manuals/mk/mk-xml.tex @@ -0,0 +1,613 @@ +% language=uk + +% \startluacode +% xml.trace_lpath = true +% \stopluacode + +\startcomponent mk-xml + +\environment mk-environment + +\chapter{XML revisioned} + +{\em The code dealing with \XML\ is evolving and the following +text might be outdated. So, in case of doubt, check the manual.} + +\subject{the parser} + +For quite a while \CONTEXT\ has built-in support for \XML\ processing and +at \PRAGMA\ we use this extensively. One of the first things I tried to deal +with in \LUA\ was \XML, and now that we have \LUATEX\ up and running it's +time to investigate this a bit more. First we'll have a look at the basic +functions, the \LUA\ side of the game. + +We load an \XML\ file as follows (the \type {document} namespace +is predefined in \CONTEXT): + +\startbuffer +\startluacode + document.xml = document.xml or { } -- define namespace + document.xml = xml.load("mk-xml.xml") -- load the file +\stopluacode +\stopbuffer + +\typebuffer \getbuffer + +The loader constructs a table representing the document structure, including +whitespace, so let's serialize the code and see what shows up: + +\startbuffer +\startluacode + local prn = xml.newhandlers { handle = tex.sprint } + tex.sprint("\\starttyping") + xml.serialize(document.xml, prn) + tex.sprint("\\stoptyping") +\stopluacode +\stopbuffer + +\typebuffer + +In the first version of the serializer, we could pass extra function +arguments that controlled the way content was processed. This method +has now been replaced by handlers. In this example we create a +simple handler where the \type {handle} function is responsible +for the final print. + +\getbuffer + +This already gives us a rather basic way to manipulate documents and +this method is even not that slow because we bypass \TEX\ reading from +file. + +\startbuffer +\startluacode + local str = "<l> <w>hello</w> <w>world</w> </l>" + local prn = xml.newhandlers { handle = tex.sprint } + tex.sprint("\\starttyping") + xml.serialize(xml.convert(str),prn) + tex.sprint("\\stoptyping") +\stopluacode +\stopbuffer + +\typebuffer + +Watch the extra print argument, we need this because otherwise the +verbatim mode will not work out well. + +\getbuffer + +You need to keep in mind that in these examples we print to \TEX\ under +the current catcode regime. + +You can save a \XML\ table with the command: + +\starttyping +\startluacode + xml.save(document.xml,"newfile.xml") +\stopluacode +\stoptyping + +These examples show that you have access to \XML\ files from +within your document. If you want to convert the table to just a +string, you can use \type {xml.tostring}. Actually, this method is +automatically used for occasions where \LUA\ wants to print an +\XML\ table or wants to join string snippets. However, as we are +inside \TEX, we need to print to \TEX\ instead of the console or +file. For this we use specialized handlers. + +The reason why I wrote the \XML\ parser is that we need it in the +utilities (so it has to provide access to the content of elements) +as well as in the text processing (so it needs to provide some +manipulation features). To serve both we have implemented a subset +of what standard \XML\ tools qualify as path based searching. + +\startbuffer +\startluacode + xml.sprint(xml.first(document.xml, "/one/three/some")) +\stopluacode +\stopbuffer + +\typebuffer + +The result of this snippet is the content of the first element +that matches the specification: \quote{\getbuffer}. As you can +see, this comes out rather verbose. The reason for this is that we +need to enter \XML\ mode in order to get such a snippet +interpreted. + +Below we give a few more variants, this time +we use a generic filter: + +\startbuffer +\startluacode + xml.sprint(xml.filter(document.xml, "/one/three/some")) +\stopluacode +\stopbuffer + +\typebuffer result: \astype{\getbuffer} + +\startbuffer +\startluacode + xml.sprint(xml.filter(document.xml, "/one/three/some/first()")) +\stopluacode +\stopbuffer + +\typebuffer result: \astype{\getbuffer} + +\startbuffer +\startluacode + xml.sprint(xml.filter(document.xml, "/one/three/some[1]")) +\stopluacode +\stopbuffer + +\typebuffer result: \astype{\getbuffer} + +\startbuffer +\startluacode + xml.sprint(xml.filter(document.xml, "/one/three/some[-1]")) +\stopluacode +\stopbuffer + +\typebuffer result: \astype{\getbuffer} + +\startbuffer +\startluacode + xml.sprint(xml.filter(document.xml, "/one/three/some/texts()")) +\stopluacode +\stopbuffer + +\typebuffer result: \astype{\getbuffer} + +\startbuffer +\startluacode + xml.sprint(xml.filter(document.xml, "/one/three/some[2]/text()")) +\stopluacode +\stopbuffer + +\typebuffer result: \astype{\getbuffer} + +The next lines shows some more variants. There are more than these and +we will extend the repertoire over time. If needed you can define +additional handlers. + +\subject{performance} + +Before we continue with more examples, a few remarks about the +performance. The first version of the parser was an enhanced +version of the one presented in the \LUA\ book: support for +namespaces, processing instructions, comments, cdata and doctype, +remapping and a few more things. When playing with the parser I +was quite satisfied about the performance. However, when I started +experimenting with 40~megabyte files, the preprocessing (needed +for the special elements) started to become more noticeable. For +smaller files its 40\% overhead is not that disturbing, but for +large files \unknown\ + +The current version uses \LPEG. We follow the same approach as +before, stack and top and such but this time parsing is about +twice as fast which is mostly due to the fact that we don't have +to prepare the stream for cdata, doctype etc. Loading the +mentioned large file took 12.5 seconds (1.5 for file io and the +rest for tree building) on my laptop (a 2.3 Ghz Core Duo running +Windows Vista). With the \LPEG\ implementation we got that down to +less 7.3 seconds. Loading the 14 interface definition files (2.6 +meg) went down from 1.05 seconds to 0.55 seconds. Namespace +related issues take some 10\% of this. + +Of course these numbers might change over time. For instance, we +now have the second implementation of the filter mechanism which +is more advanced and maybe somewhat slower on some tasks. + +\subject{patterns} + +We will not implement complete \XPATH\ functionality, but only the +features that make sense for documents that are well structured +and needs to be typeset. In addition we (will) implement text +manipulation functions. Of course speed is also a consideration +when implementing such mechanisms. + +The following list is not complete (after all here we only give an +impression of the development) but it gives a good impression. + +\nonknuthmode + +\starttabulate[|l|c|l|] +\NC \bf pattern \NC \bf supported \NC \bf comment \NC \NR +\HL +\NC \type{a} \NC \star \NC not anchored \NC \NR +\NC \type{!a} \NC \star \NC not anchored,negated \NC \NR +\NC \type{a/b} \NC \star \NC anchored on preceding \NC \NR +\NC \type{/a/b} \NC \star \NC anchored (current root) \NC \NR +\NC \type{^a/c} \NC \star \NC anchored (current root) \NC \NR +\NC \type{^^/a/c} \NC todo \NC anchored (document root) \NC \NR +\NC \type{a/*/b} \NC \star \NC one wildcard \NC \NR +\NC \type{a//b} \NC \star \NC many wildcards \NC \NR +\NC \type{a/**/b} \NC \star \NC many wildcards \NC \NR +\NC \type{.} \NC \star \NC ignored self \NC \NR +\NC \type{..} \NC \star \NC parent \NC \NR +\NC \type{a[5]} \NC \star \NC index upwards \NC \NR +\NC \type{a[-5]} \NC \star \NC index downwards \NC \NR +\NC \type{a[position()=5]} \NC maybe \NC \NC \NR +\NC \type{a[first()]} \NC maybe \NC \NC \NR +\NC \type{a[last()]} \NC maybe \NC \NC \NR +\NC \type{(b|c|d)} \NC \star \NC alternates (one of) \NC \NR +\NC \type{b|c|d} \NC \star \NC alternates (one of) \NC \NR +\NC \type{!(b|c|d)} \NC \star \NC not one of \NC \NR +\NC \type{a/(b|c|d)/e/f} \NC \star \NC anchored alternates \NC \NR +\NC \type{(c/d|e)} \NC not likely \NC nested subpaths \NC \NR +\NC \type{a/b[@bla]} \NC \star \NC any value of \NC \NR +\NC \type{a/b/@bla} \NC \star \NC any value of \NC \NR +\NC \type{a/b[@bla='oeps']} \NC \star \NC equals value \NC \NR +\NC \type{a/b[@bla=='oeps']} \NC \star \NC equals value \NC \NR +\NC \type{a/b[@bla<>'oeps']} \NC \star \NC different value \NC \NR +\NC \type{a/b[@bla!='oeps']} \NC \star \NC different value \NC \NR +\TB +\NC \type{...../attribute(id)} \NC \star \NC \NC \NR +\NC \type{...../attributes()} \NC \star \NC \NC \NR +\NC \type{...../text()} \NC \star \NC \NC \NR +\NC \type{...../texts()} \NC \star \NC \NC \NR +\NC \type{...../first()} \NC \star \NC \NC \NR +\NC \type{...../last()} \NC \star \NC \NC \NR +\NC \type{...../index(n)} \NC \star \NC \NC \NR +\NC \type{...../position(n)} \NC \star \NC \NC \NR +\TB +\NC \type{root::} \NC \star \NC \NC \NR +\NC \type{parent::} \NC \star \NC \NC \NR +\NC \type{child::} \NC \star \NC \NC \NR +\NC \type{ancestor::} \NC \star \NC \NC \NR +\NC \type{preceding-sibling::} \NC not soon \NC \NC \NR +\NC \type{following-sibling::} \NC not soon \NC \NC \NR +\NC \type{preceding-sibling-of-self::} \NC not soon \NC \NC \NR +\NC \type{following-sibling-or-self::} \NC not soon \NC \NC \NR +\NC \type{descendent::} \NC \star \NC \NC \NR +\NC \type{descendent-or-self::} \NC \star \NC \NC \NR +\NC \type{preceding::} \NC not soon \NC \NC \NR +\NC \type{following::} \NC not soon \NC \NC \NR +\NC \type{self::node()} \NC not soon \NC \NC \NR +\NC \type{id("tag")} \NC not soon \NC \NC \NR +\NC \type{node()} \NC not soon \NC \NC \NR +\stoptabulate + +This list shows that it is also possible to ask for more matches at +once. Namespaces are supported (including a wildcard) and there are +mechanisms for namespace remapping. + +\startbuffer +\startluacode + lxml.concat(document.xml,"/one/(three|five)/some",", "," and ") +\stopluacode +\stopbuffer + +\typebuffer + +We get: \astype{\getbuffer} and if we say: + +\startbuffer +\startluacode + lxml.concat(document.xml,"/one/(three|five)/some",", "," and ", + true) +\stopluacode +\stopbuffer + +\typebuffer + +We get: \quote {\getbuffer}. + +Watch how we use the \type {lxml} namespace here! Here live the +functions that pipe the result to \TEX. + +\startbuffer +\startluacode + lxml.count(document.xml,"/one/(three|five)/some") +\stopluacode +\stopbuffer + +There a several helper functions, like \type {xml.count} which in this case +returns~\getbuffer. + +\typebuffer + +Functions like this gives the opportunity to loop over lists of elements +by index. + +\subject{manipulations} + +We can manipulate elements too. The next code will add some elements +at specific locations. + +\startbuffer +\startluacode + xml.before(document.xml,"xml:///one/three/some","<be>ok</be>") + xml.after (document.xml,"xml:///one/three/some","<af>ok</af>") + tex.sprint("\\starttyping") + xml.sprint(lxml.filter(document.xml,"/one/three")) + tex.sprint("\\stoptyping") +\stopluacode +\stopbuffer + +\typebuffer + +And indeed, we suddenly have a couple of \quote {ok}'s there: + +\getbuffer + +Of course wel can also delete elements: + +\startbuffer +\startluacode + xml.delete(document.xml,"/one/three/some") + xml.delete(document.xml,"/one/three/af") + tex.sprint("\\starttyping") + xml.sprint(lxml.filter(document.xml,"/one/three")) + tex.sprint("\\stoptyping") +\stopluacode +\stopbuffer + +\typebuffer + +Now we have: + +\getbuffer + +Replacing an element is also possible. The replacement can be a +table (representing elements) or a string which is then converted +into a table first. + +\startbuffer +\startluacode + xml.replace(document.xml,"/one/three/be","<mid>done</mid>") + tex.sprint("\\starttyping") + xml.sprint(lxml.filter(document.xml,"/one/three")) + tex.sprint("\\stoptyping") +\stopluacode +\stopbuffer + +\typebuffer + +And indeed we get: + +\getbuffer + +These are just a few features of the library. I will add some more (rather) generic +manipulaters and extend the functionality of the existing ones. Also, there will +be a few manipulation functions that come in handy when preparing texts for +processing with \TEX\ (most of the \XML\ that I deal with is rather dirty and needs +some cleanup). + +\subject{streaming trees} + +Eventually we will provies series of convenient macros that will provide an +alternative for most of the \MKII\ code. In \MKII\ we have a streaming parser, which +boils down to attaching macros to elements. This includes a mechanism for saving +an restoring data, but this is not always convenient because one also has to +intercept elements that needs to be hidden. + +In \MKIV\ we do things different. First we load the complete document in memory (a +\LUA\ table). Then we flush the elements that we want to process. We can associate +setups with elements using the filters mentioned before. We can either use \TEX\ or +use \LUA\ to manipulate content. Instead if a streaming parser we now have a mixture +of streaming and tree manipulation available. Interesting is that the \XML\ loader +is pretty fast and piping data to \TEX\ is also efficient. Since we no longer need to +manipulate the elements in \TEX\ we gain processing time too, so in practice we have +now much faster \XML\ processing available. + +To give you an idea we show a few commands: + +\startbuffer +\xmlload {main}{mk-xml.xml} +\stopbuffer + +\typebuffer \getbuffer + +So that we can do things like (there are and will be a few more): + +\starttabulate[|l|l|l|] +\NC \bf command \NC \bf arguments \NC \bf result \NC \NR +\NC \type {\xmlfirst} \NC \type {{main} {/one/three/some}} \NC \xmlfirst{main}{/one/three/some} \NC \NR +\NC \type {\xmllast } \NC \type {{main} {/one/three/some}} \NC \xmllast {main}{/one/three/some} \NC \NR +\NC \type {\xmlindex} \NC \type {{main} {/one/three/some} {2}} \NC \xmlindex{main}{/one/three/some}{2} \NC \NR +\stoptabulate + +There is a set of about 30 commands that operates on the tree: loading, flushing, +filtering, associating setups and code in modules to elements. For instance when +one uses so called cals||tables, the processing is automatically activates when the +namespace can be resolved. Processing is collected in setups and those registered +are these are processed after loading the tree. In the following example we register +a handler for content that needs to end up bold. + +\starttyping +\startxmlsetups xml:mysetups + \xmlsetsetup{\xmldocument}{bold|bf}{xml:handlebold} +\stopxmlsetups + +\xmlregistersetup{xml:mysetups} + +\startxmlsetups xml:handlebold + \dontleavehmode + \bgroup + \bf + \xmlflush{#1} + \egroup +\stopxmlsetups +\stoptyping + +In this example \type {#1} represents the root of the subtree. Say that we +want to process an index entry which is coded as follows: + +\starttyping +<index> + <entry>whatever</entry> + <key>whatever</key> +</index> +\stoptyping + +We register an additional handler (here the \type {*} is a shortcut for +using the element's tag as setup name): + +\starttyping +\startxmlsetups xml:mysetups + \xmlsetsetup{\xmldocument}{bold|bf}{xml:handlebold} + \xmlsetsetup{\xmldocument}{index}{*} +\stopxmlsetups + +\xmlregistersetup{xml:mysetups} + +\startxmlsetups index + \index[\xmlfirst{#1}{key}]{\xmlfirst{#1}{entry}} +\stopxmlsetups +\stoptyping + +In practice \MKIV\ definitions are more compact than the comparable +\MKII\ ones, especially for more complex constructs (tables and such). + +\starttyping +\defineXMLenvironment + [index] + {\bgroup + \defineXMLsave[key]% + \defineXMLsave[entry]} + {\index[\XMLflush{key}]{\XMLflush{entry}}% + \egroup} +\stoptyping + +This looks compact, but keep in mind that we also need to get rid of +spurry spaces and when the code grows, we usually use setups to separate +the definition from the code. In any case, the \MKII\ solution involves +a few definitions as well as saving the content of elements. This is often +much more costly than the \MKIV\ method where we only locate and flush +content. Of course the document is stored in memory, but that happens +pretty fast: storing the 14~files (2~per interface) that define the \CONTEXT\ +user interface takes .85 seconds on a 2.3 Ghz Core Duo (Windows Vista) which +is not that bad if you take into account that we're talking of 2.7 megabytes +of highly structured data (many elements and attributes, not that much text). +Loading one of these files using \MKII\ code (for storing elements) takes +many more seconds. + +I didn't do extensive speed tests yet but for normal streamed +processing of simple documents the penalty of loading the tree can be +neglected. When comparing traditional \MKII\ code like: + +\starttyping +\defineXMLargument [title][id=] {\subject[\XMLop{at}]} +\defineXMLenvironment[p] {} {\par} + +\starttext + \processXMLfilegrouped{testspeed.xml} +\stoptext +\stoptyping + +with its \MKIV\ counterpart: + +\starttyping +\startxmlsetups document + \xmlsetsetup\xmldocument{title|p}{*} +\stopxmlsetups + +\xmlregistersetup{document} + +\startxmlsetups title + \section[\xmlatt{#1}{id}]{\xmlcontent{#1}{/}} +\stopxmlsetups + +\startxmlsetups p + \xmlflush{#1}\endgraf +\stopxmlsetups + +\starttext + \processXMLfilegrouped{testspeed.xml} +\stoptext + +I found that processing a one megabyte file with some 400 sections +takes the same runtime for both approaches. However, as soon as more +complex manipulations enter the game the \MKIV\ method starts taking +less time. Think of the manipulations needed for \MATHML\ or converting +tables into something that \CONTEXT\ can handle. Also, when we deal +with documents where we need to ignore large portions of shuffle content +around, the traditional method also has to store data in memory and in +that case \MKII\ code always loses from \MKIV\ code. Of course any speed +we gain in handling \XML\ is lost on processing complex fonts and +attributes but there we gain in quality. + +\stoptyping + +Another advantage of the \MKIV\ mechanisms is that we suddenly have so called +fully expandable \XML\ handling. All manipulations take place in \LUA\ and +there is no interfering code at the \TEX\ end. + +\subject{examples} + +For the path freaks we now show what patterns lead to. For this we will +use the following \XML\ data: + +\startbuffer[xml] +<?xml version='1.0' ?> +<a> + <?what is this?> + <b> + <c n='x'>c1</c><d>d1</d> + </b> + <b> + <c n='y'>c2</c><d>d2</d> + </b> + <?what is that?> + <c><d>d3</d></c> + <c n='y'><d>d4</d></c> + <c><d>d5</d></c> +</a> +\stopbuffer + +\typebuffer[xml] + +\xmlloadbuffer{xml}{xml} + +\startluacode + function document.ShowResultOfPattern(root,pattern) + local ok = false + for r,d,k in xml.elements(lxml.id(root),pattern) do + tex.print(xml.tostring(d[k])) + tex.sprint(tex.ctxcatcodes,"\\par") + ok = true + end + if not ok then + tex.sprint("no match") + tex.sprint(tex.ctxcatcodes,"\\par") + end + end +\stopluacode + +Here come the examples: + +\definehead[example][subsubject] +\setuphead[example][style=\tt,before=\blank,after=\nowhitespace] + +\def\ShowResultOfPattern#1#2% + {\example{#2} + \startpacked \tttf + \ctxlua{document.ShowResultOfPattern("#1","#2")} + \stoppacked} + +\ShowResultOfPattern{xml}{a/b/c} +\ShowResultOfPattern{xml}{/a/b/c} +\ShowResultOfPattern{xml}{b/c} +\ShowResultOfPattern{xml}{c} +\ShowResultOfPattern{xml}{a/*/c} +\ShowResultOfPattern{xml}{a/**/c} +\ShowResultOfPattern{xml}{a//c} +\ShowResultOfPattern{xml}{a/*/*/c} +\ShowResultOfPattern{xml}{*/c} +\ShowResultOfPattern{xml}{**/c} +\ShowResultOfPattern{xml}{a/../*/c} +\ShowResultOfPattern{xml}{a/../c} +\ShowResultOfPattern{xml}{c[@n='x']} +\ShowResultOfPattern{xml}{c[@n]} +\ShowResultOfPattern{xml}{c[@n='y']} +\ShowResultOfPattern{xml}{c[1]} +\ShowResultOfPattern{xml}{b/c[1]} +\ShowResultOfPattern{xml}{a/c[1]} +\ShowResultOfPattern{xml}{a/c[-1]} +\ShowResultOfPattern{xml}{c[1]} +\ShowResultOfPattern{xml}{c[-1]} +\ShowResultOfPattern{xml}{pi::} +\ShowResultOfPattern{xml}{pi::what} + +\stopcomponent |