From 7b271baae19db1528fbe6621bdf50af89a5a336b Mon Sep 17 00:00:00 2001 From: Hans Hagen Date: Fri, 22 Feb 2019 20:29:46 +0100 Subject: 2019-02-22 19:43:00 --- .../general/manuals/xml/xml-mkiv-expressions.tex | 645 +++++++++++++++++++++ 1 file changed, 645 insertions(+) create mode 100644 doc/context/sources/general/manuals/xml/xml-mkiv-expressions.tex (limited to 'doc/context/sources/general/manuals/xml/xml-mkiv-expressions.tex') diff --git a/doc/context/sources/general/manuals/xml/xml-mkiv-expressions.tex b/doc/context/sources/general/manuals/xml/xml-mkiv-expressions.tex new file mode 100644 index 000000000..0c126f2f8 --- /dev/null +++ b/doc/context/sources/general/manuals/xml/xml-mkiv-expressions.tex @@ -0,0 +1,645 @@ +\environment xml-mkiv-style + +\startcomponent xml-mkiv-expressions + +\startchapter[title={Expressions and filters}] + +\startsection[title={path expressions}] + +In the previous chapters we used \cmdinternal {cd:lpath} expressions, which are a variant +on \type {xpath} expressions as in \XSLT\ but in this case more geared towards +usage in \TEX. This mechanisms will be extended when demands are there. + +A path is a sequence of matches. A simple path expression is: + +\starttyping +a/b/c/d +\stoptyping + +Here each \type {/} goes one level deeper. We can go backwards in a lookup with +\type {..}: + +\starttyping +a/b/../d +\stoptyping + +We can also combine lookups, as in: + +\starttyping +a/(b|c)/d +\stoptyping + +A negated lookup is preceded by a \type {!}: + +\starttyping +a/(b|c)/!d +\stoptyping + +A wildcard is specified with a \type {*}: + +\starttyping +a/(b|c)/!d/e/*/f +\stoptyping + +In addition to these tag based lookups we can use attributes: + +\starttyping +a/(b|c)/!d/e/*/f[@type=whatever] +\stoptyping + +An \type {@} as first character means that we are dealing with an attribute. +Within the square brackets there can be boolean expressions: + +\starttyping +a/(b|c)/!d/e/*/f[@type=whatever and @id>100] +\stoptyping + +You can use functions as in: + +\starttyping +a/(b|c)/!d/e/*/f[something(text()) == "oeps"] +\stoptyping + +There are a couple of predefined functions: + +\starttabulate[|l|l|p|] +\NC \type{rootposition} \type{order} \NC number \NC the index of the matched root element (kind of special) \NC \NR +\NC \type{position} \NC number \NC the current index of the matched element in the match list \NC \NR +\NC \type{match} \NC number \NC the current index of the matched element sub list with the same parent \NC \NR +\NC \type{first} \NC number \NC \NC \NR +\NC \type{last} \NC number \NC \NC \NR +\NC \type{index} \NC number \NC the current index of the matched element in its parent list \NC \NR +\NC \type{firstindex} \NC number \NC \NC \NR +\NC \type{lastindex} \NC number \NC \NC \NR +\NC \type{element} \NC number \NC the element's index \NC \NR +\NC \type{firstelement} \NC number \NC \NC \NR +\NC \type{lastelement} \NC number \NC \NC \NR +\NC \type{text} \NC string \NC the textual representation of the matched element \NC \NR +\NC \type{content} \NC table \NC the node of the matched element \NC \NR +\NC \type{name} \NC string \NC the full name of the matched element: namespace and tag \NC \NR +\NC \type{namespace} \type{ns} \NC string \NC the namespace of the matched element \NC \NR +\NC \type{tag} \NC string \NC the tag of the matched element \NC \NR +\NC \type{attribute} \NC string \NC the value of the attribute with the given name of the matched element \NC \NR +\stoptabulate + +There are fundamental differences between \type {position}, \type {match} and +\type {index}. Each step results in a new list of matches. The \type {position} +is the index in this new (possibly intermediate) list. The \type {match} is also +an index in this list but related to the specific match of element names. The +\type {index} refers to the location in the parent element. + +Say that we have: + +\starttyping + + + + .1. + .1. + + + .2. + .2. + + + + + .3. + .3. + + + +\stoptyping + +The following then applies: + +\starttabulate[|l|l|] +\NC \type {collection/resources/manual[position()==1]/paper} \NC \type{.1.} \NC \NR +\NC \type {collection/resources/manual[match()==1]/paper} \NC \type{.1.} \type{.3.} \NC \NR +\NC \type {collection/resources/manual/paper[index()==1]} \NC \type{.2.} \NC \NR +\stoptabulate + +In most cases the \type {position} test is more restrictive than the \type +{match} test. + +You can pass your own functions too. Such functions are defined in the \type +{xml.expressions} namespace. We have defined a few shortcuts: + +\starttabulate[|l|l|] +\NC \type {find(str,pattern)} \NC \type{string.find} \NC \NR +\NC \type {contains(str)} \NC \type{string.find} \NC \NR +\NC \type {oneof(str,...)} \NC is \type{str} in list \NC \NR +\NC \type {upper(str)} \NC \type{characters.upper} \NC \NR +\NC \type {lower(str)} \NC \type{characters.lower} \NC \NR +\NC \type {number(str)} \NC \type{tonumber} \NC \NR +\NC \type {boolean(str)} \NC \type{toboolean} \NC \NR +\NC \type {idstring(str)} \NC removes leading hash \NC \NR +\NC \type {name(index)} \NC full tag name \NC \NR +\NC \type {tag(index)} \NC tag name \NC \NR +\NC \type {namespace(index)} \NC namespace of tag \NC \NR +\NC \type {text(index)} \NC content \NC \NR +\NC \type {error(str)} \NC quit and show error \NC \NR +\NC \type {quit()} \NC quit \NC \NR +\NC \type {print()} \NC print message \NC \NR +\NC \type {count(pattern)} \NC number of matches \NC \NR +\NC \type {child(pattern)} \NC take child that matches \NC \NR +\stoptabulate + + +You can also use normal \LUA\ functions as long as you make sure that you pass +the right arguments. There are a few predefined variables available inside such +functions. + +\starttabulate[|Tl|l|p|] +\NC \type{list} \NC table \NC the list of matches \NC \NR +\NC \type{l} \NC number \NC the current index in the list of matches \NC \NR +\NC \type{ll} \NC element \NC the current element that matched \NC \NR +\NC \type{order} \NC number \NC the position of the root of the path \NC \NR +\stoptabulate + +The given expression between \type {[]} is converted to a \LUA\ expression so you +can use the usual operators: + +\starttyping +== ~= <= >= < > not and or () +\stoptyping + +In addition, \type {=} equals \type {==} and \type {!=} is the same as \type +{~=}. If you mess up the expression, you quite likely get a \LUA\ error message. + +\stopsection + +\startsection[title={css selectors}] + +\startbuffer[selector-001] + + + + b.one + b.two + b.one.two + b.three + b#first + c + d e + d e + d e e + d f + @foo = bar + @bar = foo + @bar = foo1 + @bar = foo2 + @bar = foo3 + @bar = foo+4 + g + g gg d + g gg f + g gg f.one + g + g gg f.two + g gg f.three + g f.one + g f.three + @whatever = four five six + +\stopbuffer + +\xmlloadbuffer{selector-001}{selector-001} + +\startxmlsetups xml:selector:demo + \advance\scratchcounter\plusone + \inleftmargin{\the\scratchcounter}\ignorespaces\xmlverbatim{#1}\par +\stopxmlsetups + +\unexpanded\def\showCSSdemo#1#2% + {\blank + \textrule{\tttf#2} + \startlines + \dontcomplain + \tttf \obeyspaces + \scratchcounter\zerocount + \xmlcommand{#1}{#2}{xml:selector:demo} + \stoplines + \blank} + +The \CSS\ approach to filtering is a bit different from the path based one and is +supported too. In fact, you can combine both methods. Depending on what you +select, the \CSS\ one can be a little bit faster too. It has the advantage that +one can select more in one go but at the same time looks a bit less attractive. +This method was added just to show that it can be done but might be useful too. A +selector is given between curly braces (after all \CSS\ uses them and they have no +function yet in the parser. + +\starttyping +\xmlall{#1}{{foo bar .whatever, bar foo .whatever}} +\stoptyping + +The following methods are supported: + +\starttabulate[|T||] +\NC element \NC all tags element \NC \NR +\NC element-1 > element-2 \NC all tags element-2 with parent tag element-1 \NC \NR +\NC element-1 + element-2 \NC all tags element-2 preceded by tag element-1 \NC \NR +\NC element-1 ~ element-2 \NC all tags element-2 preceded by tag element-1 \NC \NR +\NC element-1 element-2 \NC all tags element-2 inside tag element-1 \NC \NR +\NC [attribute] \NC has attribute \NC \NR +\NC [attribute=value] \NC attribute equals value\NC \NR +\NC [attribute\lettertilde =value] \NC attribute contains value (space is separator) \NC \NR +\NC [attribute\letterhat ="value"] \NC attribute starts with value \NC \NR +\NC [attribute\letterdollar="value"] \NC attribute ends with value \NC \NR +\NC [attribute*="value"] \NC attribute contains value \NC \NR +\NC .class \NC has class \NC \NR +\NC \letterhash id \NC has id \NC \NR +\NC :nth-child(n) \NC the child at index n \NC \NR +\NC :nth-last-child(n) \NC the child at index n from the end \NC \NR +\NC :first-child \NC the first child \NC \NR +\NC :last-child \NC the last child \NC \NR +\NC :nth-of-type(n) \NC the match at index n \NC \NR +\NC :nth-last-of-type(n) \NC the match at index n from the end \NC \NR +\NC :first-of-type \NC the first match \NC \NR +\NC :last-of-type \NC the last match \NC \NR +\NC :only-of-type \NC the only match or nothing \NC \NR +\NC :only-child \NC the only child or nothing \NC \NR +\NC :empty \NC only when empty \NC \NR +\NC :root \NC the whole tree \NC \NR +\stoptabulate + +The next pages show some examples. For that we use the demo file: + +\typebuffer[selector-001] + +The class and id selectors often only make sense in \HTML\ like documents but they +are supported nevertheless. They are after all just shortcuts for filtering by +attribute. The class filtering is special in the sense that it checks for a class +in a list of classes given in an attribute. + +\showCSSdemo{selector-001}{{.one}} +\showCSSdemo{selector-001}{{.one, .two}} +\showCSSdemo{selector-001}{{.one, .two, \letterhash first}} + +Attributes can be filtered by presence, value, partial value and such. Quotes are +optional but we advice to use them. + +\showCSSdemo{selector-001}{{[foo], [bar=foo]}} +\showCSSdemo{selector-001}{{[bar\lettertilde=foo]}} +\showCSSdemo{selector-001}{{[bar\letterhat="foo"]}} +\showCSSdemo{selector-001}{{[whatever\lettertilde="five"]}} + +You can of course combine the methods as in: + +\showCSSdemo{selector-001}{{g f .one, g f .three}} +\showCSSdemo{selector-001}{{g > f .one, g > f .three}} +\showCSSdemo{selector-001}{{d + e}} +\showCSSdemo{selector-001}{{d ~ e}} +\showCSSdemo{selector-001}{{d ~ e, g f .one, g f .three}} + +You can also negate the result by using \type {:not} on a simple expression: + +\showCSSdemo{selector-001}{{:not([whatever\lettertilde="five"])}} +\showCSSdemo{selector-001}{{:not(d)}} + +The child and match selectors are also supported: + +\showCSSdemo{selector-001}{{a:nth-child(3)}} +\showCSSdemo{selector-001}{{a:nth-last-child(3)}} +\showCSSdemo{selector-001}{{g:nth-of-type(3)}} +\showCSSdemo{selector-001}{{g:nth-last-of-type(3)}} +\showCSSdemo{selector-001}{{a:first-child}} +\showCSSdemo{selector-001}{{a:last-child}} +\showCSSdemo{selector-001}{{e:first-of-type}} +\showCSSdemo{selector-001}{{gg d:only-of-type}} + +Instead of numbers you can also give the \type {an} and \type {an+b} formulas +as well as the \type {odd} and \type {even} keywords: + +\showCSSdemo{selector-001}{{a:nth-child(even)}} +\showCSSdemo{selector-001}{{a:nth-child(odd)}} +\showCSSdemo{selector-001}{{a:nth-child(3n+1)}} +\showCSSdemo{selector-001}{{a:nth-child(2n+3)}} + +There are a few special cases: + +\showCSSdemo{selector-001}{{g:empty}} +\showCSSdemo{selector-001}{{g:root}} +\showCSSdemo{selector-001}{{*}} + +Combining the \CSS\ methods with the regular ones is possible: + +\showCSSdemo{selector-001}{{g gg f .one}} +\showCSSdemo{selector-001}{g/gg/f[@class='one']} +\showCSSdemo{selector-001}{g/{gg f .one}} + +\startbuffer[selector-002] + + + + title 1 + title 2 + title 3 + title 4 + +\stopbuffer + +The next examples we use this file: + +\typebuffer[selector-002] + +\xmlloadbuffer{selector-002}{selector-002} + +When we filter from this (not too well structured) tree we can use both +methods to achieve the same: + +\showCSSdemo{selector-002}{{document title .one, document title .three}} + +\showCSSdemo{selector-002}{/document/title[(@class='one') or (@class='three')]} + +However, imagine this file: + +\startbuffer[selector-003] + + + + title 1 + title 1.1 + title 2 + title 2.1 + title 3 + title 3.1 + title 4 + title 4.1 + +\stopbuffer + +\typebuffer[selector-003] + +\xmlloadbuffer{selector-003}{selector-003} + +The next filter in easier with the \CSS\ selector methods because these accumulate +independent (simple) expressions: + +\showCSSdemo{selector-003}{{document title .one + subtitle, document title .two + subtitle}} + +Watch how we get an output in the document order. Because we render a sequential document +a combined filter will trigger a sorting pass. + +\stopsection + +\startsection[title={functions as filters}] + +At the \LUA\ end a whole \cmdinternal {cd:lpath} expression results in a (set of) node(s) +with its environment, but that is hardly usable in \TEX. Think of code like: + +\starttyping +for e in xml.collected(xml.load('text.xml'),"title") do + -- e = the element that matched +end +\stoptyping + +The older variant is still supported but you can best use the previous variant. + +\starttyping +for r, d, k in xml.elements(xml.load('text.xml'),"title") do + -- r = root of the title element + -- d = data table + -- k = index in data table +end +\stoptyping + +Here \type {d[k]} points to the \type {title} element and in this case all titles +in the tree pass by. In practice this kind of code is encapsulated in function +calls, like those returning elements one by one, or returning the first or last +match. The result is then fed back into \TEX, possibly after being altered by an +associated setup. We've seen the wrappers to such functions already in a previous +chapter. + +In addition to the previously discussed expressions, one can add so called +filters to the expression, for instance: + +\starttyping +a/(b|c)/!d/e/text() +\stoptyping + +In a filter, the last part of the \cmdinternal {cd:lpath} expression is a +function call. The previous example returns the text of each element \type {e} +that results from matching the expression. When running \TEX\ the following +functions are available. Some are also available when using pure \LUA. In \TEX\ +you can often use one of the macros like \type {\xmlfirst} instead of a \type +{\xmlfilter} with finalizer \type {first()}. The filter can be somewhat faster +but that is hardly noticeable. + +\starttabulate[|l|l|p|] +\NC \type {context()} \NC string \NC the serialized text with \TEX\ catcode regime \NC \NR +%NC \type {ctxtext()} \NC string \NC \NC \NR +\NC \type {function()} \NC string \NC depends on the function \NC \NR +% +\NC \type {name()} \NC string \NC the (remapped) namespace \NC \NR +\NC \type {tag()} \NC string \NC the name of the element \NC \NR +\NC \type {tags()} \NC list \NC the names of the element \NC \NR +% +\NC \type {text()} \NC string \NC the serialized text \NC \NR +\NC \type {upper()} \NC string \NC the serialized text uppercased \NC \NR +\NC \type {lower()} \NC string \NC the serialized text lowercased \NC \NR +\NC \type {stripped()} \NC string \NC the serialized text stripped \NC \NR +\NC \type {lettered()} \NC string \NC the serialized text only letters (cf. \UNICODE) \NC \NR +% +\NC \type {count()} \NC number \NC the number of matches \NC \NR +\NC \type {index()} \NC number \NC the matched index in the current path \NC \NR +\NC \type {match()} \NC number \NC the matched index in the preceding path \NC \NR +% +%NC \type {lowerall()} \NC string \NC \NC \NR +%NC \type {upperall()} \NC string \NC \NC \NR +% +\NC \type {attribute(name)} \NC content \NC returns the attribute with the given name \NC \NR +\NC \type {chainattribute(name)} \NC content \NC sidem, but backtracks till one is found \NC \NR +\NC \type {command(name)} \NC content \NC expands the setup with the given name for each found element \NC \NR +\NC \type {position(n)} \NC content \NC processes the \type {n}\high{th} instance of the found element \NC \NR +\NC \type {all()} \NC content \NC processes all instances of the found element \NC \NR +%NC \type {default} \NC content \NC all \NC \NR +\NC \type {reverse()} \NC content \NC idem in reverse order \NC \NR +\NC \type {first()} \NC content \NC processes the first instance of the found element \NC \NR +\NC \type {last()} \NC content \NC processes the last instance of the found element \NC \NR +\NC \type {concat(...)} \NC content \NC concatinates the match \NC \NC \NR +\NC \type {concatrange(from,to,...)} \NC content \NC concatinates a range of matches \NC \NC \NR +\stoptabulate + +The extra arguments of the concatinators are: \type {separator} (string), \type +{lastseparator} (string) and \type {textonly} (a boolean). + +These filters are in fact \LUA\ functions which means that if needed more of them +can be added. Indeed this happens in some of the \XML\ related \MKIV\ modules, +for instance in the \MATHML\ processor. + +\stopsection + +\startsection[title={example}] + +The number of commands is rather large and if you want to avoid them this is +often possible. Take for instance: + +\starttyping +\xmlall{#1}{/a/b[position()>3]} +\stoptyping + +Alternatively you can use: + +\starttyping +\xmlfilter{#1}{/a/b[position()>3]/all()} +\stoptyping + +and actually this is also faster as internally it avoids a function call. Of +course in practice this is hardly measurable. + +In previous examples we've already seen quite some expressions, and it might be +good to point out that the syntax is modelled after \XSLT\ but is not quite the +same. The reason is that we started with a rather minimal system and have already +styles in use that depend on compatibility. + +\starttyping +namespace:// axis node(set) [expr 1]..[expr n] / ... / filter +\stoptyping + +When we are inside a \CONTEXT\ run, the namespace is \type {tex}. Hoewever, if +you want not to print back to \TEX\ you need to be more explicit. Say that we +typeset examns and have a (not that logical) structure like: + +\starttyping + + ... + + one + two + three + + + true + 1 + + + false + 0 + + + true + 2 + + +\stoptyping + +Say that we typeset the questions with: + +\starttyping +\startxmlsetups question + \blank + score: \xmlfunction{#1}{totalscore} + \blank + \xmlfirst{#1}{text} + \startitemize + \xmlfilter{#1}{/answer/item/command(answer:item)} + \stopitemize + \endgraf + \blank +\stopxmlsetups +\stoptyping + +Each item in the answer results in a call to: + +\starttyping +\startxmlsetups answer:item + \startitem + \xmlflush{#1} + \endgraf + \xmlfilter{#1}{../../alternative[position()=rootposition()]/ + condition/command(answer:condition)} + \stopitem +\stopxmlsetups +\stoptyping + +\starttyping +\startxmlsetups answer:condition + \endgraf + condition: \xmlflush{#1} + \endgraf +\stopxmlsetups +\stoptyping + +Now, there are two rather special filters here. The first one involves +calculating the total score. As we look forward we use a function to deal with +this. + +\starttyping +\startluacode +function xml.functions.totalscore(root) + local score = 0 + for e in xml.collected(root,"/alternative") do + score = score + xml.filter(e,"xml:///score/number()") or 0 + end + tex.write(score) +end +\stopluacode +\stoptyping + +Watch how we use the namespace to keep the results at the \LUA\ end. + +The second special trick shown here is to limit a match using the current +position of the root (\type {#}) match. + +As you can see, a path expression can be more than just filtering a few nodes. At +the end of this manual you will find a bunch of examples. + +\stopsection + +\startsection[title={tables}] + +If you want to know how the internal \XML\ tables look you can print such a +table: + +\starttyping +print(table.serialize(e)) +\stoptyping + +This produces for instance: + +% s = xml.convert("some text") +% print(table.serialize(xml.filter(s,"demo")[1])) + +\starttyping +t={ + ["at"]={ + ["label"]="whatever", + }, + ["dt"]={ "some text" }, + ["ns"]="", + ["rn"]="", + ["tg"]="demo", +} +\stoptyping + +The \type {rn} entry is the renamed namespace (when renaming is applied). If you +see tags like \type {@pi@} this means that we don't have an element, but (in this +case) a processing instruction. + +\starttabulate[|l|p|] +\NC \type {@rt@} \NC the root element \NC \NR +\NC \type {@dd@} \NC document definition \NC \NR +\NC \type {@cm@} \NC comment, like \type {} \NC \NR +\NC \type {@cd@} \NC so called \type {CDATA} \NC \NR +\NC \type {@pi@} \NC processing instruction, like \type {} \NC \NR +\stoptabulate + +There are many ways to deal with the content, but in the perspective of \TEX\ +only a few matter. + +\starttabulate[|l|p|] +\NC \type {xml.sprint(e)} \NC print the content to \TEX\ and apply setups if needed \NC \NR +\NC \type {xml.tprint(e)} \NC print the content to \TEX\ (serialize elements verbose) \NC \NR +\NC \type {xml.cprint(e)} \NC print the content to \TEX\ (used for special content) \NC \NR +\stoptabulate + +Keep in mind that anything low level that you uncover is not part of the official +interface unless mentioned in this manual. + +\stopsection + +\stopchapter + +\stopcomponent -- cgit v1.2.3