2016-08-01 14:21:00

author: Context Git Mirror Bot <phg42.2a@gmail.com> 2016-08-01 16:40:14 +0200
committer: Context Git Mirror Bot <phg42.2a@gmail.com> 2016-08-01 16:40:14 +0200
commit: 96f283b0d4f0259b7d7d1c64d1d078c519fc84a6 (patch)
tree: e9673071aa75f22fee32d701d05f1fdc443ce09c /doc/context/sources/general/manuals/mk/mk-nodes.tex
parent: c44a9d2f89620e439f335029689e7f0dff9516b7 (diff)
download: context-96f283b0d4f0259b7d7d1c64d1d078c519fc84a6.tar.gz
1 files changed, 462 insertions, 0 deletions
diff --git a/doc/context/sources/general/manuals/mk/mk-nodes.tex b/doc/context/sources/general/manuals/mk/mk-nodes.tex
new file mode 100644
index 000000000..fb59ec05c
--- /dev/null
+++ b/doc/context/sources/general/manuals/mk/mk-nodes.tex
@@ -0,0 +1,462 @@
+% language=uk
+
+\startcomponent mk-nodes
+
+\environment mk-environment
+
+\chapter {Nodes and attributes}
+
+\subject{introduction}
+
+Here we will tell a bit about the development of node access in \LUATEX. We
+will also introduce attributes, a feature closely related to nodes. We assume
+that you are somewhat familiar with \TEX's nodes: glyphs, kerns, glue, penalties,
+whatsits and friends.
+
+\subject{tables}
+
+Access to node lists has been implemented rather early in the development because
+we needed it to fulfil the objectives of the Oriental \TEX\ project.  The first
+implementation used nested tables, indexed by number. In that approach, the first
+entry in each node indicated the type in string format. At that time a horizontal
+list looked as follows:
+
+\starttyping
+list = {
+    [1] = "hlist",
+    [2] = 0,
+    ...
+    [8] = {
+        [1] = {
+            [1] = "glyph",
+            ...
+        },
+        [2] = {
+            ...
+        }
+}
+\stoptyping
+
+Processing such lists is rather convenient since we can use the normal table
+iterators. Because in practice only a few entries of a node are accessed, working
+with numbers is no real problem: in slot~1 we have the type, en in the
+case of a horizontal or vertical list, we know that slot~8 is either empty or
+a table. Looping over the list is done with:
+
+\starttyping
+for i, node in ipairs(list) do
+    if node[1] == "glyph" then
+        list[i][5] = string.byte(string.upper(string.char(node[5])))
+    end
+end
+\stoptyping
+
+Node processing code hooks into the box packagers and paragraph builder and
+a few more places. This means that when using the table approach a lot of
+callbacks take place where \TEX\ has to convert to and from \LUA. Apart from
+processing time, we also have to deal with garbage collection then and on an older
+machine with insufficient memory interesting bottlenecks show up. Therefore some
+following optimizations were implemented at the \TEX\ end of the game.
+
+Side note concerning speed: when memory of processing speed is low, runtime can
+increase five to tenfold compared to \PDFTEX\ when one does intensive node
+manipulations. This is due to garbage collection at the \LUA\ end and memory
+(de)allocation at the \TEX\ end. There is not much we can do about that. Interfacing
+has a price and hardware is more powerful than when \TEX\ was written. Processing
+the \TEX\ book using no callbacks is not that much slower than using a traditional
+\TEX\ engine. However, nowadays fonts are more extensive, demands for special
+features more pressing and that comes at a price.
+
+When the list is not changed, the callback function can return the value \type
+{true}. This signals \TEX\ that it can keep the original list. When the list is
+empty, the callback function can return the value \type {false}. This signals
+\TEX\ that the list can be discarded.
+
+In order to minimize conversions and redundant processing, nested lists were
+not passed as table but as a reference. One could expand such a list when
+needed. For instance, when one hooks the same function in the \type
+{hpack_filter} and \type {pre_linebreak_filter} callbacks, this way one can be
+pretty sure that each node is only processed once. Boxed material that is part
+of the paragraph stream first enters the box packers and then already is
+processed before it enters the paragraph callback. Of course one can decide the
+expand the referred sublist and process it again. Keep in mind that we're still
+talking of a table approach, but we're slowly moving away from big conversions.
+
+In principle one can insert and delete nodes in such a list but given that
+the average length of a list representing a page is around 4000, you can
+imagine that moving around a large amount of data is not that efficient. In order
+to cope with this, we experimented a lot and came to solutions which will be
+discussed later on.
+
+At the \LUA\ end some tricks were used to avoid the mentioned insertion and
+deletion penalty. When a node was deleted, we simply set its value to \type
+{false}. Deleting all glyphs then became:
+
+\starttyping
+for i, node in ipairs(list) do
+    if node[1] == "glyph" then
+        list[i] = false
+    end
+end
+\stoptyping
+
+When \TEX\ converted a \LUA\ table back into its internal representation, it
+ignored such false nodes.
+
+For insertion a dummy node was introduced at the \LUA\ end. The next code
+duplicates the glyphs.
+
+\starttyping
+for i, node in ipairs(list) do
+    if node[1] == "glyph" then
+        list[i] = { 'inline', 0, nil, { node, node } }
+    end
+end
+\stoptyping
+
+Just before we passed the resulting list back to \TEX\ we collapsed these
+inline pseudo nodes. This was a rather fast operation.
+
+So far so good. But then we introduced attributes and keeping track of them
+as well as processing them takes quite some processing power. Nodes with
+attributes then looked like:
+
+\starttyping
+someglyph = {
+    [1] = "glyph",               -- type
+    [2] = 0,                     -- subtype
+    [3] = { [1] = 5, [4] = 10 }, -- attributes
+    [4] = 88,                    -- slot
+    [5] = 32                     -- font
+}
+\stoptyping
+
+Constructing attribute tables for each node is costly in terms of memory usage
+and processing time and we found out that the garbage collector was becoming
+a bottleneck, especially when resources are thin. We will go into more detail
+about attributes elsewhere.
+
+\subject{lists}
+
+At the same time that we discussed these issues, new Dutch word lists (adapted
+spelling) were published and we started wondering if we could use such lists
+directly for hyphenation purposes instead of relying on traditional patterns. Here
+the first observation was that handling these really huge lists is no problem at
+all. Okay, it costs some memory but we only need to load one of maybe a few of
+these lists. Hyphenating a paragraph using tables with hyphenated words and
+processing the paragraph related node list is not only fast, it also gives us the
+opportunity to cross font boundaries. Of course there are kerns and ligatures to
+deal with but this is no big deal. At least it can be an alternative or addendum
+to the current hyphenator. Some languages have very small pattern files or a very
+systematic approach to hyphenation so there is no reason to abandon the traditional
+ways in all cases. Take your choice.
+
+When experimenting with the new implementation we tested the performance by letting
+\LUA\ take care of hyphenation, spell checking (marking words) and adding
+inter||character kerns. When playing with big lists of words we found out that the
+caching mechanism could not be used due to some limitations in the \LUA\ byte code
+interpreter, so eventually we ended up with a dedicated loader.
+
+However, again we ran into performance problems when lists became more complex. And so,
+instead of converting \TEX\ datastructures into \LUA\ tables userdata types came into
+view. Taco already had reimplemented the node memory management, so a logical next step
+was to reimplement the callbacks and box related code to deal with nodes as linked lists.
+Since this is now the fashion in \LUATEX, you may forget the previous examples, although
+it is not that hard to introduce table representations again once we need them.
+
+Of course this resulted in an adaption to the regular \TEX\ code but a nice side effect
+was that we could now use fields instead of indexes into the node data structure. There
+is a small price to pay in terms of performance, but this can be compensated by clever
+programming.
+
+\starttyping
+someglyph = {
+    type = 41,
+    subtype = 0,
+    attributes = <attributes>,
+    char = 88,
+    font = 32
+}
+\stoptyping
+
+Attributes themselves are userdata. The same is true for components that are present
+when we're for instance dealing with ligatures.
+
+As you can see, in the field variant, a type is a number. In practice, because \LUA\
+hashes strings, working with strings is as fast when comparing, but since we now have
+the more abstract type indicator, we stick with the numbers, which saves a few conversions.
+When dealing with tables we get code like:
+
+\starttyping
+function loop_over_nodes(list)
+    for i, n in ipairs(list)
+        local kind = n[1]
+        if kind == "hlist" or kind == "vlist" then
+            ...
+        end
+    end
+end
+\stoptyping
+
+But now that we have linked lists, we get the following. Node related methods
+are available in the \type {node} namespace.
+
+\starttyping
+function loop_over_nodes(head)
+    local hlist, vlist = node.id('hlist'), node.id('vlist')
+    while head do
+        local kind = head.type
+        if kind == hlist or kind == vlist then
+            ...
+        end
+        head = head.next
+    end
+end
+\stoptyping
+
+Using an abstraction (i.e.\ a constant representing \type {hlist} looks
+nice here, which is why numbers instead of strings are used. The indexed
+variant is still supported and there we have strings.
+
+Going from a node list (head node) to a table is not that complex. Sometimes
+this can be handy because manipulating tables is more convenient that messing
+around with userdata when it comes down to debugging or tracing.
+
+\starttyping
+function nodes.totable(n)
+    function totable(n)
+        local f, tt = node.fields(n.id,n.subtype), { }
+        for _,v in ipairs(f) do
+            local nv = n[v]
+            if nv then
+                local tnv = type(nv)
+                if tnv == "string" or tnv == "number" then
+                    tt[v] = nv
+                else -- userdata
+                    tt[v] = nodes.totable(nv)
+                end
+            end
+        end
+        return tt
+    end
+    local t = { }
+    while n do
+        t[#t+1] = totable(n)
+        n = n.next
+    end
+    return t
+end
+\stoptyping
+
+It will be clear that here we collect data in \LUA\ while treating nodes
+as userdata keeps most of it at the \TEX\ side and this is where the gain in
+speed comes from.
+
+\subject{side effects}
+
+While experimenting with node lists Taco and I ran into a peculiar side effect.
+One of the tests involved adding kerns between glyphs (inter character spacing
+as sometimes uses in titles in a large print). When applied to a whole document
+we noticed that at some places (words) the added kerning was gone. We used
+the subtype zero kern (which is most efficient) and in the process of hyphenating
+\TEX\ removes these kerns and inserts them later (but then based on the
+information stored in the font.
+
+The reason why \TEX\ removes the font related kerns, is the following. Consider
+the code:
+
+\starttyping
+\setbox0=\hbox{some text} the text \unhcopy0 has width \the\wd0
+\stoptyping
+
+While constructing the \type {\hbox}, \TEX\ will apply kerning as dictated
+by the font. Otherwise the width of the box would not be correct. This means
+that the node list entering the linebreak machinery contains such kerns.
+Because hyphenating works on words \TEX\ will remove these kerns in the
+process of identifying the words. It creates a string, removes the original
+sequence of nodes, determines hyphenation points, and add the result to
+the node list.  For efficiency reasons \TEX\ will only look at places
+where hyphenation makes  sense.
+
+Now, imagine that we add those kerns in the callback. This time, all characters
+are surrounded by kerns (which we gave subtype zero). When \TEX\ is determining
+feasable breakpoints (hyphenation), it will remove those kerns, but only at
+certain places. Because our kerns are way larger than the normal interglyph
+kerns, we suddenly end up with an intercharacter spaced paragraph that has
+some words without such spacing but the font dictated kerns.
+
+\blank
+m o s t\quad w o r d s\quad a r e\quad s p a c e d\quad b u t\quad
+some words\quad a r e\quad n o t
+\blank
+
+Of course a solution is to use a different kern, but at least this shows that
+the moment of processing nodes as well as the kind of manipulations need
+to be chosen with care.
+
+Kerning is a nasty business anyway. Imagine the following word:
+
+\starttyping
+effe
+\stoptyping
+
+When typeset this turns into three characters, one of them being a ligature.
+
+\starttyping
+[char e] [liga ff (components f f)] [char e]
+\stoptyping
+
+However, in Dutch, such a word hyphenates as:
+
+\starttyping
+ef-fe
+\stoptyping
+
+This means that in the node list we eventually find something:
+
+\starttyping
+[char e] [disc (f-) (f) (skip 1)] [liga ff (components f f)] [char e]
+\stoptyping
+
+So, eventually we need to kern between the character sequences [e,f-],
+[e,ff], [ff,e] and [f,e].
+
+\subject {attributes}
+
+We now arrive at attributes, a new property of nodes. Before we explain a
+bit more what can be done with them, we show how to define a new attribute
+and toggle it. In the following example the \type {\visualizenextnodes} macro
+is part of \CONTEXT\ \MKIV.
+
+\startbuffer
+\newattribute\aa
+\newattribute\ab
+\visualizenextnodes \hbox {\aa1 T{\ab3\aa2 E}X}
+\stopbuffer
+
+\typebuffer
+
+\placefigure
+  [page]
+  []
+  {\type{\hbox {\aa1 T{\ab3\aa2 E}X \ab 4}}}
+  {\switchtobodyfont[7pt]%
+   \scale[width=.9\textwidth]{\framed
+     [offset=2ex,foregroundcolor=red]
+     {\startsimplecolumns[n=2]
+        \resetglobalattributes
+        \resetlocalattributes
+        \getbuffer
+      \stopsimplecolumns}}}
+
+For the sake of this example, we start the allocation at 2000 because we don't
+want to interfere with attributes already defined in \CONTEXT. The node list
+resulting from the box is shown at the next page. As you can see here, internally
+attributes become a linked list assigned to the \type {attr} field. This means
+that one has to do some work in order to inspect attributes.
+
+\starttyping
+function has_attribute(n,a)
+    if n and n.attr then
+        n = n.attr.next
+        while n do
+            if n.number == a then
+                return n.value
+            end
+            n = n.next
+        end
+    else
+        return false
+    end
+end
+\stoptyping
+
+The previous function can be used in tests like:
+
+\starttyping
+local total = 0
+while n do
+    if has_attribute(n,2000) then
+        total = total + 1
+    end
+    n = n.next
+end
+texio.write_nl(string.format(
+    "attribute 2000 has been seen % times", total
+))
+\stoptyping
+
+When implementing nodes and attributes we did rather extensive tests and
+one of the test documents implemented some preliminary color mechanism
+based on attributes. When handling the colors the previous function was
+called some 300.000 times and the total node processing time (which also
+involved font handling) was some 2.9 seconds. Implementing this function
+as a helper brought down node processing time to 2.4 seconds. Of course
+the gain depends on the complexity of the list (nesting) and the number
+of attributes that are set (upto 5 per node in this test). A few more helper
+functions are available, some are for convenience, some gain us some speed.
+
+The nice thing about attributes is that they obey grouping. This means that
+in the following sequence:
+
+\starttyping
+x {\aa1 x \ab2 x} x
+\stoptyping
+
+the attributes are assigned like:
+
+\starttyping
+x x(201=1) x(201=1,202=2) x
+\stoptyping
+
+Internally \LUATEX\ does some optimizations with respect to assigning
+a sequence of similar attributes, but you should keep in mind that in practice
+the memory usage will be larger when using many attributes.
+
+We played with color and other properties, hyphenation based on word lists
+(and tracking languages with attributes) and or special algorithms (url
+hyphenation), spell checking (marking words as being spelled wrongly), and
+a few more things. This involved handling attributes in several callbacks
+resulting in the insertion or deletion of nodes.
+
+When using attributes for color support, we have to insert \type {pdfliteral} whatsit
+nodes at some point depending on the current color. This also means that the
+time spent with color support at the \TEX\ end will be compensated by
+time spent at the \LUA\ side. It also means that because housekeeping to do
+with colors spanning pages and columns is gone because from now on color
+information travels with the nodes. This saves quite some ugly code.
+
+Because most of the things that we want to do with attributes (and we have
+quite an agenda) are already nicely isolated in \CONTEXT, attributes will
+find their way rather soon in \CONTEXT\ \MKIV.
+
+Let's end with an observation. Attributes themselves are not something
+revolutionary. However, if you had to deal with them in \TEX, i.e.\
+associate them with for instance actions in during shipout, quite some
+time would have been spent on getting things right. Even worse: it would
+have lead to never ending discussions in the \TEX\ community and as
+such it's no surprise that something like this never showed up. The fact that
+we can use \LUA\ and manipulate node lists in many ways frees us from
+much discussion.
+
+We are even considering in future versions of \LUATEX\ to turn font, language
+and direction related information into attributes (in some private range) so this
+story is far from finished. As a teaser, consider the following line of thinking.
+
+Currently when a character enters the machinery, it becomes a glyph node. Among
+other characteristics, this node contains information about the font and the
+slot in that font which is used to represent that character. In a similar fashion,
+a space becomes glue with a measure probably related to the current font.
+
+However, with access to nodes and attributes, you can imagine the following
+scenario. Instead of a font (internally represented by a font id), you use an
+attribute referring to a font. At that time, the font field us just pointing to
+\TEX's null font. In a pass over the node list, you resolve the character and their
+attributes to a fonts and (maybe) other characters. Spacing can be postponed as well
+and instead of real glue values we can use multipliers and again attributes point
+the way to resolve them.
+
+Of course the question is if this is worth the trouble. After all typesetting is
+about fonts and there is no real reason not to give them a special place.
+
+\stopcomponent
author	Context Git Mirror Bot <phg42.2a@gmail.com>	2016-08-01 16:40:14 +0200
committer	Context Git Mirror Bot <phg42.2a@gmail.com>	2016-08-01 16:40:14 +0200
commit	96f283b0d4f0259b7d7d1c64d1d078c519fc84a6 (patch)
tree	e9673071aa75f22fee32d701d05f1fdc443ce09c /doc/context/sources/general/manuals/mk/mk-nodes.tex
parent	c44a9d2f89620e439f335029689e7f0dff9516b7 (diff)
download	context-96f283b0d4f0259b7d7d1c64d1d078c519fc84a6.tar.gz