% language=uk \startcomponent mk-nodes \environment mk-environment \chapter {Nodes and attributes} \subject{introduction} Here we will tell a bit about the development of node access in \LUATEX. We will also introduce attributes, a feature closely related to nodes. We assume that you are somewhat familiar with \TEX's nodes: glyphs, kerns, glue, penalties, whatsits and friends. \subject{tables} Access to node lists has been implemented rather early in the development because we needed it to fulfil the objectives of the Oriental \TEX\ project. The first implementation used nested tables, indexed by number. In that approach, the first entry in each node indicated the type in string format. At that time a horizontal list looked as follows: \starttyping list = { [1] = "hlist", [2] = 0, ... [8] = { [1] = { [1] = "glyph", ... }, [2] = { ... } } \stoptyping Processing such lists is rather convenient since we can use the normal table iterators. Because in practice only a few entries of a node are accessed, working with numbers is no real problem: in slot~1 we have the type, en in the case of a horizontal or vertical list, we know that slot~8 is either empty or a table. Looping over the list is done with: \starttyping for i, node in ipairs(list) do if node[1] == "glyph" then list[i][5] = string.byte(string.upper(string.char(node[5]))) end end \stoptyping Node processing code hooks into the box packagers and paragraph builder and a few more places. This means that when using the table approach a lot of callbacks take place where \TEX\ has to convert to and from \LUA. Apart from processing time, we also have to deal with garbage collection then and on an older machine with insufficient memory interesting bottlenecks show up. Therefore some following optimizations were implemented at the \TEX\ end of the game. Side note concerning speed: when memory of processing speed is low, runtime can increase five to tenfold compared to \PDFTEX\ when one does intensive node manipulations. This is due to garbage collection at the \LUA\ end and memory (de)allocation at the \TEX\ end. There is not much we can do about that. Interfacing has a price and hardware is more powerful than when \TEX\ was written. Processing the \TEX\ book using no callbacks is not that much slower than using a traditional \TEX\ engine. However, nowadays fonts are more extensive, demands for special features more pressing and that comes at a price. When the list is not changed, the callback function can return the value \type {true}. This signals \TEX\ that it can keep the original list. When the list is empty, the callback function can return the value \type {false}. This signals \TEX\ that the list can be discarded. In order to minimize conversions and redundant processing, nested lists were not passed as table but as a reference. One could expand such a list when needed. For instance, when one hooks the same function in the \type {hpack_filter} and \type {pre_linebreak_filter} callbacks, this way one can be pretty sure that each node is only processed once. Boxed material that is part of the paragraph stream first enters the box packers and then already is processed before it enters the paragraph callback. Of course one can decide the expand the referred sublist and process it again. Keep in mind that we're still talking of a table approach, but we're slowly moving away from big conversions. In principle one can insert and delete nodes in such a list but given that the average length of a list representing a page is around 4000, you can imagine that moving around a large amount of data is not that efficient. In order to cope with this, we experimented a lot and came to solutions which will be discussed later on. At the \LUA\ end some tricks were used to avoid the mentioned insertion and deletion penalty. When a node was deleted, we simply set its value to \type {false}. Deleting all glyphs then became: \starttyping for i, node in ipairs(list) do if node[1] == "glyph" then list[i] = false end end \stoptyping When \TEX\ converted a \LUA\ table back into its internal representation, it ignored such false nodes. For insertion a dummy node was introduced at the \LUA\ end. The next code duplicates the glyphs. \starttyping for i, node in ipairs(list) do if node[1] == "glyph" then list[i] = { 'inline', 0, nil, { node, node } } end end \stoptyping Just before we passed the resulting list back to \TEX\ we collapsed these inline pseudo nodes. This was a rather fast operation. So far so good. But then we introduced attributes and keeping track of them as well as processing them takes quite some processing power. Nodes with attributes then looked like: \starttyping someglyph = { [1] = "glyph", -- type [2] = 0, -- subtype [3] = { [1] = 5, [4] = 10 }, -- attributes [4] = 88, -- slot [5] = 32 -- font } \stoptyping Constructing attribute tables for each node is costly in terms of memory usage and processing time and we found out that the garbage collector was becoming a bottleneck, especially when resources are thin. We will go into more detail about attributes elsewhere. \subject{lists} At the same time that we discussed these issues, new Dutch word lists (adapted spelling) were published and we started wondering if we could use such lists directly for hyphenation purposes instead of relying on traditional patterns. Here the first observation was that handling these really huge lists is no problem at all. Okay, it costs some memory but we only need to load one of maybe a few of these lists. Hyphenating a paragraph using tables with hyphenated words and processing the paragraph related node list is not only fast, it also gives us the opportunity to cross font boundaries. Of course there are kerns and ligatures to deal with but this is no big deal. At least it can be an alternative or addendum to the current hyphenator. Some languages have very small pattern files or a very systematic approach to hyphenation so there is no reason to abandon the traditional ways in all cases. Take your choice. When experimenting with the new implementation we tested the performance by letting \LUA\ take care of hyphenation, spell checking (marking words) and adding inter||character kerns. When playing with big lists of words we found out that the caching mechanism could not be used due to some limitations in the \LUA\ byte code interpreter, so eventually we ended up with a dedicated loader. However, again we ran into performance problems when lists became more complex. And so, instead of converting \TEX\ datastructures into \LUA\ tables userdata types came into view. Taco already had reimplemented the node memory management, so a logical next step was to reimplement the callbacks and box related code to deal with nodes as linked lists. Since this is now the fashion in \LUATEX, you may forget the previous examples, although it is not that hard to introduce table representations again once we need them. Of course this resulted in an adaption to the regular \TEX\ code but a nice side effect was that we could now use fields instead of indexes into the node data structure. There is a small price to pay in terms of performance, but this can be compensated by clever programming. \starttyping someglyph = { type = 41, subtype = 0, attributes = , char = 88, font = 32 } \stoptyping Attributes themselves are userdata. The same is true for components that are present when we're for instance dealing with ligatures. As you can see, in the field variant, a type is a number. In practice, because \LUA\ hashes strings, working with strings is as fast when comparing, but since we now have the more abstract type indicator, we stick with the numbers, which saves a few conversions. When dealing with tables we get code like: \starttyping function loop_over_nodes(list) for i, n in ipairs(list) local kind = n[1] if kind == "hlist" or kind == "vlist" then ... end end end \stoptyping But now that we have linked lists, we get the following. Node related methods are available in the \type {node} namespace. \starttyping function loop_over_nodes(head) local hlist, vlist = node.id('hlist'), node.id('vlist') while head do local kind = head.type if kind == hlist or kind == vlist then ... end head = head.next end end \stoptyping Using an abstraction (i.e.\ a constant representing \type {hlist} looks nice here, which is why numbers instead of strings are used. The indexed variant is still supported and there we have strings. Going from a node list (head node) to a table is not that complex. Sometimes this can be handy because manipulating tables is more convenient that messing around with userdata when it comes down to debugging or tracing. \starttyping function nodes.totable(n) function totable(n) local f, tt = node.fields(n.id,n.subtype), { } for _,v in ipairs(f) do local nv = n[v] if nv then local tnv = type(nv) if tnv == "string" or tnv == "number" then tt[v] = nv else -- userdata tt[v] = nodes.totable(nv) end end end return tt end local t = { } while n do t[#t+1] = totable(n) n = n.next end return t end \stoptyping It will be clear that here we collect data in \LUA\ while treating nodes as userdata keeps most of it at the \TEX\ side and this is where the gain in speed comes from. \subject{side effects} While experimenting with node lists Taco and I ran into a peculiar side effect. One of the tests involved adding kerns between glyphs (inter character spacing as sometimes uses in titles in a large print). When applied to a whole document we noticed that at some places (words) the added kerning was gone. We used the subtype zero kern (which is most efficient) and in the process of hyphenating \TEX\ removes these kerns and inserts them later (but then based on the information stored in the font. The reason why \TEX\ removes the font related kerns, is the following. Consider the code: \starttyping \setbox0=\hbox{some text} the text \unhcopy0 has width \the\wd0 \stoptyping While constructing the \type {\hbox}, \TEX\ will apply kerning as dictated by the font. Otherwise the width of the box would not be correct. This means that the node list entering the linebreak machinery contains such kerns. Because hyphenating works on words \TEX\ will remove these kerns in the process of identifying the words. It creates a string, removes the original sequence of nodes, determines hyphenation points, and add the result to the node list. For efficiency reasons \TEX\ will only look at places where hyphenation makes sense. Now, imagine that we add those kerns in the callback. This time, all characters are surrounded by kerns (which we gave subtype zero). When \TEX\ is determining feasable breakpoints (hyphenation), it will remove those kerns, but only at certain places. Because our kerns are way larger than the normal interglyph kerns, we suddenly end up with an intercharacter spaced paragraph that has some words without such spacing but the font dictated kerns. \blank m o s t\quad w o r d s\quad a r e\quad s p a c e d\quad b u t\quad some words\quad a r e\quad n o t \blank Of course a solution is to use a different kern, but at least this shows that the moment of processing nodes as well as the kind of manipulations need to be chosen with care. Kerning is a nasty business anyway. Imagine the following word: \starttyping effe \stoptyping When typeset this turns into three characters, one of them being a ligature. \starttyping [char e] [liga ff (components f f)] [char e] \stoptyping However, in Dutch, such a word hyphenates as: \starttyping ef-fe \stoptyping This means that in the node list we eventually find something: \starttyping [char e] [disc (f-) (f) (skip 1)] [liga ff (components f f)] [char e] \stoptyping So, eventually we need to kern between the character sequences [e,f-], [e,ff], [ff,e] and [f,e]. \subject {attributes} We now arrive at attributes, a new property of nodes. Before we explain a bit more what can be done with them, we show how to define a new attribute and toggle it. In the following example the \type {\visualizenextnodes} macro is part of \CONTEXT\ \MKIV. \startbuffer \newattribute\aa \newattribute\ab \visualizenextnodes \hbox {\aa1 T{\ab3\aa2 E}X} \stopbuffer \typebuffer \placefigure [page] [] {\type{\hbox {\aa1 T{\ab3\aa2 E}X \ab 4}}} {\switchtobodyfont[7pt]% \scale[width=.9\textwidth]{\framed [offset=2ex,foregroundcolor=red] {\startsimplecolumns[n=2] \resetglobalattributes \resetlocalattributes \getbuffer \stopsimplecolumns}}} For the sake of this example, we start the allocation at 2000 because we don't want to interfere with attributes already defined in \CONTEXT. The node list resulting from the box is shown at the next page. As you can see here, internally attributes become a linked list assigned to the \type {attr} field. This means that one has to do some work in order to inspect attributes. \starttyping function has_attribute(n,a) if n and n.attr then n = n.attr.next while n do if n.number == a then return n.value end n = n.next end else return false end end \stoptyping The previous function can be used in tests like: \starttyping local total = 0 while n do if has_attribute(n,2000) then total = total + 1 end n = n.next end texio.write_nl(string.format( "attribute 2000 has been seen % times", total )) \stoptyping When implementing nodes and attributes we did rather extensive tests and one of the test documents implemented some preliminary color mechanism based on attributes. When handling the colors the previous function was called some 300.000 times and the total node processing time (which also involved font handling) was some 2.9 seconds. Implementing this function as a helper brought down node processing time to 2.4 seconds. Of course the gain depends on the complexity of the list (nesting) and the number of attributes that are set (upto 5 per node in this test). A few more helper functions are available, some are for convenience, some gain us some speed. The nice thing about attributes is that they obey grouping. This means that in the following sequence: \starttyping x {\aa1 x \ab2 x} x \stoptyping the attributes are assigned like: \starttyping x x(201=1) x(201=1,202=2) x \stoptyping Internally \LUATEX\ does some optimizations with respect to assigning a sequence of similar attributes, but you should keep in mind that in practice the memory usage will be larger when using many attributes. We played with color and other properties, hyphenation based on word lists (and tracking languages with attributes) and or special algorithms (url hyphenation), spell checking (marking words as being spelled wrongly), and a few more things. This involved handling attributes in several callbacks resulting in the insertion or deletion of nodes. When using attributes for color support, we have to insert \type {pdfliteral} whatsit nodes at some point depending on the current color. This also means that the time spent with color support at the \TEX\ end will be compensated by time spent at the \LUA\ side. It also means that because housekeeping to do with colors spanning pages and columns is gone because from now on color information travels with the nodes. This saves quite some ugly code. Because most of the things that we want to do with attributes (and we have quite an agenda) are already nicely isolated in \CONTEXT, attributes will find their way rather soon in \CONTEXT\ \MKIV. Let's end with an observation. Attributes themselves are not something revolutionary. However, if you had to deal with them in \TEX, i.e.\ associate them with for instance actions in during shipout, quite some time would have been spent on getting things right. Even worse: it would have lead to never ending discussions in the \TEX\ community and as such it's no surprise that something like this never showed up. The fact that we can use \LUA\ and manipulate node lists in many ways frees us from much discussion. We are even considering in future versions of \LUATEX\ to turn font, language and direction related information into attributes (in some private range) so this story is far from finished. As a teaser, consider the following line of thinking. Currently when a character enters the machinery, it becomes a glyph node. Among other characteristics, this node contains information about the font and the slot in that font which is used to represent that character. In a similar fashion, a space becomes glue with a measure probably related to the current font. However, with access to nodes and attributes, you can imagine the following scenario. Instead of a font (internally represented by a font id), you use an attribute referring to a font. At that time, the font field us just pointing to \TEX's null font. In a pass over the node list, you resolve the character and their attributes to a fonts and (maybe) other characters. Spacing can be postponed as well and instead of real glue values we can use multipliers and again attributes point the way to resolve them. Of course the question is if this is worth the trouble. After all typesetting is about fonts and there is no real reason not to give them a special place. \stopcomponent