% language=uk \startcomponent mk-performance \environment mk-environment \chapter{How about performance} \subject{remark} The previous chapters already spent some words on performance and memory usage. By the time that Taco and I were implementing, discussing and testing the callbacks related to node lists, we were already convinced that in all areas covered so far (file management, handling input characters, dealing with fonts, conversion to tokens, string and table manipulation, enz.) the \TEX||\LUA\ pair was up to the task And so we were quite confident that processing nodes was not only an important aspect of \LUATEX\ but also quite feasable in terms of performance (after all we needed it in order to deal with advanced typesetting of Arab). When Taco was dealing with the \TEX\ side of the story, I was experimenting with possible mechanisms at the \LUA\ end. At the same time I got the opportunity to speed up the \METAPOST\ to \PDF\ converter and both activities involved some timing. Here I report some of the observations that we made in this process. \subject{parsing} Expressions in \LUA\ are powerful and definitely faster than regular expressions found in other languages, but they have some limits. Most noticeably is the lack of alternation. In \RUBY\ one can say: \starttyping str = "there is no gamma in here, just an beta" if str =~ /(alph|bet|delt)a/ then print($1) end \stoptyping but in \LUA\ you need a few more lines: \starttyping str = "there is no gamma in here, just an beta" for _, v in pairs({'alpha','beta','delta'}) do local s = str:match(v) if s then print(s) break end end \stoptyping Interesting is that upto now I didn't really miss alternation but it may as well be that the lack of it drove me to come up with different solutions. For \CONTEXT\ \MKIV\ the \METAPOST\ to \PDF\ converter has been rewritten in \LUA. This is a prelude to direct \LUA\ output from \METAPOST\ but I needed the exercise. It was among the first \LUA\ code in \MKIV. Progressive (sequential) parsing of the data is an option, and is done in \MKII\ using pure \TEX. We collect words and compare them to \POSTSCRIPT\ directives and act accordingly. The messy parts are scanning the preamble, which has specials to be dealt with as well as lots of unpredictable code to skip, and the \type {fshow} command which adds text to a graphic. But real dirty are the code fragments that deal with setting the line width and penshapes so the cleanup of this takes some time. In \LUA\ a different approach is taken. There is an \type {mp} table which collects a lot of functions that more or less reflect the output of \METAPOST. The functions take care of generating the right \PDF\ code and also handle the transformations needed because of the differences between \POSTSCRIPT\ and \PDF. The sequential \POSTSCRIPT\ that comes from \METAPOST\ is collected in one string and converted using \type {gsub} into a sequence of \LUA\ function calls. Before this can be done, some cleanup takes place. The resulting string is then executed as \LUA\ code. As an example: \starttyping 1 0 0 2 0 0 curveto \stoptyping becomes \starttyping mp.curveto(1,0,0,2,0,0) \stoptyping which results in: \starttyping \pdfliteral{1 0 0 2 0 0 c} \stoptyping In between, the path is stored and transformed which is needed in the case of penshapes, where some \POSTSCRIPT\ feature is used that is not available in \PDF. During the development of \LUATEX\ a new feature was added to \LUA: \type {lpeg}. With \type {lpeg} you can define text scanners. In fact, you can build parsers for languages quite conveniently so without doubt we will see it show up all over \MKIV. Since I needed an exercise to get accustomed with \type {lpeg}, I rewrote the mentioned converter. I'm sure that a better implementation is possible than I did (after all, \POSTSCRIPT\ is a language) but I went for a speedy solution. The following table shows some timings. \starttabulate[|c|c|l|] \HL \NC \tt gsub \NC \tt lpeg \NC \NC \NR \HL \NC 2.5 \NC 0.5 \NC 100 times test graphic \NC \NR \NC 9.2 \NC 1.9 \NC 100 times big graphic \NC \NR \HL \stoptabulate The test graphic has about everything that \METAPOST\ can output, including special tricks that deal with transparency and shading. The big one is just four copies of the test graphic. So, the \type {lpeg} based variant is about 5~times faster than the original variant. I'm not saying that the original implementation is that brilliant, but a 5~time improvement is rather nice especially when you consider that \type {lpeg} is still experimental and each version performs better. The tests were done with \type {lpeg} version 0.5 which performs slightly faster than its predecessor. It's worth mentioning that the original \type {gsub} based variant was already a bit improved compared to its first implementation. There we collected the \TEX\ (\PDF) code in a table and passed it in its concatenated form to \TEX. Because the \LUA\ to \TEX\ interface is by now quite efficient we can just pass the intermediate results directly to \TEX. \subject{file io} The repertore of functions that deal with individual characters in \LUA\ is small. This does not bother us too much because the individual character is not what \TEX\ is mostly dealing with. A character or sequence of characters becomes a token (internally represented by a table) and tokens result in nodes (again tables, but larger). There are many more tokens involved than nodes: in \CONTEXT\ a ratio of 200~tokens on 1~node are not uncommon. A letter like \type {x} become a token, but the control sequence \type {\command} also ends up as one token. Later on, this \type {x} may become a character node, possibly surrounded by some kerning. The input characters \type {width} result in 5~tokens, but may not end up as nodes at all, for instance when they are part of a key|/|value pair in the argument to a command. Just as there is no guaranteed one||to||one relationship between input characters and tokens, there is no straight relation between tokens and nodes. When dealing with input it is good to keep in mind that because of these interpretation stages one can never say that 1~megabyte of input characters ends up as 1~million something in memory. Just think of how many megabytes of macros get stored in a format file much smaller than the sum of bytes. We only deal with characters or sequences of bytes when reading from an input medium. There are many ways to deal with the input. For instance one can process the input lines as \TEX\ sees them, in which case \TEX\ takes care of the \UTF\ input. When we're dealing with other input encodings we can hook code into the file openers and readers and convert the raw data ourselves. We can for instance read in a file as a whole, convert it using the normal expression handlers or the byte(pair) iterators that \LUATEX\ provides, or we can go real low level using native \LUA\ code, as in: \starttyping do local function nextbyte(f) return f:read(1) end function io.bytes(f) return nextbyte, f end end f = io.open("somefile.dat") for b in io.bytes(f) do do_something(b) end f:close() \stoptyping Of course in practice one will need to integrate this into one of the reader callback, but the principle stays the same. In case you wonder if calling functions for each byte is fast enough \unknown\ it's more than fast enough for normal purposes, especially if we keep in mind that other tasks like reading of, preparing of and dealing with fonts of processing token lists take way more time. You can be sore that when half a second runtime is spent on reading a file, processing may take minutes. If one wants to sqeeze more performance out of this part, it's always an option to write special libraries for that, but this is beyond standard \LUATEX. We found out that the speed of loading data from files in \LUA\ is mostly related to the small size of \LUA's file buffer. Reading data stored in tables is extremely fast, and even faster when precompiled into bytecode. \subject{tables} When Taco and I were experimenting with the callbacks that intercept tokens and nodes, we wondered what the impact would be on performance. Although in \MKIV\ we allocate quite some memory due to font handling, we were pretty sure that handling \TEX's internal lists also could have their impact. Data related to fonts is not always subjected to garbage collection, simply because it's to be available permanently. List processing on the other hand involves a lot of temporary allocated tables. During a run a real huge amount of tokens passes the machinery. When digested, they become nodes. For testing we normally use this document (with the name \type {mk.tex}) and at the time of writing this, it has some 48 pages. This document is of moderately complexity, but not as complex as the documents that I normally process; they have with lots of graphics, layers, structural elements, maybe a bit of \XML\ parsing, etc. Nevertheless, we're talking of some 24~million tokens entering the engine for 50 pages of text. Contrary to this the number of nodes is small: only 120~thousand but the tables making up the nodes are more complex than token tables (with three numbers per token). When all tokens are intercepted and returned unchanged, on my machine the run is progressively slow and memory usage grows from 75M to 112M. There is room for improvement there, especially in the garbage collector. Side note: quite some of these tokens result from macro expansion. Also, when in the input a \type {\command} is used, the callback passes it as one token. A command stores in a format is already tokenized, but a command read from the input is tokenized when read, so behind each token reported there can be a few more input characters, but their number can be neglected compared to tokens originating from the macro package. The token callback is rather slow when used for a whole document. However, this is typically a callback that will only be used in very special situations and for a controlled number of tokens. The node callback on the other hand can be set permanently. Fortunately the number of nodes is relatively small. The overhead of a simple token handler that just counts nodes is around 5\% but most common manipulations with token lists don't take much more time. For instance, experiments with adding kerns around punctuation (a French speciality) hardly takes time, resolving ligatures is not really noticeable and applying inter||character spacing to a whole document is not that slow either. Actually, the last example is kind of special because it more than doubles the size of the node lists. Inserting or removing table elements in relatively slow when tables are large but there are some ways around this. One of the reasons of whole||document token handling being slow is that each token is a three||element table and so the garbage collector has to work rather hard. The efficiency of this process is also platform dependent (or maybe compiler specific). Manipulating the garbage collector parameters does not improve performance, unless this forces the collector to be inefficient at the cost of a lot of memory. However, when we started dealing with nodes, I gave tuning the collector another try and on the mentioned test document the following observations were made when manipulating the step multiplier: \starttabulate[|c|c|c|] \HL \NC \bf step \NC \bf runtime \NC \bf memory \NC \NR \HL \NC 200 \NC 24.0 \NC 80.5M \NC \NR \NC 175 \NC 21.0 \NC 78.2M \NC \NR \NC 150 \NC 22.0 \NC 74.6M \NC \NR \NC 160 \NC 22.0 \NC 74.6M \NC \NR \NC 165 \NC 21.0 \NC 77.6M \NC \NR \NC 125 \NC 21.5 \NC 89.2M \NC \NR \NC 100 \NC 21.5 \NC 88.4M \NC \NR \HL \stoptabulate As a result, I decided to set the \type {stepmul} variable to~165. \starttyping \ctxlua{collectgarbage("setstepmul", 165)} \stoptyping However, when we were testing thenew \type {lpeg} based \METAPOST\ converter, we ran into problems. For table intensive operations, temporary disabling the garbage collector gave a significant boost in speed. While testing performance we used the following loop: \starttyping \dorecurse {2000} { \setbox \scratchbox \hbox \bgroup \convertMPtoPDF{test-mps-procset.mps}{1}{1} \egroup } \stoptyping In such a loop, turning the garbage collector on and off is disasterous. Because no other \LUA\ calls happen between these calls, the garbage collector is never invoked at all. As a result, memory growed from the baseline of 45M to 120MB and processing became incrementally slow. I found out that restarting the collector before each conversion kept memory usage low and the speed also remained okay. \starttyping \ctxlua{collectgarbage("restart")} \stoptyping Further experiments learned that it makes sense to restart the collector at each shipout and before table intense operations. On \type {mk.tex} this results in a memory usage of 74M (at the end of the run) and a runtime of 21~seconds. Concerning nodes and speed|/|allocation issues, we need to be aware of the fact that this was still somewhat experimental and in the final version of \LUATEX\ callbacks may occur at different places and lists may be subjected to parsing multiple times at different moments and locations (for instance when we start dealing with attributes, an upcoming new feature). Back to tokens. The reason why applying the callback to every token takes a while has to do with the fact that each token goes through the associated function. If you want to have an idea of what this means for 24~million tokens, just run the following \LUA\ code: \starttyping for i=1,24 do print(i) for j=1,1000*1000 do local t = { 1, 2, 3 } end end print(os.clock()) \stoptyping This takes some 60 seconds on my machine. The following code runs about three times faster because the table has not to be allocated each time. \starttyping t = { 1, 2, 3 } for i=1,24 do print(i) for j=1,1000*1000 do t[1]=4 t[2]=5 t[3]=6 end end print(os.clock()) \stoptyping Imagine this code to be interwoven with other code and \TEX\ doing things with the tokens it gets back. The memory pool will be scattered and garbage collecting will become more difficult. However, in practice one will only apply token handling to a marked piece of the input data. It is for this reason that the callback is not: \starttyping callback.register('token_filter', function(t) return t end ) \stoptyping but instead \starttyping callback.register('token_filter', function() return token.get_next() end ) \stoptyping This gives the opportunity to fetch more than one token and keep fetching till a criterium is met (for instance a sentinel). Because \type {token.get_next} is not bound to the callback you can fetch tokens anytime you want and only use the callback to feed back tokens into \TEX. In \CONTEXT\ \MKIV\ there is some collect and flush tokens present. Here is a trivial example: \starttyping \def\SwapChars{\directlua 0 { do local t = { token.get_next(), token.get_next() } callback.register('token_filter', function() callback.register('token_filter', nil) return { t[2], t[1] } end ) end }} \SwapChars HH \SwapChars TH \stoptyping Collecting tokens can take place inside the callback but also outside. This also gives you the opportunity to collect them in efficient ways and keep an eye on the memory demands. Of course using \TEX\ directly takes less code: \starttyping \def\SwapChars#1#2{#2#1} \stoptyping The example shown here involves so little tokens that running it takes no noticeable time. Here we show this definition in tokenized form: \starttokens[demo]\def\SwapChars#1#2{#2#1}\stoptokens \setups{ShowCollect} \stopcomponent