diff options
Diffstat (limited to 'doc/context/sources/general/manuals/mk/mk-performance.tex')
-rw-r--r-- | doc/context/sources/general/manuals/mk/mk-performance.tex | 410 |
1 files changed, 410 insertions, 0 deletions
diff --git a/doc/context/sources/general/manuals/mk/mk-performance.tex b/doc/context/sources/general/manuals/mk/mk-performance.tex new file mode 100644 index 000000000..c0bb13efb --- /dev/null +++ b/doc/context/sources/general/manuals/mk/mk-performance.tex @@ -0,0 +1,410 @@ +% language=uk + +\startcomponent mk-performance + +\environment mk-environment + +\chapter{How about performance} + +\subject{remark} + +The previous chapters already spent some words on performance +and memory usage. By the time that Taco and I were implementing, +discussing and testing the callbacks related to node lists, we were +already convinced that in all areas covered so far (file management, +handling input characters, dealing with fonts, conversion to tokens, +string and table manipulation, enz.) the \TEX||\LUA\ pair was up to +the task And so we were quite confident that processing nodes was +not only an important aspect of \LUATEX\ but also quite feasable +in terms of performance (after all we needed it in order to deal +with advanced typesetting of Arab). When Taco was dealing with the +\TEX\ side of the story, I was experimenting with possible +mechanisms at the \LUA\ end. + +At the same time I got the opportunity to speed up the \METAPOST\ +to \PDF\ converter and both activities involved some timing. Here +I report some of the observations that we made in this process. + +\subject{parsing} + +Expressions in \LUA\ are powerful and definitely faster than regular +expressions found in other languages, but they have some limits. Most +noticeably is the lack of alternation. In \RUBY\ one can say: + +\starttyping +str = "there is no gamma in here, just an beta" + +if str =~ /(alph|bet|delt)a/ then + print($1) +end +\stoptyping + +but in \LUA\ you need a few more lines: + +\starttyping +str = "there is no gamma in here, just an beta" + +for _, v in pairs({'alpha','beta','delta'}) do + local s = str:match(v) + if s then + print(s) + break + end +end +\stoptyping + +Interesting is that upto now I didn't really miss alternation but +it may as well be that the lack of it drove me to come up with +different solutions. For \CONTEXT\ \MKIV\ the \METAPOST\ to \PDF\ +converter has been rewritten in \LUA. This is a prelude to direct +\LUA\ output from \METAPOST\ but I needed the exercise. It was +among the first \LUA\ code in \MKIV. + +Progressive (sequential) parsing of the data is an option, and is +done in \MKII\ using pure \TEX. We collect words and compare them +to \POSTSCRIPT\ directives and act accordingly. The messy parts +are scanning the preamble, which has specials to be dealt with as +well as lots of unpredictable code to skip, and the \type +{fshow} command which adds text to a graphic. But real dirty are +the code fragments that deal with setting the line width and +penshapes so the cleanup of this takes some time. + +In \LUA\ a different approach is taken. There is an \type {mp} table +which collects a lot of functions that more or less reflect the +output of \METAPOST. The functions take care of generating the right +\PDF\ code and also handle the transformations needed because of +the differences between \POSTSCRIPT\ and \PDF. + +The sequential \POSTSCRIPT\ that comes from \METAPOST\ is +collected in one string and converted using \type {gsub} into a +sequence of \LUA\ function calls. Before this can be done, some +cleanup takes place. The resulting string is then executed as +\LUA\ code. + +As an example: + +\starttyping +1 0 0 2 0 0 curveto +\stoptyping + +becomes + +\starttyping +mp.curveto(1,0,0,2,0,0) +\stoptyping + +which results in: + +\starttyping +\pdfliteral{1 0 0 2 0 0 c} +\stoptyping + +In between, the path is stored and transformed which is needed in +the case of penshapes, where some \POSTSCRIPT\ feature is used that +is not available in \PDF. + +During the development of \LUATEX\ a new feature was added to +\LUA: \type {lpeg}. With \type {lpeg} you can define text scanners. In +fact, you can build parsers for languages quite conveniently so +without doubt we will see it show up all over \MKIV. + +Since I needed an exercise to get accustomed with \type {lpeg}, I +rewrote the mentioned converter. I'm sure that a better implementation +is possible than I did (after all, \POSTSCRIPT\ is a language) but +I went for a speedy solution. The following table shows some timings. + +\starttabulate[|c|c|l|] +\HL +\NC \tt gsub \NC \tt lpeg \NC \NC \NR +\HL +\NC 2.5 \NC 0.5 \NC 100 times test graphic \NC \NR +\NC 9.2 \NC 1.9 \NC 100 times big graphic \NC \NR +\HL +\stoptabulate + +The test graphic has about everything that \METAPOST\ can output, +including special tricks that deal with transparency and shading. The +big one is just four copies of the test graphic. + +So, the \type {lpeg} based variant is about 5~times faster than the +original variant. I'm not saying that the original implementation is +that brilliant, but a 5~time improvement is rather nice especially +when you consider that \type {lpeg} is still experimental and each +version performs better. The tests were done with \type {lpeg} +version 0.5 which performs slightly faster than its predecessor. + +It's worth mentioning that the original \type {gsub} based variant +was already a bit improved compared to its first implementation. +There we collected the \TEX\ (\PDF) code in a table and passed it +in its concatenated form to \TEX. Because the \LUA\ to \TEX\ +interface is by now quite efficient we can just pass the +intermediate results directly to \TEX. + +\subject{file io} + +The repertore of functions that deal with individual characters +in \LUA\ is small. This does not bother us too much because the +individual character is not what \TEX\ is mostly dealing with. A +character or sequence of characters becomes a token (internally +represented by a table) and tokens result in nodes (again +tables, but larger). There are many more tokens involved than +nodes: in \CONTEXT\ a ratio of 200~tokens on 1~node are not +uncommon. A letter like \type {x} become a token, but the control +sequence \type {\command} also ends up as one token. Later on, +this \type {x} may become a character node, possibly surrounded +by some kerning. The input characters \type {width} result in +5~tokens, but may not end up as nodes at all, for instance when +they are part of a key|/|value pair in the argument to a command. + +Just as there is no guaranteed one||to||one relationship between +input characters and tokens, there is no straight relation between +tokens and nodes. When dealing with input it is good to keep in mind +that because of these interpretation stages one can never say that +1~megabyte of input characters ends up as 1~million something in +memory. Just think of how many megabytes of macros get stored in a +format file much smaller than the sum of bytes. + +We only deal with characters or sequences of bytes when reading +from an input medium. There are many ways to deal with the input. +For instance one can process the input lines as \TEX\ sees them, +in which case \TEX\ takes care of the \UTF\ input. When we're +dealing with other input encodings we can hook code into the file +openers and readers and convert the raw data ourselves. We can for +instance read in a file as a whole, convert it using the normal +expression handlers or the byte(pair) iterators that \LUATEX\ +provides, or we can go real low level using native \LUA\ code, +as in: + +\starttyping +do + local function nextbyte(f) + return f:read(1) + end + + function io.bytes(f) + return nextbyte, f + end +end + +f = io.open("somefile.dat") +for b in io.bytes(f) do + do_something(b) +end +f:close() +\stoptyping + +Of course in practice one will need to integrate this into one of the +reader callback, but the principle stays the same. In case you wonder +if calling functions for each byte is fast enough \unknown\ it's more than +fast enough for normal purposes, especially if we keep in mind that other +tasks like reading of, preparing of and dealing with fonts of processing +token lists take way more time. You can be sore that when half a second +runtime is spent on reading a file, processing may take minutes. If one +wants to sqeeze more performance out of this part, it's always an option +to write special libraries for that, but this is beyond standard \LUATEX. +We found out that the speed of loading data from files in \LUA\ is +mostly related to the small size of \LUA's file buffer. Reading data stored +in tables is extremely fast, and even faster when precompiled into bytecode. + +\subject{tables} + +When Taco and I were experimenting with the callbacks that intercept +tokens and nodes, we wondered what the impact would be on performance. +Although in \MKIV\ we allocate quite some memory due to font handling, +we were pretty sure that handling \TEX's internal lists also could +have their impact. Data related to fonts is not always subjected to +garbage collection, simply because it's to be available permanently. +List processing on the other hand involves a lot of temporary allocated +tables. During a run a real huge amount of tokens passes the machinery. +When digested, they become nodes. For testing we normally use this +document (with the name \type {mk.tex}) and at the time of writing +this, it has some 48 pages. + +This document is of moderately complexity, but not as complex as +the documents that I normally process; they have with lots of graphics, +layers, structural elements, maybe a bit of \XML\ parsing, etc. +Nevertheless, we're talking of some 24~million tokens entering the engine +for 50 pages of text. Contrary to this the number of nodes is small: +only 120~thousand but the tables making up the nodes are more complex than +token tables (with three numbers per token). When all tokens are intercepted +and returned unchanged, on my machine the run is progressively slow and +memory usage grows from 75M to 112M. There is room for improvement there, +especially in the garbage collector. + +Side note: quite some of these tokens result from macro expansion. Also, +when in the input a \type {\command} is used, the callback passes it as one +token. A command stores in a format is already tokenized, but a command read +from the input is tokenized when read, so behind each token reported there +can be a few more input characters, but their number can be neglected compared +to tokens originating from the macro package. + +The token callback is rather slow when used for a whole document. However, +this is typically a callback that will only be used in very special +situations and for a controlled number of tokens. The node callback on the +other hand can be set permanently. Fortunately the number of nodes is +relatively small. The overhead of a simple token handler that just counts +nodes is around 5\% but most common manipulations with token lists don't +take much more time. For instance, experiments with adding kerns around +punctuation (a French speciality) hardly takes time, resolving ligatures is +not really noticeable and applying inter||character spacing to a whole document +is not that slow either. Actually, the last example is kind of special +because it more than doubles the size of the node lists. Inserting or removing +table elements in relatively slow when tables are large but there are some +ways around this. + +One of the reasons of whole||document token handling being slow is that +each token is a three||element table and so the garbage collector has to work +rather hard. The efficiency of this process is also platform dependent (or +maybe compiler specific). Manipulating the garbage collector parameters +does not improve performance, unless this forces the collector to be inefficient +at the cost of a lot of memory. + +However, when we started dealing with nodes, I gave tuning the collector +another try and on the mentioned test document the following observations +were made when manipulating the step multiplier: + +\starttabulate[|c|c|c|] +\HL +\NC \bf step \NC \bf runtime \NC \bf memory \NC \NR +\HL +\NC 200 \NC 24.0 \NC 80.5M \NC \NR +\NC 175 \NC 21.0 \NC 78.2M \NC \NR +\NC 150 \NC 22.0 \NC 74.6M \NC \NR +\NC 160 \NC 22.0 \NC 74.6M \NC \NR +\NC 165 \NC 21.0 \NC 77.6M \NC \NR +\NC 125 \NC 21.5 \NC 89.2M \NC \NR +\NC 100 \NC 21.5 \NC 88.4M \NC \NR +\HL +\stoptabulate + +As a result, I decided to set the \type {stepmul} variable to~165. + +\starttyping +\ctxlua{collectgarbage("setstepmul", 165)} +\stoptyping + +However, when we were testing thenew \type {lpeg} based \METAPOST\ converter, we ran +into problems. For table intensive operations, temporary disabling the +garbage collector gave a significant boost in speed. While testing +performance we used the following loop: + +\starttyping +\dorecurse {2000} { + \setbox \scratchbox \hbox \bgroup + \convertMPtoPDF{test-mps-procset.mps}{1}{1} + \egroup +} +\stoptyping + +In such a loop, turning the garbage collector on and off is disasterous. Because +no other \LUA\ calls happen between these calls, the garbage collector is never +invoked at all. As a result, memory growed from the baseline of 45M to 120MB and +processing became incrementally slow. I found out that restarting the collector +before each conversion kept memory usage low and the speed also remained okay. + +\starttyping +\ctxlua{collectgarbage("restart")} +\stoptyping + +Further experiments learned that it makes sense to restart the collector at +each shipout and before table intense operations. On \type {mk.tex} this +results in a memory usage of 74M (at the end of the run) and a runtime of +21~seconds. + +Concerning nodes and speed|/|allocation issues, we need to be aware of +the fact that this was still somewhat experimental and in the final version +of \LUATEX\ callbacks may occur at different places and lists may be subjected +to parsing multiple times at different moments and locations (for instance when +we start dealing with attributes, an upcoming new feature). + +Back to tokens. The reason why applying the callback to every token takes a +while has to do with the fact that each token goes through the associated +function. If you want to have an idea of what this means for 24~million +tokens, just run the following \LUA\ code: + +\starttyping +for i=1,24 do + print(i) + for j=1,1000*1000 do + local t = { 1, 2, 3 } + end +end +print(os.clock()) +\stoptyping + +This takes some 60 seconds on my machine. The following code +runs about three times faster because the table has not to +be allocated each time. + +\starttyping +t = { 1, 2, 3 } +for i=1,24 do + print(i) + for j=1,1000*1000 do + t[1]=4 t[2]=5 t[3]=6 + end +end +print(os.clock()) +\stoptyping + +Imagine this code to be interwoven with other code and \TEX\ doing +things with the tokens it gets back. The memory pool will be +scattered and garbage collecting will become more difficult. + +However, in practice one will only apply token handling +to a marked piece of the input data. It is for this reason +that the callback is not: + +\starttyping +callback.register('token_filter', function(t) + return t +end ) +\stoptyping + +but instead + +\starttyping +callback.register('token_filter', function() + return token.get_next() +end ) +\stoptyping + +This gives the opportunity to fetch more than one token and +keep fetching till a criterium is met (for instance a sentinel). + +Because \type {token.get_next} is not bound to the callback you +can fetch tokens anytime you want and only use the callback to +feed back tokens into \TEX. In \CONTEXT\ \MKIV\ there is some +collect and flush tokens present. Here is a trivial example: + +\starttyping +\def\SwapChars{\directlua 0 { + do + local t = { token.get_next(), token.get_next() } + callback.register('token_filter', function() + callback.register('token_filter', nil) + return { t[2], t[1] } + end ) + end +}} + +\SwapChars HH \SwapChars TH +\stoptyping + +Collecting tokens can take place inside the callback but also +outside. This also gives you the opportunity to collect them +in efficient ways and keep an eye on the memory demands. + +Of course using \TEX\ directly takes less code: + +\starttyping +\def\SwapChars#1#2{#2#1} +\stoptyping + +The example shown here involves so little tokens that running +it takes no noticeable time. Here we show this definition in +tokenized form: + +\starttokens[demo]\def\SwapChars#1#2{#2#1}\stoptokens \setups{ShowCollect} + +\stopcomponent |