diff options
author | Context Git Mirror Bot <phg42.2a@gmail.com> | 2016-08-01 16:40:14 +0200 |
---|---|---|
committer | Context Git Mirror Bot <phg42.2a@gmail.com> | 2016-08-01 16:40:14 +0200 |
commit | 96f283b0d4f0259b7d7d1c64d1d078c519fc84a6 (patch) | |
tree | e9673071aa75f22fee32d701d05f1fdc443ce09c /doc/context/sources/general/manuals/about/about-speed.tex | |
parent | c44a9d2f89620e439f335029689e7f0dff9516b7 (diff) | |
download | context-96f283b0d4f0259b7d7d1c64d1d078c519fc84a6.tar.gz |
2016-08-01 14:21:00
Diffstat (limited to 'doc/context/sources/general/manuals/about/about-speed.tex')
-rw-r--r-- | doc/context/sources/general/manuals/about/about-speed.tex | 732 |
1 files changed, 732 insertions, 0 deletions
diff --git a/doc/context/sources/general/manuals/about/about-speed.tex b/doc/context/sources/general/manuals/about/about-speed.tex new file mode 100644 index 000000000..4b4a376e8 --- /dev/null +++ b/doc/context/sources/general/manuals/about/about-speed.tex @@ -0,0 +1,732 @@ +% language=uk + +\startcomponent about-speed + +\environment about-environment + +\startchapter[title=Speed] + +\startsection[title=Introduction] + +In the \quote {mk} and \type {hybrid} progress reports I have spend some words +on speed. Why is speed this important? + +In the early days of \CONTEXT\ I often had to process documents with thousands of +pages and hundreds of thousands of hyperlinks. You can imagine that this took a +while, especially when all kind of ornaments had to be added to the page: +backgrounds, buttons with their own backgrounds and offsets, hyperlink colors +dependent on their state, etc. Given that multiple runs were needed, this could +mean that you'd leave the machine running all night in order to get the final +document. + +It was the time when computers got twice the speed with each iteration of +hardware, so I suppose that it would run substantially faster on my current +laptop, an old Dell M90 workhorse. Of course a recently added SSD drive adds a +boost as well. But still, processing such documents on a machine with a 8Mhz 286 +processor and 640 megabytes of memory was close to impossible. But, when I +compare the speed of core duo M90 with for instance an M4600 with a i5 \CPU\ +running the same clock speed as the M90, I see a factor 2 improvement at most. Of +course going for a extremely clocked desktop will be much faster, but we're no +longer seeing a tenfold speedup every few years. On the contrary: we see a shift +multiple cores, often running at a lower clock speed, with the assumption that +threaded applications are used. This scales perfectly for web services and +graphic manipulations but not so much for \TEX. If we want go faster, we need to +see where we can be more efficient within more or less frozen clock speeds. + +Of course there are some developments that help us. First of all, for programs +like \TEX\ clever caching of files by the operating system helps a lot. Memory +still becomes faster and \CPU\ cached become larger too. For large documents with +lots of resources an SSD works out great. As \LUA\ uses floating point, speedup +in that area also help with \LUATEX. We use virtual machines for \TEX\ related +services and for some reason that works out quite well, as the underlying +operating system does lots of housekeeping in parallel. But, with all maxing out, +we finally end up at the software itself, and in \TEX\ this boils down to a core +of compiled code along with lots of macro expansions and interpret \LUA\ code. + +In the end, the question remains what causes excessive runtimes. Is it the nature +of the \TEX\ expansion engine? Is it bad macro writing? Is there too much +overhead? If you notice how fast processing the \TEX\ book goes on modern +hardware it is clear that the core engine is not the problem. It's no big deal to +get 100 pages per second on documents that use relative a simple page builder and +have macros that lack a flexible user interface. + +Take the following example: + +\starttyping +\starttext +\dorecurse{1000}{test\page} +\stoptext +\stoptyping + +We do nothing special here. We use the default Latin Modern fonts and process +single words. No burden is put on the pagebuilder either. This way we get on a +2.33 Ghz T7600 \CPU\ a performance of 185 pages per second. \footnote {In this +case the mingw version was used. A version using the native \WINDOWS\ compiler +runs somewhat faster, although this depends on the compiler options. \footnote +{We've noticed that sometimes the mingw binaries are faster than native binaries, +but sometimes they're slower.} With \LUAJITTEX\ the 185 pages per second become +becomes 195 on a 1000 page document.} The estimated \LUA\ overhead in this 1000 +page document is some 1.5 to 2 seconds. The following table shows the performance +on such a test document with different page numbers in pps (reported pages per +second). + +\starttabulate[|r|r|] +\HL +\NC \bf \# pages \NC \bf pps \NC \NR +\HL +\NC 1 \NC 2 \NC \NR +\NC 10 \NC 15 \NC \NR +\NC 100 \NC 90 \NC \NR +\NC 1000 \NC 185 \NC \NR +\NC 10000 \NC 215 \NC \NR +\HL +\stoptabulate + +The startup time, measured on a zero page document, is 0.5 seconds. This includes +loading the format, loading the embedded \LUA\ scripts and initializing them, +initializing and loading the file database, locating and loading some runtime +files and loading the absolute minumum number of fonts: a regular and math Latin +Modern. A few years before this writing that was more than a second, and the gain +is due to a slightly faster \LUA\ interpreter as well as improvements in +\CONTEXT. + +So why does this matter at all, if on a larger document the startup time can be +neglected? It does because when I have to implement a style for a project or are +developing some functionality a fast edit||run||preview cycle is a must, if only +because even a few second wait feels uncomfortable. On the other hand, when I +process a manual of say 150 pages, which uses some tricks to explain matters, I +don't care if the processing rate is between 5 and 15 pages per second, simply +because you get (done) what you asked for. It mostly has to do with feeling +comfortable. + +There is one thing to keep in mind: such measurements can vary over time, as they +depend on several factors. Even in the trivial case we need to: + +\startitemize[packed] +\startitem + load macros and \LUA\ code +\stopitem +\startitem + load additional files +\stopitem +\startitem + initialize the system, think of fonts and languages +\stopitem +\startitem + package the pages, which includes reverting to global document states +\stopitem +\startitem + create the final output stream (\PDF) +\stopitem +\stopitemize + +The simple one word per page test is not that slow, and normally for 1000 pages we +measure around 200 pps. However, due to some small speedups (that somehow add up) +in three months time I could gain a lot: + +\starttabulate[|r|r|r|r|] +\HL +\NC \bf \# pages \NC \bf Januari \NC \bf April \NC \bf May\rlap{\quad(2013)} \NR +\HL +\NC 1 \NC 2 \NC 2 \NC 2 \NC \NR +\NC 10 \NC 15 \NC 17 \NC 17 \NC \NR +\NC 100 \NC 90 \NC 109 \NC 110 \NC \NR +\NC 1000 \NC 185 \NC 234 \NC 259 \NC \NR +\NC 10000 \NC 215 \NC 258 \NC 289 \NC \NR +\HL +\stoptabulate + +Among the improvements in April were a faster output to the console (first +prototyped in \LUA, later done in the \LUATEX\ engine itself), and a couple of +low level \LUA\ optimizations. In May a dirty (maybe too tricky) global document +state restore trick has beeing introduced. Although these changes give nice speed +bump, they will mostly go unnoticed in more realistic documents. There we are +happy if we end up in the 20 pps range. So, in practice a more than 10 percent +speedup between Januari and April is just a dream. \footnote {If you wonder why I +still bother with such things: sometimes speedups are just a side effect of +trying to accomplish something else, like less verbose output in full tracing +mode.} + +There are many cases where it does matter to squeeze out every second possible. +We run workflows where some six documents are generated from one source. If we +forget about the initial overhead of fetching the source from a remote server +\footnote {In the user interface we report the time it takes to fetch the source +so that the typesetter can't be blamed for delays.} gaining half a second per +document (if we start frech each needs two runs at least) means that the user +will see the first result one second faster and have them all in six less than +before. In that case it makes sense to identify bottlenecks in the more high +level mechanisms. + +And this is why during the development of \CONTEXT\ and the transition from +\MKII\ to \MKIV\ quite some time has been spent on avoiding bottlenecks. And, at +this point we can safely conclude that, in spite of more advanced functionality, +the current version of \MKIV\ runs faster than the \MKII\ versions in most cases, +especially if you take the additional functionality into account (like \UNICODE\ +input and fonts). + +\stopsection + +\startsection[title=The \TEX\ engine] + +Writing inefficient macros is not that hard. If they are used only a few times, +for instance in setting up properties it plays no role. But if they're expanded +many times it may make a difference. Because use and development of \CONTEXT\ +went hand in hand we always made sure that the overhead was kept at a minimum. + +\startsubject[title=The parbuilder] + +There are a couple of places where document processing in a traditional \TEX\ +engine gets a performance hit. Let's start with the parbuilder. Although the +paragraph builder is quite fast it can responsible for a decent amount of runtime. +It is also a fact that the parbuilder of the engines derived from original \TEX\ +are more complex. For instance, \OMEGA\ adds bidirectionality to the picture +which involves some extra checking as well as more nodes in the list. The \PDFTEX\ +engine provides protrusion and expansions, and as that feature was primarily a +topic of research it was never optimized. + +In \LUATEX\ the parbuilder is a mixture of the \PDFTEX\ and \OMEGA\ builders and +adapted to the fact that we have split the hyphenation, ligature building, +kerning and breaking a paragraph into lines. The protrusion and expansion code is +still there but already for a few years I have alternative code for \LUATEX\ that +simplifies the implementation and could in principle give a speed boost as well +but till now we never found time to adapt the engine. Take the following test code: + +\ifdefined\tufte \else \let\tufte\relax \fi + +\starttyping +\testfeatureonce{100}{\setbox0\hbox{\tufte \par}} \tufte \par +\stoptyping + +In \MKIV\ we use \LUA\ for doing fonts so when we measure this bit we get the +used time for typesetting our \type {\tufte} quote without breaking it into +lines. A normal \LUATEX\ run needs 0.80 seconds and a \LUAJITTEX\ run takes 0.47 +seconds. \footnote {All measurements are on a Dell M90 laptop running Windows 8. +I keep using this machine because it has a decent high res 4:3 screen. It's the +same machine Luigi Scarso and I used when experimenting with \LUAJITTEX.} + +\starttyping +\testfeatureonce{100}{\setbox0\vbox{\tufte \par}} \tufte \par +\stoptyping + +In this case \LUATEX\ needs 0.80 seconds and \LUAJITTEX\ needs 0.50 seconds and +as we break the list into lines, we can deduct that close to zero seconds are +needed to break 100 samples. This (often used) sample text has the interesting +property that it has many hyphenation points and always gives multiple hyphenated +lines. So, the parbuilder, if no protrusion and expansion are used, is real fast! + +\starttyping +\startparbuilder[basic] + \testfeatureonce{100}{\setbox0\vbox{\tufte \par}} \tufte \par +\stopparbuilder +\stoptyping + +Here we kick in our \LUA\ version of the par builder. This takes 1.50 seconds for +\LUATEX\ and 0.90 seconds for \LUAJITTEX. So, \LUATEX\ needs 0.70 seconds to +break the quote into lines while \LUAJITTEX\ needs 0.43. If we stick to stock +\LUATEX, this means that a medium complex paragraph needs 0.007 seconds of \LUA\ +time and this is not that is not a time to be worried about. Of course these +numbers are not that accurate but the measurements are consistent over multiple +runs for a specific combination of \LUATEX\ and \MKIV. On a more modern machine +it's probably also close to zero. + +These measurements demonstrate that we should add some nuance to the assumption +that parbuilding takes time. For this we need to distinguish between traditional +\TEX\ and \LUATEX. In traditional \TEX\ you build an horizontal box or vertical +box. In \TEX\ speak these are called horizontal and vertical lists. The main text +flow is a special case and called the main vertical list, but in this perspective +you can consider it to be like a vertical box. + +Each vertical box is split into lines. These lines are packed into horizontal +boxes. In traditional \TEX\ constructing a list starts with turning references to +characters into glyphs and ligatures. Kerns get inserted between characters if +the font requests that. When a vertical box is split into lines, discretionary +nodes get inserted (hyphenation) and when font expansion or protrusion is enabled +extra fonts with expanded dimensions get added. + +So, in the case of vertical box, building the paragraph is not really +distinguished from ligaturing, kerning and hyphenation which means that the +timing of this process is somewhat fuzzy. Also, because after the lines are +identified some final packing of lines happens and the result gets added to a +vertical list. + +In \LUATEX\ all these stages are split into hyphenation, ligature building, +kerning, line breaking and finalizing. When the callbacks are not enabled the +normal machinery kicks in but still the stages are clearly separated. In the case +of \CONTEXT\ the font ligaturing and kerning get preceded by so called node mode +font handling. This means that we have extra steps and there can be even more +steps before and afterwards. And, hyphenation always happens on the whole list, +contrary to traditional \TEX\ that interweaves this. Keep in mind that because we +can box and unbox and in that process add extra text the whole process can get +repeated several times for the same list. Of course already treated glyphs and +kerns are normally kept as they are. + +So, because in \LUATEX\ the process of splitting into lines is separated we can +safely conclude that it is real fast. Definitely compared to al the font related +steps. So, let's go back to the tests and let's do the following: + +\starttyping +\testfeatureonce{1000}{\setbox0\hbox{\tufte}} + +\testfeatureonce{1000}{\setbox0\vbox{\tufte}} + +\startparbuilder[basic] + \testfeatureonce{1000}{\setbox0\vbox{\tufte}} +\stopparbuilder +\stoptyping + +We've put the text into a macro so that we don't have interference from reading +files. The test wrapper does the timing. The following measurements are somewhat +rough but repetition gives similar results. \footnote {Before and between runs +we do a garbage collection.} + +\starttabulate[|c|c|c|c|c|] +\HL +\NC \NC \bf engine \NC \bf method \NC \bf normal \NC \bf hz \NC \NR % comment +\HL +\NC 1 \NC luatex \NC tex hbox \NC ~9.64 \NC ~9.64 \NC \NR % baseline font feature processing, hyphenation etc: 9.74 +\NC 2 \NC \NC tex vbox \NC ~9.84 \NC 10.16 \NC \NR % 0.20 linebreak / 0.52 with hz -> 0.32 hz overhead (150pct more) +\NC 3 \NC \NC lua vbox \NC 17.28 \NC 18.43 \NC \NR % 7.64 linebreak / 8.79 with hz -> 1.33 hz overhead ( 20pct more) +\HL +\NC 4 \NC luajittex \NC tex hbox \NC ~6.33 \NC ~6.33 \NC \NR % baseline font feature processing, hyphenation etc: 6.33 +\NC 5 \NC \NC tex vbox \NC ~6.53 \NC ~6.81 \NC \NR % 0.20 linebreak / 0.48 with hz -> 0.28 hz overhead (expected 0.32) +\NC 6 \NC \NC lua vbox \NC 11.06 \NC 11.81 \NC \NR % 4.53 linebreak / 5.28 with hz -> 0.75 hz overhead +\HL +\stoptabulate + +In line~1 we see the basline: hyphenation, processing fonts and hpacking takes +9.74 seconds. In the second line we see that breaking the 1000 paragraphs costs +some 0.20 seconds and when expansion is enabled an extra 12 seconds is needed. +This means that expansion takes 150\% more runtime. If we delegate the task to +\LUA\ we need 7.64 seconds for breaking into lines which can not be neglected +but is still ok given the fact that we break 1000 paragraphs. But, interesting +is to see that our alternative expansion routine only adds 1.33 seconds which is +less than 20\%. It must be said that the built|-|in method is not that efficient +by design if only because it started out differently as part of research. + +When measured three months later, the numbers for regular \LUATEX\ (at that time +version 0.77) with the latest \CONTEXT\ were: 8.52, 8.72 and 15.40 seconds for +the normal run, which demonstrates that we should not draw too many conclusions +from such measurements. It's the overal picture that matters. + +As with earlier timings, if we use \LUAJITTEX\ we see that the runtime of \LUA\ +is much lower (due to the virtual machine). Of course we're still 20 times slower +than the built|-| in method but only 10 times slower when we use expansion. To put +these numbers in perspective: 5 seconds for 1000 paragraphs. + +\starttyping +\setupbodyfont[dejavu] + +\starttext + \dontcomplain \dorecurse{1000}{\tufte\par} +\stoptext +\stoptyping + +This results in 295 pages in the default layout and takes 17.8 seconds or 16.6 +pages per second. Expansion is not enabled. + +\starttext +\startparbuilder[basic] + \dontcomplain \dorecurse{1000}{\tufte\par} +\stopparbuilder +\stoptext + +That one takes 24.7 seconds and runs at 11.9 pages per second. This is indeed +slower but on a bit more modern machine I expect better results. We should also +realize that with Dejavu being a relative large font a difficult paragraph like +the tufte example gives overfull boxes which in turn is an indication that quite +some alternative breaks are tried. + +When typeset with Latin Modern we don't get overfull boxes and interesting is +that the native method needs less time (15.9 seconds or 14.1 pages per second) +while the \LUA\ variant also runs a bit faster: 23.4 or 9.5 pages per second. The +number of pages is 223 because this font is smaller by design. + +When we disable hyphenation the the Dejavu variant takes 16.5 (instead of 17.8) +seconds and 23.1 (instead of 24.7) seconds for \LUA, so this process is not that +demanding. + +For typesetting so many paragraphs without anything special it makes no sense to +bother with using a \LUA\ based parbuilder. I must admit that I never had to typeset +novels so all my 300 page runs are much longer anyway. Anyway, when at some point +we introduce alternative parbuilding to \CONTEXT, the speed penalty is probably +acceptable. + +Just to indicate that predictions are fuzzy: when we put a \type {\blank} between +the paragraphs we end up with 313 pages and the traditional method takes 18.3 +while \LUA\ needs 23.6 seconds. One reason for this is that the whitespace is +also handled by \LUA\ and in the pagebuilder we do some finalizing, so we +suddenly get interference of other processes (as well as the garbage collector). +Again an indication that we should not bother too much about speed. I try to make +sure that the \LUA\ (as well as \TEX) code is reasonably efficient, so in +practice it's the document style that is a more important factor than the +parbuilder, it being the traditional one or the \LUA\ variant. + +\stopsubject + +\startsubject[title=Copying boxes] + +As soon as in \CONTEXT\ you start enhancing the page with headers and footers and +backgrounds you will see that the pps rate drops. This is partly due to the fact +that suddenly quite some macro expansion takes place in order to check what needs +to happen (like font and color switches, offsets, overlays etc). But what has +more impact is that we might end up with copying boxes and that takes time. Also, +by wrapping and repackaging boxes, we add additional levels of recursion in +postprocessing code. + +\stopsubject + +\startsubject[title=Macro expansion] + +Taco and I once calculated that \MKII\ spends some 4\% of the time in accessing +the hash table. This is a clear indication that quite some macro expansions goes +on. Due to the fact that when I rewrote \MKII\ into \MKIV\ I no longer had to +take memory and other limitations into account, the codebase looks quite +different. There we do have more expansion in the mechanism that deals with +settings but the body of macros is much smaller and less parameters are passed. +So, the overall performance is better. + +\stopsubject + +\startsubject[title=Fonts] + +Using a font has several aspects. First you have to define an instance. Then, when +you use it for the first time, the font gets loaded from storage, initialized and +is passed to \TEX. All these steps are quite optimized. If we process the following +file: + +\starttyping +\setupbodyfont[dejavu] + +\starttext + regular, {\it italic}, {\bf bold ({\bi italic})} and $m^a_th$ +\stoptext +\stoptyping + +we get reported: + +\starttabulate[||T|] +\NC \type{loaded fonts} \NC xits-math.otf xits-mathbold.otf \NC \NR +\NC \NC dejavuserif-bold.ttf dejavuserif-bolditalic.ttf \NC \NR +\NC \NC dejavuserif-italic.ttf dejavuserif.ttf \NC \NR +\NC \type{fonts load time} \NC 0.374 seconds \NR +\NC \type{runtime} \NC 1.014 seconds, 0.986 pages/second \NC \NR +\stoptabulate + +So, six fonts are loaded and because XITS is used we also preload the math bold +variant. Loading of text fonts is delayed but in order initialize math we need to +preload the math fonts. + +If we don't define a bodyfont, a default set gets loaded: Latin Modern. In that +case we get: + +\starttabulate[||T|] +\NC \type{loaded fonts} \NC latinmodern-math.otf \NC \NR +\NC \NC lmroman10-bolditalic.otf lmroman12-bold.otf \NC \NR +\NC \NC lmroman12-italic.otf lmroman12-regular.otf \NC \NR +\NC \type{fonts load time} \NC 0.265 seconds \NR +\NC \type{runtime} \NC 0.874 seconds, 1.144 pages/second \NC \NR +\stoptabulate + +Before we had native \OPENTYPE\ Latin Modern math fonts, it took slightly longer +because we had to load many small \TYPEONE\ fonts and assemble a virtual math font. + +As soon as you start mixing more fonts and/or load additional weights and styles +you will see these times increase. But if you use an already loaded font with +a different featureset or scaled differently, the burden is rather low. It is +safe to say that at this moment loading fonts is not a bottleneck. + +Applying fonts can be more demanding. For instance if you typeset Arabic or +Devanagari the amount of node and font juggling definitely influences the total +runtime. As the code is rather optimized there is not much we can do about it. +It's the price that comes with flexibility. As far as I can tell getting the same +results with \PDFTEX\ (if possible at all) or \XETEX\ is not taking less time. If +you've split up your document in separate files you will seldom run more than a +dozen pages which is then still bearable. + +If you are for instance typesetting a dictionary like document, it does not make +sense to do all font switches by switching body fonts. Just defining a couple of +font instances makes more sense and comes at no cost. Being already quite +efficient given the complexity you should not expect impressive speedups in this +area. + +\stopsubject + +\startsubject[title=Manipulations] + +The main manipulation that I have to do is to process \XML\ into something +readable. Using the built||in parser and mapper already has some advantages +and if applied in the right way it's also rather efficient. The more you restrict +your queries, the better. + +Text manipulations using \LUA\ are often quite fast and seldom the reason for +seeing slow processing. You can do lots of things at the \LUA\ end and still have +all the \CONTEXT\ magic by using the \type {context} namespace and function. + +\stopsubject + +\startsubject[title=Multipass] + +You can try to save 1 second on a 20 second run but that is not that impressive +if you need to process the document three times in order to get your cross +references right. Okay you'd save 3 seconds but still to get result you needs +some 60 seconds (unless you already have run the document before). If you have a +predictable workflow you might know in advance that you only need two runs in +case you can enforce that with \type {--runs=2}. Furthermore you can try to +optimize the style by getting rid of redundant settings and inefficient font +switches. But no matter what we optimize, unless we have a document with no cross +references, sectioning and positioning, you often end up with the extra run, +although \CONTEXT\ tries to minimize the number of needed runs needed. + +\stopsubject + +\startsubject[title=Trial runs] + +Some mechanisms, like extreme tables, need multiple passes and all but the last +one are tagged as trial runs. Because in many cases only dimensions matter, we +can disable some time consuming code in such case. For instance, at some point +Alan Braslau and I found out that the new chemical manual ran real slow, mainly +due to the tens of thousands of \METAPOST\ graphics. Adding support for trial +runs to the chemical structure macros gave a fourfold improvement. The manual is +still a slow|-|runner, but that is simply because it has so many runtime +generated graphics. + +\stopsubject + +\stopsection + +\startsection[title=The \METAPOST\ library] + +When the \METAPOST\ library got included we saw a drastic speedup in processing +document with lots of graphics. However, when \METAPOST\ got a different number +system (native, double and decimal) the changed memory model immediately lead to +a slow down. On one 150 page manual which a graphic on each page I saw the +\METAPOST\ runtime go up from about half a second upto more than 5 seconds. In +this case I was able to rewrite some core \METAFUN\ macro to better suit the new +model, but you might not be so lucky. So more careful coding is needed. Of course +if you only have a few graphics, you can just ignore the change. + +\stopsection + +\startsection[title=The \LUA\ interpreter] + +Where the \TEX\ part of \LUATEX\ is compiled, the \LUA\ code gets interpreted, +converted into bytecode, and ran by the virtual machine. \LUA\ is by design quite +portable, which means that the virtual machine is not optimized for a specific +target. The \LUAJIT\ interpreter on the other hand is written in assembler and +available for only some platforms, but the virtual machine is about twice as +fast. The just||in||time part of \LUAJIT\ is not if much help and even can slow +down processing. + +When we moved from \LUA~5.1 to 5.2 we found out that there was some speedup but +it's hard to say why. There has been changes in the way strings are dealt with +(\LUA\ hashes strings) and we use lots of strings, really lots. There has been +changes in the garbage collection and during a run lots of garbage needs to be +collected. There are some fundamental changes in so called environments and who +knows what impact that has. + +If you ever tried to measure the performance of \LUA, you probably have noticed +that it is quite fast. This means that it makes no sense to optimize code that +gets visited only occasionally. But some of the \CONTEXT\ code gets exercised a +lot, for instance all code that deals with fonts. We use attributes a lot and +checking them is for good reason not the fastest code. But given the often +advanced functionality that it makes possible we're willing to pay the price. +It's also functionality that you seldom need all at the same time and for +straightforward text only documents all that code is never executed. + +When writing \TEX\ or \LUA\ code I spent a lot of time making it as efficient as +possible in terms of performance and memory usage. The sole reason for that is +that we happen to process documents where a lot of functionality is combined, so +if many small speed||ups accumulate to a noticeable performance gain it's worth +the effort. + +So, where does \LUA\ influence runtime? First of all we use \LUA\ do deal with all +in- and output as well as locating files in the \TEX\ directory structure. Because +that code is partly shared with the script manager (\type {mtxrun}) it is optimized +but some more is possible if needed. It is already not the most easy to read code, +so I don't want to introduce even more obscurity. + +Quite some code deals with loading, preparing and caching fonts. That code is +mostly optimized for memory usage although speed is also okay. This code is only +called when a font is loaded for the first time (after an update). After that +loading is at matter of milliseconds. When a text gets typeset and when fonts are +processed in so called node mode, depending on the script and|/|or enabled +features, a substantial amount of time is spent in \LUA. There is still a bit +complex dealing with inserting kerns but future \LUATEX\ will carry kerning +in the glyph node so there we can gain some runtime. + +If a page has 4000 characters and if font features as well as other manipulations +demand 10 runs over the text, we have 40.000 checks of nodes and potential +actions. Each involves an id check, maybe a subtype check, maybe some attribute +checking and possibly some action. So, if we have 200.000 (or more) function +calls to the per page \TEX\ end it might add up to a lot. Around the time that we +went to \LUA~5.2 and played with \LUAJITTEX, the node accessors have been sped +up. This gave indeed a measurable speedup but not on an average document, only on +the more extreme documents or features. Because the \MKIV\ \LUA\ code goes from +experimental to production to final, some improvements are made in the process +but there is not much to gain there. We just have to wait till computers get +faster, \CPU\ cache get bigger, branch prediction improves, floating point +calculations take less time, memory is speedy, and flash storage is the standard. + +The \LUA\ code is plugged into the \TEX\ machinery via callbacks. For +instance each time a box is build several callbacks are triggered, even if it's +an empty box or just an extra wrapper. Take for instance this: + +\starttyping +\hbox \bgroup + \hskip \zeropoint + \hbox \bgroup + test + \egroup + \hskip \zeropoint +\egroup +\stoptyping + +Of course you won't come up with this code as it doesn't do much good but macros +that you use can definitely produce this. For instance, the zero skip can be a +left and right margin that happen to be. For 10.000 iterations I measured 0.78 +seconds while the text one takes 0.62 seconds: + +\starttyping +\hbox \bgroup + \hbox \bgroup + test + \egroup +\egroup +\stoptyping + +Why is this? One reason is that a zero skip results in a node and the more nodes +we have the more memory (de)allocation takes place and the more nodes in the list +need to be checked. Of course the relative difference is less when we have more +text. So how can we improve this? The following variant, at the cost of some +testing takes just as much time. + +\starttyping +\hbox \bgroup + \hbox \bgroup + \scratchdimen\zeropoint + \ifdim\scratchdimen=\zeropoint\else\hskip\scratchdimen\fi + test + \ifdim\scratchdimen=\zeropoint\else\hskip\scratchdimen\fi + \egroup +\egroup +\stoptyping + +As does this one, but the longer the text, the slower it gets as one of the two +copies needs to be skipped. + +\starttyping +\hbox \bgroup + \hbox \bgroup + \scratchdimen\zeropoint + \ifdim\scratchdimen=\zeropoint + test% + \else + \hskip\scratchdimen + test% + \hskip\scratchdimen + \fi + \egroup +\egroup +\stoptyping + +Of course most speedup is gained when we don't package at all, so if we test +before we package but such an optimization is seldom realistic because much more +goes on and we cannot check for everything. Also, 10.000 is a lot while 0.10 +seconds is something we can live with. By the way, compare the following + +\starttyping +\hbox \bgroup + \hskip\zeropoint + test% + \hskip\zeropoint +\egroup + +\hbox \bgroup + \kern\zeropoint + test% + \kern\zeropoint +\egroup +\stoptyping + +The first variant is less efficient that the second one, because a skip +effectively is a glue node pointing to a specification node while a kern is just +a simple node with the width stored in it. \footnote {On the \LUATEX\ agenda is +moving the glue spec into the glue node.} I must admit that I seldom keep in mind +to use kerns instead of skips when possible if only because one needs to be sure +to be in the right mode, horizontal or vertical, so additional commands might be +needed. + +\stopsection + +\startsection[title=Macros] + +Are macros a bottleneck? In practice not really. Of course we have optimized the +core \CONTEXT\ macros pretty well, but one reason for that is that we have a +rather extensive set of configuration and definition mechanisms that rely heavily +on inheritance. Where possible all that code is written in a way that macro +expansion won't hurt too much. because of this users themselves can be more +liberal in coding. There is a lot going on deep down and if you turn on tracing +macros you can get horrified. But, not all shown code paths are entered. During the +move (and rewrite) from \MKII\ to \MKIV\ quite some bottlenecks that result from +limitations of machines and memory have been removed and as a result the macro +expansion part is somewhat faster, which nicely compensates the fact that we +have a more advanced but slower inheritance subsystem. Readability of code and +speed are probably nicely balanced by now. + +Once a macro is read in, its internal representation is pretty efficient. For +instance references to macro names are just pointers into a hash table. Of +course, when a macro is seen in your source, that name has to be looked up, but +that's a fast action. Using short names in the running text for instance really +doesn't speed up processing much. Switching font sets on the other hand does, as +then quite some checking happens and the related macros are pretty extensive. +However, once a font is loaded references to it a pretty fast. Just keep in mind +that if you define something inside a group, in most cases it got forgotten. So, +if you need something more often, just define it at the outer level. + +\stopsection + +\startsection[title=Optimizing code] + +Optimizing only makes sense if used very often and called frequently or when the +problem to solve is demanding. An example of code that gets done often is page +building, where we pack together many layout elements. Font switches can also be +time consuming, if defined wrong. These can happen for instance for formulas, +marked words, cross references, margin notes, footnotes (often a complete +bodyfont switch), table cells, etc. Yet another is clever vertical spacing that +happens between structural elements. All these mechanisms are reasonably +optimized. + +I can safely say that deep down \CONTEXT\ is no that inefficient, given what it +has to do. But when a style for instance does redundant or unnecessary massive +font switches you are wasting runtime. I dare to say that instead of trying to +speed up code (for instance by redefining macros) you can better spend the time +in making styles efficient. For instance having 10 \type {\blank}'s in a row +will work out rather well but takes time. If you know that a section head has no +raised or lowered text and no math, you can consider using \type {\definefont} to +define the right size (especially if it is a special size) instead of defining +an extra bodyfont size and switch to that as it includes setting up related sizes +and math. + +It might sound like using \LUA\ for some tasks makes \CONTEXT\ slower, but this +is not true. Of course it's hard to prove because by now we also have more +advanced font support, cleaner math mechanisms, additional features especially in +especially structure related mechanisms, etc. There are also mechanisms that are +faster, for instance extreme tables (a follow up on natural tables) and mixed +column modes. Of course on the previously mentioned 300 page simple paragraphs +with simple Latin text the \PDFTEX\ engine is much faster than \LUATEX, also +because simple fonts are used. But for many of todays document this engine is no +longer an options. For instance in our \XML\ processing in multiple languages, +\LUATEX\ beats \PDFTEX. There is not that much to optimize left, so most speed up +has to come from faster machines. And this is not much different from the past: +processing 300 page document on a 4.7Mhz 8086 architecture was not much fun and +we're not even talking of advanced macros here. Faster machines made more clever +and user friendly systems possible but at the cost of runtime, to even if +machines have become many times faster, processing still takes time. On the other +hand, \CONTEXT\ will not become more complex than it is now, so from now on we +can benefit from faster \CPU's, memory and storage. + +\stopsection + +\stopchapter + +\stopcomponent |