diff options
Diffstat (limited to 'doc/context/sources/general/manuals/hybrid/hybrid-parbuilder.tex')
-rw-r--r-- | doc/context/sources/general/manuals/hybrid/hybrid-parbuilder.tex | 340 |
1 files changed, 340 insertions, 0 deletions
diff --git a/doc/context/sources/general/manuals/hybrid/hybrid-parbuilder.tex b/doc/context/sources/general/manuals/hybrid/hybrid-parbuilder.tex new file mode 100644 index 000000000..06519b2fb --- /dev/null +++ b/doc/context/sources/general/manuals/hybrid/hybrid-parbuilder.tex @@ -0,0 +1,340 @@ +% language=uk + +\startcomponent hybrid-parbuilder + +\startbuffer[MyAbstract] +\StartAbstract + In this article I will summarize some experiences with converting the \TEX\ + par builder to \LUA. In due time there will be a plugin mechanism in + \CONTEXT, and this is a prelude to that. +\StopAbstract +\stopbuffer + +\doifmodeelse {tugboat} { + \usemodule[tug-01,abr-01] + \setvariables + [tugboat] + [columns=yes] + \setvariables + [tugboat] + [year=2010, + volume=99, + number=9, + page=99] + \setvariables + [tugboat] + [title=Building paragraphs, + subtitle=, + keywords=, + author=Hans Hagen, + address=PRAGMA ADE\\Ridderstraat 27\\8061GH Hasselt NL, + email=pragma@wxs.nl] + % + % we use a buffer as abstract themselves are buffers and + % inside macros we loose line endings and such + \getbuffer[MyAbstract] + % + \StartArticle +} { + \environment hybrid-environment + \startchapter[title={Building paragraphs}] +} + +\startsection [title={Introduction}] + +You enter the den of the Lion when you start messing around with the parbuilder. +Actually, as \TEX\ does a pretty good job on breaking paragraphs into lines I +never really looked into the code that does it all. However, the Oriental \TEX\ +project kind of forced it upon me. In the chapter about font goodies an optimizer +is described that works per line. This method is somewhat similar to expansion +level~one support (hz) in the sense that it acts independent of the par builder: +the split off (best) lines are postprocessed. Where expansion involves horizontal +scaling, the goodies approach does with (Arabic) words what the original HZ +approach does with glyphs. + +It would be quite some challenge (at least for me) to come up with solutions that +look at the whole paragraph and as the per-line approach works quite well, there +is no real need for an alternative. However, in September 2008, when we were +exploring solutions for Arabic par building, Taco converted the parbuilder into +\LUA\ code and stripped away all code related to hyphenation, protrusion, +expansion, last line fitting, and some more. As we had enough on our plate at +that time, we never came to really testing it. There was even less reason to +explore this route because in the Oriental \TEX\ project we decided to follow the +\quotation {use advanced \OPENTYPE\ features} route which in turn lead to the +\quote {replace words in lines by narrower of wider variants} approach. + +However, as the code was laying around and as we want to explore further I +decided to pick up the parbuilder thread. In this chapter some experiences will +be discussed. The following story is as much Taco's as mine. + +\stopsection + +\startsection [title={Cleaning up}] + +In retrospect, we should not have been too surprised that the first approximation +was broken in many places, and for good reason. The first version of the code was +a conversion of the \CCODE\ code that in turn was a conversion from the original +interwoven \PASCAL\ code. That first conversion still looked quite \CCODE||ish +and carried interesting bit and pieces of \CCODE||macros, \CCODE||like pointer +tests, interesting magic constants and more. + +When I took the code and \LUA-fied it nearly every line was changed and it took +Taco and me a bit of reverse engineering to sort out all problems (thank you +Skype). Why was it not an easy task? There are good reasons for this. + +\startitemize + +\startitem The parbuilder (and related hpacking) code is derived from traditional +\TEX\ and has bits of \PDFTEX, \ALEPH\ (\OMEGA), and of course \LUATEX. \stopitem + +\startitem The advocated approach to extending \TEX\ has been to use change files +which means that a coder does not see the whole picture. \stopitem + +\startitem Originally the code is programmed in the literate way which means that +the resulting functions are build stepwise. However, the final functions can (and +have) become quite large. Because \LUATEX\ uses the woven (merged) code indeed we +have large functions. Of course this relates to the fact that succesive \TEX\ +engines have added functionality. Eventually the source will be webbed again, but +in a more sequential way. \stopitem + +\startitem This is normally no big deal, but the \ALEPH\ (\OMEGA) code has added +a level of complexity due to directional processing and additional begin and end +related boxes. \stopitem + +\startitem Also the \ETEX\ extension that deals with last line fitting is +interwoven and uses goto's for the control flow. Fortunately the extensions are +driven by parameters which make the related code sections easy to recognize. +\stopitem + +\startitem The \PDFTEX\ protrusion extension adds code to glyph handling and +discretionary handling. The expansion feature does that too and in addition also +messes around with kerns. Extra parameters are introduced (and adapted) that +influence the decisions for breaking lines. There is also code originating in +\PDFTEX\ which deals with poor mans grid snapping although that is quite isolated +and not interwoven. \stopitem + +\startitem Because it uses a slightly different way to deal with hyphenation, +\LUATEX\ itself also adds some code. \stopitem + +\startitem Tracing is sort of interwoven in the code. As it uses goto's to share +code instead of functions, one needs to keep a good eye on what gets skipped or +not. \stopitem + +\stopitemize + +I'm pretty sure that the code that we started with looks quite different from the +original \TEX\ code if it had been translated into \CCODE. Actually in modern +\TEX\ compiling involves a translation into \CCODE\ first but the intermediate +form is not meant for human eyes. As the \LUATEX\ project started from that +merged code, Taco and Hartmut already spent quite some time on making it more +readable. Of course the original comments are still there. + +Cleaning up such code takes a while. Because both languages are similar but also +quite different it took some time to get compatible output. Because the \CCODE\ +code uses macros, careful checking was needed. Of course \LUA's table model and +local variables brought some work as well. And still the code looks a bit +\CCODE||ish. We could not divert too much from the original model simply because +it's well documented. + +When moving around code redundant tests and orphan code has been removed. Future +versions (or variants) might as well look much different as I want more hooks, +clearly split stages, and convert some linked list based mechanism to \LUA\ +tables. On the other hand, as already much code has been written for \CONTEXT\ +\MKIV, making it all reasonable fast was no big deal. + +\stopsection + +\startsection [title={Expansion}] + +The original \CCODE||code related to protrusion and expansion is not that +efficient as many (redundant) function calls take place in the linebreaker and +packer. As most work related to fonts is done in the backend, we can simply stick +to width calculations here. Also, it is no problem at all that we use floating +point calculations (as \LUA\ has only floats). The final result will look okay as +the original hpack routine will nicely compensate for rounding errors as it will +normally distribute the content well enough. We are currently compatible with the +regular par builder and protrusion code, but expansion gives different results +(actually not worse). + +The \LUA\ hpacker follows a different approach. And let's admit it: most \TEX ies +won't see the difference anyway. As long as we're cross platform compatible it's +fine. + +It is a well known fact that character expansion slows down the parbuilder. There +are good reasons for this in the \PDFTEX\ approach. Each glyph and intercharacter +kern is checked a few times for stretch or shrink using a function call. Also +each font reference is checked. This is a side effect of the way \PDFTEX\ backend +works as there each variant has its own font. However, in \LUATEX, we scale +inline and therefore don't really need the fonts. Even better, we can get rid of +all that testing and only need to pass the eventual \type {expansion_ratio} so +that the backend can do the right scaling. We will prototype this in the \LUA\ +version \footnote {For this Hartmuts has adapted the backend code has to honour +this field in the glyph and kern nodes.} and we feel confident about this +approach it will be backported into the \CCODE\ code base. So eventually the +\CCODE\ might become a bit more readable and efficient. + +Intercharacter kerning is dealt with in a somewhat strange way. If a kern of +subtype zero is seen, and if it's neighbours are glyphs from the same font, the +kern gets replaced by a scaled one looked up in the font's kerning table. In the +parbuilder no real replacement takes place but as each line ends up in the hpack +routine (where all work is simply duplicated and done again) it really gets +replaced there. When discussing the current aproach we decided, that manipulating +intercharacter kerns while leaving regular spacing untouched, is not really a +good idea so there will be an extra level of configuration added to \LUATEX: +\footnote {As I more and more run into books typeset (not by \TEX) with a +combination of character expansion and additional intercharacter kerning I've +been seriously thinking of removing support for expansion from \CONTEXT\ \MKIV. +Not all is progress especially if it can be abused.} + +\starttabulate +\NC 0 \NC no character and kern expansion \NC \NR +\NC 1 \NC character and kern expansion applied to complete lines \NC \NR +\NC 2 \NC character and kern expansion as part of the par builder \NC \NR +\NC 3 \NC only character expansion as part of the par builder (new) \NC \NR +\stoptabulate + +You might wonder what happens when you unbox such a list: the original font +references have been replaced as were the kerns. However, when repackaged again, +the kerns are replaced again. In traditional \TEX, indeed rekerning might happen +when a paragraph is repackaged (as different hyphenation points might be chosen +and ligature rebuilding etc.\ has taken place) but in \LUATEX\ we have clearly +separated stages. An interesting side effect of the conversion is that we really +have to wonder what certain code does and if it's still needed. + +\stopsection + +\startsection [title={Performance}] + +% timeit context ... + +We had already noticed that the \LUA\ variant was not that slow. So after the +first cleanup it was time to do some tests. We used our regular \type {tufte.tex} +test file. This happens to be a worst case example because each broken line ends +with a comma or hyphen and these will hang into the margin when protruding is +enabled. So the solution space is rather large (an example will be shown later). + +Here are some timings of the March 26, 2010 version. The test is typeset in a box +so no shipout takes place. We're talking of 1000 typeset paragraphs. The times +are in seconds an between parentheses the speed relative to the regular +parbuilder is mentioned. + +\startmode[mkiv] + +\startluacode + local times = { + { 1.6, 8.4, 9.8 }, -- 6.7 reported in statistics + { 1.7, 14.2, 15.6 }, -- 13.4 + { 2.3, 11.4, 13.3 }, -- 9.5 + { 2.9, 19.1, 21.5 }, -- 18.2 + } + + local NC, NR, b, format = context.NC, context.NR, context.bold, string.format + + local function v(i,j) + if times[i][j]<10 then -- This is a hack. The font that we use has no table + context.dummydigit() -- digits (tnum) so we need this hack. Not nice anyway. + end + context.equaldigits(format("%0.01f",times[i][j])) + if j > 1 then + context.enspace() + context.equaldigits(format("(%0.01f)",times[i][j]/times[i][1])) + end + end + + context.starttabulate { "|l|c|c|c|" } + NC() NC() b("native") NC() b("lua") NC() b("lua + hpack") NC() NR() + NC() b("normal") NC() v(1,1) NC() v(1,2) NC() v(1,3) NC() NR() + NC() b("protruding") NC() v(2,1) NC() v(2,2) NC() v(2,3) NC() NR() + NC() b("expansion") NC() v(3,1) NC() v(3,2) NC() v(3,3) NC() NR() + NC() b("both") NC() v(4,1) NC() v(4,2) NC() v(4,3) NC() NR() + context.stoptabulate() +\stopluacode + +\stopmode + +\startnotmode[mkiv] + +% for the tugboat article + +\starttabulate[|l|c|c|c|] +\NC \NC \bf native \NC \bf lua \NC \bf lua + hpack \NC \NR +\NC \bf normal \NC 1.6 \NC 8.4 (5.3) \NC 9.8 (6.1) \NC \NR +\NC \bf protruding \NC 1.7 \NC 14.2 (8.4) \NC 15.6 (9.2) \NC \NR +\NC \bf expansion \NC 2.3 \NC 11.4 (5.0) \NC 13.3 (5.8) \NC \NR +\NC \bf both \NC 2.9 \NC 19.1 (6.6) \NC 21.5 (7.4) \NC \NR +\stoptabulate + +\stopnotmode + +For a regular paragraph the \LUA\ variant (currently) is 5~times slower and about +6~times when we use the \LUA\ hpacker, which is not that bad given that it's +interpreted code and that each access to a field in a node involves a function +call. Actually, we can make a dedicated hpacker as some code can be omitted, The +reason why the protruding is relatively slow is, that we have quite some +protruding characters in the test text (many commas and potential hyphens) and +therefore we have quite some lookups and calculations. In the \CCODE\ variant +much of that is inlined by macros. + +Will things get faster? I'm sure that I can boost the protrusion code and +probably the rest as well but it will always be slower than the built in +function. This is no problem as we will only use the \LUA\ variant for +experiments and special purposes. For that reason more \MKIV\ like tracing will +be added (some is already present) and more hooks will be provided once the +builder is more compartimized. Also, future versions of \LUATEX\ will pass around +paragrapgh related parameters differently so that will have impact on the code as +well. + +\stopsection + +\startsection[title=Usage] + +The basic parbuilder is enabled and disabled as follows:\footnote {I'm not +sure yet if the parbuilder has to do automatic grouping.} + +\startbuffer[example] +\definefontfeature[example][default][protrusion=pure] +\definedfont[Serif*example] +\setupalign[hanging] + +\startparbuilder[basic] + \startcolor[blue] + \input tufte + \stopcolor +\stopparbuilder + +\stopbuffer + +\typebuffer[example] + +\startmode[mkiv] + This results in: \par \getbuffer[example] +\stopmode + +There are a few tracing options in the \type {parbuilders} namespace but these +are not stable yet. + +\stopsection + +\startsection[title=Conclusion] + +The module started working quite well around the time that Peter Gabriels +\quotation {Scratch My Back} ended up in my Squeezecenter: modern classical +interpretations of some of his favourite songs. I must admit that I scratched the +back of my head a couple of times when looking at the code below. It made me +realize that a new implementation of a known problem indeed can come out quite +different but at the same time has much in common. As with music it's a matter of +taste which variant a user likes most. + +At the time of this writing there is still work to be done. For instance the +large functions need to be broken into smaller steps. And of course more testing +is needed. + +\stopsection + +\doifmodeelse {tugboat} { + \StopArticle +} { + \stopchapter +} + +\stopcomponent |