diff options
author | Context Git Mirror Bot <phg42.2a@gmail.com> | 2015-10-07 14:15:06 +0200 |
---|---|---|
committer | Context Git Mirror Bot <phg42.2a@gmail.com> | 2015-10-07 14:15:06 +0200 |
commit | ee1c809d23ce322e7946f941545f7e0fa27ae5c6 (patch) | |
tree | 3e32a64b19cf9706e5ff0df289eb56e77571a5ca /doc/context/sources/general/manuals/luatex/luatex-languages.tex | |
parent | 961f357ef202a44da1f4b315c82ef143a6f51497 (diff) | |
download | context-ee1c809d23ce322e7946f941545f7e0fa27ae5c6.tar.gz |
2015-10-07 12:05:00
Diffstat (limited to 'doc/context/sources/general/manuals/luatex/luatex-languages.tex')
-rw-r--r-- | doc/context/sources/general/manuals/luatex/luatex-languages.tex | 514 |
1 files changed, 514 insertions, 0 deletions
diff --git a/doc/context/sources/general/manuals/luatex/luatex-languages.tex b/doc/context/sources/general/manuals/luatex/luatex-languages.tex new file mode 100644 index 000000000..56978b0fd --- /dev/null +++ b/doc/context/sources/general/manuals/luatex/luatex-languages.tex @@ -0,0 +1,514 @@ +\environment luatex-style +\environment luatex-logos + +\startcomponent luatex-languages + +\startchapter[reference=languages,title={Languages and characters, fonts and glyphs}] + +\LUATEX's internal handling of the characters and glyphs that eventually become +typeset is quite different from the way \TEX82 handles those same objects. The +easiest way to explain the difference is to focus on unrestricted horizontal mode +(i.e.\ paragraphs) and hyphenation first. Later on, it will be easy to deal +with the differences that occur in horizontal and math modes. + +In \TEX82, the characters you type are converted into \type {char_node} records +when they are encountered by the main control loop. \TEX\ attaches and processes +the font information while creating those records, so that the resulting \quote +{horizontal list} contains the final forms of ligatures and implicit kerning. +This packaging is needed because we may want to get the effective width of for +instance a horizontal box. + +When it becomes necessary to hyphenate words in a paragraph, \TEX\ converts (one +word at time) the \type {char_node} records into a string array by replacing +ligatures with their components and ignoring the kerning. Then it runs the +hyphenation algorithm on this string, and converts the hyphenated result back +into a \quote {horizontal list} that is consecutively spliced back into the +paragraph stream. Keep in mind that the paragraph may contain unboxed horizontal +material, which then already contains ligatures and kerns and the words therein +are part of the hyphenation process. + +The \type {char_node} records are somewhat misnamed, as they are glyph positions +in specific fonts, and therefore not really \quote {characters} in the linguistic +sense. There is no language information inside the \type {char_node} records. +Instead, language information is passed along using \type {language whatsit} +records inside the horizontal list. + +In \LUATEX, the situation is quite different. The characters you type are always +converted into \type {glyph_node} records with a special subtype to identify them +as being intended as linguistic characters. \LUATEX\ stores the needed language +information in those records, but does not do any font|-|related processing at +the time of node creation. It only stores the index of the current font. + +When it becomes necessary to typeset a paragraph, \LUATEX\ first inserts all +hyphenation points right into the whole node list. Next, it processes all the +font information in the whole list (creating ligatures and adjusting kerning), +and finally it adjusts all the subtype identifiers so that the records are \quote +{glyph nodes} from now on. + +That was the broad overview. The rest of this chapter will deal with the minutiae +of the new process. + +\section[charsandglyphs]{Characters and glyphs} + +\TEX82 (including \PDFTEX) differentiates between \type {char_node}s and \type +{lig_node}s. The former are simple items that contained nothing but a \quote +{character} and a \quote {font} field, and they lived in the same memory as +tokens did. The latter also contained a list of components, and a subtype +indicating whether this ligature was the result of a word boundary, and it was +stored in the same place as other nodes like boxes and kerns and glues. + +In \LUATEX, these two types are merged into one, somewhat larger structure called +a \type {glyph_node}. Besides having the old character, font, and component +fields, and the new special fields like \quote {attr} +(see~\in{section}[glyphnodes]), these nodes also contain: + +\startitemize + +\startitem A subtype, split into four main types: + + \startitemize + \startitem + \type {character}, for characters to be hyphenated: the lowest bit + (bit 0) is set to 1. + \stopitem + \startitem + \type {glyph}, for specific font glyphs: the lowest bit (bit 0) is + not set. + \stopitem + \startitem + \type {ligature}, for ligatures (bit 1 is set) + \stopitem + \startitem + \type {ghost}, for \quote {ghost objects} (bit 2 is set) + \stopitem + \stopitemize + + The latter two make further use of two extra fields (bits 3 and 4): + + \startitemize + \startitem + \type {left}, for ligatures created from a left word boundary and for + ghosts created from \type {\leftghost} + \stopitem + \startitem + \type {right}, for ligatures created from a right word boundary and + for ghosts created from \type {\rightghost} + \stopitem + \stopitemize + + For ligatures, both bits can be set at the same time (in case of a + single|-|glyph word). + +\stopitem + +\startitem + \type {glyph_node}s of type \quote {character} also contain language data, + split into four items that were current when the node was created: the + \type {\setlanguage} (15 bits), \type {\lefthyphenmin} (8 bits), \type + {\righthyphenmin} (8 bits), and \type {\uchyph} (1 bit). +\stopitem + +\stopitemize + +Incidentally, \LUATEX\ allows 16383 separate languages, and words can be 256 +characters long. + +The new primitive \type {\hyphenationmin} can be used to signal the minimal length +of a word. This value stored with the (current) language. + +Because the \type {\uchyph} value is saved in the actual nodes, its handling is +subtly different from \TEX82: changes to \type {\uchyph} become effective +immediately, not at the end of the current partial paragraph. + +Typeset boxes now always have their language information embedded in the nodes +themselves, so there is no longer a possible dependency on the surrounding +language settings. In \TEX82, a mid-paragraph statement like \type {\unhbox0} would +process the box using the current paragraph language unless there was a +\type {\setlanguage} issued inside the box. In \LUATEX, all language variables are +already frozen. + +\section{The main control loop} + +In \LUATEX's main loop, almost all input characters that are to be typeset are +converted into \type {glyph} node records with subtype \quote {character}, but +there are a few exceptions. + +First, the \type {\accent} primitives creates nodes with subtype \quote {glyph} +instead of \quote {character}: one for the actual accent and one for the +accentee. The primary reason for this is that \type {\accent} in \TEX82 is +explicitly dependent on the current font encoding, so it would not make much +sense to attach a new meaning to the primitive's name, as that would invalidate +many old documents and macro packages. A secondary reason is that in \TEX82, +\type {\accent} prohibits hyphenation of the current word. Since in \LUATEX\ +hyphenation only takes place on \quote {character} nodes, it is possible to +achieve the same effect. + +This change of meaning did happen with \type {\char}, that now generates \quote +{glyph} nodes with a character subtype. In traditional \TEX\ there was a strong +relationship betwene the 8|-|bit input encoding, hyphenation and glyph staken +from a font. In \LUATEX\ we have \UTF\ input, and in most cases this maps +directly to a character in a font, apart from glyph replacement in the font +engine. If you want to access arbitrary glyphs in a font directly you can alwasy +use \LUA\ to do so, because fonts are available as \LUA\ table. + +Second, all the results of processing in math mode eventually become nodes with +\quote {glyph} subtypes. + +Third, the \ALEPH|-|derived commands \type {\leftghost} and \type {\rightghost} +create nodes of a third subtype: \quote {ghost}. These nodes are ignored +completely by all further processing until the stage where inter|-|glyph kerning +is added. + +Fourth, automatic discretionaries are handled differently. \TEX82 inserts an +empty discretionary after sensing an input character that matches the \type +{\hyphenchar} in the current font. This test is wrong, in our opinion: whether or +not hyphenation takes place should not depend on the current font, it is a +language property. + +In \LUATEX, it works like this: if \LUATEX\ senses a string of input characters +that matches the value of the new integer parameter \type {\exhyphenchar}, it will +insert an explicit discretionary after that series of nodes. Initex sets the \type +{\exhyphenchar=`\-}. Incidentally, this is a global parameter instead of a +language-specific one because it may be useful to change the value depending on +the document structure instead of the text language. + +The insertion of discretionaries after a sequence of explicit hyphens happens at +the same time as the other hyphenation processing, {\it not\/} inside the main +control loop. + +The only use \LUATEX\ has for \type {\hyphenchar} is at the check whether a word +should be considered for hyphenation at all. If the \type {\hyphenchar} of the font +attached to the first character node in a word is negative, then hyphenation of +that word is abandoned immediately. {\bf This behavior is added for backward +compatibility only, and the use of \type {\hyphenchar=-1} as a means of +preventing hyphenation should not be used in new \LUATEX\ documents.} + +Fifth, \type {\setlanguage} no longer creates whatsits. The meaning of \type +{\setlanguage} is changed so that it is now an integer parameter like all others. +That integer parameter is used in \type {\glyph_node} creation to add language +information to the glyph nodes. In conjunction, the \type {\language} primitive is +extended so that it always also updates the value of \type {\setlanguage}. + +Sixth, the \type {\noboundary} command (this command prohibits word boundary +processing where that would normally take place) now does create whatsits. These +whatsits are needed because the exact place of the \type {\noboundary} command in +the input stream has to be retained until after the ligature and font processing +stages. + +Finally, there is no longer a \type {main_loop} label in the code. Remember that +\TEX82 did quite a lot of processing while adding \type {char_nodes} to the +horizontal list? For speed reasons, it handled that processing code outside of +the \quote {main control} loop, and only the first character of any \quote {word} +was handled by that \quote {main control} loop. In \LUATEX, there is no longer a +need for that (all hard work is done later), and the (now very small) bits of +character|-|handling code have been moved back inline. When \type +{\tracingcommands} is on, this is visible because the full word is reported, +instead of just the initial character. + +\section[patternsexceptions]{Loading patterns and exceptions} + +The hyphenation algorithm in \LUATEX\ is quite different from the one in \TEX82, +although it uses essentially the same user input. + +After expansion, the argument for \type {\patterns} has to be proper \UTF8 with +individual patterns separated by spaces, no \type {\char} or \type {\chardef}d +commands are allowed. The current implementation is even more strict, and will +reject all non|-|\UNICODE\ characters, but that will be changed in the future. +For now, the generated errors are a valuable tool in discovering font-encoding +specific pattern files. + +Likewise, the expanded argument for \type {\hyphenation} also has to be proper +\UTF8, but here a tiny little bit of extra syntax is provided: + +\startitemize[n] +\startitem + Three sets of arguments in curly braces (\type {{}{}{}}) indicates a desired + complex discretionary, with arguments as in \type {\discretionary}'s command in + normal document input. +\stopitem +\startitem + A \type {-} indicates a desired simple discretionary, cf.\ \type {\-} and \type + {\discretionary{-}{}{}} in normal document input. +\stopitem +\startitem + Internal command names are ignored. This rule is provided especially for \type + {\discretionary}, but it also helps to deal with \type {\relax} commands that + may sneak in. +\stopitem +\startitem + An \type {=} indicates a (non|-|discretionary) hyphen in the document input. +\stopitem +\stopitemize + +The expanded argument is first converted back to a space-separated string while +dropping the internal command names. This string is then converted into a +dictionary by a routine that creates key|-|value pairs by converting the other +listed items. It is important to note that the keys in an exception dictionary +can always be generated from the values. Here are a few examples: + +\starttabulate[|l|l|l|] +\NC \ssbf value \NC \ssbf implied key (input) \NC \ssbf effect \NC\NR +\NC \type {ta-ble} \NC table \NC \type {ta\-ble} ($=$ \type {ta\discretionary{-}{}{}ble}) \NC\NR +\NC \type {ba{k-}{}{c}ken} \NC backen \NC \type {ba\discretionary{k-}{}{c}ken} \NC\NR +\stoptabulate + +The resultant patterns and exception dictionary will be stored under the language +code that is the present value of \type {\language}. + +In the last line of the table, you see there is no \type {\discretionary} command +in the value: the command is optional in the \TEX-based input syntax. The +underlying reason for that is that it is conceivable that a whole dictionary of +words is stored as a plain text file and loaded into \LUATEX\ using one of the +functions in the \LUA\ \type {lang} library. This loading method is quite a bit +faster than going through the \TEX\ language primitives, but some (most?) of that +speed gain would be lost if it had to interpret command sequences while doing so. + +It is possible to specify extra hyphenation points in compound words by using +\type {{-}{}{-}} for the explicit hyphen character (replace \type {-} by the +actual explicit hyphen character if needed). For example, this matches the word +\quote {multi|-|word|-|boundaries} and allows an extra break inbetweem \quote +{boun} and \quote {daries}: + +\starttyping +\hyphenation{multi{-}{}{-}word{-}{}{-}boun-daries} +\stoptyping + +The motivation behind the \ETEX\ extension \type {\savinghyphcodes} was that +hyphenation heavily depended on font encodings. This is no longer true in +\LUATEX, and the corresponding primitive is ignored pending complete removal. The +future semantics of \type {\uppercase} and \type {\lowercase} are still under +consideration, no changes have taken place yet. + +\section{Applying hyphenation} + +The internal structures \LUATEX\ uses for the insertion of discretionaries in +words is very different from the ones in \TEX82, and that means there are some +noticeable differences in handling as well. + +First and foremost, there is no \quote {compressed trie} involved in hyphenation. +The algorithm still reads \PATGEN-generated pattern files, but \LUATEX\ uses a +finite state hash to match the patterns against the word to be hyphenated. This +algorithm is based on the \quote {libhnj} library used by \OPENOFFICE, which in +turn is inspired by \TEX. The memory allocation for this new implementation is +completely dynamic, so the \WEBC\ setting for \type {trie_size} is ignored. + +Differences between \LUATEX\ and \TEX82 that are a direct result of that: + +\startitemize +\startitem + \LUATEX\ happily hyphenates the full \UNICODE\ character range. +\stopitem +\startitem + Pattern and exception dictionary size is limited by the available memory + only, all allocations are done dynamically. The trie|-|related settings in + \type {texmf.cnf} are ignored. +\stopitem +\startitem + Because there is no \quote {trie preparation} stage, language patterns never + become frozen. This means that the primitive \type {\patterns} (and its \LUA\ + counterpart \type {lang.patterns}) can be used at any time, not only in + ini\TEX. +\stopitem +\startitem + Only the string representation of \type {\patterns} and \type {\hyphenation} is + stored in the format file. At format load time, they are simply + re|-|evaluated. It follows that there is no real reason to preload languages + in the format file. In fact, it is usually not a good idea to do so. It is + much smarter to load patterns no sooner than the first time they are actually + needed. +\stopitem +\startitem + \LUATEX\ uses the language-specific variables \type {\prehyphenchar} and \type + {\posthyphenchar} in the creation of implicit discretionaries, instead of + \TEX82's \type {\hyphenchar}, and the values of the language|-|specific variables + \type {\preexhyphenchar} and \type {\postexhyphenchar} for explicit + discretionaries (instead of \TEX82's empty discretionary). +\stopitem +\startitem + The value of the two counters related to hyphenation, \type {hyphenpenalty} + and \type {exhyphenpenalty}, are now stored in the discretionary nodes. This + permits a local overload for explicit \type {\discretionary} commands. The + value current when the hyphenation pass is applied is used. When no callbacks + are used this is compatible with traditional \TEX. When you apply the \LUA\ + \type {lang.hyphenate} function the current values are used. +\stopitem +\stopitemize + +Inserted characters and ligatures inherit their attributes from the nearest glyph +node item (usually the preceding one, but the following one for the items +inserted at the left-hand side of a word). + +Word boundaries are no longer implied by font switches, but by language switches. +One word can have two separate fonts and still be hyphenated correctly (but it +can not have two different languages, the \type {\setlanguage} command forces a +word boundary). + +All languages start out with \type {\prehyphenchar=`\-}, \type {\posthyphenchar=0}, +\type {\preexhyphenchar=0} and \type {\postexhyphenchar=0}. When you assign the +values of one of these four parameters, you are actually changing the settings +for the current \type {\language}, this behavior is compatible with \type {\patterns} +and \type {\hyphenation}. + +\LUATEX\ also hyphenates the first word in a paragraph. Words can be up to 256 +characters long (up from 64 in \TEX82). Longer words generate an error right now, +but eventually either the limitation will be removed or perhaps it will become +possible to silently ignore the excess characters (this is what happens in +\TEX82, but there the behavior cannot be controlled). + +If you are using the \LUA\ function \type {lang.hyphenate}, you should be aware +that this function expects to receive a list of \quote {character} nodes. It will +not operate properly in the presence of \quote {glyph}, \quote {ligature}, or +\quote {ghost} nodes, nor does it know how to deal with kerning. In the near +future, it will be able to skip over \quote {ghost} nodes, and we may add a less +fuzzy function you can call as well. + +The hyphenation exception dictionary is maintained as key|-|value hash, and that +is also dynamic, so the \type {hyph_size} setting is not used either. + +\section{Applying ligatures and kerning} + +After all possible hyphenation points have been inserted in the list, \LUATEX\ +will process the list to convert the \quote {character} nodes into \quote {glyph} +and \quote {ligature} nodes. This is actually done in two stages: first all +ligatures are processed, then all kerning information is applied to the result +list. But those two stages are somewhat dependent on each other: If the used font +makes it possible to do so, the ligaturing stage adds virtual \quote {character} +nodes to the word boundaries in the list. While doing so, it removes and +interprets \type {noboundary} nodes. The kerning stage deletes those word +boundary items after it is done with them, and it does the same for \quote +{ghost} nodes. Finally, at the end of the kerning stage, all remaining \quote +{character} nodes are converted to \quote {glyph} nodes. + +This work separation is worth mentioning because, if you overrule from \LUA\ only +one of the two callbacks related to font handling, then you have to make sure you +perform the tasks normally done by \LUATEX\ itself in order to make sure that the +other, non|-|overruled, routine continues to function properly. + +Work in this area is not yet complete, but most of the possible cases are handled +by our rewritten ligaturing engine. We are working hard to make sure all of the +possible inputs will become supported soon. + +For example, take the word \type {office}, hyphenated \type {of-fice}, using a +\quote {normal} font with all the \type {f}-\type {f} and \type {f}-\type {i} +type ligatures: + +\starttabulate[|l|l|] +\NC Initial: \NC \type {{o}{f}{f}{i}{c}{e}} \NC\NR +\NC After hyphenation: \NC \type {{o}{f}{{-},{},{}}{f}{i}{c}{e}} \NC\NR +\NC First ligature stage: \NC \type {{o}{{f-},{f},{<ff>}}{i}{c}{e}} \NC\NR +\NC Final result: \NC \type {{o}{{f-},{<fi>},{<ffi>}}{c}{e}} \NC\NR +\stoptabulate + +That's bad enough, but let us assume that there is also a hyphenation point +between the \type {f} and the \type {i}, to create \type {of-f-ice}. Then the +final result should be: + +\starttyping +{o}{{f-}, + {{f-}, + {i}, + {<fi>}}, + {{<ff>-}, + {i}, + {<ffi>}}}{c}{e} +\stoptyping + +with discretionaries in the post-break text as well as in the replacement text of +the top-level discretionary that resulted from the first hyphenation point. + +Here is that nested solution again, in a different representation: + +\starttabulate[|l|l|l|l|] +\NC \NC pre \NC post \NC replace \NC \NR +\NC topdisc \NC \type {f-}$^1$ \NC sub1 \NC sub2 \NC \NR +\NC sub1 \NC \type {f-}$^2$ \NC \type {i}$^3$ \NC \type {<fi>}$^4$ \NC \NR +\NC sub2 \NC \type {<ff>-}$^5$\NC \type {i}$^6$ \NC \type {<ffi>}$^7$ \NC \NR +\stoptabulate + +When line breaking is choosing its breakpoints, the following fields will +eventually be selected: + +\starttabulate[|l|l|l|] +\NC \type {of-f-ice} \NC \type {f-}$^1$ \NC \NR +\NC \NC \type {f-}$^2$ \NC \NR +\NC \NC \type {i}$^3$ \NC \NR +\NC \type {of-fice} \NC \type {f-}$^1$ \NC \NR +\NC \NC \type {<fi>}$^4$ \NC \NR +\NC \type {off-ice} \NC \type {<ff>-}$^5$ \NC \NR +\NC \NC \type {i}$^6$ \NC \NR +\NC \type {office} \NC \type {<ffi>}$^7$ \NC \NR +\stoptabulate + +The current solution in \LUATEX\ is not able to handle nested discretionaries, +but it is in fact smart enough to handle this fictional \type {of-f-ice} example. +It does so by combining two sequential discretionary nodes as if they were a +single object (where the second discretionary node is treated as an extension of +the first node). + +One can observe that the \type {of-f-ice} and \type {off-ice} cases both end with +the same actual post replacement list (\type {i}), and that this would be the +case even if that \type {i} was the first item of a potential following ligature +like \type {ic}. This allows \LUATEX\ to do away with one of the fields, and thus +make the whole stuff fit into just two discretionary nodes. + +The mapping of the seven list fields to the six fields in this discretionary node +pair is as follows: + +\starttabulate[|l|p|] +\NC \bf field \NC \bf description \NC \NR +\NC \type {disc1.pre} \NC \type {f-}$^1$ \NC \NR +\NC \type {disc1.post} \NC \type {<fi>}$^4$ \NC \NR +\NC \type {disc1.replace} \NC \type {<ffi>}$^7$ \NC \NR +\NC \type {disc2.pre} \NC \type {f-}$^2$ \NC \NR +\NC \type {disc2.post} \NC \type {i}$^{3{,}6}$\NC \NR +\NC \type {disc2.replace} \NC \type {<ff>-}$^5$\NC \NR +\stoptabulate + +What is actually generated after ligaturing has been applied is therefore: + +\starttyping +{o}{{f-}, + {<fi>}, + {<ffi>}} + {{f-}, + {i}, + {<ff>-}}{c}{e} +\stoptyping + +The two discretionaries have different subtypes from a discretionary appearing on +its own: the first has subtype 4, and the second has subtype 5. The need for +these special subtypes stems from the fact that not all of the fields appear in +their \quote {normal} location. The second discretionary especially looks odd, +with things like the \type {<ff>-} appearing in \type {disc2.replace}. The fact +that some of the fields have different meanings (and different processing code +internally) is what makes it necessary to have different subtypes: this enables +\LUATEX\ to distinguish this sequence of two joined discretionary nodes from the +case of two standalone discretionaries appearing in a row. + +Of course there is still that relationship with fonts: ligatures can be implemented by +mapping a sequence of glyphs onto one glyph, but also by selective replacement and +kerning. This means that the above examples are just representing the traditional +approach. + +\section{Breaking paragraphs into lines} + +This code is still almost unchanged, but because of the above|-|mentioned changes +with respect to discretionaries and ligatures, line breaking will potentially be +different from traditional \TEX. The actual line breaking code is still based on +the \TEX82 algorithms, and it does not expect there to be discretionaries inside +of discretionaries. + +But that situation is now fairly common in \LUATEX, due to the changes to the +ligaturing mechanism. And also, the \LUATEX\ discretionary nodes are implemented +slightly different from the \TEX82 nodes: the \type {no_break} text is now +embedded inside the disc node, where previously these nodes kept their place in +the horizontal list (the discretionary node contained a counter indicating how +many nodes to skip). + +The combined effect of these two differences is that \LUATEX\ does not always use +all of the potential breakpoints in a paragraph, especially when fonts with many +ligatures are used. + +\stopchapter + +\stopcomponent |