diff options
author | Context Git Mirror Bot <phg42.2a@gmail.com> | 2016-07-30 01:22:07 +0200 |
---|---|---|
committer | Context Git Mirror Bot <phg42.2a@gmail.com> | 2016-07-30 01:22:07 +0200 |
commit | 5135aef167bec739fe429e1aa987671768b237bc (patch) | |
tree | bd9f9696704e57c45f453bb7dc6becd5501cb657 /doc/context/sources/general/manuals/luatex/luatex-languages.tex | |
parent | 9d7c4ba8449bec1da920c01e24a17c41bbf2211d (diff) | |
download | context-5135aef167bec739fe429e1aa987671768b237bc.tar.gz |
2016-07-30 00:31:00
Diffstat (limited to 'doc/context/sources/general/manuals/luatex/luatex-languages.tex')
-rw-r--r-- | doc/context/sources/general/manuals/luatex/luatex-languages.tex | 770 |
1 files changed, 0 insertions, 770 deletions
diff --git a/doc/context/sources/general/manuals/luatex/luatex-languages.tex b/doc/context/sources/general/manuals/luatex/luatex-languages.tex deleted file mode 100644 index ad7b7b9d6..000000000 --- a/doc/context/sources/general/manuals/luatex/luatex-languages.tex +++ /dev/null @@ -1,770 +0,0 @@ -% language=uk - -\environment luatex-style -\environment luatex-logos - -\startcomponent luatex-languages - -\startchapter[reference=languages,title={Languages, characters, fonts and glyphs}] - -\LUATEX's internal handling of the characters and glyphs that eventually become -typeset is quite different from the way \TEX82 handles those same objects. The -easiest way to explain the difference is to focus on unrestricted horizontal mode -(i.e.\ paragraphs) and hyphenation first. Later on, it will be easy to deal -with the differences that occur in horizontal and math modes. - -In \TEX82, the characters you type are converted into \type {char_node} records -when they are encountered by the main control loop. \TEX\ attaches and processes -the font information while creating those records, so that the resulting \quote -{horizontal list} contains the final forms of ligatures and implicit kerning. -This packaging is needed because we may want to get the effective width of for -instance a horizontal box. - -When it becomes necessary to hyphenate words in a paragraph, \TEX\ converts (one -word at time) the \type {char_node} records into a string by replacing ligatures -with their components and ignoring the kerning. Then it runs the hyphenation -algorithm on this string, and converts the hyphenated result back into a \quote -{horizontal list} that is consecutively spliced back into the paragraph stream. -Keep in mind that the paragraph may contain unboxed horizontal material, which -then already contains ligatures and kerns and the words therein are part of the -hyphenation process. - -Those \type {char_node} records are somewhat misnamed, as they are glyph -positions in specific fonts, and therefore not really \quote {characters} in the -linguistic sense. There is no language information inside the \type {char_node} -records at all. Instead, language information is passed along using \type -{language whatsit} records inside the horizontal list. - -In \LUATEX, the situation is quite different. The characters you type are always -converted into \type {glyph_node} records with a special subtype to identify them -as being intended as linguistic characters. \LUATEX\ stores the needed language -information in those records, but does not do any font|-|related processing at -the time of node creation. It only stores the index of the current font and a -reference to a character in that font. - -When it becomes necessary to typeset a paragraph, \LUATEX\ first inserts all -hyphenation points right into the whole node list. Next, it processes all the -font information in the whole list (creating ligatures and adjusting kerning), -and finally it adjusts all the subtype identifiers so that the records are \quote -{glyph nodes} from now on. - -\section[charsandglyphs]{Characters and glyphs} - -\TEX82 (including \PDFTEX) differentiates between \type {char_node}s and \type -{lig_node}s. The former are simple items that contained nothing but a \quote -{character} and a \quote {font} field, and they lived in the same memory as -tokens did. The latter also contained a list of components, and a subtype -indicating whether this ligature was the result of a word boundary, and it was -stored in the same place as other nodes like boxes and kerns and glues. - -In \LUATEX, these two types are merged into one, somewhat larger structure called -a \type {glyph_node}. Besides having the old character, font, and component -fields, and the new special fields like \quote {attr} (see~\in {section} -[glyphnodes]), these nodes also contain: - -\startitemize - -\startitem A subtype, split into four main types: - - \startitemize - \startitem - \type {character}, for characters to be hyphenated: the lowest bit - (bit 0) is set to 1. - \stopitem - \startitem - \type {glyph}, for specific font glyphs: the lowest bit (bit 0) is - not set. - \stopitem - \startitem - \type {ligature}, for ligatures (bit 1 is set) - \stopitem - \startitem - \type {ghost}, for \quote {ghost objects} (bit 2 is set) - \stopitem - \stopitemize - - The latter two make further use of two extra fields (bits 3 and 4): - - \startitemize - \startitem - \type {left}, for ligatures created from a left word boundary and for - ghosts created from \type {\leftghost} - \stopitem - \startitem - \type {right}, for ligatures created from a right word boundary and - for ghosts created from \type {\rightghost} - \stopitem - \stopitemize - - For ligatures, both bits can be set at the same time (in case of a - single|-|glyph word). - -\stopitem - -\startitem - \type {glyph_node}s of type \quote {character} also contain language data, - split into four items that were current when the node was created: the - \type {\setlanguage} (15 bits), \type {\lefthyphenmin} (8 bits), \type - {\righthyphenmin} (8 bits), and \type {\uchyph} (1 bit). -\stopitem - -\stopitemize - -Incidentally, \LUATEX\ allows 16383 separate languages, and words can be 256 -characters long. The language is stored with each character. You can set -\type {\firstvalidlanguage} to for instance~1 and make thereby language~0 -an ignored hyphenation language. - -The new primitive \type {\hyphenationmin} can be used to signal the minimal length -of a word. This value stored with the (current) language. - -Because the \type {\uchyph} value is saved in the actual nodes, its handling is -subtly different from \TEX82: changes to \type {\uchyph} become effective -immediately, not at the end of the current partial paragraph. - -Typeset boxes now always have their language information embedded in the nodes -themselves, so there is no longer a possible dependency on the surrounding -language settings. In \TEX82, a mid-paragraph statement like \type {\unhbox0} would -process the box using the current paragraph language unless there was a -\type {\setlanguage} issued inside the box. In \LUATEX, all language variables are -already frozen. - -In traditional \TEX\ the process of hyphenation is driven by \type {lccode}s. In -\LUATEX\ we made this dependency less strong. There are several strategies -possible. When you do nothing, the currently used \type {lccode}s are used, when -loading patterns, setting exceptions or hyphenating a list. - -When you set \type {\savinghyphcodes} to a value larger than zero the current set -of \type {lccode}s will be saved with the language. In that case changing a \type -{lccode} afterwards has no effect. However, you can adapt the set with: - -\starttyping -\hjcode`a=`a -\stoptyping - -This change is global which makes sense if you keep in mind that the moment that -hyphenation happens is (normally) when the paragraph or a horizontal box is -constructed. When \type {\savinghyphcodes} was zero when the language got -initialized you start out with nothing, otherwise you already have a set. - -When a \type {\hjcode} is larger than $0$ but smaller than $32$ is indicates the -to be used length. In the following example we map a character (\type {x}) onto -another one in the patterns and tell the engine that \type {œ} counts as one -character. Because traditionally zero itself is reserved for inhibiting -hyphenation, a value of $32$ counts as zero. - -\starttyping -% assuming french patterns: -foobar % foo-bar - -\hjcode`x=`o - -fxxbar % fxx-bar - -\lefthyphenmin3 - -œdipus % œdi-pus - -\lefthyphenmin4 - -œdipus % œdipus - -\hjcode`œ=2 - -œdipus % œdi-pus - -\hjcode`i=32 -\hjcode`d=32 - -œdipus % œdipus -\stoptyping - -Carrying all this information with each glyph would give too much overhead and -also make the process of setting up thee codes more complex. A solution with -\type {hjcode} sets was considered but rejected because in practice the current -approach is sufficient and it would not be compatible anyway. - -Beware: the values are always saved in the format, independent of the setting -of \type {\savinghyphcodes} at the moment the format is dumped. - -A boundary node normally would mark the end of a word which interferes with for -instance discretionary injection. For this you can use the \type {\wordboundary} -as trigger. Here are a few examples of usage: - -\startbuffer - discrete---discrete -\stopbuffer -\typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop -\startbuffer - discrete\discretionary{}{}{---}discrete -\stopbuffer -\typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop -\startbuffer - discrete\wordboundary\discretionary{}{}{---}discrete -\stopbuffer -\typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop -\startbuffer - discrete\wordboundary\discretionary{}{}{---}\wordboundary discrete -\stopbuffer -\typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop -\startbuffer - discrete\wordboundary\discretionary{---}{}{}\wordboundary discrete -\stopbuffer -\typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop - -We only accept an explicit hyphen when there is a preceding glyph and we skip a -sequence of explicit hyphens as that normally indicates a \type {--} or \type -{---} ligature in which case we can in a worse case usage get bad node lists -later on due to messed up ligature building as these dashes are ligatures in base -fonts. This is a side effect of the separating the hyphenation, ligaturing and -kerning steps. - -The start and end of a characters is signalled by a glue, penalty, kern or boundary -node. But by default also a hlist, vlist, rule, dir, whatsit, ins, and adjust node -indicate a start or end. You can omit the last set from the test by setting -\type {\hyphenationbounds} to a non|-|zero value: - -\starttabulate[|Tl|l|] -\NC 0 \NC not strict \NC \NR -\NC 1 \NC strict start \NC \NR -\NC 2 \NC strict end \NC \NR -\NC 3 \NC strict start and strict end \NC \NR -\stoptabulate - -\section{The main control loop} - -In \LUATEX's main loop, almost all input characters that are to be typeset are -converted into \type {glyph} node records with subtype \quote {character}, but -there are a few exceptions. - -First, the \type {\accent} primitives creates nodes with subtype \quote {glyph} -instead of \quote {character}: one for the actual accent and one for the -accentee. The primary reason for this is that \type {\accent} in \TEX82 is -explicitly dependent on the current font encoding, so it would not make much -sense to attach a new meaning to the primitive's name, as that would invalidate -many old documents and macro packages. \footnote {Of course, modern packages will -not use the \type {\accent} primitive at all but try to map directly on composed -characters.} A secondary reason is that in \TEX82, \type {\accent} prohibits -hyphenation of the current word. Since in \LUATEX\ hyphenation only takes place -on \quote {character} nodes, it is possible to achieve the same effect. - -This change of meaning did happen with \type {\char}, that now generates \quote -{glyph} nodes with a character subtype. In traditional \TEX\ there was a strong -relationship between the 8|-|bit input encoding, hyphenation and glyphs taken -from a font. In \LUATEX\ we have \UTF\ input, and in most cases this maps -directly to a character in a font, apart from glyph replacement in the font -engine. If you want to access arbitrary glyphs in a font directly you can always -use \LUA\ to do so, because fonts are available as \LUA\ table. - -Second, all the results of processing in math mode eventually become nodes with -\quote {glyph} subtypes. - -Third, the \ALEPH|-|derived commands \type {\leftghost} and \type {\rightghost} -create nodes of a third subtype: \quote {ghost}. These nodes are ignored -completely by all further processing until the stage where inter|-|glyph kerning -is added. - -Fourth, automatic discretionaries are handled differently. \TEX82 inserts an -empty discretionary after sensing an input character that matches the \type -{\hyphenchar} in the current font. This test is wrong in our opinion: whether or -not hyphenation takes place should not depend on the current font, it is a -language property. \footnote {When \TEX\ showed up we didn't have \UNICODE\ yet -and being limited to eight bits meant that one sometimes had to compromise -between supporting character input, glyph rendering, hyphenation.} - -In \LUATEX, it works like this: if \LUATEX\ senses a string of input characters -that matches the value of the new integer parameter \type {\exhyphenchar}, it will -insert an explicit discretionary after that series of nodes. Initex sets the \type -{\exhyphenchar=`\-}. Incidentally, this is a global parameter instead of a -language-specific one because it may be useful to change the value depending on -the document structure instead of the text language. - -The insertion of discretionaries after a sequence of explicit hyphens happens at -the same time as the other hyphenation processing, {\it not\/} inside the main -control loop. - -The only use \LUATEX\ has for \type {\hyphenchar} is at the check whether a word -should be considered for hyphenation at all. If the \type {\hyphenchar} of the -font attached to the first character node in a word is negative, then hyphenation -of that word is abandoned immediately. This behaviour is added for backward -compatibility only, and the use of \type {\hyphenchar=-1} as a means of -preventing hyphenation should not be used in new \LUATEX\ documents. - -Fifth, \type {\setlanguage} no longer creates whatsits. The meaning of \type -{\setlanguage} is changed so that it is now an integer parameter like all others. -That integer parameter is used in \type {\glyph_node} creation to add language -information to the glyph nodes. In conjunction, the \type {\language} primitive is -extended so that it always also updates the value of \type {\setlanguage}. - -Sixth, the \type {\noboundary} command (that prohibits word boundary processing -where that would normally take place) now does create nodes. These nodes are -needed because the exact place of the \type {\noboundary} command in the input -stream has to be retained until after the ligature and font processing stages. - -Finally, there is no longer a \type {main_loop} label in the code. Remember that -\TEX82 did quite a lot of processing while adding \type {char_nodes} to the -horizontal list? For speed reasons, it handled that processing code outside of -the \quote {main control} loop, and only the first character of any \quote {word} -was handled by that \quote {main control} loop. In \LUATEX, there is no longer a -need for that (all hard work is done later), and the (now very small) bits of -character|-|handling code have been moved back inline. When \type -{\tracingcommands} is on, this is visible because the full word is reported, -instead of just the initial character. - -\section[patternsexceptions]{Loading patterns and exceptions} - -The hyphenation algorithm in \LUATEX\ is quite different from the one in \TEX82, -although it uses essentially the same user input. - -After expansion, the argument for \type {\patterns} has to be proper \UTF8 with -individual patterns separated by spaces, no \type {\char} or \type {\chardef}d -commands are allowed. The current implementation quite strict and will reject all -non|-|\UNICODE\ characters. - -Likewise, the expanded argument for \type {\hyphenation} also has to be proper -\UTF8, but here a bit of extra syntax is provided: - -\startitemize[n] -\startitem - Three sets of arguments in curly braces (\type {{}{}{}}) indicates a desired - complex discretionary, with arguments as in \type {\discretionary}'s command in - normal document input. -\stopitem -\startitem - A \type {-} indicates a desired simple discretionary, cf.\ \type {\-} and \type - {\discretionary{-}{}{}} in normal document input. -\stopitem -\startitem - Internal command names are ignored. This rule is provided especially for \type - {\discretionary}, but it also helps to deal with \type {\relax} commands that - may sneak in. -\stopitem -\startitem - An \type {=} indicates a (non|-|discretionary) hyphen in the document input. -\stopitem -\stopitemize - -The expanded argument is first converted back to a space-separated string while -dropping the internal command names. This string is then converted into a -dictionary by a routine that creates key|-|value pairs by converting the other -listed items. It is important to note that the keys in an exception dictionary -can always be generated from the values. Here are a few examples: - -\starttabulate[|l|l|l|] -\NC \bf value \NC \bf implied key (input) \NC \bf effect \NC\NR -\NC \type {ta-ble} \NC table \NC \type {ta\-ble} ($=$ \type {ta\discretionary{-}{}{}ble}) \NC\NR -\NC \type {ba{k-}{}{c}ken} \NC backen \NC \type {ba\discretionary{k-}{}{c}ken} \NC\NR -\stoptabulate - -The resultant patterns and exception dictionary will be stored under the language -code that is the present value of \type {\language}. - -In the last line of the table, you see there is no \type {\discretionary} command -in the value: the command is optional in the \TEX-based input syntax. The -underlying reason for that is that it is conceivable that a whole dictionary of -words is stored as a plain text file and loaded into \LUATEX\ using one of the -functions in the \LUA\ \type {lang} library. This loading method is quite a bit -faster than going through the \TEX\ language primitives, but some (most?) of that -speed gain would be lost if it had to interpret command sequences while doing so. - -It is possible to specify extra hyphenation points in compound words by using -\type {{-}{}{-}} for the explicit hyphen character (replace \type {-} by the -actual explicit hyphen character if needed). For example, this matches the word -\quote {multi|-|word|-|boundaries} and allows an extra break inbetween \quote -{boun} and \quote {daries}: - -\starttyping -\hyphenation{multi{-}{}{-}word{-}{}{-}boun-daries} -\stoptyping - -The motivation behind the \ETEX\ extension \type {\savinghyphcodes} was that -hyphenation heavily depended on font encodings. This is no longer true in -\LUATEX, and the corresponding primitive is basically ignored. Because we now -have \type {hjcode}, the case relate codes can be used exclusively for \type -{\uppercase} and \type {\lowercase}. - -\section{Applying hyphenation} - -The internal structures \LUATEX\ uses for the insertion of discretionaries in -words is very different from the ones in \TEX82, and that means there are some -noticeable differences in handling as well. - -First and foremost, there is no \quote {compressed trie} involved in hyphenation. -The algorithm still reads \PATGEN-generated pattern files, but \LUATEX\ uses a -finite state hash to match the patterns against the word to be hyphenated. This -algorithm is based on the \quote {libhnj} library used by \OPENOFFICE, which in -turn is inspired by \TEX. - -There are a few differences between \LUATEX\ and \TEX82 that are a direct result -of the implementation: - -\startitemize -\startitem - \LUATEX\ happily hyphenates the full \UNICODE\ character range. -\stopitem -\startitem - Pattern and exception dictionary size is limited by the available memory - only, all allocations are done dynamically. The trie|-|related settings in - \type {texmf.cnf} are ignored. -\stopitem -\startitem - Because there is no \quote {trie preparation} stage, language patterns never - become frozen. This means that the primitive \type {\patterns} (and its \LUA\ - counterpart \type {lang.patterns}) can be used at any time, not only in - ini\TEX. -\stopitem -\startitem - Only the string representation of \type {\patterns} and \type {\hyphenation} is - stored in the format file. At format load time, they are simply - re|-|evaluated. It follows that there is no real reason to preload languages - in the format file. In fact, it is usually not a good idea to do so. It is - much smarter to load patterns no sooner than the first time they are actually - needed. -\stopitem -\startitem - \LUATEX\ uses the language-specific variables \type {\prehyphenchar} and \type - {\posthyphenchar} in the creation of implicit discretionaries, instead of - \TEX82's \type {\hyphenchar}, and the values of the language|-|specific variables - \type {\preexhyphenchar} and \type {\postexhyphenchar} for explicit - discretionaries (instead of \TEX82's empty discretionary). -\stopitem -\startitem - The value of the two counters related to hyphenation, \type {\hyphenpenalty} - and \type {\exhyphenpenalty}, are now stored in the discretionary nodes. This - permits a local overload for explicit \type {\discretionary} commands. The - value current when the hyphenation pass is applied is used. When no callbacks - are used this is compatible with traditional \TEX. When you apply the \LUA\ - \type {lang.hyphenate} function the current values are used. -\stopitem -\stopitemize - -Because we store penalties in the disc node the \type {\discretionary} command has -been extended to accept an optional penalty specification, so you can do the -following: - -\startbuffer -\hsize1mm -1:foo{\hyphenpenalty 10000\discretionary{}{}{}}bar\par -2:foo\discretionary penalty 10000 {}{}{}bar\par -3:foo\discretionary{}{}{}bar\par -\stopbuffer - -\typebuffer - -This results in: - -\blank \start \getbuffer \stop \blank - -Inserted characters and ligatures inherit their attributes from the nearest glyph -node item (usually the preceding one, but the following one for the items -inserted at the left-hand side of a word). - -Word boundaries are no longer implied by font switches, but by language switches. -One word can have two separate fonts and still be hyphenated correctly (but it -can not have two different languages, the \type {\setlanguage} command forces a -word boundary). - -All languages start out with \type {\prehyphenchar=`\-}, \type {\posthyphenchar=0}, -\type {\preexhyphenchar=0} and \type {\postexhyphenchar=0}. When you assign the -values of one of these four parameters, you are actually changing the settings -for the current \type {\language}, this behaviour is compatible with \type {\patterns} -and \type {\hyphenation}. - -\LUATEX\ also hyphenates the first word in a paragraph. Words can be up to 256 -characters long (up from 64 in \TEX82). Longer words generate an error right now, -but eventually either the limitation will be removed or perhaps it will become -possible to silently ignore the excess characters (this is what happens in -\TEX82, but there the behaviour cannot be controlled). - -If you are using the \LUA\ function \type {lang.hyphenate}, you should be aware -that this function expects to receive a list of \quote {character} nodes. It will -not operate properly in the presence of \quote {glyph}, \quote {ligature}, or -\quote {ghost} nodes, nor does it know how to deal with kerning. - -The hyphenation exception dictionary is maintained as key|-|value hash, and that -is also dynamic, so the \type {hyph_size} setting is not used either. - -\section{Applying ligatures and kerning} - -After all possible hyphenation points have been inserted in the list, \LUATEX\ -will process the list to convert the \quote {character} nodes into \quote {glyph} -and \quote {ligature} nodes. This is actually done in two stages: first all -ligatures are processed, then all kerning information is applied to the result -list. But those two stages are somewhat dependent on each other: If the used font -makes it possible to do so, the ligaturing stage adds virtual \quote {character} -nodes to the word boundaries in the list. While doing so, it removes and -interprets \type {\noboundary} nodes. The kerning stage deletes those word -boundary items after it is done with them, and it does the same for \quote -{ghost} nodes. Finally, at the end of the kerning stage, all remaining \quote -{character} nodes are converted to \quote {glyph} nodes. - -This work separation is worth mentioning because, if you overrule from \LUA\ only -one of the two callbacks related to font handling, then you have to make sure you -perform the tasks normally done by \LUATEX\ itself in order to make sure that the -other, non|-|overruled, routine continues to function properly. - -Work in this area is not yet complete, but most of the possible cases are handled -by our rewritten ligaturing engine. At some point all of the possible inputs will -become supported. \footnote {Not all of this makes sense because we nowadays have -\OPENTYPE\ fonts and ligature building can happen in ,any different ways there.} - -For example, take the word \type {office}, hyphenated \type {of-fice}, using a -\quote {normal} font with all the \type {f}-\type {f} and \type {f}-\type {i} -type ligatures: - -\starttabulate[|l|l|] -\NC Initial: \NC \type {{o}{f}{f}{i}{c}{e}} \NC\NR -\NC After hyphenation: \NC \type {{o}{f}{{-},{},{}}{f}{i}{c}{e}} \NC\NR -\NC First ligature stage: \NC \type {{o}{{f-},{f},{<ff>}}{i}{c}{e}} \NC\NR -\NC Final result: \NC \type {{o}{{f-},{<fi>},{<ffi>}}{c}{e}} \NC\NR -\stoptabulate - -That's bad enough, but let us assume that there is also a hyphenation point -between the \type {f} and the \type {i}, to create \type {of-f-ice}. Then the -final result should be: - -\starttyping -{o}{{f-}, - {{f-}, - {i}, - {<fi>}}, - {{<ff>-}, - {i}, - {<ffi>}}}{c}{e} -\stoptyping - -with discretionaries in the post-break text as well as in the replacement text of -the top-level discretionary that resulted from the first hyphenation point. - -Here is that nested solution again, in a different representation: - -\starttabulate[|l|l|l|l|] -\NC \NC pre \NC post \NC replace \NC \NR -\NC topdisc \NC \type {f-}$^1$ \NC sub1 \NC sub2 \NC \NR -\NC sub1 \NC \type {f-}$^2$ \NC \type {i}$^3$ \NC \type {<fi>}$^4$ \NC \NR -\NC sub2 \NC \type {<ff>-}$^5$\NC \type {i}$^6$ \NC \type {<ffi>}$^7$ \NC \NR -\stoptabulate - -When line breaking is choosing its breakpoints, the following fields will -eventually be selected: - -\starttabulate[|l|l|l|] -\NC \type {of-f-ice} \NC \type {f-}$^1$ \NC \NR -\NC \NC \type {f-}$^2$ \NC \NR -\NC \NC \type {i}$^3$ \NC \NR -\NC \type {of-fice} \NC \type {f-}$^1$ \NC \NR -\NC \NC \type {<fi>}$^4$ \NC \NR -\NC \type {off-ice} \NC \type {<ff>-}$^5$ \NC \NR -\NC \NC \type {i}$^6$ \NC \NR -\NC \type {office} \NC \type {<ffi>}$^7$ \NC \NR -\stoptabulate - -The current solution in \LUATEX\ is not able to handle nested discretionaries, -but it is in fact smart enough to handle this fictional \type {of-f-ice} example. -It does so by combining two sequential discretionary nodes as if they were a -single object (where the second discretionary node is treated as an extension of -the first node). - -One can observe that the \type {of-f-ice} and \type {off-ice} cases both end with -the same actual post replacement list (\type {i}), and that this would be the -case even if that \type {i} was the first item of a potential following ligature -like \type {ic}. This allows \LUATEX\ to do away with one of the fields, and thus -make the whole stuff fit into just two discretionary nodes. - -The mapping of the seven list fields to the six fields in this discretionary node -pair is as follows: - -\starttabulate[|l|p|] -\NC \bf field \NC \bf description \NC \NR -\NC \type {disc1.pre} \NC \type {f-}$^1$ \NC \NR -\NC \type {disc1.post} \NC \type {<fi>}$^4$ \NC \NR -\NC \type {disc1.replace} \NC \type {<ffi>}$^7$ \NC \NR -\NC \type {disc2.pre} \NC \type {f-}$^2$ \NC \NR -\NC \type {disc2.post} \NC \type {i}$^{3{,}6}$\NC \NR -\NC \type {disc2.replace} \NC \type {<ff>-}$^5$\NC \NR -\stoptabulate - -What is actually generated after ligaturing has been applied is therefore: - -\starttyping -{o}{{f-}, - {<fi>}, - {<ffi>}} - {{f-}, - {i}, - {<ff>-}}{c}{e} -\stoptyping - -The two discretionaries have different subtypes from a discretionary appearing on -its own: the first has subtype 4, and the second has subtype 5. The need for -these special subtypes stems from the fact that not all of the fields appear in -their \quote {normal} location. The second discretionary especially looks odd, -with things like the \type {<ff>-} appearing in \type {disc2.replace}. The fact -that some of the fields have different meanings (and different processing code -internally) is what makes it necessary to have different subtypes: this enables -\LUATEX\ to distinguish this sequence of two joined discretionary nodes from the -case of two standalone discretionaries appearing in a row. - -Of course there is still that relationship with fonts: ligatures can be implemented by -mapping a sequence of glyphs onto one glyph, but also by selective replacement and -kerning. This means that the above examples are just representing the traditional -approach. - -\section{Breaking paragraphs into lines} - -This code is still almost unchanged, but because of the above|-|mentioned changes -with respect to discretionaries and ligatures, line breaking will potentially be -different from traditional \TEX. The actual line breaking code is still based on -the \TEX82 algorithms, and it does not expect there to be discretionaries inside -of discretionaries. - -But that situation is now fairly common in \LUATEX, due to the changes to the -ligaturing mechanism. And also, the \LUATEX\ discretionary nodes are implemented -slightly different from the \TEX82 nodes: the \type {no_break} text is now -embedded inside the disc node, where previously these nodes kept their place in -the horizontal list. In traditional \TEX\ the discretionary node contains a -counter indicating how many nodes to skip, but in \LUATEX\ we store the pre, post -and replace text in the discretionary node. - -The combined effect of these two differences is that \LUATEX\ does not always use -all of the potential breakpoints in a paragraph, especially when fonts with many -ligatures are used. Of course kerning also complicates matters here. - -\section{The \type {lang} library} - -This library provides the interface to \LUATEX's structure -representing a language, and the associated functions. - -\startfunctioncall -<language> l = lang.new() -<language> l = lang.new(<number> id) -\stopfunctioncall - -This function creates a new userdata object. An object of type \type {<language>} -is the first argument to most of the other functions in the \type {lang} -library. These functions can also be used as if they were object methods, using -the colon syntax. - -Without an argument, the next available internal id number will be assigned to -this object. With argument, an object will be created that links to the internal -language with that id number. - -\startfunctioncall -<number> n = lang.id(<language> l) -\stopfunctioncall - -returns the internal \type {\language} id number this object refers to. - -\startfunctioncall -<string> n = lang.hyphenation(<language> l) -lang.hyphenation(<language> l, <string> n) -\stopfunctioncall - -Either returns the current hyphenation exceptions for this language, or adds new -ones. The syntax of the string is explained in~\in {section} -[patternsexceptions]. - -\startfunctioncall -lang.clear_hyphenation(<language> l) -\stopfunctioncall - -Clears the exception dictionary (string) for this language. - -\startfunctioncall -<string> n = lang.clean(<language> l, <string> o) -<string> n = lang.clean(<string> o) -\stopfunctioncall - -Creates a hyphenation key from the supplied hyphenation value. The syntax of the -argument string is explained in~\in {section} [patternsexceptions]. This function -is useful if you want to do something else based on the words in a dictionary -file, like spell|-|checking. - -\startfunctioncall -<string> n = lang.patterns(<language> l) -lang.patterns(<language> l, <string> n) -\stopfunctioncall - -Adds additional patterns for this language object, or returns the current set. -The syntax of this string is explained in~\in {section} [patternsexceptions]. - -\startfunctioncall -lang.clear_patterns(<language> l) -\stopfunctioncall - -Clears the pattern dictionary for this language. - -\startfunctioncall -<number> n = lang.prehyphenchar(<language> l) -lang.prehyphenchar(<language> l, <number> n) -\stopfunctioncall - -Gets or sets the \quote {pre|-|break} hyphen character for implicit hyphenation -in this language (initially the hyphen, decimal 45). - -\startfunctioncall -<number> n = lang.posthyphenchar(<language> l) -lang.posthyphenchar(<language> l, <number> n) -\stopfunctioncall - -Gets or sets the \quote {post|-|break} hyphen character for implicit hyphenation -in this language (initially null, decimal~0, indicating emptiness). - -\startfunctioncall -<number> n = lang.preexhyphenchar(<language> l) -lang.preexhyphenchar(<language> l, <number> n) -\stopfunctioncall - -Gets or sets the \quote {pre|-|break} hyphen character for explicit hyphenation -in this language (initially null, decimal~0, indicating emptiness). - -\startfunctioncall -<number> n = lang.postexhyphenchar(<language> l) -lang.postexhyphenchar(<language> l, <number> n) -\stopfunctioncall - -Gets or sets the \quote {post|-|break} hyphen character for explicit hyphenation -in this language (initially null, decimal~0, indicating emptiness). - -\startfunctioncall -<boolean> success = lang.hyphenate(<node> head) -<boolean> success = lang.hyphenate(<node> head, <node> tail) -\stopfunctioncall - -Inserts hyphenation points (discretionary nodes) in a node list. If \type {tail} -is given as argument, processing stops on that node. Currently, \type {success} -is always true if \type {head} (and \type {tail}, if specified) are proper nodes, -regardless of possible other errors. - -Hyphenation works only on \quote {characters}, a special subtype of all the glyph -nodes with the node subtype having the value \type {1}. Glyph modes with -different subtypes are not processed. See \in {section~} [charsandglyphs] for -more details. - -The following two commands can be used to set or query hj codes: - -\startfunctioncall -lang.sethjcode(<language> l, <number> char, <number> usedchar) -<number> usedchar = lang.gethjcode(<language> l, <number> char) -\stopfunctioncall - -When you set a hjcode the current sets get initialized unless the set was already -initialized due to \type {\savinghyphcodes} being larger than zero. - -\stopchapter - -\stopcomponent - -% \parindent0pt \hsize=1.1cm -% 12-34-56 \par -% 12-34-\hbox{56} \par -% 12-34-\vrule width 1em height 1.5ex \par -% 12-\hbox{34}-56 \par -% 12-\vrule width 1em height 1.5ex-56 \par -% \hjcode`\1=`\1 \hjcode`\2=`\2 \hjcode`\3=`\3 \hjcode`\4=`\4 \vskip.5cm -% 12-34-56 \par -% 12-34-\hbox{56} \par -% 12-34-\vrule width 1em height 1.5ex \par -% 12-\hbox{34}-56 \par -% 12-\vrule width 1em height 1.5ex-56 \par - |