diff options
author | Hans Hagen <pragma@wxs.nl> | 2020-01-26 19:35:43 +0100 |
---|---|---|
committer | Context Git Mirror Bot <phg@phi-gamma.net> | 2020-01-26 19:35:43 +0100 |
commit | 43fc66771a0c9d27cc0b7fe7a69392ea313bd0ca (patch) | |
tree | 9b339c63cd28528e5062fe980e964808df619374 /doc/context/sources/general/manuals/luametatex/luametatex-languages.tex | |
parent | 5189b2143a30a39cd3533569cbef3f06422cc1d9 (diff) | |
download | context-43fc66771a0c9d27cc0b7fe7a69392ea313bd0ca.tar.gz |
2020-01-26 18:37:00
Diffstat (limited to 'doc/context/sources/general/manuals/luametatex/luametatex-languages.tex')
-rw-r--r-- | doc/context/sources/general/manuals/luametatex/luametatex-languages.tex | 1113 |
1 files changed, 1113 insertions, 0 deletions
diff --git a/doc/context/sources/general/manuals/luametatex/luametatex-languages.tex b/doc/context/sources/general/manuals/luametatex/luametatex-languages.tex new file mode 100644 index 000000000..19112a7f1 --- /dev/null +++ b/doc/context/sources/general/manuals/luametatex/luametatex-languages.tex @@ -0,0 +1,1113 @@ +% language=uk + +\environment luametatex-style + +\startcomponent luametatex-languages + +\startchapter[reference=languages,title={Languages, characters, fonts and glyphs}] + +\startsection[title={Introduction}] + +\topicindex {languages} + +\LUATEX's internal handling of the characters and glyphs that eventually become +typeset is quite different from the way \TEX82 handles those same objects. The +easiest way to explain the difference is to focus on unrestricted horizontal mode +(i.e.\ paragraphs) and hyphenation first. Later on, it will be easy to deal +with the differences that occur in horizontal and math modes. + +In \TEX82, the characters you type are converted into \type {char} node records +when they are encountered by the main control loop. \TEX\ attaches and processes +the font information while creating those records, so that the resulting \quote +{horizontal list} contains the final forms of ligatures and implicit kerning. +This packaging is needed because we may want to get the effective width of for +instance a horizontal box. + +When it becomes necessary to hyphenate words in a paragraph, \TEX\ converts (one +word at time) the \type {char} node records into a string by replacing ligatures +with their components and ignoring the kerning. Then it runs the hyphenation +algorithm on this string, and converts the hyphenated result back into a \quote +{horizontal list} that is consecutively spliced back into the paragraph stream. +Keep in mind that the paragraph may contain unboxed horizontal material, which +then already contains ligatures and kerns and the words therein are part of the +hyphenation process. + +Those \type {char} node records are somewhat misnamed, as they are glyph +positions in specific fonts, and therefore not really \quote {characters} in the +linguistic sense. There is no language information inside the \type {char} node +records at all. Instead, language information is passed along using \type +{language whatsit} nodes inside the horizontal list. + +In \LUATEX, the situation is quite different. The characters you type are always +converted into \nod {glyph} node records with a special subtype to identify them +as being intended as linguistic characters. \LUATEX\ stores the needed language +information in those records, but does not do any font|-|related processing at +the time of node creation. It only stores the index of the current font and a +reference to a character in that font. + +When it becomes necessary to typeset a paragraph, \LUATEX\ first inserts all +hyphenation points right into the whole node list. Next, it processes all the +font information in the whole list (creating ligatures and adjusting kerning), +and finally it adjusts all the subtype identifiers so that the records are \quote +{glyph nodes} from now on. + +\stopsection + +\startsection[title={Characters, glyphs and discretionaries},reference=charsandglyphs] + +\topicindex {characters} +\topicindex {glyphs} +\topicindex {hyphenation} + +\TEX82 (including \PDFTEX) differentiates between \type {char} nodes and \type +{lig} nodes. The former are simple items that contained nothing but a \quote +{character} and a \quote {font} field, and they lived in the same memory as +tokens did. The latter also contained a list of components, and a subtype +indicating whether this ligature was the result of a word boundary, and it was +stored in the same place as other nodes like boxes and kerns and glues. + +In \LUATEX, these two types are merged into one, somewhat larger structure called +a \nod {glyph} node. Besides having the old character, font, and component +fields there are a few more, like \quote {attr} that we will see in \in {section} +[glyphnodes], these nodes also contain a subtype, that codes four main types and +two additional ghost types. For ligatures, multiple bits can be set at the same +time (in case of a single|-|glyph word). + +\startitemize + \startitem + \type {character}, for characters to be hyphenated: the lowest bit + (bit 0) is set to 1. + \stopitem + \startitem + \nod {glyph}, for specific font glyphs: the lowest bit (bit 0) is + not set. + \stopitem + \startitem + \type {ligature}, for constructed ligatures bit 1 is set. + \stopitem + \startitem + \type {ghost}, for so called \quote {ghost objects} bit 2 is set. + \stopitem + \startitem + \type {left}, for ligatures created from a left word boundary and for + ghosts created from \lpr {leftghost} bit 3 gets set. + \stopitem + \startitem + \type {right}, for ligatures created from a right word boundary and + for ghosts created from \lpr {rightghost} bit 4 is set. + \stopitem +\stopitemize + +The \nod {glyph} nodes also contain language data, split into four items that +were current when the node was created: the \prm {setlanguage} (15~bits), \prm +{lefthyphenmin} (8~bits), \prm {righthyphenmin} (8~bits), and \prm {uchyph} +(1~bit). + +Incidentally, \LUATEX\ allows 16383 separate languages, and words can be 256 +characters long. The language is stored with each character. You can set +\prm {firstvalidlanguage} to for instance~1 and make thereby language~0 +an ignored hyphenation language. + +The new primitive \lpr {hyphenationmin} can be used to signal the minimal length +of a word. This value is stored with the (current) language. + +Because the \prm {uchyph} value is saved in the actual nodes, its handling is +subtly different from \TEX82: changes to \prm {uchyph} become effective +immediately, not at the end of the current partial paragraph. + +Typeset boxes now always have their language information embedded in the nodes +themselves, so there is no longer a possible dependency on the surrounding +language settings. In \TEX82, a mid|-|paragraph statement like \type {\unhbox0} +would process the box using the current paragraph language unless there was a +\prm {setlanguage} issued inside the box. In \LUATEX, all language variables +are already frozen. + +In traditional \TEX\ the process of hyphenation is driven by \type {lccode}s. In +\LUATEX\ we made this dependency less strong. There are several strategies +possible. When you do nothing, the currently used \type {lccode}s are used, when +loading patterns, setting exceptions or hyphenating a list. + +When you set \prm {savinghyphcodes} to a value greater than zero the current set +of \type {lccode}s will be saved with the language. In that case changing a \type +{lccode} afterwards has no effect. However, you can adapt the set with: + +\starttyping +\hjcode`a=`a +\stoptyping + +This change is global which makes sense if you keep in mind that the moment that +hyphenation happens is (normally) when the paragraph or a horizontal box is +constructed. When \prm {savinghyphcodes} was zero when the language got +initialized you start out with nothing, otherwise you already have a set. + +When a \lpr {hjcode} is greater than 0 but less than 32 is indicates the +to be used length. In the following example we map a character (\type {x}) onto +another one in the patterns and tell the engine that \type {œ} counts as one +character. Because traditionally zero itself is reserved for inhibiting +hyphenation, a value of 32 counts as zero. + +Here are some examples (we assume that French patterns are used): + +\starttabulate[||||] +\NC \NC \type{foobar} \NC \type{foo-bar} \NC \NR +\NC \type{\hjcode`x=`o} \NC \type{fxxbar} \NC \type{fxx-bar} \NC \NR +\NC \type{\lefthyphenmin3} \NC \type{œdipus} \NC \type{œdi-pus} \NC \NR +\NC \type{\lefthyphenmin4} \NC \type{œdipus} \NC \type{œdipus} \NC \NR +\NC \type{\hjcode`œ=2} \NC \type{œdipus} \NC \type{œdi-pus} \NC \NR +\NC \type{\hjcode`i=32 \hjcode`d=32} \NC \type{œdipus} \NC \type{œdipus} \NC \NR +\NC +\stoptabulate + +Carrying all this information with each glyph would give too much overhead and +also make the process of setting up these codes more complex. A solution with +\type {hjcode} sets was considered but rejected because in practice the current +approach is sufficient and it would not be compatible anyway. + +Beware: the values are always saved in the format, independent of the setting +of \prm {savinghyphcodes} at the moment the format is dumped. + +A boundary node normally would mark the end of a word which interferes with for +instance discretionary injection. For this you can use the \prm {wordboundary} +as a trigger. Here are a few examples of usage: + +\startbuffer + discrete---discrete +\stopbuffer +\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower +\startbuffer + discrete\discretionary{}{}{---}discrete +\stopbuffer +\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower +\startbuffer + discrete\wordboundary\discretionary{}{}{---}discrete +\stopbuffer +\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower +\startbuffer + discrete\wordboundary\discretionary{}{}{---}\wordboundary discrete +\stopbuffer +\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower +\startbuffer + discrete\wordboundary\discretionary{---}{}{}\wordboundary discrete +\stopbuffer +\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower + +We only accept an explicit hyphen when there is a preceding glyph and we skip a +sequence of explicit hyphens since that normally indicates a \type {--} or \type +{---} ligature in which case we can in a worse case usage get bad node lists +later on due to messed up ligature building as these dashes are ligatures in base +fonts. This is a side effect of separating the hyphenation, ligaturing and +kerning steps. + +The start and end of a sequence of characters is signalled by a \nod {glue}, \nod +{penalty}, \nod {kern} or \nod {boundary} node. But by default also a \nod +{hlist}, \nod {vlist}, \nod {rule}, \nod {dir}, \nod {whatsit}, \nod {ins}, and +\nod {adjust} node indicate a start or end. You can omit the last set from the +test by setting \lpr {hyphenationbounds} to a non|-|zero value: + +\starttabulate[|c|l|] +\DB value \BC behaviour \NC \NR +\TB +\NC \type{0} \NC not strict \NC \NR +\NC \type{1} \NC strict start \NC \NR +\NC \type{2} \NC strict end \NC \NR +\NC \type{3} \NC strict start and strict end \NC \NR +\LL +\stoptabulate + +The word start is determined as follows: + +\starttabulate[|l|l|] +\DB node \BC behaviour \NC \NR +\TB +\BC boundary \NC yes when wordboundary \NC \NR +\BC hlist \NC when hyphenationbounds 1 or 3 \NC \NR +\BC vlist \NC when hyphenationbounds 1 or 3 \NC \NR +\BC rule \NC when hyphenationbounds 1 or 3 \NC \NR +\BC dir \NC when hyphenationbounds 1 or 3 \NC \NR +\BC whatsit \NC when hyphenationbounds 1 or 3 \NC \NR +\BC glue \NC yes \NC \NR +\BC math \NC skipped \NC \NR +\BC glyph \NC exhyphenchar (one only) : yes (so no -- ---) \NC \NR +\BC otherwise \NC yes \NC \NR +\LL +\stoptabulate + +The word end is determined as follows: + +\starttabulate[|l|l|] +\DB node \BC behaviour \NC \NR +\TB +\BC boundary \NC yes \NC \NR +\BC glyph \NC yes when different language \NC \NR +\BC glue \NC yes \NC \NR +\BC penalty \NC yes \NC \NR +\BC kern \NC yes when not italic (for some historic reason) \NC \NR +\BC hlist \NC when hyphenationbounds 2 or 3 \NC \NR +\BC vlist \NC when hyphenationbounds 2 or 3 \NC \NR +\BC rule \NC when hyphenationbounds 2 or 3 \NC \NR +\BC dir \NC when hyphenationbounds 2 or 3 \NC \NR +\BC whatsit \NC when hyphenationbounds 2 or 3 \NC \NR +\BC ins \NC when hyphenationbounds 2 or 3 \NC \NR +\BC adjust \NC when hyphenationbounds 2 or 3 \NC \NR +\LL +\stoptabulate + +\in {Figures} [hb:1] upto \in [hb:5] show some examples. In all cases we set the +min values to 1 and make sure that the words hyphenate at each character. + +\hyphenation{o-n-e t-w-o} + +\def\SomeTest#1#2% + {\lefthyphenmin \plusone + \righthyphenmin \plusone + \parindent \zeropoint + \everypar \emptytoks + \dontcomplain + \hbox to 2cm {% + \vtop {% + \hsize 1pt + \hyphenationbounds#1 + #2 + \par}}} + +\startplacefigure[reference=hb:1,title={\type{one}}] + \startcombination[4*1] + {\SomeTest{0}{one}} {\type{0}} + {\SomeTest{1}{one}} {\type{1}} + {\SomeTest{2}{one}} {\type{2}} + {\SomeTest{3}{one}} {\type{3}} + \stopcombination +\stopplacefigure + +\startplacefigure[reference=hb:2,title={\type{one\null two}}] + \startcombination[4*1] + {\SomeTest{0}{one\null two}} {\type{0}} + {\SomeTest{1}{one\null two}} {\type{1}} + {\SomeTest{2}{one\null two}} {\type{2}} + {\SomeTest{3}{one\null two}} {\type{3}} + \stopcombination +\stopplacefigure + +\startplacefigure[reference=hb:3,title={\type{\null one\null two}}] + \startcombination[4*1] + {\SomeTest{0}{\null one\null two}} {\type{0}} + {\SomeTest{1}{\null one\null two}} {\type{1}} + {\SomeTest{2}{\null one\null two}} {\type{2}} + {\SomeTest{3}{\null one\null two}} {\type{3}} + \stopcombination +\stopplacefigure + +\startplacefigure[reference=hb:4,title={\type{one\null two\null}}] + \startcombination[4*1] + {\SomeTest{0}{one\null two\null}} {\type{0}} + {\SomeTest{1}{one\null two\null}} {\type{1}} + {\SomeTest{2}{one\null two\null}} {\type{2}} + {\SomeTest{3}{one\null two\null}} {\type{3}} + \stopcombination +\stopplacefigure + +\startplacefigure[reference=hb:5,title={\type{\null one\null two\null}}] + \startcombination[4*1] + {\SomeTest{0}{\null one\null two\null}} {\type{0}} + {\SomeTest{1}{\null one\null two\null}} {\type{1}} + {\SomeTest{2}{\null one\null two\null}} {\type{2}} + {\SomeTest{3}{\null one\null two\null}} {\type{3}} + \stopcombination +\stopplacefigure + +% (Future versions of \LUATEX\ might provide more granularity.) + +In traditional \TEX\ ligature building and hyphenation are interwoven with the +line break mechanism. In \LUATEX\ these phases are isolated. As a consequence we +deal differently with (a sequence of) explicit hyphens. We already have added +some control over aspects of the hyphenation and yet another one concerns +automatic hyphens (e.g.\ \type {-} characters in the input). + +When \lpr {automatichyphenmode} has a value of 0, a hyphen will be turned into +an automatic discretionary. The snippets before and after it will not be +hyphenated. A side effect is that a leading hyphen can lead to a split but one +will seldom run into that situation. Setting a pre and post character makes this +more prominent. A value of 1 will prevent this side effect and a value of 2 will +not turn the hyphen into a discretionary. Experiments with other options, like +permitting hyphenation of the words on both sides were discarded. + +\startbuffer[a] +before-after \par +before--after \par +before---after \par +\stopbuffer + +\startbuffer[b] +-before \par +after- \par +--before \par +after-- \par +---before \par +after--- \par +\stopbuffer + +\startbuffer[c] +before-after \par +before--after \par +before---after \par +\stopbuffer + +\startbuffer[demo] +\startcombination[nx=4,ny=3,location=top] + {\framed[align=normal,strut=no,top=\vskip.5ex,bottom=\vskip.5ex]{\automatichyphenmode\zerocount \hsize6em \getbuffer[a]}} {A~0~6em} + {\framed[align=normal,strut=no,top=\vskip.5ex,bottom=\vskip.5ex]{\automatichyphenmode\zerocount \hsize2pt \getbuffer[a]}} {A~0~2pt} + {\framed[align=normal,strut=no,top=\vskip.5ex,bottom=\vskip.5ex]{\automatichyphenmode\plusone \hsize2pt \getbuffer[a]}} {A~1~2pt} + {\framed[align=normal,strut=no,top=\vskip.5ex,bottom=\vskip.5ex]{\automatichyphenmode\plustwo \hsize2pt \getbuffer[a]}} {A~2~2pt} + {\framed[align=normal,strut=no,top=\vskip.5ex,bottom=\vskip.5ex]{\automatichyphenmode\zerocount \hsize6em \getbuffer[b]}} {B~0~6em} + {\framed[align=normal,strut=no,top=\vskip.5ex,bottom=\vskip.5ex]{\automatichyphenmode\zerocount \hsize2pt \getbuffer[b]}} {B~0~2pt} + {\framed[align=normal,strut=no,top=\vskip.5ex,bottom=\vskip.5ex]{\automatichyphenmode\plusone \hsize2pt \getbuffer[b]}} {B~1~2pt} + {\framed[align=normal,strut=no,top=\vskip.5ex,bottom=\vskip.5ex]{\automatichyphenmode\plustwo \hsize2pt \getbuffer[b]}} {B~2~2pt} + {\framed[align=normal,strut=no,top=\vskip.5ex,bottom=\vskip.5ex]{\automatichyphenmode\zerocount \hsize6em \getbuffer[c]}} {C~0~6em} + {\framed[align=normal,strut=no,top=\vskip.5ex,bottom=\vskip.5ex]{\automatichyphenmode\zerocount \hsize2pt \getbuffer[c]}} {C~0~2pt} + {\framed[align=normal,strut=no,top=\vskip.5ex,bottom=\vskip.5ex]{\automatichyphenmode\plusone \hsize2pt \getbuffer[c]}} {C~1~2pt} + {\framed[align=normal,strut=no,top=\vskip.5ex,bottom=\vskip.5ex]{\automatichyphenmode\plustwo \hsize2pt \getbuffer[c]}} {C~2~2pt} +\stopcombination +\stopbuffer + +\startplacefigure[locationreference=automatichyphenmode:1,title={The automatic modes \type {0} (default), \type {1} and \type {2}, with a \prm {hsize} +of 6em and 2pt (which triggers a linebreak).}] + \dontcomplain \tt \getbuffer[demo] +\stopplacefigure + +\startplacefigure[reference=automatichyphenmode:2,title={The automatic modes \type {0} (default), \type {1} and \type {2}, with \lpr {preexhyphenchar} and \lpr {postexhyphenchar} set to characters \type {A} and \type {B}.}] + \postexhyphenchar`A\relax + \preexhyphenchar `B\relax + \dontcomplain \tt \getbuffer[demo] +\stopplacefigure + +In \in {figure} [automatichyphenmode:1] \in {and} [automatichyphenmode:2] we show +what happens with three samples: + +Input A: \typebuffer[a] +Input B: \typebuffer[b] +Input C: \typebuffer[c] + +As with primitive companions of other single character commands, the \prm {-} +command has a more verbose primitive version in \lpr {explicitdiscretionary} +and the normally intercepted in the hyphenator character \type {-} (or whatever +is configured) is available as \lpr {automaticdiscretionary}. + +\stopsection + +\startsection[title={The main control loop}] + +\topicindex {main loop} +\topicindex {hyphenation} + +In \LUATEX's main loop, almost all input characters that are to be typeset are +converted into \nod {glyph} node records with subtype \quote {character}, but +there are a few exceptions. + +\startitemize[n] + +\startitem + The \prm {accent} primitive creates nodes with subtype \quote {glyph} + instead of \quote {character}: one for the actual accent and one for the + accentee. The primary reason for this is that \prm {accent} in \TEX82 is + explicitly dependent on the current font encoding, so it would not make much + sense to attach a new meaning to the primitive's name, as that would + invalidate many old documents and macro packages. A secondary reason is that + in \TEX82, \prm {accent} prohibits hyphenation of the current word. Since + in \LUATEX\ hyphenation only takes place on \quote {character} nodes, it is + possible to achieve the same effect. Of course, modern \UNICODE\ aware macro + packages will not use the \prm {accent} primitive at all but try to map + directly on composed characters. + + This change of meaning did happen with \prm {char}, that now generates + \quote {glyph} nodes with a character subtype. In traditional \TEX\ there was + a strong relationship between the 8|-|bit input encoding, hyphenation and + glyphs taken from a font. In \LUATEX\ we have \UTF\ input, and in most cases + this maps directly to a character in a font, apart from glyph replacement in + the font engine. If you want to access arbitrary glyphs in a font directly + you can always use \LUA\ to do so, because fonts are available as \LUA\ + table. +\stopitem + +\startitem + All the results of processing in math mode eventually become nodes with + \quote {glyph} subtypes. In fact, the result of processing math is just + a regular list of glyphs, kerns, glue, penalties, boxes etc. +\stopitem + +\startitem + The \ALEPH|-|derived commands \lpr {leftghost} and \lpr {rightghost} + create nodes of a third subtype: \quote {ghost}. These nodes are ignored + completely by all further processing until the stage where inter|-|glyph + kerning is added. +\stopitem + +\startitem + Automatic discretionaries are handled differently. \TEX82 inserts an empty + discretionary after sensing an input character that matches the \prm + {hyphenchar} in the current font. This test is wrong in our opinion: whether + or not hyphenation takes place should not depend on the current font, it is a + language property. \footnote {When \TEX\ showed up we didn't have \UNICODE\ + yet and being limited to eight bits meant that one sometimes had to + compromise between supporting character input, glyph rendering, hyphenation.} + + In \LUATEX, it works like this: if \LUATEX\ senses a string of input + characters that matches the value of the new integer parameter \prm + {exhyphenchar}, it will insert an explicit discretionary after that series of + nodes. Initially \TEX\ sets the \type {\exhyphenchar=`\-}. Incidentally, this + is a global parameter instead of a language-specific one because it may be + useful to change the value depending on the document structure instead of the + text language. + + The insertion of discretionaries after a sequence of explicit hyphens happens + at the same time as the other hyphenation processing, {\it not\/} inside the + main control loop. + + The only use \LUATEX\ has for \prm {hyphenchar} is at the check whether a + word should be considered for hyphenation at all. If the \prm {hyphenchar} + of the font attached to the first character node in a word is negative, then + hyphenation of that word is abandoned immediately. This behaviour is added + for backward compatibility only, and the use of \type {\hyphenchar=-1} as a + means of preventing hyphenation should not be used in new \LUATEX\ documents. +\stopitem + +\startitem + The \prm {setlanguage} command no longer creates whatsits. The meaning of + \prm {setlanguage} is changed so that it is now an integer parameter like all + others. That integer parameter is used in \type {\glyph_node} creation to add + language information to the glyph nodes. In conjunction, the \prm {language} + primitive is extended so that it always also updates the value of \prm + {setlanguage}. +\stopitem + +\startitem + The \prm {noboundary} command (that prohibits word boundary processing + where that would normally take place) now does create nodes. These nodes are + needed because the exact place of the \prm {noboundary} command in the + input stream has to be retained until after the ligature and font processing + stages. +\stopitem + +\startitem + There is no longer a \type {main_loop} label in the code. Remember that + \TEX82 did quite a lot of processing while adding \type {char_nodes} to the + horizontal list? For speed reasons, it handled that processing code outside + of the \quote {main control} loop, and only the first character of any \quote + {word} was handled by that \quote {main control} loop. In \LUATEX, there is + no longer a need for that (all hard work is done later), and the (now very + small) bits of character|-|handling code have been moved back inline. When + \prm {tracingcommands} is on, this is visible because the full word is + reported, instead of just the initial character. +\stopitem + +\stopitemize + +Because we tend to make hard coded behaviour configurable a few new primitives +have been added: + +\starttyping +\hyphenpenaltymode +\automatichyphenpenalty +\explicithyphenpenalty +\stoptyping + +The first parameter has the following consequences for automatic discs (the ones +resulting from an \prm {exhyphenchar}: + +\starttabulate[|c|l|l|] +\DB mode \BC automatic disc \type {-} \BC explicit disc \prm{-} \NC \NR +\TB +\NC \type{0} \NC \prm {exhyphenpenalty} \NC \prm {exhyphenpenalty} \NC \NR +\NC \type{1} \NC \prm {hyphenpenalty} \NC \prm {hyphenpenalty} \NC \NR +\NC \type{2} \NC \prm {exhyphenpenalty} \NC \prm {hyphenpenalty} \NC \NR +\NC \type{3} \NC \prm {hyphenpenalty} \NC \prm {exhyphenpenalty} \NC \NR +\NC \type{4} \NC \lpr {automatichyphenpenalty} \NC \lpr {explicithyphenpenalty} \NC \NR +\NC \type{5} \NC \prm {exhyphenpenalty} \NC \lpr {explicithyphenpenalty} \NC \NR +\NC \type{6} \NC \prm {hyphenpenalty} \NC \lpr {explicithyphenpenalty} \NC \NR +\NC \type{7} \NC \lpr {automatichyphenpenalty} \NC \prm {exhyphenpenalty} \NC \NR +\NC \type{8} \NC \lpr {automatichyphenpenalty} \NC \prm {hyphenpenalty} \NC \NR +\LL +\stoptabulate + +other values do what we always did in \LUATEX: insert \prm {exhyphenpenalty}. + +\stopsection + +\startsection[title={Loading patterns and exceptions},reference=patternsexceptions] + +\topicindex {hyphenation} +\topicindex {hyphenation+patterns} +\topicindex {hyphenation+exceptions} +\topicindex {patterns} +\topicindex {exceptions} + +Although we keep the traditional approach towards hyphenation (which is still +superior) the implementation of the hyphenation algorithm in \LUATEX\ is quite +different from the one in \TEX82. + +After expansion, the argument for \prm {patterns} has to be proper \UTF8 with +individual patterns separated by spaces, no \prm {char} or \prm {chardef}d +commands are allowed. The current implementation is quite strict and will reject +all non|-|\UNICODE\ characters. Likewise, the expanded argument for \prm +{hyphenation} also has to be proper \UTF8, but here a bit of extra syntax is +provided: + +\startitemize[n] +\startitem + Three sets of arguments in curly braces (\type {{}{}{}}) indicate a desired + complex discretionary, with arguments as in \prm {discretionary}'s command in + normal document input. +\stopitem +\startitem + A \type {-} indicates a desired simple discretionary, cf.\ \type {\-} and + \type {\discretionary{-}{}{}} in normal document input. +\stopitem +\startitem + Internal command names are ignored. This rule is provided especially for \prm + {discretionary}, but it also helps to deal with \prm {relax} commands that + may sneak in. +\stopitem +\startitem + An \type {=} indicates a (non|-|discretionary) hyphen in the document input. +\stopitem +\stopitemize + +The expanded argument is first converted back to a space|-|separated string while +dropping the internal command names. This string is then converted into a +dictionary by a routine that creates key|-|value pairs by converting the other +listed items. It is important to note that the keys in an exception dictionary +can always be generated from the values. Here are a few examples: + +\starttabulate[|l|l|l|] +\DB value \BC implied key (input) \BC effect \NC\NR +\TB +\NC \type {ta-ble} \NC table \NC \type {ta\-ble} ($=$ \type {ta\discretionary{-}{}{}ble}) \NC\NR +\NC \type {ba{k-}{}{c}ken} \NC backen \NC \type {ba\discretionary{k-}{}{c}ken} \NC\NR +\LL +\stoptabulate + +The resultant patterns and exception dictionary will be stored under the language +code that is the present value of \prm {language}. + +In the last line of the table, you see there is no \prm {discretionary} command +in the value: the command is optional in the \TEX-based input syntax. The +underlying reason for that is that it is conceivable that a whole dictionary of +words is stored as a plain text file and loaded into \LUATEX\ using one of the +functions in the \LUA\ \type {lang} library. This loading method is quite a bit +faster than going through the \TEX\ language primitives, but some (most?) of that +speed gain would be lost if it had to interpret command sequences while doing so. + +It is possible to specify extra hyphenation points in compound words by using +\type {{-}{}{-}} for the explicit hyphen character (replace \type {-} by the +actual explicit hyphen character if needed). For example, this matches the word +\quote {multi|-|word|-|boundaries} and allows an extra break inbetween \quote +{boun} and \quote {daries}: + +\starttyping +\hyphenation{multi{-}{}{-}word{-}{}{-}boun-daries} +\stoptyping + +The motivation behind the \ETEX\ extension \prm {savinghyphcodes} was that +hyphenation heavily depended on font encodings. This is no longer true in +\LUATEX, and the corresponding primitive is basically ignored. Because we now +have \lpr {hjcode}, the case relate codes can be used exclusively for \prm +{uppercase} and \prm {lowercase}. + +The three curly brace pair pattern in an exception can be somewhat unexpected so +we will try to explain it by example. The pattern \type {foo{}{}{x}bar} pattern +creates a lookup \type {fooxbar} and the pattern \type {foo{}{}{}bar} creates +\type {foobar}. Then, when a hit happens there is a replacement text (\type {x}) +or none. Because we introduced penalties in discretionary nodes, the exception +syntax now also can take a penalty specification. The value between square brackets +is a multiplier for \lpr {exceptionpenalty}. Here we have set it to 10000 so +effectively we get 30000 in the example. + +\def\ShowSample#1#2% + {\startlinecorrection[blank] + \hyphenation{#1}% + \exceptionpenalty=10000 + \bTABLE[foregroundstyle=type] + \bTR + \bTD[align=middle,nx=4] \type{#1} \eTD + \eTR + \bTR + \bTD[align=middle] \type{10em} \eTD + \bTD[align=middle] \type {3em} \eTD + \bTD[align=middle] \type {0em} \eTD + \bTD[align=middle] \type {6em} \eTD + \eTR + \bTR + \bTD[width=10em]\vtop{\hsize 10em 123 #2 123\par}\eTD + \bTD[width=10em]\vtop{\hsize 3em 123 #2 123\par}\eTD + \bTD[width=10em]\vtop{\hsize 0em 123 #2 123\par}\eTD + \bTD[width=10em]\vtop{\setupalign[verytolerant,stretch]\rmtf\hsize 6em 123 #2 #2 #2 #2 123\par}\eTD + \eTR + \eTABLE + \stoplinecorrection} + +\ShowSample{x{a-}{-b}{}x{a-}{-b}{}x{a-}{-b}{}x{a-}{-b}{}xx}{xxxxxx} +\ShowSample{x{a-}{-b}{}x{a-}{-b}{}[3]x{a-}{-b}{}[1]x{a-}{-b}{}xx}{xxxxxx} + +\ShowSample{z{a-}{-b}{z}{a-}{-b}{z}{a-}{-b}{z}{a-}{-b}{z}z}{zzzzzz} +\ShowSample{z{a-}{-b}{z}{a-}{-b}{z}[3]{a-}{-b}{z}[1]{a-}{-b}{z}z}{zzzzzz} + +\stopsection + +\startsection[title={Applying hyphenation}] + +\topicindex {hyphenation+how it works} +\topicindex {hyphenation+discretionaries} +\topicindex {discretionaries} + +The internal structures \LUATEX\ uses for the insertion of discretionaries in +words is very different from the ones in \TEX82, and that means there are some +noticeable differences in handling as well. + +First and foremost, there is no \quote {compressed trie} involved in hyphenation. +The algorithm still reads pattern files generated by \PATGEN, but \LUATEX\ uses a +finite state hash to match the patterns against the word to be hyphenated. This +algorithm is based on the \quote {libhnj} library used by \OPENOFFICE, which in +turn is inspired by \TEX. + +There are a few differences between \LUATEX\ and \TEX82 that are a direct result +of the implementation: + +\startitemize +\startitem + \LUATEX\ happily hyphenates the full \UNICODE\ character range. +\stopitem +\startitem + Pattern and exception dictionary size is limited by the available memory + only, all allocations are done dynamically. The trie|-|related settings in + \type {texmf.cnf} are ignored. +\stopitem +\startitem + Because there is no \quote {trie preparation} stage, language patterns never + become frozen. This means that the primitive \prm {patterns} (and its \LUA\ + counterpart \type {lang.patterns}) can be used at any time, not only in + ini\TEX. +\stopitem +\startitem + Only the string representation of \prm {patterns} and \prm {hyphenation} is + stored in the format file. At format load time, they are simply + re|-|evaluated. It follows that there is no real reason to preload languages + in the format file. In fact, it is usually not a good idea to do so. It is + much smarter to load patterns no sooner than the first time they are actually + needed. +\stopitem +\startitem + \LUATEX\ uses the language-specific variables \lpr {prehyphenchar} and \lpr + {posthyphenchar} in the creation of implicit discretionaries, instead of + \TEX82's \prm {hyphenchar}, and the values of the language|-|specific + variables \lpr {preexhyphenchar} and \lpr {postexhyphenchar} for explicit + discretionaries (instead of \TEX82's empty discretionary). +\stopitem +\startitem + The value of the two counters related to hyphenation, \prm {hyphenpenalty} + and \prm {exhyphenpenalty}, are now stored in the discretionary nodes. This + permits a local overload for explicit \prm {discretionary} commands. The + value current when the hyphenation pass is applied is used. When no callbacks + are used this is compatible with traditional \TEX. When you apply the \LUA\ + \type {lang.hyphenate} function the current values are used. +\stopitem +\startitem + The hyphenation exception dictionary is maintained as key|-|value hash, and + that is also dynamic, so the \type {hyph_size} setting is not used either. +\stopitem +\stopitemize + +Because we store penalties in the disc node the \prm {discretionary} command has +been extended to accept an optional penalty specification, so you can do the +following: + +\startbuffer +\hsize1mm +1:foo{\hyphenpenalty 10000\discretionary{}{}{}}bar\par +2:foo\discretionary penalty 10000 {}{}{}bar\par +3:foo\discretionary{}{}{}bar\par +\stopbuffer + +\typebuffer + +This results in: + +\blank \start \getbuffer \stop \blank + +Inserted characters and ligatures inherit their attributes from the nearest glyph +node item (usually the preceding one, but the following one for the items +inserted at the left-hand side of a word). + +Word boundaries are no longer implied by font switches, but by language switches. +One word can have two separate fonts and still be hyphenated correctly (but it +can not have two different languages, the \prm {setlanguage} command forces a +word boundary). + +All languages start out with \type {\prehyphenchar=`\-}, \type {\posthyphenchar=0}, +\type {\preexhyphenchar=0} and \type {\postexhyphenchar=0}. When you assign the +values of one of these four parameters, you are actually changing the settings +for the current \prm {language}, this behaviour is compatible with \prm {patterns} +and \prm {hyphenation}. + +\LUATEX\ also hyphenates the first word in a paragraph. Words can be up to 256 +characters long (up from 64 in \TEX82). Longer words are ignored right now, but +eventually either the limitation will be removed or perhaps it will become +possible to silently ignore the excess characters (this is what happens in +\TEX82, but there the behaviour cannot be controlled). + +If you are using the \LUA\ function \type {lang.hyphenate}, you should be aware +that this function expects to receive a list of \quote {character} nodes. It will +not operate properly in the presence of \quote {glyph}, \quote {ligature}, or +\quote {ghost} nodes, nor does it know how to deal with kerning. + +\stopsection + +\startsection[title={Applying ligatures and kerning}] + +\topicindex {ligatures} +\topicindex {kerning} + +After all possible hyphenation points have been inserted in the list, \LUATEX\ +will process the list to convert the \quote {character} nodes into \quote {glyph} +and \quote {ligature} nodes. This is actually done in two stages: first all +ligatures are processed, then all kerning information is applied to the result +list. But those two stages are somewhat dependent on each other: If the used font +makes it possible to do so, the ligaturing stage adds virtual \quote {character} +nodes to the word boundaries in the list. While doing so, it removes and +interprets \prm {noboundary} nodes. The kerning stage deletes those word +boundary items after it is done with them, and it does the same for \quote +{ghost} nodes. Finally, at the end of the kerning stage, all remaining \quote +{character} nodes are converted to \quote {glyph} nodes. + +This word separation is worth mentioning because, if you overrule from \LUA\ only +one of the two callbacks related to font handling, then you have to make sure you +perform the tasks normally done by \LUATEX\ itself in order to make sure that the +other, non|-|overruled, routine continues to function properly. + +Although we could improve the situation the reality is that in modern \OPENTYPE\ +fonts ligatures can be constructed in many ways: by replacing a sequence of +characters by one glyph, or by selectively replacing individual glyphs, or by +kerning, or any combination of this. Add to that contextual analysis and it will +be clear that we have to let \LUA\ do that job instead. The generic font handler +that we provide (which is part of \CONTEXT) distinguishes between base mode +(which essentially is what we describe here and which delegates the task to \TEX) +and node mode (which deals with more complex fonts. + +Let's look at an example. Take the word \type {office}, hyphenated \type +{of-fice}, using a \quote {normal} font with all the \type {f}-\type {f} and +\type {f}-\type {i} type ligatures: + +\starttabulate[|l|l|] +\NC initial \NC \type {{o}{f}{f}{i}{c}{e}} \NC\NR +\NC after hyphenation \NC \type {{o}{f}{{-},{},{}}{f}{i}{c}{e}} \NC\NR +\NC first ligature stage \NC \type {{o}{{f-},{f},{<ff>}}{i}{c}{e}} \NC\NR +\NC final result \NC \type {{o}{{f-},{<fi>},{<ffi>}}{c}{e}} \NC\NR +\stoptabulate + +That's bad enough, but let us assume that there is also a hyphenation point +between the \type {f} and the \type {i}, to create \type {of-f-ice}. Then the +final result should be: + +\starttyping +{o}{{f-}, + {{f-}, + {i}, + {<fi>}}, + {{<ff>-}, + {i}, + {<ffi>}}}{c}{e} +\stoptyping + +with discretionaries in the post-break text as well as in the replacement text of +the top-level discretionary that resulted from the first hyphenation point. + +Here is that nested solution again, in a different representation: + +\testpage[4] + +\starttabulate[|l|c|c|c|c|c|c|] +\DB \BC pre \BC \BC post \BC \BC replace \BC \NC \NR +\TB +\NC topdisc \NC \type {f-} \NC (1) \NC \NC sub 1 \NC \NC sub 2 \NC \NR +\NC sub 1 \NC \type {f-} \NC (2) \NC \type {i} \NC (3) \NC \type {<fi>} \NC (4) \NC \NR +\NC sub 2 \NC \type {<ff>-} \NC (5) \NC \type {i} \NC (6) \NC \type {<ffi>} \NC (7) \NC \NR +\LL +\stoptabulate + +When line breaking is choosing its breakpoints, the following fields will +eventually be selected: + +\starttabulate[|l|c|c|] +\NC \type {of-f-ice} \NC \type {f-} \NC (1) \NC \NR +\NC \NC \type {f-} \NC (2) \NC \NR +\NC \NC \type {i} \NC (3) \NC \NR +\NC \type {of-fice} \NC \type {f-} \NC (1) \NC \NR +\NC \NC \type {<fi>} \NC (4) \NC \NR +\NC \type {off-ice} \NC \type {<ff>-} \NC (5) \NC \NR +\NC \NC \type {i} \NC (6) \NC \NR +\NC \type {office} \NC \type {<ffi>} \NC (7) \NC \NR +\stoptabulate + +The current solution in \LUATEX\ is not able to handle nested discretionaries, +but it is in fact smart enough to handle this fictional \type {of-f-ice} example. +It does so by combining two sequential discretionary nodes as if they were a +single object (where the second discretionary node is treated as an extension of +the first node). + +One can observe that the \type {of-f-ice} and \type {off-ice} cases both end with +the same actual post replacement list (\type {i}), and that this would be the +case even if \type {i} was the first item of a potential following ligature like +\type {ic}. This allows \LUATEX\ to do away with one of the fields, and thus make +the whole stuff fit into just two discretionary nodes. + +The mapping of the seven list fields to the six fields in this discretionary node +pair is as follows: + +\starttabulate[|l|c|c|] +\DB field \BC description \NC \NC \NR +\TB +\NC \type {disc1.pre} \NC \type {f-} \NC (1) \NC \NR +\NC \type {disc1.post} \NC \type {<fi>} \NC (4) \NC \NR +\NC \type {disc1.replace} \NC \type {<ffi>} \NC (7) \NC \NR +\NC \type {disc2.pre} \NC \type {f-} \NC (2) \NC \NR +\NC \type {disc2.post} \NC \type {i} \NC (3,6) \NC \NR +\NC \type {disc2.replace} \NC \type {<ff>-} \NC (5) \NC \NR +\LL +\stoptabulate + +What is actually generated after ligaturing has been applied is therefore: + +\starttyping +{o}{{f-}, + {<fi>}, + {<ffi>}} + {{f-}, + {i}, + {<ff>-}}{c}{e} +\stoptyping + +The two discretionaries have different subtypes from a discretionary appearing on +its own: the first has subtype 4, and the second has subtype 5. The need for +these special subtypes stems from the fact that not all of the fields appear in +their \quote {normal} location. The second discretionary especially looks odd, +with things like the \type {<ff>-} appearing in \type {disc2.replace}. The fact +that some of the fields have different meanings (and different processing code +internally) is what makes it necessary to have different subtypes: this enables +\LUATEX\ to distinguish this sequence of two joined discretionary nodes from the +case of two standalone discretionaries appearing in a row. + +Of course there is still that relationship with fonts: ligatures can be implemented by +mapping a sequence of glyphs onto one glyph, but also by selective replacement and +kerning. This means that the above examples are just representing the traditional +approach. + +\stopsection + +\startsection[title={Breaking paragraphs into lines}] + +\topicindex {linebreaks} +\topicindex {paragraphs} +\topicindex {discretionaries} + +This code is almost unchanged, but because of the above|-|mentioned changes +with respect to discretionaries and ligatures, line breaking will potentially be +different from traditional \TEX. The actual line breaking code is still based on +the \TEX82 algorithms, and it does not expect there to be discretionaries inside +of discretionaries. But, as patterns evolve and font handling can influence +discretionaries, you need to be aware of the fact that long term consistency is not +an engine matter only. + +But that situation is now fairly common in \LUATEX, due to the changes to the +ligaturing mechanism. And also, the \LUATEX\ discretionary nodes are implemented +slightly different from the \TEX82 nodes: the \type {no_break} text is now +embedded inside the disc node, where previously these nodes kept their place in +the horizontal list. In traditional \TEX\ the discretionary node contains a +counter indicating how many nodes to skip, but in \LUATEX\ we store the pre, post +and replace text in the discretionary node. + +The combined effect of these two differences is that \LUATEX\ does not always use +all of the potential breakpoints in a paragraph, especially when fonts with many +ligatures are used. Of course kerning also complicates matters here. + +\stopsection + +\startsection[title={The \type {lang} library}][library=lang] + +\subsection {\type {new} and \type {id}} + +\topicindex {languages+library} + +\libindex {new} +\libindex {id} + +This library provides the interface to \LUATEX's structure representing a +language, and the associated functions. + +\startfunctioncall +<language> l = lang.new() +<language> l = lang.new(<number> id) +\stopfunctioncall + +This function creates a new userdata object. An object of type \type {<language>} +is the first argument to most of the other functions in the \type {lang} library. +These functions can also be used as if they were object methods, using the colon +syntax. Without an argument, the next available internal id number will be +assigned to this object. With argument, an object will be created that links to +the internal language with that id number. + +\startfunctioncall +<number> n = lang.id(<language> l) +\stopfunctioncall + +The number returned is the internal \prm {language} id number this object refers to. + +\subsection {\type {hyphenation}} + +\libindex {hyphenation} + +You can hyphenate a string directly with: + +\startfunctioncall +<string> n = lang.hyphenation(<language> l) +lang.hyphenation(<language> l, <string> n) +\stopfunctioncall + +\subsection {\type {clear_hyphenation} and \type {clean}} + +\libindex {clear_hyphenation} +\libindex {clean} + +This either returns the current hyphenation exceptions for this language, or adds +new ones. The syntax of the string is explained in~\in {section} +[patternsexceptions]. + +\startfunctioncall +lang.clear_hyphenation(<language> l) +\stopfunctioncall + +This call clears the exception dictionary (string) for this language. + +\startfunctioncall +<string> n = lang.clean(<language> l, <string> o) +<string> n = lang.clean(<string> o) +\stopfunctioncall + +This function creates a hyphenation key from the supplied hyphenation value. The +syntax of the argument string is explained in \in {section} [patternsexceptions]. +This function is useful if you want to do something else based on the words in a +dictionary file, like spell|-|checking. + +\subsection {\type {patterns} and \type {clear_patterns}} + +\libindex {patterns} +\libindex {clear_patterns} + +\startfunctioncall +<string> n = lang.patterns(<language> l) +lang.patterns(<language> l, <string> n) +\stopfunctioncall + +This adds additional patterns for this language object, or returns the current +set. The syntax of this string is explained in \in {section} +[patternsexceptions]. + +\startfunctioncall +lang.clear_patterns(<language> l) +\stopfunctioncall + +This can be used to clear the pattern dictionary for a language. + +\subsection {\type {hyphenationmin}} + +\libindex {hyphenationmin} + +This function sets (or gets) the value of the \TEX\ parameter +\type {\hyphenationmin}. + +\startfunctioncall +n = lang.hyphenationmin(<language> l) +lang.hyphenationmin(<language> l, <number> n) +\stopfunctioncall + +\subsection {\type {[pre|post][ex|]hyphenchar}} + +\libindex {prehyphenchar} +\libindex {posthyphenchar} +\libindex {preexhyphenchar} +\libindex {postexhyphenchar} + +\startfunctioncall +<number> n = lang.prehyphenchar(<language> l) +lang.prehyphenchar(<language> l, <number> n) + +<number> n = lang.posthyphenchar(<language> l) +lang.posthyphenchar(<language> l, <number> n) +\stopfunctioncall + +These two are used to get or set the \quote {pre|-|break} and \quote +{post|-|break} hyphen characters for implicit hyphenation in this language. The +intial values are decimal 45 (hyphen) and decimal~0 (indicating emptiness). + +\startfunctioncall +<number> n = lang.preexhyphenchar(<language> l) +lang.preexhyphenchar(<language> l, <number> n) + +<number> n = lang.postexhyphenchar(<language> l) +lang.postexhyphenchar(<language> l, <number> n) +\stopfunctioncall + +These gets or set the \quote {pre|-|break} and \quote {post|-|break} hyphen +characters for explicit hyphenation in this language. Both are initially +decimal~0 (indicating emptiness). + +\subsection {\type {hyphenate}} + +\libindex {hyphenate} + +The next call inserts hyphenation points (discretionary nodes) in a node list. If +\type {tail} is given as argument, processing stops on that node. Currently, +\type {success} is always true if \type {head} (and \type {tail}, if specified) +are proper nodes, regardless of possible other errors. + +\startfunctioncall +<boolean> success = lang.hyphenate(<node> head) +<boolean> success = lang.hyphenate(<node> head, <node> tail) +\stopfunctioncall + +Hyphenation works only on \quote {characters}, a special subtype of all the glyph +nodes with the node subtype having the value \type {1}. Glyph modes with +different subtypes are not processed. See \in {section} [charsandglyphs] for +more details. + +\subsection {\type {[set|get]hjcode}} + +\libindex {sethjcode} +\libindex {gethjcode} + +The following two commands can be used to set or query hj codes: + +\startfunctioncall +lang.sethjcode(<language> l, <number> char, <number> usedchar) +<number> usedchar = lang.gethjcode(<language> l, <number> char) +\stopfunctioncall + +When you set a hjcode the current sets get initialized unless the set was already +initialized due to \prm {savinghyphcodes} being larger than zero. + +\stopsection + +\stopchapter + +\stopcomponent + +% \parindent0pt \hsize=1.1cm +% 12-34-56 \par +% 12-34-\hbox{56} \par +% 12-34-\vrule width 1em height 1.5ex \par +% 12-\hbox{34}-56 \par +% 12-\vrule width 1em height 1.5ex-56 \par +% \hjcode`\1=`\1 \hjcode`\2=`\2 \hjcode`\3=`\3 \hjcode`\4=`\4 \vskip.5cm +% 12-34-56 \par +% 12-34-\hbox{56} \par +% 12-34-\vrule width 1em height 1.5ex \par +% 12-\hbox{34}-56 \par +% 12-\vrule width 1em height 1.5ex-56 \par + |