summaryrefslogtreecommitdiff
path: root/doc/context/sources/general/manuals/luatex/luatex-languages.tex
diff options
context:
space:
mode:
Diffstat (limited to 'doc/context/sources/general/manuals/luatex/luatex-languages.tex')
-rw-r--r--doc/context/sources/general/manuals/luatex/luatex-languages.tex793
1 files changed, 470 insertions, 323 deletions
diff --git a/doc/context/sources/general/manuals/luatex/luatex-languages.tex b/doc/context/sources/general/manuals/luatex/luatex-languages.tex
index 365e87f26..d4a7bda60 100644
--- a/doc/context/sources/general/manuals/luatex/luatex-languages.tex
+++ b/doc/context/sources/general/manuals/luatex/luatex-languages.tex
@@ -1,19 +1,22 @@
% language=uk
\environment luatex-style
-\environment luatex-logos
\startcomponent luatex-languages
\startchapter[reference=languages,title={Languages, characters, fonts and glyphs}]
+\startsection[title={Introduction}]
+
+\topicindex {languages}
+
\LUATEX's internal handling of the characters and glyphs that eventually become
typeset is quite different from the way \TEX82 handles those same objects. The
easiest way to explain the difference is to focus on unrestricted horizontal mode
(i.e.\ paragraphs) and hyphenation first. Later on, it will be easy to deal
with the differences that occur in horizontal and math modes.
-In \TEX82, the characters you type are converted into \type {char_node} records
+In \TEX82, the characters you type are converted into \type {char} node records
when they are encountered by the main control loop. \TEX\ attaches and processes
the font information while creating those records, so that the resulting \quote
{horizontal list} contains the final forms of ligatures and implicit kerning.
@@ -21,7 +24,7 @@ This packaging is needed because we may want to get the effective width of for
instance a horizontal box.
When it becomes necessary to hyphenate words in a paragraph, \TEX\ converts (one
-word at time) the \type {char_node} records into a string by replacing ligatures
+word at time) the \type {char} node records into a string by replacing ligatures
with their components and ignoring the kerning. Then it runs the hyphenation
algorithm on this string, and converts the hyphenated result back into a \quote
{horizontal list} that is consecutively spliced back into the paragraph stream.
@@ -29,14 +32,14 @@ Keep in mind that the paragraph may contain unboxed horizontal material, which
then already contains ligatures and kerns and the words therein are part of the
hyphenation process.
-Those \type {char_node} records are somewhat misnamed, as they are glyph
+Those \type {char} node records are somewhat misnamed, as they are glyph
positions in specific fonts, and therefore not really \quote {characters} in the
-linguistic sense. There is no language information inside the \type {char_node}
+linguistic sense. There is no language information inside the \type {char} node
records at all. Instead, language information is passed along using \type
-{language whatsit} records inside the horizontal list.
+{language whatsit} nodes inside the horizontal list.
In \LUATEX, the situation is quite different. The characters you type are always
-converted into \type {glyph_node} records with a special subtype to identify them
+converted into \nod {glyph} node records with a special subtype to identify them
as being intended as linguistic characters. \LUATEX\ stores the needed language
information in those records, but does not do any font|-|related processing at
the time of node creation. It only stores the index of the current font and a
@@ -48,93 +51,83 @@ font information in the whole list (creating ligatures and adjusting kerning),
and finally it adjusts all the subtype identifiers so that the records are \quote
{glyph nodes} from now on.
-\section[charsandglyphs]{Characters and glyphs}
+\stopsection
+
+\startsection[title={Characters, glyphs and discretionaries},reference=charsandglyphs]
+
+\topicindex {characters}
+\topicindex {glyphs}
+\topicindex {hyphenation}
-\TEX82 (including \PDFTEX) differentiates between \type {char_node}s and \type
-{lig_node}s. The former are simple items that contained nothing but a \quote
+\TEX82 (including \PDFTEX) differentiates between \type {char} nodes and \type
+{lig} nodes. The former are simple items that contained nothing but a \quote
{character} and a \quote {font} field, and they lived in the same memory as
tokens did. The latter also contained a list of components, and a subtype
indicating whether this ligature was the result of a word boundary, and it was
stored in the same place as other nodes like boxes and kerns and glues.
In \LUATEX, these two types are merged into one, somewhat larger structure called
-a \type {glyph_node}. Besides having the old character, font, and component
-fields, and the new special fields like \quote {attr} (see~\in {section}
-[glyphnodes]), these nodes also contain:
+a \nod {glyph} node. Besides having the old character, font, and component
+fields there are a few more, like \quote {attr} that we will see in \in {section}
+[glyphnodes], these nodes also contain a subtype, that codes four main types and
+two additional ghost types. For ligatures, multiple bits can be set at the same
+time (in case of a single|-|glyph word).
\startitemize
-
-\startitem A subtype, split into four main types:
-
- \startitemize
- \startitem
- \type {character}, for characters to be hyphenated: the lowest bit
- (bit 0) is set to 1.
- \stopitem
- \startitem
- \type {glyph}, for specific font glyphs: the lowest bit (bit 0) is
- not set.
- \stopitem
- \startitem
- \type {ligature}, for ligatures (bit 1 is set)
- \stopitem
- \startitem
- \type {ghost}, for \quote {ghost objects} (bit 2 is set)
- \stopitem
- \stopitemize
-
- The latter two make further use of two extra fields (bits 3 and 4):
-
- \startitemize
- \startitem
- \type {left}, for ligatures created from a left word boundary and for
- ghosts created from \type {\leftghost}
- \stopitem
- \startitem
- \type {right}, for ligatures created from a right word boundary and
- for ghosts created from \type {\rightghost}
- \stopitem
- \stopitemize
-
- For ligatures, both bits can be set at the same time (in case of a
- single|-|glyph word).
-
-\stopitem
-
-\startitem
- \type {glyph_node}s of type \quote {character} also contain language data,
- split into four items that were current when the node was created: the
- \type {\setlanguage} (15 bits), \type {\lefthyphenmin} (8 bits), \type
- {\righthyphenmin} (8 bits), and \type {\uchyph} (1 bit).
-\stopitem
-
+ \startitem
+ \type {character}, for characters to be hyphenated: the lowest bit
+ (bit 0) is set to 1.
+ \stopitem
+ \startitem
+ \nod {glyph}, for specific font glyphs: the lowest bit (bit 0) is
+ not set.
+ \stopitem
+ \startitem
+ \type {ligature}, for constructed ligatures bit 1 is set.
+ \stopitem
+ \startitem
+ \type {ghost}, for so called \quote {ghost objects} bit 2 is set.
+ \stopitem
+ \startitem
+ \type {left}, for ligatures created from a left word boundary and for
+ ghosts created from \lpr {leftghost} bit 3 gets set.
+ \stopitem
+ \startitem
+ \type {right}, for ligatures created from a right word boundary and
+ for ghosts created from \lpr {rightghost} bit 4 is set.
+ \stopitem
\stopitemize
+The \nod {glyph} nodes also contain language data, split into four items that
+were current when the node was created: the \prm {setlanguage} (15~bits), \prm
+{lefthyphenmin} (8~bits), \prm {righthyphenmin} (8~bits), and \prm {uchyph}
+(1~bit).
+
Incidentally, \LUATEX\ allows 16383 separate languages, and words can be 256
characters long. The language is stored with each character. You can set
-\type {\firstvalidlanguage} to for instance~1 and make thereby language~0
+\prm {firstvalidlanguage} to for instance~1 and make thereby language~0
an ignored hyphenation language.
-The new primitive \type {\hyphenationmin} can be used to signal the minimal length
-of a word. This value stored with the (current) language.
+The new primitive \lpr {hyphenationmin} can be used to signal the minimal length
+of a word. This value is stored with the (current) language.
-Because the \type {\uchyph} value is saved in the actual nodes, its handling is
-subtly different from \TEX82: changes to \type {\uchyph} become effective
+Because the \prm {uchyph} value is saved in the actual nodes, its handling is
+subtly different from \TEX82: changes to \prm {uchyph} become effective
immediately, not at the end of the current partial paragraph.
Typeset boxes now always have their language information embedded in the nodes
themselves, so there is no longer a possible dependency on the surrounding
-language settings. In \TEX82, a mid-paragraph statement like \type {\unhbox0} would
-process the box using the current paragraph language unless there was a
-\type {\setlanguage} issued inside the box. In \LUATEX, all language variables are
-already frozen.
+language settings. In \TEX82, a mid|-|paragraph statement like \type {\unhbox0}
+would process the box using the current paragraph language unless there was a
+\prm {setlanguage} issued inside the box. In \LUATEX, all language variables
+are already frozen.
In traditional \TEX\ the process of hyphenation is driven by \type {lccode}s. In
\LUATEX\ we made this dependency less strong. There are several strategies
possible. When you do nothing, the currently used \type {lccode}s are used, when
loading patterns, setting exceptions or hyphenating a list.
-When you set \type {\savinghyphcodes} to a value larger than zero the current set
+When you set \prm {savinghyphcodes} to a value greater than zero the current set
of \type {lccode}s will be saved with the language. In that case changing a \type
{lccode} afterwards has no effect. However, you can adapt the set with:
@@ -144,96 +137,88 @@ of \type {lccode}s will be saved with the language. In that case changing a \typ
This change is global which makes sense if you keep in mind that the moment that
hyphenation happens is (normally) when the paragraph or a horizontal box is
-constructed. When \type {\savinghyphcodes} was zero when the language got
+constructed. When \prm {savinghyphcodes} was zero when the language got
initialized you start out with nothing, otherwise you already have a set.
-When a \type {\hjcode} is larger than $0$ but smaller than $32$ is indicates the
+When a \lpr {hjcode} is greater than 0 but less than 32 is indicates the
to be used length. In the following example we map a character (\type {x}) onto
another one in the patterns and tell the engine that \type {œ} counts as one
character. Because traditionally zero itself is reserved for inhibiting
-hyphenation, a value of $32$ counts as zero.
-
-\starttyping
-% assuming french patterns:
-foobar % foo-bar
-
-\hjcode`x=`o
-
-fxxbar % fxx-bar
-
-\lefthyphenmin3
-
-œdipus % œdi-pus
-
-\lefthyphenmin4
-
-œdipus % œdipus
-
-\hjcode`œ=2
-
-œdipus % œdi-pus
-
-\hjcode`i=32
-\hjcode`d=32
-
-œdipus % œdipus
-\stoptyping
+hyphenation, a value of 32 counts as zero.
+
+Here are some examples (we assume that French patterns are used):
+
+\starttabulate[||||]
+\NC \NC \type{foobar} \NC \type{foo-bar} \NC \NR
+\NC \type{\hjcode`x=`o} \NC \type{fxxbar} \NC \type{fxx-bar} \NC \NR
+\NC \type{\lefthyphenmin3} \NC \type{œdipus} \NC \type{œdi-pus} \NC \NR
+\NC \type{\lefthyphenmin4} \NC \type{œdipus} \NC \type{œdipus} \NC \NR
+\NC \type{\hjcode`œ=2} \NC \type{œdipus} \NC \type{œdi-pus} \NC \NR
+\NC \type{\hjcode`i=32 \hjcode`d=32} \NC \type{œdipus} \NC \type{œdipus} \NC \NR
+\NC
+\stoptabulate
Carrying all this information with each glyph would give too much overhead and
-also make the process of setting up thee codes more complex. A solution with
+also make the process of setting up these codes more complex. A solution with
\type {hjcode} sets was considered but rejected because in practice the current
approach is sufficient and it would not be compatible anyway.
Beware: the values are always saved in the format, independent of the setting
-of \type {\savinghyphcodes} at the moment the format is dumped.
+of \prm {savinghyphcodes} at the moment the format is dumped.
A boundary node normally would mark the end of a word which interferes with for
-instance discretionary injection. For this you can use the \type {\wordboundary}
-as trigger. Here are a few examples of usage:
+instance discretionary injection. For this you can use the \prm {wordboundary}
+as a trigger. Here are a few examples of usage:
\startbuffer
discrete---discrete
\stopbuffer
-\typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
+\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
\startbuffer
discrete\discretionary{}{}{---}discrete
\stopbuffer
-\typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
+\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
\startbuffer
discrete\wordboundary\discretionary{}{}{---}discrete
\stopbuffer
-\typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
+\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
\startbuffer
discrete\wordboundary\discretionary{}{}{---}\wordboundary discrete
\stopbuffer
-\typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
+\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
\startbuffer
discrete\wordboundary\discretionary{---}{}{}\wordboundary discrete
\stopbuffer
-\typebuffer \start \dontcomplain \hsize 1pt \getbuffer \par \stop
+\typebuffer \startnarrower \dontcomplain \hsize 1pt \getbuffer \par \stopnarrower
We only accept an explicit hyphen when there is a preceding glyph and we skip a
-sequence of explicit hyphens as that normally indicates a \type {--} or \type
+sequence of explicit hyphens since that normally indicates a \type {--} or \type
{---} ligature in which case we can in a worse case usage get bad node lists
later on due to messed up ligature building as these dashes are ligatures in base
-fonts. This is a side effect of the separating the hyphenation, ligaturing and
+fonts. This is a side effect of separating the hyphenation, ligaturing and
kerning steps.
-The start and end of a characters is signalled by a glue, penalty, kern or boundary
-node. But by default also a hlist, vlist, rule, dir, whatsit, ins, and adjust node
-indicate a start or end. You can omit the last set from the test by setting
-\type {\hyphenationbounds} to a non|-|zero value:
+The start and end of a sequence of characters is signalled by a \nod {glue}, \nod
+{penalty}, \nod {kern} or \nod {boundary} node. But by default also a \nod
+{hlist}, \nod {vlist}, \nod {rule}, \nod {dir}, \nod {whatsit}, \nod {ins}, and
+\nod {adjust} node indicate a start or end. You can omit the last set from the
+test by setting \lpr {hyphenationbounds} to a non|-|zero value:
-\starttabulate[|l|l|]
+\starttabulate[|c|l|]
+\DB value \BC behaviour \NC \NR
+\TB
\NC \type{0} \NC not strict \NC \NR
\NC \type{1} \NC strict start \NC \NR
\NC \type{2} \NC strict end \NC \NR
\NC \type{3} \NC strict start and strict end \NC \NR
+\LL
\stoptabulate
The word start is determined as follows:
\starttabulate[|l|l|]
+\DB node \BC behaviour \NC \NR
+\TB
\BC boundary \NC yes when wordboundary \NC \NR
\BC hlist \NC when hyphenationbounds 1 or 3 \NC \NR
\BC vlist \NC when hyphenationbounds 1 or 3 \NC \NR
@@ -244,11 +229,14 @@ The word start is determined as follows:
\BC math \NC skipped \NC \NR
\BC glyph \NC exhyphenchar (one only) : yes (so no -- ---) \NC \NR
\BC otherwise \NC yes \NC \NR
+\LL
\stoptabulate
The word end is determined as follows:
\starttabulate[|l|l|]
+\DB node \BC behaviour \NC \NR
+\TB
\BC boundary \NC yes \NC \NR
\BC glyph \NC yes when different language \NC \NR
\BC glue \NC yes \NC \NR
@@ -261,10 +249,11 @@ The word end is determined as follows:
\BC whatsit \NC when hyphenationbounds 2 or 3 \NC \NR
\BC ins \NC when hyphenationbounds 2 or 3 \NC \NR
\BC adjust \NC when hyphenationbounds 2 or 3 \NC \NR
+\LL
\stoptabulate
-\in{Figures}[hb:1] upto \in[hb:5] show some examples. In all cases we set the min
-values to 1 and make sure that the words hyphenate at each character.
+\in {Figures} [hb:1] upto \in [hb:5] show some examples. In all cases we set the
+min values to 1 and make sure that the words hyphenate at each character.
\hyphenation{o-n-e t-w-o}
@@ -283,12 +272,13 @@ values to 1 and make sure that the words hyphenate at each character.
\startplacefigure[reference=hb:1,title={\type{one}}]
\startcombination[4*1]
- {\SomeTest{0}{one}} {\type{0}}
- {\SomeTest{1}{one}} {\type{1}}
- {\SomeTest{2}{one}} {\type{2}}
- {\SomeTest{3}{one}} {\type{3}}
+ {\SomeTest{0}{one}} {\type{0}}
+ {\SomeTest{1}{one}} {\type{1}}
+ {\SomeTest{2}{one}} {\type{2}}
+ {\SomeTest{3}{one}} {\type{3}}
\stopcombination
\stopplacefigure
+
\startplacefigure[reference=hb:2,title={\type{one\null two}}]
\startcombination[4*1]
{\SomeTest{0}{one\null two}} {\type{0}}
@@ -297,6 +287,7 @@ values to 1 and make sure that the words hyphenate at each character.
{\SomeTest{3}{one\null two}} {\type{3}}
\stopcombination
\stopplacefigure
+
\startplacefigure[reference=hb:3,title={\type{\null one\null two}}]
\startcombination[4*1]
{\SomeTest{0}{\null one\null two}} {\type{0}}
@@ -305,6 +296,7 @@ values to 1 and make sure that the words hyphenate at each character.
{\SomeTest{3}{\null one\null two}} {\type{3}}
\stopcombination
\stopplacefigure
+
\startplacefigure[reference=hb:4,title={\type{one\null two\null}}]
\startcombination[4*1]
{\SomeTest{0}{one\null two\null}} {\type{0}}
@@ -313,6 +305,7 @@ values to 1 and make sure that the words hyphenate at each character.
{\SomeTest{3}{one\null two\null}} {\type{3}}
\stopcombination
\stopplacefigure
+
\startplacefigure[reference=hb:5,title={\type{\null one\null two\null}}]
\startcombination[4*1]
{\SomeTest{0}{\null one\null two\null}} {\type{0}}
@@ -330,7 +323,7 @@ deal differently with (a sequence of) explicit hyphens. We already have added
some control over aspects of the hyphenation and yet another one concerns
automatic hyphens (e.g.\ \type {-} characters in the input).
-When \type {\automatichyphenmode} has a value of 0, a hyphen will be turned into
+When \lpr {automatichyphenmode} has a value of 0, a hyphen will be turned into
an automatic discretionary. The snippets before and after it will not be
hyphenated. A side effect is that a leading hyphen can lead to a split but one
will seldom run into that situation. Setting a pre and post character makes this
@@ -359,12 +352,6 @@ before--after \par
before---after \par
\stopbuffer
-We show three samples:
-
-Input A: \typebuffer[a]
-Input B: \typebuffer[b]
-Input C: \typebuffer[c]
-
\startbuffer[demo]
\startcombination[nx=4,ny=3,location=top]
{\framed[align=normal,strut=no,top=\vskip.5ex,bottom=\vskip.5ex]{\automatichyphenmode\zerocount \hsize6em \getbuffer[a]}} {A~0~6em}
@@ -382,104 +369,139 @@ Input C: \typebuffer[c]
\stopcombination
\stopbuffer
-\startplacefigure[reference=automatic:1,title={The automatic modes \type {0} (default), \type {1} and \type {2}, with a \type {\hsize}
+\startplacefigure[locationreference=automatichyphenmode:1,title={The automatic modes \type {0} (default), \type {1} and \type {2}, with a \prm {hsize}
of 6em and 2pt (which triggers a linebreak).}]
\dontcomplain \tt \getbuffer[demo]
\stopplacefigure
-\startplacefigure[reference=automatic:2,title={The automatic modes \type {0} (default), \type {1} and \type {2}, with \type
-{\preexhyphenchar} and \type {\postexhyphenchar} set to characters \type {A} and \type {B}.}]
+\startplacefigure[reference=automatichyphenmode:2,title={The automatic modes \type {0} (default), \type {1} and \type {2}, with \lpr {preexhyphenchar} and \lpr {postexhyphenchar} set to characters \type {A} and \type {B}.}]
\postexhyphenchar`A\relax
\preexhyphenchar `B\relax
\dontcomplain \tt \getbuffer[demo]
\stopplacefigure
-As with primitive companions of other single character commands, the \type {\-}
-command has a more verbose primitive version in \type {\explicitdiscretionary}
+In \in {figure} [automatichyphenmode:1] \in {and} [automatichyphenmode:2] we show
+what happens with three samples:
+
+Input A: \typebuffer[a]
+Input B: \typebuffer[b]
+Input C: \typebuffer[c]
+
+As with primitive companions of other single character commands, the \prm {-}
+command has a more verbose primitive version in \lpr {explicitdiscretionary}
and the normally intercepted in the hyphenator character \type {-} (or whatever
-is configured) is available as \type {\automaticdiscretionary}.
+is configured) is available as \lpr {automaticdiscretionary}.
-\section{The main control loop}
+\stopsection
+
+\startsection[title={The main control loop}]
+
+\topicindex {main loop}
+\topicindex {hyphenation}
In \LUATEX's main loop, almost all input characters that are to be typeset are
-converted into \type {glyph} node records with subtype \quote {character}, but
+converted into \nod {glyph} node records with subtype \quote {character}, but
there are a few exceptions.
-First, the \type {\accent} primitives creates nodes with subtype \quote {glyph}
-instead of \quote {character}: one for the actual accent and one for the
-accentee. The primary reason for this is that \type {\accent} in \TEX82 is
-explicitly dependent on the current font encoding, so it would not make much
-sense to attach a new meaning to the primitive's name, as that would invalidate
-many old documents and macro packages. \footnote {Of course, modern packages will
-not use the \type {\accent} primitive at all but try to map directly on composed
-characters.} A secondary reason is that in \TEX82, \type {\accent} prohibits
-hyphenation of the current word. Since in \LUATEX\ hyphenation only takes place
-on \quote {character} nodes, it is possible to achieve the same effect.
-
-This change of meaning did happen with \type {\char}, that now generates \quote
-{glyph} nodes with a character subtype. In traditional \TEX\ there was a strong
-relationship between the 8|-|bit input encoding, hyphenation and glyphs taken
-from a font. In \LUATEX\ we have \UTF\ input, and in most cases this maps
-directly to a character in a font, apart from glyph replacement in the font
-engine. If you want to access arbitrary glyphs in a font directly you can always
-use \LUA\ to do so, because fonts are available as \LUA\ table.
-
-Second, all the results of processing in math mode eventually become nodes with
-\quote {glyph} subtypes.
-
-Third, the \ALEPH|-|derived commands \type {\leftghost} and \type {\rightghost}
-create nodes of a third subtype: \quote {ghost}. These nodes are ignored
-completely by all further processing until the stage where inter|-|glyph kerning
-is added.
-
-Fourth, automatic discretionaries are handled differently. \TEX82 inserts an
-empty discretionary after sensing an input character that matches the \type
-{\hyphenchar} in the current font. This test is wrong in our opinion: whether or
-not hyphenation takes place should not depend on the current font, it is a
-language property. \footnote {When \TEX\ showed up we didn't have \UNICODE\ yet
-and being limited to eight bits meant that one sometimes had to compromise
-between supporting character input, glyph rendering, hyphenation.}
-
-In \LUATEX, it works like this: if \LUATEX\ senses a string of input characters
-that matches the value of the new integer parameter \type {\exhyphenchar}, it will
-insert an explicit discretionary after that series of nodes. Initex sets the \type
-{\exhyphenchar=`\-}. Incidentally, this is a global parameter instead of a
-language-specific one because it may be useful to change the value depending on
-the document structure instead of the text language.
-
-The insertion of discretionaries after a sequence of explicit hyphens happens at
-the same time as the other hyphenation processing, {\it not\/} inside the main
-control loop.
-
-The only use \LUATEX\ has for \type {\hyphenchar} is at the check whether a word
-should be considered for hyphenation at all. If the \type {\hyphenchar} of the
-font attached to the first character node in a word is negative, then hyphenation
-of that word is abandoned immediately. This behaviour is added for backward
-compatibility only, and the use of \type {\hyphenchar=-1} as a means of
-preventing hyphenation should not be used in new \LUATEX\ documents.
-
-Fifth, \type {\setlanguage} no longer creates whatsits. The meaning of \type
-{\setlanguage} is changed so that it is now an integer parameter like all others.
-That integer parameter is used in \type {\glyph_node} creation to add language
-information to the glyph nodes. In conjunction, the \type {\language} primitive is
-extended so that it always also updates the value of \type {\setlanguage}.
-
-Sixth, the \type {\noboundary} command (that prohibits word boundary processing
-where that would normally take place) now does create nodes. These nodes are
-needed because the exact place of the \type {\noboundary} command in the input
-stream has to be retained until after the ligature and font processing stages.
-
-Finally, there is no longer a \type {main_loop} label in the code. Remember that
-\TEX82 did quite a lot of processing while adding \type {char_nodes} to the
-horizontal list? For speed reasons, it handled that processing code outside of
-the \quote {main control} loop, and only the first character of any \quote {word}
-was handled by that \quote {main control} loop. In \LUATEX, there is no longer a
-need for that (all hard work is done later), and the (now very small) bits of
-character|-|handling code have been moved back inline. When \type
-{\tracingcommands} is on, this is visible because the full word is reported,
-instead of just the initial character.
-
-Because we tend to make hard codes behaviour configurable a few new primitives
+\startitemize[n]
+
+\startitem
+ The \prm {accent} primitive creates nodes with subtype \quote {glyph}
+ instead of \quote {character}: one for the actual accent and one for the
+ accentee. The primary reason for this is that \prm {accent} in \TEX82 is
+ explicitly dependent on the current font encoding, so it would not make much
+ sense to attach a new meaning to the primitive's name, as that would
+ invalidate many old documents and macro packages. A secondary reason is that
+ in \TEX82, \prm {accent} prohibits hyphenation of the current word. Since
+ in \LUATEX\ hyphenation only takes place on \quote {character} nodes, it is
+ possible to achieve the same effect. Of course, modern \UNICODE\ aware macro
+ packages will not use the \prm {accent} primitive at all but try to map
+ directly on composed characters.
+
+ This change of meaning did happen with \prm {char}, that now generates
+ \quote {glyph} nodes with a character subtype. In traditional \TEX\ there was
+ a strong relationship between the 8|-|bit input encoding, hyphenation and
+ glyphs taken from a font. In \LUATEX\ we have \UTF\ input, and in most cases
+ this maps directly to a character in a font, apart from glyph replacement in
+ the font engine. If you want to access arbitrary glyphs in a font directly
+ you can always use \LUA\ to do so, because fonts are available as \LUA\
+ table.
+\stopitem
+
+\startitem
+ All the results of processing in math mode eventually become nodes with
+ \quote {glyph} subtypes. In fact, the result of processing math is just
+ a regular list of glyphs, kerns, glue, penalties, boxes etc.
+\stopitem
+
+\startitem
+ The \ALEPH|-|derived commands \lpr {leftghost} and \lpr {rightghost}
+ create nodes of a third subtype: \quote {ghost}. These nodes are ignored
+ completely by all further processing until the stage where inter|-|glyph
+ kerning is added.
+\stopitem
+
+\startitem
+ Automatic discretionaries are handled differently. \TEX82 inserts an empty
+ discretionary after sensing an input character that matches the \prm
+ {hyphenchar} in the current font. This test is wrong in our opinion: whether
+ or not hyphenation takes place should not depend on the current font, it is a
+ language property. \footnote {When \TEX\ showed up we didn't have \UNICODE\
+ yet and being limited to eight bits meant that one sometimes had to
+ compromise between supporting character input, glyph rendering, hyphenation.}
+
+ In \LUATEX, it works like this: if \LUATEX\ senses a string of input
+ characters that matches the value of the new integer parameter \prm
+ {exhyphenchar}, it will insert an explicit discretionary after that series of
+ nodes. Initially \TEX\ sets the \type {\exhyphenchar=`\-}. Incidentally, this
+ is a global parameter instead of a language-specific one because it may be
+ useful to change the value depending on the document structure instead of the
+ text language.
+
+ The insertion of discretionaries after a sequence of explicit hyphens happens
+ at the same time as the other hyphenation processing, {\it not\/} inside the
+ main control loop.
+
+ The only use \LUATEX\ has for \prm {hyphenchar} is at the check whether a
+ word should be considered for hyphenation at all. If the \prm {hyphenchar}
+ of the font attached to the first character node in a word is negative, then
+ hyphenation of that word is abandoned immediately. This behaviour is added
+ for backward compatibility only, and the use of \type {\hyphenchar=-1} as a
+ means of preventing hyphenation should not be used in new \LUATEX\ documents.
+\stopitem
+
+\startitem
+ The \prm {setlanguage} command no longer creates whatsits. The meaning of
+ \prm {setlanguage} is changed so that it is now an integer parameter like all
+ others. That integer parameter is used in \type {\glyph_node} creation to add
+ language information to the glyph nodes. In conjunction, the \prm {language}
+ primitive is extended so that it always also updates the value of \prm
+ {setlanguage}.
+\stopitem
+
+\startitem
+ The \prm {noboundary} command (that prohibits word boundary processing
+ where that would normally take place) now does create nodes. These nodes are
+ needed because the exact place of the \prm {noboundary} command in the
+ input stream has to be retained until after the ligature and font processing
+ stages.
+\stopitem
+
+\startitem
+ There is no longer a \type {main_loop} label in the code. Remember that
+ \TEX82 did quite a lot of processing while adding \type {char_nodes} to the
+ horizontal list? For speed reasons, it handled that processing code outside
+ of the \quote {main control} loop, and only the first character of any \quote
+ {word} was handled by that \quote {main control} loop. In \LUATEX, there is
+ no longer a need for that (all hard work is done later), and the (now very
+ small) bits of character|-|handling code have been moved back inline. When
+ \prm {tracingcommands} is on, this is visible because the full word is
+ reported, instead of just the initial character.
+\stopitem
+
+\stopitemize
+
+Because we tend to make hard coded behaviour configurable a few new primitives
have been added:
\starttyping
@@ -489,50 +511,59 @@ have been added:
\stoptyping
The first parameter has the following consequences for automatic discs (the ones
-resulting from an \type {\exhyphenchar}:
+resulting from an \prm {exhyphenchar}:
\starttabulate[|c|l|l|]
-\BC mode \BC automatic disc \type{-} \BC explicit disc \type{\-} \NC \NR
-\HL
-\NC \type{0} \NC \type {\exhyphenpenalty} \NC \type {\exhyphenpenalty} \NC \NR
-\NC \type{1} \NC \type {\hyphenpenalty} \NC \type {\hyphenpenalty} \NC \NR
-\NC \type{2} \NC \type {\exhyphenpenalty} \NC \type {\hyphenpenalty} \NC \NR
-\NC \type{3} \NC \type {\hyphenpenalty} \NC \type {\exhyphenpenalty} \NC \NR
-\NC \type{4} \NC \type {\automatichyphenpenalty} \NC \type {\explicithyphenpenalty} \NC \NR
-\NC \type{5} \NC \type {\exhyphenpenalty} \NC \type {\explicithyphenpenalty} \NC \NR
-\NC \type{6} \NC \type {\hyphenpenalty} \NC \type {\explicithyphenpenalty} \NC \NR
-\NC \type{7} \NC \type {\automatichyphenpenalty} \NC \type {\exhyphenpenalty} \NC \NR
-\NC \type{8} \NC \type {\automatichyphenpenalty} \NC \type {\hyphenpenalty} \NC \NR
+\DB mode \BC automatic disc \type {-} \BC explicit disc \prm{-} \NC \NR
+\TB
+\NC \type{0} \NC \prm {exhyphenpenalty} \NC \prm {exhyphenpenalty} \NC \NR
+\NC \type{1} \NC \prm {hyphenpenalty} \NC \prm {hyphenpenalty} \NC \NR
+\NC \type{2} \NC \prm {exhyphenpenalty} \NC \prm {hyphenpenalty} \NC \NR
+\NC \type{3} \NC \prm {hyphenpenalty} \NC \prm {exhyphenpenalty} \NC \NR
+\NC \type{4} \NC \lpr {automatichyphenpenalty} \NC \lpr {explicithyphenpenalty} \NC \NR
+\NC \type{5} \NC \prm {exhyphenpenalty} \NC \lpr {explicithyphenpenalty} \NC \NR
+\NC \type{6} \NC \prm {hyphenpenalty} \NC \lpr {explicithyphenpenalty} \NC \NR
+\NC \type{7} \NC \lpr {automatichyphenpenalty} \NC \prm {exhyphenpenalty} \NC \NR
+\NC \type{8} \NC \lpr {automatichyphenpenalty} \NC \prm {hyphenpenalty} \NC \NR
+\LL
\stoptabulate
-other values do what we always did in \LUATEX: insert \type {\exhyphenpenalty}.
+other values do what we always did in \LUATEX: insert \prm {exhyphenpenalty}.
-\section[patternsexceptions]{Loading patterns and exceptions}
+\stopsection
-The hyphenation algorithm in \LUATEX\ is quite different from the one in \TEX82,
-although it uses essentially the same user input.
+\startsection[title={Loading patterns and exceptions},reference=patternsexceptions]
-After expansion, the argument for \type {\patterns} has to be proper \UTF8 with
-individual patterns separated by spaces, no \type {\char} or \type {\chardef}d
-commands are allowed. The current implementation quite strict and will reject all
-non|-|\UNICODE\ characters.
+\topicindex {hyphenation}
+\topicindex {hyphenation+patterns}
+\topicindex {hyphenation+exceptions}
+\topicindex {patterns}
+\topicindex {exceptions}
-Likewise, the expanded argument for \type {\hyphenation} also has to be proper
-\UTF8, but here a bit of extra syntax is provided:
+Although we keep the traditional approach towards hyphenation (which is still
+superior) the implementation of the hyphenation algorithm in \LUATEX\ is quite
+different from the one in \TEX82.
+
+After expansion, the argument for \prm {patterns} has to be proper \UTF8 with
+individual patterns separated by spaces, no \prm {char} or \prm {chardef}d
+commands are allowed. The current implementation is quite strict and will reject
+all non|-|\UNICODE\ characters. Likewise, the expanded argument for \prm
+{hyphenation} also has to be proper \UTF8, but here a bit of extra syntax is
+provided:
\startitemize[n]
\startitem
- Three sets of arguments in curly braces (\type {{}{}{}}) indicates a desired
- complex discretionary, with arguments as in \type {\discretionary}'s command in
+ Three sets of arguments in curly braces (\type {{}{}{}}) indicate a desired
+ complex discretionary, with arguments as in \prm {discretionary}'s command in
normal document input.
\stopitem
\startitem
- A \type {-} indicates a desired simple discretionary, cf.\ \type {\-} and \type
- {\discretionary{-}{}{}} in normal document input.
+ A \type {-} indicates a desired simple discretionary, cf.\ \type {\-} and
+ \type {\discretionary{-}{}{}} in normal document input.
\stopitem
\startitem
- Internal command names are ignored. This rule is provided especially for \type
- {\discretionary}, but it also helps to deal with \type {\relax} commands that
+ Internal command names are ignored. This rule is provided especially for \prm
+ {discretionary}, but it also helps to deal with \prm {relax} commands that
may sneak in.
\stopitem
\startitem
@@ -540,22 +571,24 @@ Likewise, the expanded argument for \type {\hyphenation} also has to be proper
\stopitem
\stopitemize
-The expanded argument is first converted back to a space-separated string while
+The expanded argument is first converted back to a space|-|separated string while
dropping the internal command names. This string is then converted into a
dictionary by a routine that creates key|-|value pairs by converting the other
listed items. It is important to note that the keys in an exception dictionary
can always be generated from the values. Here are a few examples:
\starttabulate[|l|l|l|]
-\BC value \BC implied key (input) \NC effect \NC\NR
+\DB value \BC implied key (input) \BC effect \NC\NR
+\TB
\NC \type {ta-ble} \NC table \NC \type {ta\-ble} ($=$ \type {ta\discretionary{-}{}{}ble}) \NC\NR
\NC \type {ba{k-}{}{c}ken} \NC backen \NC \type {ba\discretionary{k-}{}{c}ken} \NC\NR
+\LL
\stoptabulate
The resultant patterns and exception dictionary will be stored under the language
-code that is the present value of \type {\language}.
+code that is the present value of \prm {language}.
-In the last line of the table, you see there is no \type {\discretionary} command
+In the last line of the table, you see there is no \prm {discretionary} command
in the value: the command is optional in the \TEX-based input syntax. The
underlying reason for that is that it is conceivable that a whole dictionary of
words is stored as a plain text file and loaded into \LUATEX\ using one of the
@@ -573,20 +606,64 @@ actual explicit hyphen character if needed). For example, this matches the word
\hyphenation{multi{-}{}{-}word{-}{}{-}boun-daries}
\stoptyping
-The motivation behind the \ETEX\ extension \type {\savinghyphcodes} was that
+The motivation behind the \ETEX\ extension \prm {savinghyphcodes} was that
hyphenation heavily depended on font encodings. This is no longer true in
\LUATEX, and the corresponding primitive is basically ignored. Because we now
-have \type {hjcode}, the case relate codes can be used exclusively for \type
-{\uppercase} and \type {\lowercase}.
-
-\section{Applying hyphenation}
+have \lpr {hjcode}, the case relate codes can be used exclusively for \prm
+{uppercase} and \prm {lowercase}.
+
+The three curly brace pair pattern in an exception can be somewhat unexpected so
+we will try to explain it by example. The pattern \type {foo{}{}{x}bar} pattern
+creates a lookup \type {fooxbar} and the pattern \type {foo{}{}{}bar} creates
+\type {foobar}. Then, when a hit happens there is a replacement text (\type {x})
+or none. Because we introduced penalties in discretionary nodes, the exception
+syntax now also can take a penalty specification. The value between square brackets
+is a multiplier for \lpr {exceptionpenalty}. Here we have set it to 10000 so
+effectively we get 30000 in the example.
+
+\def\ShowSample#1#2%
+ {\startlinecorrection[blank]
+ \hyphenation{#1}%
+ \exceptionpenalty=10000
+ \bTABLE[foregroundstyle=type]
+ \bTR
+ \bTD[align=middle,nx=4] \type{#1} \eTD
+ \eTR
+ \bTR
+ \bTD[align=middle] \type{10em} \eTD
+ \bTD[align=middle] \type {3em} \eTD
+ \bTD[align=middle] \type {0em} \eTD
+ \bTD[align=middle] \type {6em} \eTD
+ \eTR
+ \bTR
+ \bTD[width=10em]\vtop{\hsize 10em 123 #2 123\par}\eTD
+ \bTD[width=10em]\vtop{\hsize 3em 123 #2 123\par}\eTD
+ \bTD[width=10em]\vtop{\hsize 0em 123 #2 123\par}\eTD
+ \bTD[width=10em]\vtop{\setupalign[verytolerant,stretch]\rmtf\hsize 6em 123 #2 #2 #2 #2 123\par}\eTD
+ \eTR
+ \eTABLE
+ \stoplinecorrection}
+
+\ShowSample{x{a-}{-b}{}x{a-}{-b}{}x{a-}{-b}{}x{a-}{-b}{}xx}{xxxxxx}
+\ShowSample{x{a-}{-b}{}x{a-}{-b}{}[3]x{a-}{-b}{}[1]x{a-}{-b}{}xx}{xxxxxx}
+
+\ShowSample{z{a-}{-b}{z}{a-}{-b}{z}{a-}{-b}{z}{a-}{-b}{z}z}{zzzzzz}
+\ShowSample{z{a-}{-b}{z}{a-}{-b}{z}[3]{a-}{-b}{z}[1]{a-}{-b}{z}z}{zzzzzz}
+
+\stopsection
+
+\startsection[title={Applying hyphenation}]
+
+\topicindex {hyphenation+how it works}
+\topicindex {hyphenation+discretionaries}
+\topicindex {discretionaries}
The internal structures \LUATEX\ uses for the insertion of discretionaries in
words is very different from the ones in \TEX82, and that means there are some
noticeable differences in handling as well.
First and foremost, there is no \quote {compressed trie} involved in hyphenation.
-The algorithm still reads \PATGEN-generated pattern files, but \LUATEX\ uses a
+The algorithm still reads pattern files generated by \PATGEN, but \LUATEX\ uses a
finite state hash to match the patterns against the word to be hyphenated. This
algorithm is based on the \quote {libhnj} library used by \OPENOFFICE, which in
turn is inspired by \TEX.
@@ -605,12 +682,12 @@ of the implementation:
\stopitem
\startitem
Because there is no \quote {trie preparation} stage, language patterns never
- become frozen. This means that the primitive \type {\patterns} (and its \LUA\
+ become frozen. This means that the primitive \prm {patterns} (and its \LUA\
counterpart \type {lang.patterns}) can be used at any time, not only in
ini\TEX.
\stopitem
\startitem
- Only the string representation of \type {\patterns} and \type {\hyphenation} is
+ Only the string representation of \prm {patterns} and \prm {hyphenation} is
stored in the format file. At format load time, they are simply
re|-|evaluated. It follows that there is no real reason to preload languages
in the format file. In fact, it is usually not a good idea to do so. It is
@@ -618,23 +695,27 @@ of the implementation:
needed.
\stopitem
\startitem
- \LUATEX\ uses the language-specific variables \type {\prehyphenchar} and \type
- {\posthyphenchar} in the creation of implicit discretionaries, instead of
- \TEX82's \type {\hyphenchar}, and the values of the language|-|specific variables
- \type {\preexhyphenchar} and \type {\postexhyphenchar} for explicit
+ \LUATEX\ uses the language-specific variables \lpr {prehyphenchar} and \lpr
+ {posthyphenchar} in the creation of implicit discretionaries, instead of
+ \TEX82's \prm {hyphenchar}, and the values of the language|-|specific
+ variables \lpr {preexhyphenchar} and \lpr {postexhyphenchar} for explicit
discretionaries (instead of \TEX82's empty discretionary).
\stopitem
\startitem
- The value of the two counters related to hyphenation, \type {\hyphenpenalty}
- and \type {\exhyphenpenalty}, are now stored in the discretionary nodes. This
- permits a local overload for explicit \type {\discretionary} commands. The
+ The value of the two counters related to hyphenation, \prm {hyphenpenalty}
+ and \prm {exhyphenpenalty}, are now stored in the discretionary nodes. This
+ permits a local overload for explicit \prm {discretionary} commands. The
value current when the hyphenation pass is applied is used. When no callbacks
are used this is compatible with traditional \TEX. When you apply the \LUA\
\type {lang.hyphenate} function the current values are used.
\stopitem
+\startitem
+ The hyphenation exception dictionary is maintained as key|-|value hash, and
+ that is also dynamic, so the \type {hyph_size} setting is not used either.
+\stopitem
\stopitemize
-Because we store penalties in the disc node the \type {\discretionary} command has
+Because we store penalties in the disc node the \prm {discretionary} command has
been extended to accept an optional penalty specification, so you can do the
following:
@@ -657,18 +738,18 @@ inserted at the left-hand side of a word).
Word boundaries are no longer implied by font switches, but by language switches.
One word can have two separate fonts and still be hyphenated correctly (but it
-can not have two different languages, the \type {\setlanguage} command forces a
+can not have two different languages, the \prm {setlanguage} command forces a
word boundary).
All languages start out with \type {\prehyphenchar=`\-}, \type {\posthyphenchar=0},
\type {\preexhyphenchar=0} and \type {\postexhyphenchar=0}. When you assign the
values of one of these four parameters, you are actually changing the settings
-for the current \type {\language}, this behaviour is compatible with \type {\patterns}
-and \type {\hyphenation}.
+for the current \prm {language}, this behaviour is compatible with \prm {patterns}
+and \prm {hyphenation}.
\LUATEX\ also hyphenates the first word in a paragraph. Words can be up to 256
-characters long (up from 64 in \TEX82). Longer words generate an error right now,
-but eventually either the limitation will be removed or perhaps it will become
+characters long (up from 64 in \TEX82). Longer words are ignored right now, but
+eventually either the limitation will be removed or perhaps it will become
possible to silently ignore the excess characters (this is what happens in
\TEX82, but there the behaviour cannot be controlled).
@@ -677,10 +758,12 @@ that this function expects to receive a list of \quote {character} nodes. It wil
not operate properly in the presence of \quote {glyph}, \quote {ligature}, or
\quote {ghost} nodes, nor does it know how to deal with kerning.
-The hyphenation exception dictionary is maintained as key|-|value hash, and that
-is also dynamic, so the \type {hyph_size} setting is not used either.
+\stopsection
-\section{Applying ligatures and kerning}
+\startsection[title={Applying ligatures and kerning}]
+
+\topicindex {ligatures}
+\topicindex {kerning}
After all possible hyphenation points have been inserted in the list, \LUATEX\
will process the list to convert the \quote {character} nodes into \quote {glyph}
@@ -689,24 +772,28 @@ ligatures are processed, then all kerning information is applied to the result
list. But those two stages are somewhat dependent on each other: If the used font
makes it possible to do so, the ligaturing stage adds virtual \quote {character}
nodes to the word boundaries in the list. While doing so, it removes and
-interprets \type {\noboundary} nodes. The kerning stage deletes those word
+interprets \prm {noboundary} nodes. The kerning stage deletes those word
boundary items after it is done with them, and it does the same for \quote
{ghost} nodes. Finally, at the end of the kerning stage, all remaining \quote
{character} nodes are converted to \quote {glyph} nodes.
-This work separation is worth mentioning because, if you overrule from \LUA\ only
+This word separation is worth mentioning because, if you overrule from \LUA\ only
one of the two callbacks related to font handling, then you have to make sure you
perform the tasks normally done by \LUATEX\ itself in order to make sure that the
other, non|-|overruled, routine continues to function properly.
-Work in this area is not yet complete, but most of the possible cases are handled
-by our rewritten ligaturing engine. At some point all of the possible inputs will
-become supported. \footnote {Not all of this makes sense because we nowadays have
-\OPENTYPE\ fonts and ligature building can happen in ,any different ways there.}
+Although we could improve the situation the reality is that in modern \OPENTYPE\
+fonts ligatures can be constructed in many ways: by replacing a sequence of
+characters by one glyph, or by selectively replacing individual glyphs, or by
+kerning, or any combination of this. Add to that contextual analysis and it will
+be clear that we have to let \LUA\ do that job instead. The generic font handler
+that we provide (which is part of \CONTEXT) distinguishes between base mode
+(which essentially is what we describe here and which delegates the task to \TEX)
+and node mode (which deals with more complex fonts.
-For example, take the word \type {office}, hyphenated \type {of-fice}, using a
-\quote {normal} font with all the \type {f}-\type {f} and \type {f}-\type {i}
-type ligatures:
+Let's look at an example. Take the word \type {office}, hyphenated \type
+{of-fice}, using a \quote {normal} font with all the \type {f}-\type {f} and
+\type {f}-\type {i} type ligatures:
\starttabulate[|l|l|]
\NC initial \NC \type {{o}{f}{f}{i}{c}{e}} \NC\NR
@@ -734,11 +821,15 @@ the top-level discretionary that resulted from the first hyphenation point.
Here is that nested solution again, in a different representation:
+\testpage[4]
+
\starttabulate[|l|c|c|c|c|c|c|]
-\NC \BC pre \BC \BC post \BC \BC replace \BC \NC \NR
+\DB \BC pre \BC \BC post \BC \BC replace \BC \NC \NR
+\TB
\NC topdisc \NC \type {f-} \NC (1) \NC \NC sub 1 \NC \NC sub 2 \NC \NR
\NC sub 1 \NC \type {f-} \NC (2) \NC \type {i} \NC (3) \NC \type {<fi>} \NC (4) \NC \NR
\NC sub 2 \NC \type {<ff>-} \NC (5) \NC \type {i} \NC (6) \NC \type {<ffi>} \NC (7) \NC \NR
+\LL
\stoptabulate
When line breaking is choosing its breakpoints, the following fields will
@@ -763,21 +854,23 @@ the first node).
One can observe that the \type {of-f-ice} and \type {off-ice} cases both end with
the same actual post replacement list (\type {i}), and that this would be the
-case even if that \type {i} was the first item of a potential following ligature
-like \type {ic}. This allows \LUATEX\ to do away with one of the fields, and thus
-make the whole stuff fit into just two discretionary nodes.
+case even if \type {i} was the first item of a potential following ligature like
+\type {ic}. This allows \LUATEX\ to do away with one of the fields, and thus make
+the whole stuff fit into just two discretionary nodes.
The mapping of the seven list fields to the six fields in this discretionary node
pair is as follows:
\starttabulate[|l|c|c|]
-\BC field \BC description \NC \NC \NR
+\DB field \BC description \NC \NC \NR
+\TB
\NC \type {disc1.pre} \NC \type {f-} \NC (1) \NC \NR
\NC \type {disc1.post} \NC \type {<fi>} \NC (4) \NC \NR
\NC \type {disc1.replace} \NC \type {<ffi>} \NC (7) \NC \NR
\NC \type {disc2.pre} \NC \type {f-} \NC (2) \NC \NR
\NC \type {disc2.post} \NC \type {i} \NC (3,6) \NC \NR
\NC \type {disc2.replace} \NC \type {<ff>-} \NC (5) \NC \NR
+\LL
\stoptabulate
What is actually generated after ligaturing has been applied is therefore:
@@ -806,13 +899,21 @@ mapping a sequence of glyphs onto one glyph, but also by selective replacement a
kerning. This means that the above examples are just representing the traditional
approach.
-\section{Breaking paragraphs into lines}
+\stopsection
+
+\startsection[title={Breaking paragraphs into lines}]
-This code is still almost unchanged, but because of the above|-|mentioned changes
+\topicindex {linebreaks}
+\topicindex {paragraphs}
+\topicindex {discretionaries}
+
+This code is almost unchanged, but because of the above|-|mentioned changes
with respect to discretionaries and ligatures, line breaking will potentially be
different from traditional \TEX. The actual line breaking code is still based on
the \TEX82 algorithms, and it does not expect there to be discretionaries inside
-of discretionaries.
+of discretionaries. But, as patterns evolve and font handling can influence
+discretionaries, you need to be aware of the fact that long term consistency is not
+an engine matter only.
But that situation is now fairly common in \LUATEX, due to the changes to the
ligaturing mechanism. And also, the \LUATEX\ discretionary nodes are implemented
@@ -826,10 +927,19 @@ The combined effect of these two differences is that \LUATEX\ does not always us
all of the potential breakpoints in a paragraph, especially when fonts with many
ligatures are used. Of course kerning also complicates matters here.
-\section{The \type {lang} library}
+\stopsection
+
+\startsection[title={The \type {lang} library}][library=lang]
+
+\subsection {\type {new} and \type {id}}
-This library provides the interface to \LUATEX's structure
-representing a language, and the associated functions.
+\topicindex {languages+library}
+
+\libindex {new}
+\libindex {id}
+
+This library provides the interface to \LUATEX's structure representing a
+language, and the associated functions.
\startfunctioncall
<language> l = lang.new()
@@ -837,106 +947,141 @@ representing a language, and the associated functions.
\stopfunctioncall
This function creates a new userdata object. An object of type \type {<language>}
-is the first argument to most of the other functions in the \type {lang}
-library. These functions can also be used as if they were object methods, using
-the colon syntax.
-
-Without an argument, the next available internal id number will be assigned to
-this object. With argument, an object will be created that links to the internal
-language with that id number.
+is the first argument to most of the other functions in the \type {lang} library.
+These functions can also be used as if they were object methods, using the colon
+syntax. Without an argument, the next available internal id number will be
+assigned to this object. With argument, an object will be created that links to
+the internal language with that id number.
\startfunctioncall
<number> n = lang.id(<language> l)
\stopfunctioncall
-returns the internal \type {\language} id number this object refers to.
+The number returned is the internal \prm {language} id number this object refers to.
+
+\subsection {\type {hyphenation}}
+
+\libindex {hyphenation}
+
+You can hyphenate a string directly with:
\startfunctioncall
<string> n = lang.hyphenation(<language> l)
lang.hyphenation(<language> l, <string> n)
\stopfunctioncall
-Either returns the current hyphenation exceptions for this language, or adds new
-ones. The syntax of the string is explained in~\in {section}
+\subsection {\type {clear_hyphenation} and \type {clean}}
+
+\libindex {clear_hyphenation}
+\libindex {clean}
+
+This either returns the current hyphenation exceptions for this language, or adds
+new ones. The syntax of the string is explained in~\in {section}
[patternsexceptions].
\startfunctioncall
lang.clear_hyphenation(<language> l)
\stopfunctioncall
-Clears the exception dictionary (string) for this language.
+This call clears the exception dictionary (string) for this language.
\startfunctioncall
<string> n = lang.clean(<language> l, <string> o)
<string> n = lang.clean(<string> o)
\stopfunctioncall
-Creates a hyphenation key from the supplied hyphenation value. The syntax of the
-argument string is explained in~\in {section} [patternsexceptions]. This function
-is useful if you want to do something else based on the words in a dictionary
-file, like spell|-|checking.
+This function creates a hyphenation key from the supplied hyphenation value. The
+syntax of the argument string is explained in \in {section} [patternsexceptions].
+This function is useful if you want to do something else based on the words in a
+dictionary file, like spell|-|checking.
+
+\subsection {\type {patterns} and \type {clear_patterns}}
+
+\libindex {patterns}
+\libindex {clear_patterns}
\startfunctioncall
<string> n = lang.patterns(<language> l)
lang.patterns(<language> l, <string> n)
\stopfunctioncall
-Adds additional patterns for this language object, or returns the current set.
-The syntax of this string is explained in~\in {section} [patternsexceptions].
+This adds additional patterns for this language object, or returns the current
+set. The syntax of this string is explained in \in {section}
+[patternsexceptions].
\startfunctioncall
lang.clear_patterns(<language> l)
\stopfunctioncall
-Clears the pattern dictionary for this language.
+This can be used to clear the pattern dictionary for a language.
+
+\subsection {\type {hyphenationmin}}
+
+\libindex {hyphenationmin}
+
+This function sets (or gets) the value of the \TEX\ parameter
+\type {\hyphenationmin}.
\startfunctioncall
-<number> n = lang.prehyphenchar(<language> l)
-lang.prehyphenchar(<language> l, <number> n)
+n = lang.hyphenationmin(<language> l)
+lang.hyphenationmin(<language> l, <number> n)
\stopfunctioncall
-Gets or sets the \quote {pre|-|break} hyphen character for implicit hyphenation
-in this language (initially the hyphen, decimal 45).
+\subsection {\type {[pre|post][ex|]hyphenchar}}
+
+\libindex {prehyphenchar}
+\libindex {posthyphenchar}
+\libindex {preexhyphenchar}
+\libindex {postexhyphenchar}
\startfunctioncall
+<number> n = lang.prehyphenchar(<language> l)
+lang.prehyphenchar(<language> l, <number> n)
+
<number> n = lang.posthyphenchar(<language> l)
lang.posthyphenchar(<language> l, <number> n)
\stopfunctioncall
-Gets or sets the \quote {post|-|break} hyphen character for implicit hyphenation
-in this language (initially null, decimal~0, indicating emptiness).
+These two are used to get or set the \quote {pre|-|break} and \quote
+{post|-|break} hyphen characters for implicit hyphenation in this language. The
+intial values are decimal 45 (hyphen) and decimal~0 (indicating emptiness).
\startfunctioncall
<number> n = lang.preexhyphenchar(<language> l)
lang.preexhyphenchar(<language> l, <number> n)
-\stopfunctioncall
-Gets or sets the \quote {pre|-|break} hyphen character for explicit hyphenation
-in this language (initially null, decimal~0, indicating emptiness).
-
-\startfunctioncall
<number> n = lang.postexhyphenchar(<language> l)
lang.postexhyphenchar(<language> l, <number> n)
\stopfunctioncall
-Gets or sets the \quote {post|-|break} hyphen character for explicit hyphenation
-in this language (initially null, decimal~0, indicating emptiness).
+These gets or set the \quote {pre|-|break} and \quote {post|-|break} hyphen
+characters for explicit hyphenation in this language. Both are initially
+decimal~0 (indicating emptiness).
+
+\subsection {\type {hyphenate}}
+
+\libindex {hyphenate}
+
+The next call inserts hyphenation points (discretionary nodes) in a node list. If
+\type {tail} is given as argument, processing stops on that node. Currently,
+\type {success} is always true if \type {head} (and \type {tail}, if specified)
+are proper nodes, regardless of possible other errors.
\startfunctioncall
<boolean> success = lang.hyphenate(<node> head)
<boolean> success = lang.hyphenate(<node> head, <node> tail)
\stopfunctioncall
-Inserts hyphenation points (discretionary nodes) in a node list. If \type {tail}
-is given as argument, processing stops on that node. Currently, \type {success}
-is always true if \type {head} (and \type {tail}, if specified) are proper nodes,
-regardless of possible other errors.
-
Hyphenation works only on \quote {characters}, a special subtype of all the glyph
nodes with the node subtype having the value \type {1}. Glyph modes with
-different subtypes are not processed. See \in {section~} [charsandglyphs] for
+different subtypes are not processed. See \in {section} [charsandglyphs] for
more details.
+\subsection {\type {[set|get]hjcode}}
+
+\libindex {sethjcode}
+\libindex {gethjcode}
+
The following two commands can be used to set or query hj codes:
\startfunctioncall
@@ -945,7 +1090,9 @@ lang.sethjcode(<language> l, <number> char, <number> usedchar)
\stopfunctioncall
When you set a hjcode the current sets get initialized unless the set was already
-initialized due to \type {\savinghyphcodes} being larger than zero.
+initialized due to \prm {savinghyphcodes} being larger than zero.
+
+\stopsection
\stopchapter