summaryrefslogtreecommitdiff
path: root/doc/context/sources/general/manuals/evenmore/evenmore-hyphenation.tex
blob: 50113ed27202db5ddf40dfaa53fc1db538e63e56 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
% language=us runpath=texruns:manuals/evenmore

\environment evenmore-style

\startcomponent evenmore-hyphenation

\usebodyfont[pagella]

\startchapter[title=Hyphenation]

\startsection[title={Introduction}]

Hyphenation is driven by the character codes. In a traditional \TEX\ such a code
accesses a glyph in a font, which is why the font encoding mattered, but in
\LUATEX\ we use \UNICODE\ and when hyphenation is applied. \footnote {In
\CONTEXT\ \MKII\ we also use \UTF\ patterns, which made it possible to ship
patterns that didn't depend on a font encoding. Mojca and Arthur made \UTF\ the
default when the (upgraded) hyphenation pattern project started.} Later, the
character codes are adapted by the font handler where they become glyphs. There
are moments when you don't want to hyphenate and a cheap trick is to switch to a
language that has no hyphenation patterns. But, in a system like \CONTEXT\ that
doesn't work well because we have lots of language bound properties. Therefore in
\MKIV\ we set the left- and right hyphen minima to extreme values, something that
blocks hyphenation quite well. But this is not a pretty solution at all. Even
worse is that when we have situations where discretionaries (\type
{\discretionary}), automatic (\type{-}) or explicit (\type {\-}) are used these
still kick in.

For that reason in \LMTX\ we have a mode variable that controls hyphenation. In
\LUATEX\ we have primitives like \type {\compoundhyphenmode}, \type
{\hyphenationbounds} and \type {\hyphenpenaltymode} that controlled how
hyphenation and discretionary injection is handled but when in \LUAMETATEX\ the
more generic \type {\hyphenationmode} parameter was introduced the precursors
were all merged into this one. One can argue that this is a form of regression
but there are good reasons, most noticeably the fact that we keep these
properties with glyph nodes so that we have better control over them in grouped
situations where as some operations happen when the paragraph as whole get
treated local overloads are lost. \footnote {Of course it also is a wink to those
who complain that we add primitives to an otherwise leaner variant of \LUATEX,
but let us not elaborate on that misunderstanding.} It anyway means that in
\LMTX\ we have to set different parameters but that is no big deal because users
are supposed to use the more high level interfaces; instead of setting parameters
to values one flips bits in \type {\hyphenationmode}, which in the end makes more
sense and also permits extensions later without adding much overhead.

Currently this mode parameter controls the following options:

\starttabulate[|Tr|||]
\NC \uchexnumber{\normalhyphenationcode}           \NC \type{\normalhyphenationcode}           \NC honour the (normal) \type{\discretionary} primitive \NC \NR
\NC \uchexnumber{\automatichyphenationcode}        \NC \type{\automatichyphenationcode}        \NC turn \type {-} into (automatic) discretionaries \NC \NR
\NC \uchexnumber{\explicithyphenationcode}         \NC \type{\explicithyphenationcode}         \NC turn \type {\-} into (explicit) discretionaries \NC \NR
\NC \uchexnumber{\syllablehyphenationcode}         \NC \type{\syllablehyphenationcode}         \NC hyphenate (syllable) according to language \NC \NR
\NC \uchexnumber{\uppercasehyphenationcode}        \NC \type{\uppercasehyphenationcode}        \NC hyphenate uppercase characters too \NC \NR
\NC \uchexnumber{\compoundhyphenationcode}         \NC \type{\compoundhyphenationcode}         \NC permit break at an explicit hyphen (border cases) \NC \NR
\NC \uchexnumber{\strictstarthyphenationcode}      \NC \type{\strictstarthyphenationcode}      \NC traditional \TEX\ compatibility wrt the start of a word \NC \NR
\NC \uchexnumber{\strictendhyphenationcode}        \NC \type{\strictendhyphenationcode}        \NC traditional \TEX\ compatibility wrt the end of a word \NC \NR
\NC \uchexnumber{\automaticpenaltyhyphenationcode} \NC \type{\automaticpenaltyhyphenationcode} \NC use \type {\automatichyphenpenalty} \NC \NR
\NC \uchexnumber{\explicitpenaltyhyphenationcode}  \NC \type{\explicitpenaltyhyphenationcode}  \NC use \type {\explicithyphenpenalty} \NC \NR
\NC \uchexnumber{\permitgluehyphenationcode}       \NC \type{\permitgluehyphenationcode}       \NC turn glue in discretionaries into kerns \NC \NR
\stoptabulate

The default \CONTEXT\ setup is:

\starttyping
\hyphenationmode \numexpr
    \normalhyphenationcode
  + \automatichyphenationcode
  + \explicithyphenationcode
  + \syllablehyphenationcode
  + \uppercasehyphenationcode
  + \compoundhyphenationcode
  % \strictstarthyphenationcode
  % \strictendhyphenationcode
  + \automaticpenaltyhyphenationcode
  + \explicitpenaltyhyphenationcode
  + \permitgluehyphenationcode
\relax
\stoptyping

When a discretionary node is created (triggered by \type {\discretionary}) the
current value is used. Injected glyph nodes on the other hand will store the
current value and use that when it is needed for hyphenating the list.

\stopsection

\startsection[title={Controlling hyphenation}]

We start with an example that has some Dutch words:

\startbuffer[sample]
NEDERLANDS\par Nederlands\par nederlands\par
\CONTEXT  \par test\-test\par test-test \par
\stopbuffer

\typebuffer[sample]

\startbuffer[result]
\startlinecorrection
\dontleavehmode \dorecurse{\boxlines\scratchboxone} {%
   \setbox\scratchbox\boxline\scratchboxone#1%
   \ruledhpack{\strut\unhbox\scratchbox}%
   \kern.25\emwidth
}
\stoplinecorrection
\stopbuffer

When we typeset this with a \type {\hsize} of 2mm we get:

\setbox\scratchboxone\vbox{\dontcomplain \nl \hsize 2mm \getbuffer[sample]}

\getbuffer[result]

But when we block hyphenation with \type {\nohyhens} we see:

\setbox\scratchboxone\vbox{\dontcomplain \nl \hsize 2mm \nohyphens \getbuffer[sample]}

\getbuffer[result]

The \MKIV\ behavior can be emulated by setting the mode as follows

\startbuffer[demo]
\bitwiseflip \hyphenationmode \syllablehyphenationcode
\stopbuffer

\setbox\scratchboxone\vbox{\dontcomplain \nl \hsize 2mm \getbuffer[demo] \getbuffer[sample]}

\getbuffer[result]

This time the three non|-|syllable variants get hyphenated and that is not what
we want. In this case there is a \type {\discretionary} in the definition of the
macro that generates \CONTEXT\ and, apart from the fact that we might not even
want to hyphenate logos, we have to block it when we apply \type {\nohyphens}.

This mode setting are directly applied to the three non|-|syllable variants but
delayed in the syllable discretionaries because hyphenation happens later so the
state becomes a property of glyph nodes. Doing the same for the other
discretionaries would demand an adaption of various pieces of the engine code and
plugged in user (\LUA) code also has to consider it which makes no sense.

\startbuffer[sample]
\nohyphens nederlands {\dohyphens nederlands} nederlands\par
\stopbuffer

\typebuffer[sample]

\setbox\scratchboxone\vbox{\dontcomplain \nl \hsize 2mm \getbuffer[sample]}
\getbuffer[result]

Compare this with:

\startbuffer[sample]
nederlands {\nohyphens nederlands} nederlands\par
\stopbuffer

\typebuffer[sample]

\setbox\scratchboxone\vbox{\dontcomplain \nl \hsize 2mm \getbuffer[sample]}
\getbuffer[result]

\stopsection

\startsection[title={Compound hyphenation}]

Yet another discretionary related issue is with compound words, that is: cases
where \type {\discretionary} commands sit between words. There are of course
tricks to deal with it like adding a huge penalty combined with a zero skip. This
is okay in a traditional \TEX\ engine but in an opened up one you might not want
this. Just to mention one aspect: when processing \OPENTYPE\ fonts you actually
need to look into discretionaries in order to deal with glyphs that interact. And
you don't want to deal with penalties and skips unless they have an explicit
meaning. We show the four possibilities:

\startbuffer[sample]
nederlands\discretionary           {!}{!}{!}nederlands\blank
\stopbuffer

\typebuffer[sample]

\setbox\scratchboxone\vbox{\dontcomplain \nl \hsize 2mm \getbuffer[sample]}
\getbuffer[result]

\startbuffer[sample]
nederlands\discretionary options 1 {!}{!}{!}nederlands\blank
\stopbuffer

\typebuffer[sample]

\setbox\scratchboxone\vbox{\dontcomplain \nl \hsize 2mm \getbuffer[sample]}
\getbuffer[result]

\startbuffer[sample]
nederlands\discretionary options 2 {!}{!}{!}nederlands\blank
\stopbuffer

\typebuffer[sample]

\setbox\scratchboxone\vbox{\dontcomplain \nl \hsize 2mm \getbuffer[sample]}
\getbuffer[result]

\startbuffer[sample]
nederlands\discretionary options 3 {!}{!}{!}nederlands\blank
\stopbuffer

\typebuffer[sample]

\setbox\scratchboxone\vbox{\dontcomplain \nl \hsize 2mm \getbuffer[sample]}
\getbuffer[result]

Here is an example of such an interference. Of course in practice this happens
seldom and certainly not with ligatures. Some fonts have kerning between certain
glyphs and for instance dashes and there it could matter.

\startbuffer
ef%
\penalty \plustenthousand
\hskip   \zeropoint
\discretionary{-}{f}{f}%
\penalty \plustenthousand
\hskip   \zeropoint
e
ef\discretionary options 3 {-}{f}{f}e
\stopbuffer

\typebuffer

As you can see, we only get the ligature when we set the options. In the process
of processing \OPENTYPE\ features it can be that one actually looses a
discretionary, although we try to prevent this when possible.

\startlinecorrection
\scale[height=2cm]{\setupbodyfont[pagella]\showglyphs\getbuffer}
\stoplinecorrection

But, as said, the fact that we don't need the penalties and glue helps at the
\LUA\ end: the cleaner the node list, the better.

\stopsection

\startsection[title={Tracing}]

The already present tracker command has been extended so handle the options:

\startbuffer[sample0]
\enabletrackers[discretionaries]
\stopbuffer
\startbuffer[sample1]
test\discretionary {]} {[} {[]}test
\stopbuffer
\startbuffer[sample2]
testing\discretionary {]} {[} {[]}testing
\stopbuffer
\startbuffer[sample3]
testing\discretionary options 3 {]} {[} {[]}testing
\stopbuffer

\typebuffer[sample0,sample1,sample2,sample3]

\setbox\scratchboxone\vbox{\dontcomplain            \getbuffer[sample0,sample1]} \getbuffer[result]
\setbox\scratchboxone\vbox{\dontcomplain \hsize 2mm \getbuffer[sample0,sample2]} \getbuffer[result]
\setbox\scratchboxone\vbox{\dontcomplain \hsize 2mm \getbuffer[sample0,sample3]} \getbuffer[result]

\stopsection

\startsection[title={Glue in discretionaries}]

In the case you cannot predict what goes into a discretionary you can get run into
an error message with respect to unsupported glue in a disc node. The mode value
\number\permitgluehyphenationcode\space makes glue acceptable and turn into
kern, as demonstrated here;

\startbuffer
{\hsize 1mm \darkblue \discretionary{potential conspiracy}{prophets}{influencers}\par}
\stopbuffer

\typebuffer

The line break occurs but the space in the pre part is of course frozen:

{\getbuffer}

As usual \TEX\ users will come up with applications.

\stopsection

\startsection[title={Penalties}]

By default the par builder will use the value of \type {\hyphenpenalty} that gets
stored in the discretionary node. However, when the \type {\discretionary} is
followed by a \type {penalty} keyword and a number, that one will.

\stopsection

\startsection[title=Exceptions]

At some point a user on the \CONTEXT\ mailing list wondered how to deal with a case
like this:

\startbuffer[example]
\switchtobodyfont[pagella]\mainlanguage[de]auffasse
\stopbuffer

\typebuffer[example]

\startlinecorrection
\scale[height=2cm]{\inlinebuffer[example]}
\stoplinecorrection

\startbuffer
\startexceptions[de]
au{f-}{-f}{ff}(f\zwnj f)asse
\stopexceptions
\stopbuffer

In \LUAMETATEX\ you can block the unwanted ligature using this trick:

\typebuffer \getbuffer

\startlinecorrection
\scale[height=2cm]{\inlinebuffer[example]}
\stoplinecorrection

The exception mechanism in \LUATEX\ and therefore \LUAMETATEX\ works as follows.
When we have this exception:

\starttyping
au{f-}{-f}{ff}asse
\stoptyping

the engine will register that exception under \type {auffasse}, that is: the
replacement part determines the word. When it runs into that word, it will create
a so called discretionary node with a pre, post and replace part. However, it
only uses the \type {ff} for a lookup and keeps the original two glyphs: these
become the replacement text. However, in \LUAMETATEX\ you can add an alternative
replacement:

\startbuffer
\startexceptions[de]
au{f-}{-f}{ff}(st)asse
\stopexceptions
\stopbuffer

\typebuffer \getbuffer

This time the replacement text becomes \type {xx}. So we get \type {austasse} and
it is that sequence that is seen by the font handler when it applies its tricks.
On some fonts however

\startbuffer[example]
\switchtobodyfont[pagella]\mainlanguage[de]auffasse
\stopbuffer

\startlinecorrection
\scale[height=2cm]{\showglyphs\showfontkerns\inlinebuffer[example]}
\stoplinecorrection

But in the Pagella font that we use here, a kern is added between the \type {s} and
the \type {t}. If you don't want that you can say this:

\startbuffer
\startexceptions[de]
au{f-}{-f}{ff}(s\zwnj t)asse
\stopexceptions
\stopbuffer

\typebuffer \getbuffer

\startlinecorrection
\scale[height=2cm]{\showglyphs\showfontkerns\inlinebuffer[example]}
\stoplinecorrection

A \type {zwj} will block a ligature (some fonts have an \type {st} ligature) and a
\type {zwnj} blocks a ligatures as well as kerns.

You can actually abuse this mechanism for trickery like this:

\startbuffer
\startexceptions[nl]
wis-kun-d{e-}{o}{eo}(e-o)n-der-wijs
\stopexceptions
\stopbuffer

\typebuffer \getbuffer

The Dutch word \type {wiskundeonderwijs} is found as exception and comes out like
this:

\startbuffer[example]
\switchtobodyfont[pagella]\mainlanguage[nl]wiskundeonderwijs
\stopbuffer

\startlinecorrection
\scale[height=1cm]{\showglyphs\showfontkerns\inlinebuffer[example]}
\stoplinecorrection

Watch the hyphen that makes the compound word more visible! The other hyphens in
the exception are proper hyphenation points and when a break happens there a
hyphen is automatically added. The \type {\nokerning} and \type {\noligaturing}
macros can be used grouped:

\startbuffer[example]
{every}\quad
{\nokerning every}\quad
{\noligaturing every}\quad
{e{\nokerning v}ery}\quad
{e{\glyphoptions\noleftkernglyphoptioncode  v}ery}\quad
{e{\glyphoptions\norightkernglyphoptioncode v}ery}\quad
\stopbuffer

\typebuffer[example]

There are several low level control options. In addition to those shown here we
have a pair for ligatures: \typ {\noleftligatureglyphoptioncode} and \typ
{\norightligatureglyphoptioncode}.

\startlinecorrection[blank]
\scale[width=\textwidth]{\showglyphs\showfontkerns\inlinebuffer[example]}
\stoplinecorrection

There are alternative mechanism, like a blocker that implements a font feature
and a replacement mechanism, but these are not discussed here.

\stopsection

\stopchapter

\stopcomponent