summaryrefslogtreecommitdiff
path: root/doc/context/sources/general/manuals/languages/languages-basics.tex
blob: 840897096861e24d856a46ad169da5cbb8786154 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
% language=uk

\startcomponent languages-basics

\environment languages-environment

\startchapter[title=Some basics][color=darkyellow]

\startsection[title={Introduction}]

In this chapter we will see how we can toggle between languages. A first
introduction to patterns will be given. Some details of how to control the
hyphenation with specific patterns will be given in a later chapter.

\stopsection

\startsection[title={Available languages}]

When you use the English version of \CONTEXT\ you will default to US English as
main language. This means that hyphenation will be US specific, which by the way
is different from the rules in GB. All labels that are generated by the system
are also in English. Languages can often be accessed by names like \type
{english} or \type {dutch} although it is quite common to use the short tags like
\type {en} and \type {nl}. Because we want to be as compatible as possible with
\MKII, there are quite some synonyms. The following table lists the languages that
for which support is built|-|in.\footnote {More languages can be defined. It is
up to users to provide the information.}

\startbuffer
\usemodule[languages-system]

\loadinstalledlanguages
\showinstalledlanguages
\stopbuffer

\getbuffer

You can call up such a table with the following commands:

\typebuffer

Instead you can run \type {context --global languages-system.mkiv}.

As you can see, many languages have hyphenation patterns but for Japanese,
Korean, Chinese as well as Arabic languages they make no sense. The patterns are
loaded on demand. The number is the internal number that is used in the engine; a
user never has to use that number. Numbers $<1$ are used to disable hyphenation.
The file tag is used to locate and load a specification. Such files have names
like type {lang-nl.lua}.

Some languages share the same hyphenation patterns but can have demands that
differ, like labels or quotes. The characters shown in the table are those found
in the pattern files. The number of patterns differs a lot between languages.
This relates to the systematic behind them. Some languages use word stems, others
base their hyphenation on syllables. Some language have inflections which adds to
the complexity while others can combine words in ways that demand special care
for word boundaries. Of course a low or high number can signal a low quality as
well, but most pattern collections are assembled over many years and updated when
for instance spelling rules change. I think that we can safely say that most patterns
are quite stable and of good quality.

\stopsection

\startsection[title=Switching]

The document language is set with

\starttyping
\mainlanguage[en]
\stoptyping

but when you want to apply the proper hyphenation rules to an embedded language
you can use:

\starttyping
\language[en]
\stoptyping

or just:

\starttyping
\en
\stoptyping

The main language determines what labels show up, how numbering happens, in what
way dates get formatted, etc. Normally the \typ {\mainlanguage} command comes
before the \typ {\starttext} command.

\stopsection

\startsection[title=Hyphenation]

In \LUATEX\ each character that gets typeset not only carries a font id and character
code, but also a language number. You can switch language whenever you want and
the change will be carried with the characters. Switching within a word doesn't make
sense but it is permitted:

\starttabulate[|||T|]
\NC 1 \NC \type{\de incrediblykompliziert}      \NC \hyphenatedword{\de incrediblykompliziert}     \NC \NR
\NC 2 \NC \type{\en incrediblykompliziert}      \NC \hyphenatedword{\en incrediblykompliziert}     \NC \NR
\NC 3 \NC \type{\en incredibly\de kompliziert}  \NC \hyphenatedword{\en incredibly\de kompliziert} \NC \NR
\NC 4 \NC \type{\en incredibly\de\-kompliziert} \NC \hyphenatedword{\en incredibly\de\-kompliziert} \NC \NR
\NC 5 \NC \type{\en incredibly\de-kompliziert}  \NC \hyphenatedword{\en incredibly\de-kompliziert} \NC \NR
\stoptabulate

In the line 4 we have a \type {\-} between the two words, and in the last
line just a \type {-}. If you look closely you will notice that the snippets
can be quite small. If we typeset a word with a 1mm text width we get this:

\blank \start \en \hsize 1mm incredibly \par \stop \blank

If you are familiar with the details of hyphenation, you know that the number of
characters at the end and beginning of a word is controlled by the two variables
\typ {\lefthyphenmin} and \typ {\righthyphenmin}. However, these only influence
the hyphenation process. What bits and pieces eventually end up on a line is
determined by the par builder and there the \type {\hsize} matters. In practice
you will not run into these situations, unless you have extreme long words and a
narrow column.

Hyphenation normally is limited to regular characters that make up the alphabet of
a language. It is insensitive for capitalization as the following text shows:

\blank

\startnarrower
\hyphenatedword {This time the musical distraction while developing code came
from watching youtube performances of Cory Henry (also known from Snarky Puppy,
a conglomerate of excellent players). Just search the web for his name with \quote
{Stevie Wonder and Michael Jackson Tribute}. There is no keyboard he can't play.
Another interesting keyboard player is Sun Rai (a short name for Rai
Thistlethwayte, just google for \quote {The Beatles, Come Together, Live Piano
Acoustic with Loop Pedal}, or do a combined search with \quote {Matt
Chamberlain}. Okay, and talking of keyboards, let's not forget Vika Yermolyeva
(vkgoeswild) as she's one of a kind too on the web. And then there is Jacob
Collier, in one word: incredible (or hyphenated the Dutch way {\nl incredible},
let me repeat that in French {\fr incredible}).} \footnote {Get me right, there
are of course many more fantastic musicians.}
\stopnarrower

\blank

Of course, names are often short and don't need to be hyphenated
(or the left and right settings prohibit it). Another complication with names is
that they can come from another language so we either need to switch language
temporarily or we need to add an exception (more about that later).

\stopsection

\startsection[title=Primitives]

In traditional \TEX\ the language is not a property of a character but is
triggered by a signal in the (so called) list. Think of:

\starttyping
<language 1>this is <language 2>nederlands<language 1> mixed with english
\stoptyping

This number is set by the primitive \typ {\language}. Language triggers are
injected into the list depending on the value of this number. There is also a \typ
{\setlanguage} primitive that can inject triggers without setting the \typ
{\language} number. Because in \LUATEX\ the state is kept with the character
you don't need to worry about the subtle differences here.

In \CONTEXT\ the \typ {\language} and \typ {\setlanguage} commands are overloaded
by a more advanced switch macro. You cannot assume that they work as explained in
general manuals about \TEX. Currently you can still assign a number but that
might change. Just consider the language to be an abstraction and don't mess with
this number. Both commands not only change the current language but also do
specific initializations when needed.

What characters get involved in hyhenation is historically determines by the so
called \type {\lccode} values. Each character can have such a value which maps
an uppercase to a lowercase character. This concept has been extended in \ETEX\
where it binds to a pattern set (language). However, in \CONTEXT\ the user never
has to worry about such details.

% The \type {\patterns} primitive is
% The \type {\hyphenation} primitive is

In traditional hyphenation there will not be hyphenated if the sum of \typ
{\lefthyphenmin} and \typ {\righthyphenmin} exceeds 62. This limitation is not
present in the to be presented \LUA\ variant of this routine as there is no
good reason for this limitation other than implementation constraints.

\stopsection

\startsection[title=Control]

We already mentioned \typ {\lefthyphenmin} and \typ {\righthyphenmin}. These
two variables control the area in a word that is subjected to hyphenation.
Setting these values is a matter of taste but making them too small can result in
bad hyphenation when the patterns are made with the assumptions that certain
minima are used. Using a \typ {\lefthyphenmin} of 2 while the patterns are made
with a value of 3 in mind is a bad idea.

\startlinecorrection[blank]
\startluacode
context.bTABLE { option = "stretch", align= "middle" }
    context.bTR()
        context.bTD { ny = 2, align = "middle,lohi", style = "monobold" }
            context.verbatim("\\lefthyphenmin")
        context.eTD()
        context.bTD { nx = 5, style = "monobold" }
            context.verbatim("\\righthyphenmin")
        context.eTD()
    context.eTR()
    context.bTR()
        for right=1,5 do
            context.bTD()
                context.mono(right)
            context.eTD()
        end
    context.eTR()
    for left=1,5 do
        context.bTR()
            context.bTD()
                context.mono(left)
            context.eTD()
            for right=1,5 do
                context.bTD()
                    context("\\lefthyphenmin %s \\righthyphenmin %s \\hyphenatedword{interesting}",left,right)
                context.eTD()
            end
        context.eTR()
    end
context.eTABLE()
\stopluacode
\stoplinecorrection

When \TEX\ breaks a paragraph into lines it will try do so without hyphenation.
When that fails (read: when the badness becomes too high) a next effort will take
hyphenation into account. \footnote {Because in \LUATEX\ we always hyphenate
there is no real gain in trying not to hyphenate. Because in traditional \TEX\
hyphenation happens on the fly a pass without hyphenating makes more sense.} When
the badness is still too high, an optional emergency pass can be made but only
when the tolerances are set to permit this. In \CONTEXT\ you can try these
settings when you get too many over- or underfull boxes reported on the console.

\starttyping
\setupalign[tolerant]
\setupalign[verytolerant]
\setupalign[verytolerant,stretch]
\stoptyping

Personally I tend to use the last setting, especially in automated flows. After
all, \TEX\ will not apply stretch unless it's really needed.

The two \typ {\*hyphenmin} parameters can be set any time and the current value
is stored with each character. They can also be set with the language which we
will see later.

When \TEX\ hyphenates words it has to decide where a word starts and ends. In
traditional \TEX\ the words starts normally at a character that falls within the
scope of the hyphenator. It ends at when a box (hlist or vlist) is seen, but also
at a rule, discretionary, accent (forget about this in \CONTEXT) or math. An
example will be given in the chapter that discussed the \LUA\ alternative.

\stopsection

\startsection[title=Installing]

    todo

\stopsection

\startsection[title=Modes]

Languages are one of the mechanisms where you can access the current state. There are
for instance two (official) macros that contain the current (main) language:

\startbuffer
\starttabulate[||Tc|]
\HL
\NC \bf macro                    \NC \bf value            \NC \NR
\HL
\NC \type {\currentmainlanguage} \NC \currentmainlanguage \NC \NR
\NC \type {\currentlanguage}     \NC \currentlanguage     \NC \NR
\HL
\stoptabulate
\stopbuffer

\getbuffer

When we have set \type {\language[nl]} we get this:

\start \nl \getbuffer \stop

If you write a style that needs to adapt to a language you can use modes. There
are several ways to do this:

\startbuffer
\language[nl]

\startmode[**en]
    \color[darkred]{main english}
\stopmode

\startmode[*en]
    \color[darkred]{local english}
\stopmode

\startmode[**nl]
    \color[darkblue]{main dutch}
\stopmode

\startmode[*nl]
    \color[darkblue]{local dutch}
\stopmode

\startmodeset
    [*en] {\color[darkgreen]{english set}}
    [*nl] {\color[darkgreen]{dutch set}}
\stopmodeset
\stopbuffer

\typebuffer

This typesets:

\blank \startpacked \setupindenting[no] \getbuffer \stoppacked \blank

When you use setups you can use the following trick:

\startbuffer
\language[nl]

\startsetups language:en
    \color[darkorange]{something english}
\stopsetups

\startsetups language:nl
    \color[darkorange]{something dutch}
\stopsetups

\setups[language:\currentlanguage]
\stopbuffer

\typebuffer

As expected we get:

\blank \start \setupindenting[no] \getbuffer \stop \blank

\stopsection

\stopchapter

\stopcomponent