doc/context/sources/general/manuals/mk/mk-cjk.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320

% language=uk

\usemodule[fnt-24]

\startcomponent mk-cjk

\environment mk-environment

\definefontfallback [FullTyping] [adobemyungjostd-medium] [0x3000-0xFFFF] [check=yes,force=no]
\definefontfallback [FullTyping] [adobesongstd-light]     [0x3000-0xFFFF] [check=yes,force=no]

\definefontsynonym  [MyTyping]  [lmmono10-regular] [fallbacks=FullTyping]
\definefont[MyTypingFont][MyTyping sa 1]

\nonknuthmode

\chapter{Chinese, Japanese and Korean, aka CJK}

\start \setuptyping[style=\MyTypingFont] % begin of typing hackery

{\em This aspect of \MKIV\ is under construction. We use non-realistic examples.
We need to reimplement chinese numbering in \LUA, etc.\ etc.}

{\em todo: There is no need for checkinf the width if the halfwidth feature is turned on.}

\subject{introduction}

In \CONTEXT\ \MKII\ we support \CJK\ languages. Intercharacter spacing as
well as linebreaks are taken care of. Chinese numbering is dealt with and
labels and other language specific aspects are supported too. The implementation
uses active characters and some special encoding subsystem. Although it works
quite okay, in \MKIV\ we follow a different route.

The current implementation is an intermediate one and is used to explore the
possibilities and identify needs. One handicap in implementing \CJK\ support is
that the wishlist of features and behaviour is somewhat dependent on who you talk
to. This means that the implementation will have some default behaviour but can be
tuned to specific needs. The current implementation uses the script related
analyser and is triggered by fonts but at some point I may decide to provide
analysing independent of fonts.

As will all things \TEX, we need to find a proper font to get our document typeset
and because \CJK\ fonts are normally quite large they are not always available on
your system by default.

\subject{scripts and languages}

I'm no expert on \CJK\ and will never be one so don't expect much insight in the
scripts and languages here. Here we only look at the way a sequence of characters
in the input turns into a typeset paragraph. For that it is important to keep in
mind that in a Korean or Japanese text we might find Chinese characters and that
the spacing rules become somewhat fuzzed by that. For instance Korean has spaces
between words and words can be broken at any point, while Chinese has no spaces.

Officially Chinese runs from top to bottom but here we focus on the horizontal
variant. When turned into glyphs the characters normally are of equal width
and in principle we could expect them all to be vertically aligned. However, a
font can have characters that take half that space: so called halfwidth
characters. And, of course, in practice a font might have shapes that fall into
this categrory but happen to have their own width which deviates from this.

This means that a mechanism that deals with \CJK\ has to take care of a few
things:

\startitemize[packed]
\item Spaces at the end of the line (or actually anywhere in the input stream)
      need to be removed but only for Chinese.
\item Opening and closing symbols as well as punctuation needs special treatment
      especially when they are halfwidth.
\item Korean uses proportially spaces punctuation and mixes with other latin fonts,
      while Chinese often uses built in latin shapes.
\item We may break anywhere but not after an opening symbol like~( or and not
      before a closing symbol like~).
\item We need to deal with mixed Chinese and Korean spacing rules.
\stopitemize

Let's start with showing some Korean. We use one of the fonts shipped
by Adobe as part of Acrobat but first we define a Korean featureset and
a font.

\startbuffer
\definefontfeature
  [korean]
  [script=hang,language=kor,mode=node,analyze=yes]

\definefont[KoreanSample][adobemyungjostd-medium*korean]
\stopbuffer

\typebuffer \getbuffer

Korean looks like this:

\startbuffer
\KoreanSample \setscript[hangul]

모든 인간은 태어날 때부터 자유로우며 그 존엄과 권리에 있어 동등하다.
인간은 천부적으로 이성과 양심을 부여받았으며 서로 형제애의 정신으로
행동하여야 한다.
\stopbuffer

\typebuffer \start \getbuffer \stop

The Korean script reflect syllabes and is very structured.
Although modern fonts contain prebuilt syllabes one can also use
the jamo alphabet to build them from components. The following
example is provided by Dohyun Kim:

\startbuffer
\definefontfeature [medievalkorean] [mode=node,script=hang,lang=kor,ccmp=yes,ljmo=yes,vjmo=yes,tjmo=yes]
\definefontfeature [modernkorean]   [mode=node,script=hang,lang=kor]

\enabletrackers[scripts.analyzing]
\setscript[hangul]
\definedfont [UnBatang*medievalkorean at 20pt] ᄒᆞᆫ글 \ruledhbox{ᄒᆞᆫ글} \ruledhbox{ᄒᆞᆫ} \ruledhbox{글}\blank
\definedfont [UnBatang*modernkorean   at 20pt] ᄒᆞᆫ글 \ruledhbox{ᄒᆞᆫ글} \ruledhbox{ᄒᆞᆫ} \ruledhbox{글}\blank
\disabletrackers[scripts.analyzing]
\stopbuffer

\typebuffer \start \getbuffer \stop

There are subtle differences between the medieval and modern
shapes. It was this example that lead to more advanced \type
{tounicode} support in \MKIV\ so that copy and paste works out
well now for such input.

For Chinese we define a couple of features

\startbuffer
\definefontfeature
  [chinese-traditional]
  [mode=node,script=hang,lang=zht]
\definefontfeature
  [chinese-simple]
  [mode=node,script=hang,lang=zhs]
\definefontfeature
  [chinese-traditional-hw]
  [mode=node,script=hang,lang=zht,hwid=yes]
\definefontfeature
  [chinese-simple-hw]
  [mode=node,script=hang,lang=zhs,hwid=yes]
\stopbuffer

\typebuffer \getbuffer

\startbuffer
\definefont[ChineseSampleFW][adobesongstd-light*chinese-traditional]
\definefont[ChineseSampleHW][adobesongstd-light*chinese-traditional-hw]
\setscript[hanzi]

\ChineseSampleFW
兡也包因沘氓侷柵苗孫孫財崧淫設弼琶跑愍窟榜蒸奭稽
霄瓢館縲擻鼕〈孃魔釁〉佉沎岠狋垚柛胅娭涘罞偟惈牻荺
傒焱菏酡廅滘絺赩塴榗箂踃嬁澕蓴醊獧螗餟燱螬駸礑鎞
瀧鄿瀯騬醹躕鱕。

\ChineseSampleHW
兡也包因沘氓侷柵苗孫孫財崧淫設弼琶跑愍窟榜蒸奭稽
霄瓢館縲擻鼕〈孃魔釁〉佉沎岠狋垚柛胅娭涘罞偟惈牻荺
傒焱菏酡廅滘絺赩塴榗箂踃嬁澕蓴醊獧螗餟燱螬駸礑鎞
瀧鄿瀯騬醹躕鱕。
\stopbuffer

\typebuffer \start \getbuffer \stop

A few more samples:

\startbuffer
\definefont[ChFntAT][name:adobesongstd-light*chinese-traditional-hw at 16pt]
\definefont[ChFntBT][name:songti*chinese-traditional                at 16pt]
\definefont[ChFntCT][name:fangsong*chinese-traditional              at 16pt]

\definefont[ChFntAS][name:adobesongstd-light*chinese-simple-hw      at 16pt]
\definefont[ChFntBS][name:songti*chinese-simple                     at 16pt]
\definefont[ChFntCS][name:fangsong*chinese-simple                   at 16pt]
\stopbuffer

\typebuffer \getbuffer

In these fonts traditional comes out as follows:

\start \setscript[hanzi]
\startlines
\ChFntAT 我〈能吞下玻璃而不傷身〉體。
\ChFntBT 我〈能吞下玻璃而不傷身〉體。
\ChFntCT 我〈能吞下玻璃而不傷身〉體。
\stoplines
\stop

And simple as:

\start \setscript[hanzi]
\startlines
\ChFntAS 我〈能吞下玻璃而不伤身〉体。
\ChFntBS 我〈能吞下玻璃而不伤身〉体。
\ChFntCS 我〈能吞下玻璃而不伤身〉体。
\stoplines
\stop

\subject {tracing}

As usual in \CONTEXT, we have some tracing built in. When you say

\startbuffer
\enabletrackers[scripts.analyzing]
\stopbuffer

You will get the output colored according to the category that the
analyser put them in. When you say

\startbuffer
\enabletrackers[scripts.injections]
\stopbuffer

some rudimentary information will be written to the log about whet gets
inserted in the nodelist.

Analyzed input looks like:

\startbuffer
아아, 나는 이제야 도(道)를 알았도다. 마음이 어두운 자는 이목이
누(累)가 되지 않는다. 이목만을 믿는 자는 보고 듣는 것이
더욱 밝혀져서 병이 되는 것이다. 이제 내 마부가 발을 말굽에
밟혀서 뒷차에 실리었으므로, 나는 드디어 혼자 고삐를 늦추어
강에 띄우고, 무릎을 구부려 발을 모으고 안장 위에 앉았다.
한번 떨어지면 강이나 물로 땅을 삼고, 물로 옷을 삼으며,
물로 몸을 삼고, 물로 성정을 삼을 것이다. 이제야 내 마음은
한번 떨어질 것을 판단한 터이므로, 내 귓속에 강물 소리가 없어졌다.
무릇 아홉 번 건너는데도 걱정이 없어 의자 위에서 좌와(坐臥)하고
기거(起居)하는 것 같았다.
\stopbuffer

\typebuffer \start \enabletrackers[scripts.analyzing] \KoreanSample \setscript[hangul] \getbuffer \disabletrackers[scripts.analyzing] \stop

For developers (and those who provide them with input) we have another tracing

\startbuffer
\definedfont[arialuni*korean at 10pt] \setscript[hangul] \ShowCombinationsKorean
\stopbuffer

\typebuffer

We need to use a font that supports Chinese as well as Korean. This gives quite some output.

\start \getbuffer \stop

% 안녕하세요? (Hello)
% 감사합니다. (Thank you)

\page \stop % end of typing hackery

\stopcomponent

% \font\JapaneseFontA=name:kozminprovi-regular
%
% \startlines
% Hankaku          : {\JapaneseFontA ｱｲｳｴｵｶｷｸｹｺｻｼｽｾｿﾀﾁﾂﾃ}
% Romanj digits    : {\JapaneseFontA ０１２３４５６７８９}
% Romanj lowercase : {\JapaneseFontA ａｂｃｄｅｆｇｈｉ}
% Romanj uppercase : {\JapaneseFontA ＡＢＣＤＥＦＧＨＩ}
% \stoplines
%
% \enabletrackers[scripts.analyzing]
%
% \start \raggedright \dontleavehmode
%     \ruledhbox\bgroup \ChFntBS ，\egroup  \quad
%     \ruledhbox\bgroup \ChFntBS 〉\egroup \quad
%     \ruledhbox\bgroup \ChFntBS 〈\egroup \par
% \stop
%
% \def\DoChineseSample#1#2#3%
%   {\ruledvtop{#1\hsize#2\relax#3}}
%
% \def\ChineseSampleA#1#2{%
%     \blank
%     \subsubject{hsize #2, fullwidth}
%     \dontleavehmode
%         \DoChineseSample{#1}{#2}{吞吞吞，吞吞吞吞。}\quad
%         \DoChineseSample{#1}{#2}{吞吞吞，，吞吞吞吞。}\quad
%         \DoChineseSample{#1}{#2}{吞吞吞〉吞吞吞吞。}\quad
%         \DoChineseSample{#1}{#2}{吞吞吞〉，吞吞吞吞。}
%     \blank[small]
%     \dontleavehmode
%         \DoChineseSample{#1}{#2}{吞吞吞〉〉吞吞吞吞。}\quad
%         \DoChineseSample{#1}{#2}{吞吞吞〉〉吞吞吞吞。}\quad
%         \DoChineseSample{#1}{#2}{〈吞吞吞吞吞吞吞。}\quad
%         \DoChineseSample{#1}{#2}{〈〈吞吞吞吞吞吞吞。}
%     \blank[small]
%     \dontleavehmode
%         \DoChineseSample{#1}{#2}{吞吞吞…吞吞吞吞。}\quad
%         \DoChineseSample{#1}{#2}{吞吞吞……吞吞吞吞。}
%     \dontleavehmode
%     \blank
% }
%
% \ChineseSampleA\ChFntBS{4.25em}
% \ChineseSampleA\ChFntBS{4.00em}
% \ChineseSampleA\ChFntBS{3.75em}
% \ChineseSampleA\ChFntBS{3.50em}
% \ChineseSampleA\ChFntBS{3.25em}
% \ChineseSampleA\ChFntBS{3.00em}
%
% \def\ChineseSampleB#1#2{%
%     \blank
%     \subsubject{hsize #2, halfwidth}
%     \dontleavehmode
%         \DoChineseSample{#1}{#2}{吞吞吞,吞吞吞吞。}\quad
%         \DoChineseSample{#1}{#2}{吞吞吞‘吞吞吞吞。}\quad
%         \DoChineseSample{#1}{#2}{吞吞吞’吞吞吞吞。}\quad
%     \blank
% }
%
% \ChineseSampleB\ChFntBS{4.25em}
% \ChineseSampleB\ChFntBS{4.00em}
% \ChineseSampleB\ChFntBS{3.75em}
% \ChineseSampleB\ChFntBS{3.50em}
% \ChineseSampleB\ChFntBS{3.25em}
% \ChineseSampleB\ChFntBS{3.00em}
%
% \disabletrackers[scripts.analyzing]