doc/context/sources/general/manuals/mk/mk-goingutf.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187

% language=uk

\startcomponent mk-gointutf

\environment mk-environment

\chapter{Going \UTF}

\LUATEX\ only understands input codes in the Universal Character
Set Transformation Format, aka \UCS\ Transformation Format, better
known as: \UTF. There is a good reason for this universal view
on characters: whatever support gets hard coded into the programs,
it's never enough, as 25 years of \TEX\ history have clearly
demonstrated. Macro packages often support more or less standard
input encodings, as well as local standards, user adapted ones,
etc.

There is enough information on the Internet and in books about what
exactly is \UTF. If you don't know the details yet: \UTF\ is a
multi||byte encoding. The characters with a bytecode up to 127 map
onto their normal \ASCII\ representation. A larger number indicates
that the following bytes are part of the character code. Up to 4~bytes
make an \UTF-8 code, while \UTF-16 always uses two pairs of bytes.

\starttabulate[|c|c|c|c|c|]
\NC \bf byte 1 \NC \bf byte 2 \NC \bf byte 3 \NC \bf byte 4 \NC \bf unicode            \NC \NR
\NC 192--223   \NC 128--191   \NC            \NC            \NC 0x80--0x7f{}f          \NC \NR
\NC 224--239   \NC 128--191   \NC 128--191   \NC            \NC 0x800--0xf{}f{}f{}f    \NC \NR
\NC 240--247   \NC 128--191   \NC 128--191   \NC 128--191   \NC 0x10000--0x1f{}f{}f{}f \NC \NR
\stoptabulate

In \UTF-8 the characters in the range $128$--$191$ are illegal
as first characters. The characters 254 and 255 are
completely illegal and should not appear at all since they are
related to \UTF-16.

Instead of providing a never|-|complete truckload of other input
formats, \LUATEX\ sticks to one input encoding but at the same
time provides hooks that permits users to write filters that
preprocess their input into \UTF.

While writing the \LUATEX\ code as well as the \CONTEXT\ input
handling, we experimented a lot. Right from the beginning we had
a pretty clear picture of what we wanted to achieve and how it
could be done, but in the end arrived at solutions that permitted
fast and efficient \LUA\ scripting as well as a simple interface.

What is involved in handling any input encoding and especially
\UTF?. First of all, we wanted to support \UTF-8 as well as
\UTF-16. \LUATEX\ implements \UTF-8 rather straightforward: it
just assumes that the input is usable \UTF. This means that
it does not combine characters. There is a good reason for this:
any automation needs to be configurable (on|/|off) and the more
is done in the core, the slower it gets.

In \UNICODE, when a character is followed by an \quote
{accent}, the standard may prescribe that these two characters are
replaced by one. Of course, when characters turn into glyphs, and
when no matching glyph is present, we may need to decompose any
character into components and paste them together from glyphs in
fonts. Therefore, as a first step, a collapser was written. In the
(pre|)|loaded \LUA\ tables we have stored information about
what combination of characters need to be combined into another
character.

So, an \type {a} followed by an \type {`} becomes \type {à} and
an \type {e} followed by \type {"} becomes \type {ë}. This
process is repeated till no more sequences combine. After a few
alternatives we arrived at a solution that is acceptably fast:
mere milliseconds per average page. Experiments demonstrated that
we can not gain much by implementing this in pure~C, but we did
gain some speed by using a dedicated loop||over||utf||string
function.

A second \UTF\ related issue is \UTF-16. This coding scheme comes
in two endian variants. We wanted to do the conversion in \LUA,
but decided to play a bit with a multi||byte file read function.
After some experiments we quickly learned that hard coding such
methods in \TEX\ was doomed to be complex, and the whole idea
behind \LUATEX\ is to make things less complex. The complexity has
to do with the fact that we need some control over the different
linebreak triggers, that is, (combinations of) character 10 and/or 13. In
the end, the multi||byte readers were removed from the code and we
ended up with a pure \LUA\ solution, which could be sped up by
using a multi||byte loop||over||string function.

Instead of hard coding solutions in \LUATEX\ a couple of fast
loop||over||string functions were added to the \LUA\ string
function repertoire and the solutions were coded in \LUA. We did
extensive timing with huge \UTF-16 encoded files, and are
confident that fast solutions can be found. Keep in mind that
reading files is never the bottleneck anyway. The only drawback
of an efficient \UTF-16 reader is that the file is loaded into
memory, but this is hardly a problem.

Concerning arbitrary input encodings, we can be brief. It's rather
easy to loop over a string and replace characters in the $0$--$255$
range by their \UTF\ counterparts. All one needs is to maintain
conversion tables and \TEX\ macro packages have always done that.

Yet another (more obscure) kind of remapping concerns those special
\TEX\ characters. If we use a traditional \TEX\ auxiliary file, then
we must make sure that for instance percent signs, hashes, dollars
and other characters are handled right. If we set the catcode of
the percent sign to \quote {letter}, then we get into trouble when
such a percent sign ends up in the table of contents and is read in
under a different catcode regime (and becomes for instance a comment
symbol). One way to deal with such situations is to temporarily move
the problematic characters into a private \UNICODE\ area and deal
with them accordingly. In that case they no longer can interfere.

Where do we handle such conversions? There are two places where
we can hook converters into the input.

\startitemize[n,packed]
\item each time when we read a line from a file, i.e.\ we can hook
      conversion code into the read callbacks
\item using the special \type {process_input_buffer} callback which
      is called whenever \TEX\ needs a new line of input
\stopitemize

Because we can overload the standard file open and read functions,
we can easily hook the \UTF\ collapse function into the readers.
The same is true for the \UTF-16\ handler. In \CONTEXT, for
performance reasons we load such files into memory, which means
that we also need to provide a special reader to \TEX. When
handling \UTF-16, we don't need to combine characters so that stage
is skipped then.

So, to summarize this, here is what we do in \CONTEXT. Keep in
mind that we overload the standard input methods and therefore
have complete control over how \LUATEX\ locates and opens files.

\startitemize[n]

\item When we have a \UTF\ file, we will read from that file line
      by line, and combine characters when collapsing is enabled.

\item When \LUATEX\ wants to open a file, we look into the first
      bytes to see if it is a \UTF-16\ file, in either big or
      little endian format. When this is the case, we load the
      file into memory, convert the data to \UTF-8, identify
      lines, and provide a reader that will give back the file
      linewise.

\item When we have been told to recode the input (i.e.\ when we have
      enabled an input regime) we use the normal line||by||line
      reader and convert those lines on the fly into valid \UTF.
      No collapsing is needed.

\stopitemize

Because we conduct our experiments in \CONTEXT\ \MKIV\ the code that
we provide may look a bit messy and more complex than the previous
description may suggest. But keep in mind that a mature macro
package needs to adapt to what users are accustomed to. The fact
that \LUATEX\ moved on to \UTF\ input does not mean that all the
tools that users use and the files that they have produced over
decades automagically convert as well.

Because we are now living in a \UTF\ world, we need to keep that
in mind when we do tricky things with sequences of characters, for
instance in processing verbatim. When we implement verbatim in
pure \TEX\ we can do as before, but when we let \LUA\ kick in, we
need to use string methods that are \UTF-aware. In addition to
the linked-in \UNICODE\ library, there are dedicated iterator
functions added to the \type {string} namespace; think of:

\starttyping
for c in string.utfcharacters(str) do
    something_with(c)
end
\stoptyping

Occasionally we need to output raw 8-bit code, for instance
to \DVI\ or \PDF\ backends (specials and literals). Of course
we could have cooked up a truckload of conversion functions
for this, but during one of our travels to a \TEX\ conference,
we came up with the following trick.

We reserve the top 256 values of the \UNICODE\ range, starting at
hexadecimal value 0x110000, for byte output. When writing to an
output stream, that offset will be subtracted. So, 0x1100A9 is written
out as hexadecimal byte value A9, which is the decimal value 169, which
in the Latin~1 encoding is the slot for the copyright sign.

\stopcomponent