diff options
author | Context Git Mirror Bot <phg42.2a@gmail.com> | 2016-08-01 16:40:14 +0200 |
---|---|---|
committer | Context Git Mirror Bot <phg42.2a@gmail.com> | 2016-08-01 16:40:14 +0200 |
commit | 96f283b0d4f0259b7d7d1c64d1d078c519fc84a6 (patch) | |
tree | e9673071aa75f22fee32d701d05f1fdc443ce09c /doc/context/sources/general/manuals/mk/mk-goingutf.tex | |
parent | c44a9d2f89620e439f335029689e7f0dff9516b7 (diff) | |
download | context-96f283b0d4f0259b7d7d1c64d1d078c519fc84a6.tar.gz |
2016-08-01 14:21:00
Diffstat (limited to 'doc/context/sources/general/manuals/mk/mk-goingutf.tex')
-rw-r--r-- | doc/context/sources/general/manuals/mk/mk-goingutf.tex | 187 |
1 files changed, 187 insertions, 0 deletions
diff --git a/doc/context/sources/general/manuals/mk/mk-goingutf.tex b/doc/context/sources/general/manuals/mk/mk-goingutf.tex new file mode 100644 index 000000000..1d81cc999 --- /dev/null +++ b/doc/context/sources/general/manuals/mk/mk-goingutf.tex @@ -0,0 +1,187 @@ +% language=uk + +\startcomponent mk-gointutf + +\environment mk-environment + +\chapter{Going \UTF} + +\LUATEX\ only understands input codes in the Universal Character +Set Transformation Format, aka \UCS\ Transformation Format, better +known as: \UTF. There is a good reason for this universal view +on characters: whatever support gets hard coded into the programs, +it's never enough, as 25 years of \TEX\ history have clearly +demonstrated. Macro packages often support more or less standard +input encodings, as well as local standards, user adapted ones, +etc. + +There is enough information on the Internet and in books about what +exactly is \UTF. If you don't know the details yet: \UTF\ is a +multi||byte encoding. The characters with a bytecode up to 127 map +onto their normal \ASCII\ representation. A larger number indicates +that the following bytes are part of the character code. Up to 4~bytes +make an \UTF-8 code, while \UTF-16 always uses two pairs of bytes. + +\starttabulate[|c|c|c|c|c|] +\NC \bf byte 1 \NC \bf byte 2 \NC \bf byte 3 \NC \bf byte 4 \NC \bf unicode \NC \NR +\NC 192--223 \NC 128--191 \NC \NC \NC 0x80--0x7f{}f \NC \NR +\NC 224--239 \NC 128--191 \NC 128--191 \NC \NC 0x800--0xf{}f{}f{}f \NC \NR +\NC 240--247 \NC 128--191 \NC 128--191 \NC 128--191 \NC 0x10000--0x1f{}f{}f{}f \NC \NR +\stoptabulate + +In \UTF-8 the characters in the range $128$--$191$ are illegal +as first characters. The characters 254 and 255 are +completely illegal and should not appear at all since they are +related to \UTF-16. + +Instead of providing a never|-|complete truckload of other input +formats, \LUATEX\ sticks to one input encoding but at the same +time provides hooks that permits users to write filters that +preprocess their input into \UTF. + +While writing the \LUATEX\ code as well as the \CONTEXT\ input +handling, we experimented a lot. Right from the beginning we had +a pretty clear picture of what we wanted to achieve and how it +could be done, but in the end arrived at solutions that permitted +fast and efficient \LUA\ scripting as well as a simple interface. + +What is involved in handling any input encoding and especially +\UTF?. First of all, we wanted to support \UTF-8 as well as +\UTF-16. \LUATEX\ implements \UTF-8 rather straightforward: it +just assumes that the input is usable \UTF. This means that +it does not combine characters. There is a good reason for this: +any automation needs to be configurable (on|/|off) and the more +is done in the core, the slower it gets. + +In \UNICODE, when a character is followed by an \quote +{accent}, the standard may prescribe that these two characters are +replaced by one. Of course, when characters turn into glyphs, and +when no matching glyph is present, we may need to decompose any +character into components and paste them together from glyphs in +fonts. Therefore, as a first step, a collapser was written. In the +(pre|)|loaded \LUA\ tables we have stored information about +what combination of characters need to be combined into another +character. + +So, an \type {a} followed by an \type {`} becomes \type {à} and +an \type {e} followed by \type {"} becomes \type {ë}. This +process is repeated till no more sequences combine. After a few +alternatives we arrived at a solution that is acceptably fast: +mere milliseconds per average page. Experiments demonstrated that +we can not gain much by implementing this in pure~C, but we did +gain some speed by using a dedicated loop||over||utf||string +function. + +A second \UTF\ related issue is \UTF-16. This coding scheme comes +in two endian variants. We wanted to do the conversion in \LUA, +but decided to play a bit with a multi||byte file read function. +After some experiments we quickly learned that hard coding such +methods in \TEX\ was doomed to be complex, and the whole idea +behind \LUATEX\ is to make things less complex. The complexity has +to do with the fact that we need some control over the different +linebreak triggers, that is, (combinations of) character 10 and/or 13. In +the end, the multi||byte readers were removed from the code and we +ended up with a pure \LUA\ solution, which could be sped up by +using a multi||byte loop||over||string function. + +Instead of hard coding solutions in \LUATEX\ a couple of fast +loop||over||string functions were added to the \LUA\ string +function repertoire and the solutions were coded in \LUA. We did +extensive timing with huge \UTF-16 encoded files, and are +confident that fast solutions can be found. Keep in mind that +reading files is never the bottleneck anyway. The only drawback +of an efficient \UTF-16 reader is that the file is loaded into +memory, but this is hardly a problem. + +Concerning arbitrary input encodings, we can be brief. It's rather +easy to loop over a string and replace characters in the $0$--$255$ +range by their \UTF\ counterparts. All one needs is to maintain +conversion tables and \TEX\ macro packages have always done that. + +Yet another (more obscure) kind of remapping concerns those special +\TEX\ characters. If we use a traditional \TEX\ auxiliary file, then +we must make sure that for instance percent signs, hashes, dollars +and other characters are handled right. If we set the catcode of +the percent sign to \quote {letter}, then we get into trouble when +such a percent sign ends up in the table of contents and is read in +under a different catcode regime (and becomes for instance a comment +symbol). One way to deal with such situations is to temporarily move +the problematic characters into a private \UNICODE\ area and deal +with them accordingly. In that case they no longer can interfere. + +Where do we handle such conversions? There are two places where +we can hook converters into the input. + +\startitemize[n,packed] +\item each time when we read a line from a file, i.e.\ we can hook + conversion code into the read callbacks +\item using the special \type {process_input_buffer} callback which + is called whenever \TEX\ needs a new line of input +\stopitemize + +Because we can overload the standard file open and read functions, +we can easily hook the \UTF\ collapse function into the readers. +The same is true for the \UTF-16\ handler. In \CONTEXT, for +performance reasons we load such files into memory, which means +that we also need to provide a special reader to \TEX. When +handling \UTF-16, we don't need to combine characters so that stage +is skipped then. + +So, to summarize this, here is what we do in \CONTEXT. Keep in +mind that we overload the standard input methods and therefore +have complete control over how \LUATEX\ locates and opens files. + +\startitemize[n] + +\item When we have a \UTF\ file, we will read from that file line + by line, and combine characters when collapsing is enabled. + +\item When \LUATEX\ wants to open a file, we look into the first + bytes to see if it is a \UTF-16\ file, in either big or + little endian format. When this is the case, we load the + file into memory, convert the data to \UTF-8, identify + lines, and provide a reader that will give back the file + linewise. + +\item When we have been told to recode the input (i.e.\ when we have + enabled an input regime) we use the normal line||by||line + reader and convert those lines on the fly into valid \UTF. + No collapsing is needed. + +\stopitemize + +Because we conduct our experiments in \CONTEXT\ \MKIV\ the code that +we provide may look a bit messy and more complex than the previous +description may suggest. But keep in mind that a mature macro +package needs to adapt to what users are accustomed to. The fact +that \LUATEX\ moved on to \UTF\ input does not mean that all the +tools that users use and the files that they have produced over +decades automagically convert as well. + +Because we are now living in a \UTF\ world, we need to keep that +in mind when we do tricky things with sequences of characters, for +instance in processing verbatim. When we implement verbatim in +pure \TEX\ we can do as before, but when we let \LUA\ kick in, we +need to use string methods that are \UTF-aware. In addition to +the linked-in \UNICODE\ library, there are dedicated iterator +functions added to the \type {string} namespace; think of: + +\starttyping +for c in string.utfcharacters(str) do + something_with(c) +end +\stoptyping + +Occasionally we need to output raw 8-bit code, for instance +to \DVI\ or \PDF\ backends (specials and literals). Of course +we could have cooked up a truckload of conversion functions +for this, but during one of our travels to a \TEX\ conference, +we came up with the following trick. + +We reserve the top 256 values of the \UNICODE\ range, starting at +hexadecimal value 0x110000, for byte output. When writing to an +output stream, that offset will be subtracted. So, 0x1100A9 is written +out as hexadecimal byte value A9, which is the decimal value 169, which +in the Latin~1 encoding is the slot for the copyright sign. + +\stopcomponent |