1 files changed, 630 insertions, 0 deletions
diff --git a/doc/context/sources/general/manuals/hybrid/hybrid-characters.tex b/doc/context/sources/general/manuals/hybrid/hybrid-characters.tex
new file mode 100644
index 000000000..4800e1500
--- /dev/null
+++ b/doc/context/sources/general/manuals/hybrid/hybrid-characters.tex
@@ -0,0 +1,630 @@
+% language=uk
+
+\startcomponent hybrid-characters
+
+\environment hybrid-environment
+
+\startchapter[title={Characters with special meanings}]
+
+\startsection[title={Introduction}]
+
+When \TEX\ was designed \UNICODE\ was not yet available and characters were
+encoded in a seven or eight bit encoding, like \ASCII\ or \EBCDIC. Also, the
+layout of keyboards was dependent of the vendor. A lot has happened since then:
+more and more \UNICODE\ has become the standard (with \UTF\ as widely used way of
+efficiently coding it).
+
+Also at that time, fonts on computers were limited to 256 characters at most.
+This resulted in \TEX\ macro packages dealing with some form of input encoding on
+the one hand and a font encoding on the other. As a side effect of character
+nodes storing a reference to a glyph in a font hyphenation was related to font
+encodings. All this was quite okay for documents written in English but when
+\TEX\ became pupular in more countries more input as well as font encodings were
+used.
+
+Of course, with \LUATEX\ being a \UNICODE\ engine this has changed, and even more
+because wide fonts (either \TYPEONE\ or \OPENTYPE) are supported. However, as
+\TEX\ is already widely used, we cannot simply change the way characters are
+treated, certainly not special ones. Let's go back in time and see how plain
+\TEX\ set some standards, see how \CONTEXT\ does it currently, and look ahead how
+future versions will deal with it.
+
+\stopsection
+
+\startsection[title={Catcodes}]
+
+Traditional \TEX\ is an eight bit engine while \LUATEX\ extends this to \UTF\
+input and internally works with large numbers.
+
+In addition to its natural number (at most 0xFF for traditional \TEX\ and upto
+0x10FFFF for \LUATEX), each character can have a so called category code, or
+catcode. This code determines how \TEX\ will treat the character when it is seen
+in the input. The category code is stored with the character so when we change
+such a code, already read characters retain theirs. Once typeset a character can
+have turned into a glyph and its catcode properties are lost.
+
+There are 16 possible catcodes that have the following meaning:
+
+\starttabulate[|l|l|p|]
+\NC 0 \NC escape \NC This starts an control sequence. The scanner
+reads the whole sequence and stores a reference to it in an
+efficient way. For instance the character sequence \type {\relax}
+starts with a backslash that has category code zero and \TEX\
+reads on till it meets non letters. In macro definitions a
+reference to the so called hash table is stored. \NC \NR
+\NC 1 \NC begin group \NC This marks the begin of a group. A group
+an be used to indicate a scope, the content of a token list, box
+or macro body, etc. \NC \NR
+\NC 2 \NC end group \NC This marks the end of a group. \NC \NR
+\NC 3 \NC math shift \NC Math starts and ends with characters
+tagged like this. Two in a row indicate display math. \NC \NR
+\NC 4 \NC alignment tab \NC Characters with this property indicate
+a next entry in an alignment. \NC \NR
+\NC 5 \NC end line \NC This one is somewhat special. As line
+endings are operating system dependent, they are normalized to
+character 13 and by default that one has this category code. \NC
+\NR
+\NC 6 \NC parameter \NC Macro parameters start with a character
+with this category code. Such characters are also used in
+alignment specifications. In nested definitions, multiple of them
+in a row are used. \NC \NR
+\NC 7 \NC superscript \NC Tagged like this, a character signals
+that the next token (or group) is to be superscripted. Two such
+characters in a row will make the parser treat the following
+character or lowercase hexadecimal number as specification for
+a replacement character. \NC \NR
+\NC 8 \NC subscript \NC Codes as such, a character signals that
+the next token (or group) is to be subscripted. \NC \NR
+\NC 9 \NC ignored \NC When a character has this category code it
+is simply ignored. \NC \NR
+\NC 10 \NC space \NC This one is also special. Any character tagged
+as such is converted to the \ASCII\ space character with code 32.
+\NC \NR
+\NC 11 \NC letter \NC Normally this are the characters that make op
+sequences with a meaning like words. Letters are special in the sense that
+macro names can only be made of letters. The hyphenation machinery will
+normally only deal with letters. \NC \NR
+\NC 12 \NC other \NC Examples of other characters are punctuation and
+special symbols. \NC \NR
+\NC 13 \NC active \NC This makes a character into a macro. Of course
+it needs to get a meaning in order not to trigger an error. \NC \NR
+\NC 14 \NC comment \NC All characters on the same line after comment
+characters are ignored. \NC \NR
+\NC 15 \NC invalid \NC An error message is issued when an invalid
+character is seen. This catcode is probably not assigned very
+often. \NC \NR
+\stoptabulate
+
+So, there is a lot to tell about these codes. We will not discuss the input
+parser here, but it is good to know that the following happens.
+
+\startitemize[packed]
+\startitem
+    The engine reads lines, and normalizes cariage return
+    and linefeed sequences.
+\stopitem
+\startitem
+    Each line gets a character with number \type {\endlinechar} appended.
+    Normally this is a character with code 13. In \LUATEX\ a value of $-1$ will
+    disable this automatism.
+\stopitem
+\startitem
+    Normally spaces (characters with the space property) at the end of a line are
+    discarded.
+\stopitem
+\startitem
+    Sequences like \type {^^A} are converted to characters with numbers depending
+    on the position in \ASCII\ vector: \type {^^@} is zero, \type {^^A} is one,
+    etc.
+\stopitem
+\startitem
+    Sequences like \type {^^1f} are converted to characters with a number similar
+    to the (lowercase) hexadecimal part.
+\stopitem
+\stopitemize
+
+Hopefully this is enough background information to get through the following
+sections so let's stick to a simple example:
+
+\starttyping
+\def\test#1{$x_{#1}$}
+\stoptyping
+
+Here there are two control sequences, starting with a backslash with category
+code zero. Then comes an category~6 character that indicates a parameter that is
+referenced later on. The outer curly braces encapsulate the definition and the
+inner two braces mark the argument to a subscript, which itself is indicated by
+an underscore with category code~8. The start and end of mathmode is indicated
+with a dollar sign that is tagged as math shift (category code~3). The character
+\type {x} is just a letter.
+
+Given the above description, how do we deal with catcodes and newlines at the
+\LUA\ end? Catcodes are easy: we can print back to \TEX\ using a specific catcode
+regime (later we will see a few of those regimes). As character~13 is used as
+default at the \TEX\ end, we should also use it at the \LUA\ end, i.e.\ we should
+use \type {\r} as line terminator (\type {\endlinechar}). On the other hand, we
+have to use \type {\n} (character 10, \type {\newlinechar}) for printing to the
+terminal, log file, of \TEX\ output handles, although in \CONTEXT\ all that
+happens via \LUA\ anyway, so we don't bother too much about it here.
+
+There is a pitfall. As \TEX\ reads lines, it depends on the file system to
+provide them: it fetches lines or whatever represents the same on block devices.
+In \LUATEX\ the implementation is similar: if you plug in a reader callback, it
+has to provide a function that returns a line. Passing two lines does not work
+out as expected as \TEX\ discards anything following the line separator (cr, lf
+or crlf) and then appends a normalized endline character (in our case
+character~13). At least, this is what \TEX\ does naturally. So, in callbacks you
+can best feed line by line without any of those characters.
+
+When you print something from \LUA\ to \TEX\ the situation is slightly different:
+
+\startbuffer
+\startluacode
+tex.print("line 1\r line 2")
+tex.print("line 3\n line 4")
+\stopluacode
+\stopbuffer
+
+\typebuffer
+
+This is what we get:
+
+\startpacked\getbuffer\stoppacked
+
+The explicit \type {\endlinechar} (\type {\r}) terminates the line and the rest
+gets discarded. However, a \type {\n} by default has category code~12 (other) and
+is turned into a space and successive spaces are (normally) ignored, which is why
+we get the third and fourth line separated by a space.
+
+Things get real hairy when we do the following:
+
+\startbuffer
+\startluacode
+tex.print("\\bgroup")
+tex.print("\\obeylines")
+tex.print("line 1\r line 2")
+tex.print("line 3\n line 4")
+tex.print("\\egroup")
+\stopluacode
+\stopbuffer
+
+\typebuffer
+
+Now we get this (the \type {tex.print} function appends an endline character
+itself):
+
+\startpacked\getbuffer\stoppacked
+
+By making the endline character active and equivalent to \type {\par} \TEX\
+nicely scans on and we get the second line as well. Now, if you're still with us,
+you're ready for the next section.
+
+\stopsection
+
+\startsection[title={Plain \TEX}]
+
+In the \TEX\ engine, some characters already have a special meaning. This is
+needed because otherwise we cannot use the macro language to set up the format.
+This is hard|-|coded so the next code is not really used.
+
+\starttyping
+\catcode `\^^@ =  9  % ascii null is ignored
+\catcode `\^^M =  5  % ascii return is end-line
+\catcode `\\   =  0  % backslash is TeX escape character
+\catcode `\%   = 14  % percent sign is comment character
+\catcode `\    = 10  % ascii space is blank space
+\catcode `\^^? = 15  % ascii delete is invalid
+\stoptyping
+
+There is no real reason for setting up the null and delete character but maybe in
+those days the input could contain them. The regular upper- and lowercase
+characters are initialized to be letters with catcode~11. All other characters
+get category code~12 (other).
+
+The plain \TEX\ format starts with setting up some characters that get a special
+meaning.
+
+\starttyping
+\catcode `\{   =  1 % left brace is begin-group character
+\catcode `\}   =  2 % right brace is end-group character
+\catcode `\$   =  3 % dollar sign is math shift
+\catcode `\&   =  4 % ampersand is alignment tab
+\catcode `\#   =  6 % hash mark is macro parameter character
+\catcode `\^   =  7 \catcode`\^^K=7 % circumflex and uparrow
+                                    % are for superscripts
+\catcode `\_   =  8 \catcode`\^^A=8 % underline and downarrow
+                                    % are for subscripts
+\catcode `\^^I = 10 % ascii tab is a blank space
+\catcode `\~   = 13 % tilde is active
+\stoptyping
+
+The fact that this happens in the format file indicates that it is not by design
+that for instance curly braces are used for grouping, or the hash for indicating
+arguments. Even math could have been set up differently. Nevertheless, all macro
+packages have adopted these conventions so they could as well have been
+hard|-|coded presets.
+
+Keep in mind that nothing prevents us to define more characters this way, so we
+could make square brackets into group characters as well. I wonder how many
+people have used the two additional special characters that can be used for
+super- and subscripts. The comment indicates that it is meant for a special
+keyboard.
+
+One way to make sure that a macro will not be overloaded is to use characters in
+it's name that are letters when defining the macro but make sure that they are
+others when the user inputs text.
+
+\starttyping
+\catcode `@ = 11
+\stoptyping
+
+Again, the fact that plain \TEX\ uses the commercial at sign has set a standard.
+After all, at that time this symbol was not as popular as it is nowadays.
+
+Further on in the format some more catcode magic happens. For instance this:
+
+\starttyping
+\catcode `\^^L = 13 \outer\def^^L{\par} % ascii form-feed is "\outer\par"
+\stoptyping
+
+So, in your input a formfeed is equivalent to an empty line which makes sense,
+although later we will see that in \CONTEXT\ we do it differently. As the tilde
+was already active it also gets defined:
+
+\starttyping \def~{\penalty10000\ } % tie \stoptyping
+
+Again, this convention is adopted and therefore a sort of standard. Nowadays we
+have special \UNICODE\ characters for this, but as they don't have a
+visualization editing is somewhat cumbersome.
+
+The change in catcode of the newline character \type {^^M} is done locally, for
+instance in \type {\obeylines}. Keep in mind that this is the character that
+\TEX\ appends to the end of an input line. The space is made active when spaces
+are to be obeyed.
+
+A few very special cases are the following.
+
+\starttyping
+\mathcode `\^^Z = "8000 % \ne
+\mathcode `\    = "8000 % \space
+\mathcode `\'   = "8000 % ^\prime
+\mathcode `\_   = "8000 % \_
+\stoptyping
+
+This flags those characters as being special in mathmode. Normally when you do
+something like this:
+
+\starttyping
+\def\test#1{$#1$} \test{x_2} \test{x''}
+\stoptyping
+
+The catcodes that are set when passing the argument to \type {\test} are frozen
+when they end up in the body of the macro. This means that when \type {'} is
+other it will be other when the math list is built. However, in math mode, plain
+\TEX\ wants to turn that character into a prime and even in a double one when
+there are two in a row. The special value \type {"8000} tells the math machinery
+that when it has an active meaning, that one will be triggered. And indeed, the
+plain format defined these active characters, but in a special way, sort of:
+
+\starttyping
+{ \catcode`\' = 13 \gdef'{....} }
+\stoptyping
+
+So, when active it has a meaning, and it happens to be only treated as active
+when in math mode.
+
+Quite some other math codes are set as well, like:
+
+\starttyping
+\mathcode`\^^@ = "2201 % \cdot
+\mathcode`\^^A = "3223 % \downarrow
+\mathcode`\^^B = "010B % \alpha
+\mathcode`\^^C = "010C % \beta
+\stoptyping
+
+In Appendix~C of The \TeX book Don Knuth explains the rationale behind this
+choice: he had a keyboard that has these shortcuts. As a consequence, one of the
+math font encodings also has that layout. It must have been a pretty classified
+keyboard as I could not find a picture on the internet. One can probably assemble
+such a keyboard from one of those keyboard that come with no imprint. Anyhow, Don
+explicitly says \quotation {Of course, designers of \TEX\ macro packages that are
+intended to be widely used should stick to the standard \ASCII\ characters.} so
+that is what we do in the next sections.
+
+\stopsection
+
+\startsection[title={How about \CONTEXT}]
+
+In \CONTEXT\ we've always used several catcode regimes and switching between them
+was a massive operation. Think of a different regime when defining macros,
+inputting text, typesetting verbatim, processing \XML, etc. When \LUATEX\
+introduced catcode tables, the existing mechanisms were rewritten to take
+advantage of this. This is the standard table for input as of December 2010.
+
+\starttyping
+\startcatcodetable \ctxcatcodes
+  \catcode \tabasciicode        \spacecatcode
+  \catcode \endoflineasciicode  \endoflinecatcode
+  \catcode \formfeedasciicode   \endoflinecatcode
+  \catcode \spaceasciicode      \spacecatcode
+  \catcode \endoffileasciicode  \ignorecatcode
+  \catcode \circumflexasciicode \superscriptcatcode
+  \catcode \underscoreasciicode \subscriptcatcode
+  \catcode \ampersandasciicode  \alignmentcatcode
+  \catcode \backslashasciicode  \escapecatcode
+  \catcode \leftbraceasciicode  \begingroupcatcode
+  \catcode \rightbraceasciicode \endgroupcatcode
+  \catcode \dollarasciicode     \mathshiftcatcode
+  \catcode \hashasciicode       \parametercatcode
+  \catcode \commentasciicode    \commentcatcode
+  \catcode \tildeasciicode      \activecatcode
+  \catcode \barasciicode        \activecatcode
+\stopcatcodetable
+\stoptyping
+
+Because the meaning of active characters can differ per table there is a related
+mechanism for switching those meanings. A careful reader might notice that the
+formfeed character is just a newline. If present at all, it often sits on its own
+line, so effectively it then behaves as in plain \TEX: triggering a new
+paragraph. Otherwise it becomes just a space in the running text.
+
+In addition to the active tilde we also have an active bar. This is actually one
+of the oldest features: we use bars for signaling special breakpoints, something
+that is really needed in Dutch (education), where we have many compound words.
+Just to show a few applications:
+
+\starttyping
+firstpart||secondpart  this|(|orthat)  one|+|two|+|three
+\stoptyping
+
+In \MKIV\ we have another way of dealing with this. There you can enable a
+special parser that deals with it at another level, the node list.
+
+\starttyping
+\setbreakpoints[compound]
+\stoptyping
+
+When \TEX ies discuss catcodes some can get quite upset, probably because they
+spend some time fighting their side effects. Personally I like the concept. They
+can be a pain to deal with but also can be fun. For instance, support of \XML\ in
+\CONTEXT\ \MKII\ was made possible by using active \type {<} and \type {&}.
+
+When dealing with all kind of inputs the fact that characters have special
+meanings can get in the way. One can argue that once a few have a special
+meaning, it does not matter that some others have. Most complaints from users
+concern \type {$}, \type {&} and \type {_}. When for symmetry we add \type {^} it
+is clear that these characters relate to math.
+
+Getting away from the \type {$} can only happen when users are willing to use for
+instance \type {\m{x}} instead of \type {$x$}. The \type {&} is an easy one
+because in \CONTEXT\ we have always discouraged its use in tables and math
+alignments. Using (short) commands is a bit more keying but also provides more
+control. That leaves the \type {_} and \type {^} and there is a nice solution for
+this: the special math tagging discussed in the previous section.
+
+For quite a while \CONTEXT\ provides two commands that makes it possible to use
+\type {&}, \type {_} and \type {^} as characters with only a special meaning
+inside math mode. The command
+
+\starttyping
+\nonknuthmode
+\stoptyping
+
+turns on this feature. The counterpart of this command is
+
+\starttyping
+\donknuthmode
+\stoptyping
+
+One step further goes the command:
+
+\starttyping
+\asciimode
+\stoptyping
+
+This only leave the backslash and curly braces a special meaning.
+
+\starttyping
+\startcatcodetable \txtcatcodes
+  \catcode \tabasciicode       \spacecatcode
+  \catcode \endoflineasciicode \endoflinecatcode
+  \catcode \formfeedasciicode  \endoflinecatcode
+  \catcode \spaceasciicode     \spacecatcode
+  \catcode \endoffileasciicode \ignorecatcode
+  \catcode \backslashasciicode \escapecatcode
+  \catcode \leftbraceasciicode \begingroupcatcode
+  \catcode \rightbraceasciicode\endgroupcatcode
+\stopcatcodetable
+\stoptyping
+
+So, even the percentage character being a comment starter is no longer there. At
+this time it's still being discussed where we draw the line. For instance, using
+the following setup renders puts \TEX\ out of action, and we happily use it deep
+down in \CONTEXT\ to deal with verbatim.
+
+\starttyping
+\startcatcodetable \vrbcatcodes
+  \catcode \tabasciicode       \othercatcode
+  \catcode \endoflineasciicode \othercatcode
+  \catcode \formfeedasciicode  \othercatcode
+  \catcode \spaceasciicode     \othercatcode
+  \catcode \endoffileasciicode \othercatcode
+\stopcatcodetable
+\stoptyping
+
+\stopsection
+
+\startsection[title={Where are we heading?}]
+
+When defining macros, in \CONTEXT\ we not only use the \type {@} to provide some
+protection against overloading, but also the \type {?} and \type {!}. There is of
+course some freedom in how to use them but there are a few rules, like:
+
+\starttyping
+\c!width         % interface neutral key
+\v!yes           % interface neutral value
+\s!default       % system constant
+\e!start         % interface specific command name snippet
+\!!depth         % width as keyword to primitive
+\!!stringa       % scratch macro
+\??ab            % namespace
+\@@abwidth       % namespace-key combination
+\stoptyping
+
+There are some more but this demonstrates the principle. When defining macros
+that use these, you need to push and pop the current catcode regime
+
+\starttyping
+\pushcatcodes
+\catcodetable \prtcatcodes
+....
+\popcatcodes
+\stoptyping
+
+or more convenient:
+
+\starttyping
+\unprotect
+....
+\protect
+\stoptyping
+
+Recently we introduced named parameters in \CONTEXT\ and files that are coded
+that way are tagged as \MKVI. Because we nowadays are less concerned about
+performance, some of the commands that define the user interface have been
+rewritten. At the cost of a bit more runtime we move towards a somewhat cleaner
+inheritance model that uses less memory. As a side effect module writers can
+define the interface to functionality with a few commands; think of defining
+instances with inheritance, setting up instances, accessing parameters etc. It
+sounds more impressive than it is in practice but the reason for mentioning it
+here is that this opportunity is also used to provide module writers an
+additional protected character: \type {_}.
+
+\starttyping
+\def\do_this_or_that#variable#index%
+  {$#variable_{#index}$}
+
+\def\thisorthat#variable#index%
+  {(\do_this_or_that{#variable}{#index})}
+\stoptyping
+
+Of course in the user macros we don't use the \type {_} if only because we want
+that character to show up as it is meant.
+
+\starttyping
+\startcatcodetable \prtcatcodes
+  \catcode \tabasciicode        \spacecatcode
+  \catcode \endoflineasciicode  \endoflinecatcode
+  \catcode \formfeedasciicode   \endoflinecatcode
+  \catcode \spaceasciicode      \spacecatcode
+  \catcode \endoffileasciicode  \ignorecatcode
+  \catcode \circumflexasciicode \superscriptcatcode
+  \catcode \underscoreasciicode \lettercatcode
+  \catcode \ampersandasciicode  \alignmentcatcode
+  \catcode \backslashasciicode  \escapecatcode
+  \catcode \leftbraceasciicode  \begingroupcatcode
+  \catcode \rightbraceasciicode \endgroupcatcode
+  \catcode \dollarasciicode     \mathshiftcatcode
+  \catcode \hashasciicode       \parametercatcode
+  \catcode \commentasciicode    \commentcatcode
+  \catcode `\@                  \lettercatcode
+  \catcode `\!                  \lettercatcode
+  \catcode `\?                  \lettercatcode
+  \catcode \tildeasciicode      \activecatcode
+  \catcode \barasciicode        \activecatcode
+\stopcatcodetable
+\stoptyping
+
+This table is currently used when defining core macros and modules. A rather
+special case is the circumflex. It still has a superscript related catcode, and
+this is only because the circumflex has an additional special meaning
+
+Instead of the symbolic names in the previous blob of code we could have
+indicated characters numbers as follows:
+
+\starttyping
+\catcode `\^^I \spacecatcode
+\stoptyping
+
+However, if at some point we decide to treat the circumflex similar as the
+underscore, i.e.\ give it a letter catcode, then we should not use this double
+circumflex method. In fact, the code base does not do that any longer, so we can
+decide on that any moment. If for some reason the double circumflex method is
+needed, for instance when defining macros like \type {\obeylines}, one can do
+this:
+
+\starttyping
+\bgroup
+  \permitcircumflexescape
+  \catcode \endoflineasciicode \activecatcode
+  \gdef\obeylines%
+    {\catcode\endoflineasciicode\activecatcode%
+     \def^^M{\par}}
+\egroup
+\stoptyping
+
+However, in the case of a newline one can also do this:
+
+\starttyping
+\bgroup
+  \catcode \endoflineasciicode \activecatcode
+  \gdef\obeylines%
+    {\catcode\endoflineasciicode\activecatcode%
+     \def
+       {\par}}
+\egroup
+\stoptyping
+
+Or just:
+
+\starttyping
+\def\obeylines{\defineactivecharacter 13 {\par}}
+\stoptyping
+
+In \CONTEXT\ we have the following variant, which is faster
+than the previous one.
+
+\starttyping
+\def\obeylines
+  {\catcode\endoflineasciicode\activecatcode
+   \expandafter\def\activeendoflinecode{\obeyedline}}
+\stoptyping
+
+So there are not circumflexes used at all. Also, we only need to change the
+meaning of \type {\obeyedline} to give this macro another effect.
+
+All this means that we are upgrading catcode tables, we also consider making
+\type {\nonknuthmode} the default, i.e.\ move the initialization to the catcode
+vectors. Interesting is that we could have done that long ago, as the mentioned
+\type {"8000} trickery has proven to be quite robust. In fact, in math mode we're
+still pretty much in knuth mode anyway.
+
+There is one pitfall. Take this:
+
+\starttyping
+\def\test{$\something_2$} % \something_
+\def\test{$\something_x$} % \something_x
+\stoptyping
+
+When we are in unprotected mode, the underscore is part of the macro name, and
+will not trigger a subscript. The solution is simple:
+
+\starttyping
+\def\test{$\something _2$}
+\def\test{$\something _x$}
+\stoptyping
+
+In the rather large \CONTEXT\ code base there were only a few spots where we had
+to add a space. When moving on to \MKIV\ we have the freedom to introduce such
+changes, although we don't want to break compatibility too much and only for the
+good. We expect this all to settle down in 2011. No matter what we decide upon,
+some characters will always have a special meaning. So in fact we always stay in
+some sort of donknuthmode, which is what \TEX\ is all about.
+
+\stopsection
+
+\stopchapter
+
+\stopcomponent
+
+% ligatures