% language=uk

\startcomponent hybrid-characters

\environment hybrid-environment

\startchapter[title={Characters with special meanings}]

\startsection[title={Introduction}]

When \TEX\ was designed \UNICODE\ was not yet available and characters were
encoded in a seven or eight bit encoding, like \ASCII\ or \EBCDIC. Also, the
layout of keyboards was dependent of the vendor. A lot has happened since then:
more and more \UNICODE\ has become the standard (with \UTF\ as widely used way of
efficiently coding it).

Also at that time, fonts on computers were limited to 256 characters at most.
This resulted in \TEX\ macro packages dealing with some form of input encoding on
the one hand and a font encoding on the other. As a side effect of character
nodes storing a reference to a glyph in a font hyphenation was related to font
encodings. All this was quite okay for documents written in English but when
\TEX\ became pupular in more countries more input as well as font encodings were
used.

Of course, with \LUATEX\ being a \UNICODE\ engine this has changed, and even more
because wide fonts (either \TYPEONE\ or \OPENTYPE) are supported. However, as
\TEX\ is already widely used, we cannot simply change the way characters are
treated, certainly not special ones. Let's go back in time and see how plain
\TEX\ set some standards, see how \CONTEXT\ does it currently, and look ahead how
future versions will deal with it.

\stopsection

\startsection[title={Catcodes}]

Traditional \TEX\ is an eight bit engine while \LUATEX\ extends this to \UTF\
input and internally works with large numbers.

In addition to its natural number (at most 0xFF for traditional \TEX\ and upto
0x10FFFF for \LUATEX), each character can have a so called category code, or
catcode. This code determines how \TEX\ will treat the character when it is seen
in the input. The category code is stored with the character so when we change
such a code, already read characters retain theirs. Once typeset a character can
have turned into a glyph and its catcode properties are lost.

There are 16 possible catcodes that have the following meaning:

\starttabulate[|l|l|p|]
\NC 0 \NC escape \NC This starts an control sequence. The scanner
reads the whole sequence and stores a reference to it in an
efficient way. For instance the character sequence \type {\relax}
starts with a backslash that has category code zero and \TEX\
reads on till it meets non letters. In macro definitions a
reference to the so called hash table is stored. \NC \NR
\NC 1 \NC begin group \NC This marks the begin of a group. A group
an be used to indicate a scope, the content of a token list, box
or macro body, etc. \NC \NR
\NC 2 \NC end group \NC This marks the end of a group. \NC \NR
\NC 3 \NC math shift \NC Math starts and ends with characters
tagged like this. Two in a row indicate display math. \NC \NR
\NC 4 \NC alignment tab \NC Characters with this property indicate
a next entry in an alignment. \NC \NR
\NC 5 \NC end line \NC This one is somewhat special. As line
endings are operating system dependent, they are normalized to
character 13 and by default that one has this category code. \NC
\NR
\NC 6 \NC parameter \NC Macro parameters start with a character
with this category code. Such characters are also used in
alignment specifications. In nested definitions, multiple of them
in a row are used. \NC \NR
\NC 7 \NC superscript \NC Tagged like this, a character signals
that the next token (or group) is to be superscripted. Two such
characters in a row will make the parser treat the following
character or lowercase hexadecimal number as specification for
a replacement character. \NC \NR
\NC 8 \NC subscript \NC Codes as such, a character signals that
the next token (or group) is to be subscripted. \NC \NR
\NC 9 \NC ignored \NC When a character has this category code it
is simply ignored. \NC \NR
\NC 10 \NC space \NC This one is also special. Any character tagged
as such is converted to the \ASCII\ space character with code 32.
\NC \NR
\NC 11 \NC letter \NC Normally this are the characters that make op
sequences with a meaning like words. Letters are special in the sense that
macro names can only be made of letters. The hyphenation machinery will
normally only deal with letters. \NC \NR
\NC 12 \NC other \NC Examples of other characters are punctuation and
special symbols. \NC \NR
\NC 13 \NC active \NC This makes a character into a macro. Of course
it needs to get a meaning in order not to trigger an error. \NC \NR
\NC 14 \NC comment \NC All characters on the same line after comment
characters are ignored. \NC \NR
\NC 15 \NC invalid \NC An error message is issued when an invalid
character is seen. This catcode is probably not assigned very
often. \NC \NR
\stoptabulate

So, there is a lot to tell about these codes. We will not discuss the input
parser here, but it is good to know that the following happens.

\startitemize[packed]
\startitem
    The engine reads lines, and normalizes cariage return
    and linefeed sequences.
\stopitem
\startitem
    Each line gets a character with number \type {\endlinechar} appended.
    Normally this is a character with code 13. In \LUATEX\ a value of $-1$ will
    disable this automatism.
\stopitem
\startitem
    Normally spaces (characters with the space property) at the end of a line are
    discarded.
\stopitem
\startitem
    Sequences like \type {^^A} are converted to characters with numbers depending
    on the position in \ASCII\ vector: \type {^^@} is zero, \type {^^A} is one,
    etc.
\stopitem
\startitem
    Sequences like \type {^^1f} are converted to characters with a number similar
    to the (lowercase) hexadecimal part.
\stopitem
\stopitemize

Hopefully this is enough background information to get through the following
sections so let's stick to a simple example:

\starttyping
\def\test#1{$x_{#1}$}
\stoptyping

Here there are two control sequences, starting with a backslash with category
code zero. Then comes an category~6 character that indicates a parameter that is
referenced later on. The outer curly braces encapsulate the definition and the
inner two braces mark the argument to a subscript, which itself is indicated by
an underscore with category code~8. The start and end of mathmode is indicated
with a dollar sign that is tagged as math shift (category code~3). The character
\type {x} is just a letter.

Given the above description, how do we deal with catcodes and newlines at the
\LUA\ end? Catcodes are easy: we can print back to \TEX\ using a specific catcode
regime (later we will see a few of those regimes). As character~13 is used as
default at the \TEX\ end, we should also use it at the \LUA\ end, i.e.\ we should
use \type {\r} as line terminator (\type {\endlinechar}). On the other hand, we
have to use \type {\n} (character 10, \type {\newlinechar}) for printing to the
terminal, log file, of \TEX\ output handles, although in \CONTEXT\ all that
happens via \LUA\ anyway, so we don't bother too much about it here.

There is a pitfall. As \TEX\ reads lines, it depends on the file system to
provide them: it fetches lines or whatever represents the same on block devices.
In \LUATEX\ the implementation is similar: if you plug in a reader callback, it
has to provide a function that returns a line. Passing two lines does not work
out as expected as \TEX\ discards anything following the line separator (cr, lf
or crlf) and then appends a normalized endline character (in our case
character~13). At least, this is what \TEX\ does naturally. So, in callbacks you
can best feed line by line without any of those characters.

When you print something from \LUA\ to \TEX\ the situation is slightly different:

\startbuffer
\startluacode
tex.print("line 1\r line 2")
tex.print("line 3\n line 4")
\stopluacode
\stopbuffer

\typebuffer

This is what we get:

\startpacked\getbuffer\stoppacked

The explicit \type {\endlinechar} (\type {\r}) terminates the line and the rest
gets discarded. However, a \type {\n} by default has category code~12 (other) and
is turned into a space and successive spaces are (normally) ignored, which is why
we get the third and fourth line separated by a space.

Things get real hairy when we do the following:

\startbuffer
\startluacode
tex.print("\\bgroup")
tex.print("\\obeylines")
tex.print("line 1\r line 2")
tex.print("line 3\n line 4")
tex.print("\\egroup")
\stopluacode
\stopbuffer

\typebuffer

Now we get this (the \type {tex.print} function appends an endline character
itself):

\startpacked\getbuffer\stoppacked

By making the endline character active and equivalent to \type {\par} \TEX\
nicely scans on and we get the second line as well. Now, if you're still with us,
you're ready for the next section.

\stopsection

\startsection[title={Plain \TEX}]

In the \TEX\ engine, some characters already have a special meaning. This is
needed because otherwise we cannot use the macro language to set up the format.
This is hard|-|coded so the next code is not really used.

\starttyping
\catcode `\^^@ =  9  % ascii null is ignored
\catcode `\^^M =  5  % ascii return is end-line
\catcode `\\   =  0  % backslash is TeX escape character
\catcode `\%   = 14  % percent sign is comment character
\catcode `\    = 10  % ascii space is blank space
\catcode `\^^? = 15  % ascii delete is invalid
\stoptyping

There is no real reason for setting up the null and delete character but maybe in
those days the input could contain them. The regular upper- and lowercase
characters are initialized to be letters with catcode~11. All other characters
get category code~12 (other).

The plain \TEX\ format starts with setting up some characters that get a special
meaning.

\starttyping
\catcode `\{   =  1 % left brace is begin-group character
\catcode `\}   =  2 % right brace is end-group character
\catcode `\$   =  3 % dollar sign is math shift
\catcode `\&   =  4 % ampersand is alignment tab
\catcode `\#   =  6 % hash mark is macro parameter character
\catcode `\^   =  7 \catcode`\^^K=7 % circumflex and uparrow
                                    % are for superscripts
\catcode `\_   =  8 \catcode`\^^A=8 % underline and downarrow
                                    % are for subscripts
\catcode `\^^I = 10 % ascii tab is a blank space
\catcode `\~   = 13 % tilde is active
\stoptyping

The fact that this happens in the format file indicates that it is not by design
that for instance curly braces are used for grouping, or the hash for indicating
arguments. Even math could have been set up differently. Nevertheless, all macro
packages have adopted these conventions so they could as well have been
hard|-|coded presets.

Keep in mind that nothing prevents us to define more characters this way, so we
could make square brackets into group characters as well. I wonder how many
people have used the two additional special characters that can be used for
super- and subscripts. The comment indicates that it is meant for a special
keyboard.

One way to make sure that a macro will not be overloaded is to use characters in
it's name that are letters when defining the macro but make sure that they are
others when the user inputs text.

\starttyping
\catcode `@ = 11
\stoptyping

Again, the fact that plain \TEX\ uses the commercial at sign has set a standard.
After all, at that time this symbol was not as popular as it is nowadays.

Further on in the format some more catcode magic happens. For instance this:

\starttyping
\catcode `\^^L = 13 \outer\def^^L{\par} % ascii form-feed is "\outer\par"
\stoptyping

So, in your input a formfeed is equivalent to an empty line which makes sense,
although later we will see that in \CONTEXT\ we do it differently. As the tilde
was already active it also gets defined:

\starttyping \def~{\penalty10000\ } % tie \stoptyping

Again, this convention is adopted and therefore a sort of standard. Nowadays we
have special \UNICODE\ characters for this, but as they don't have a
visualization editing is somewhat cumbersome.

The change in catcode of the newline character \type {^^M} is done locally, for
instance in \type {\obeylines}. Keep in mind that this is the character that
\TEX\ appends to the end of an input line. The space is made active when spaces
are to be obeyed.

A few very special cases are the following.

\starttyping
\mathcode `\^^Z = "8000 % \ne
\mathcode `\    = "8000 % \space
\mathcode `\'   = "8000 % ^\prime
\mathcode `\_   = "8000 % \_
\stoptyping

This flags those characters as being special in mathmode. Normally when you do
something like this:

\starttyping
\def\test#1{$#1$} \test{x_2} \test{x''}
\stoptyping

The catcodes that are set when passing the argument to \type {\test} are frozen
when they end up in the body of the macro. This means that when \type {'} is
other it will be other when the math list is built. However, in math mode, plain
\TEX\ wants to turn that character into a prime and even in a double one when
there are two in a row. The special value \type {"8000} tells the math machinery
that when it has an active meaning, that one will be triggered. And indeed, the
plain format defined these active characters, but in a special way, sort of:

\starttyping
{ \catcode`\' = 13 \gdef'{....} }
\stoptyping

So, when active it has a meaning, and it happens to be only treated as active
when in math mode.

Quite some other math codes are set as well, like:

\starttyping
\mathcode`\^^@ = "2201 % \cdot
\mathcode`\^^A = "3223 % \downarrow
\mathcode`\^^B = "010B % \alpha
\mathcode`\^^C = "010C % \beta
\stoptyping

In Appendix~C of The \TeX book Don Knuth explains the rationale behind this
choice: he had a keyboard that has these shortcuts. As a consequence, one of the
math font encodings also has that layout. It must have been a pretty classified
keyboard as I could not find a picture on the internet. One can probably assemble
such a keyboard from one of those keyboard that come with no imprint. Anyhow, Don
explicitly says \quotation {Of course, designers of \TEX\ macro packages that are
intended to be widely used should stick to the standard \ASCII\ characters.} so
that is what we do in the next sections.

\stopsection

\startsection[title={How about \CONTEXT}]

In \CONTEXT\ we've always used several catcode regimes and switching between them
was a massive operation. Think of a different regime when defining macros,
inputting text, typesetting verbatim, processing \XML, etc. When \LUATEX\
introduced catcode tables, the existing mechanisms were rewritten to take
advantage of this. This is the standard table for input as of December 2010.

\starttyping
\startcatcodetable \ctxcatcodes
  \catcode \tabasciicode        \spacecatcode
  \catcode \endoflineasciicode  \endoflinecatcode
  \catcode \formfeedasciicode   \endoflinecatcode
  \catcode \spaceasciicode      \spacecatcode
  \catcode \endoffileasciicode  \ignorecatcode
  \catcode \circumflexasciicode \superscriptcatcode
  \catcode \underscoreasciicode \subscriptcatcode
  \catcode \ampersandasciicode  \alignmentcatcode
  \catcode \backslashasciicode  \escapecatcode
  \catcode \leftbraceasciicode  \begingroupcatcode
  \catcode \rightbraceasciicode \endgroupcatcode
  \catcode \dollarasciicode     \mathshiftcatcode
  \catcode \hashasciicode       \parametercatcode
  \catcode \commentasciicode    \commentcatcode
  \catcode \tildeasciicode      \activecatcode
  \catcode \barasciicode        \activecatcode
\stopcatcodetable
\stoptyping

Because the meaning of active characters can differ per table there is a related
mechanism for switching those meanings. A careful reader might notice that the
formfeed character is just a newline. If present at all, it often sits on its own
line, so effectively it then behaves as in plain \TEX: triggering a new
paragraph. Otherwise it becomes just a space in the running text.

In addition to the active tilde we also have an active bar. This is actually one
of the oldest features: we use bars for signaling special breakpoints, something
that is really needed in Dutch (education), where we have many compound words.
Just to show a few applications:

\starttyping
firstpart||secondpart  this|(|orthat)  one|+|two|+|three
\stoptyping

In \MKIV\ we have another way of dealing with this. There you can enable a
special parser that deals with it at another level, the node list.

\starttyping
\setbreakpoints[compound]
\stoptyping

When \TEX ies discuss catcodes some can get quite upset, probably because they
spend some time fighting their side effects. Personally I like the concept. They
can be a pain to deal with but also can be fun. For instance, support of \XML\ in
\CONTEXT\ \MKII\ was made possible by using active \type {<} and \type {&}.

When dealing with all kind of inputs the fact that characters have special
meanings can get in the way. One can argue that once a few have a special
meaning, it does not matter that some others have. Most complaints from users
concern \type {$}, \type {&} and \type {_}. When for symmetry we add \type {^} it
is clear that these characters relate to math.

Getting away from the \type {$} can only happen when users are willing to use for
instance \type {\m{x}} instead of \type {$x$}. The \type {&} is an easy one
because in \CONTEXT\ we have always discouraged its use in tables and math
alignments. Using (short) commands is a bit more keying but also provides more
control. That leaves the \type {_} and \type {^} and there is a nice solution for
this: the special math tagging discussed in the previous section.

For quite a while \CONTEXT\ provides two commands that makes it possible to use
\type {&}, \type {_} and \type {^} as characters with only a special meaning
inside math mode. The command

\starttyping
\nonknuthmode
\stoptyping

turns on this feature. The counterpart of this command is

\starttyping
\donknuthmode
\stoptyping

One step further goes the command:

\starttyping
\asciimode
\stoptyping

This only leave the backslash and curly braces a special meaning.

\starttyping
\startcatcodetable \txtcatcodes
  \catcode \tabasciicode       \spacecatcode
  \catcode \endoflineasciicode \endoflinecatcode
  \catcode \formfeedasciicode  \endoflinecatcode
  \catcode \spaceasciicode     \spacecatcode
  \catcode \endoffileasciicode \ignorecatcode
  \catcode \backslashasciicode \escapecatcode
  \catcode \leftbraceasciicode \begingroupcatcode
  \catcode \rightbraceasciicode\endgroupcatcode
\stopcatcodetable
\stoptyping

So, even the percentage character being a comment starter is no longer there. At
this time it's still being discussed where we draw the line. For instance, using
the following setup renders puts \TEX\ out of action, and we happily use it deep
down in \CONTEXT\ to deal with verbatim.

\starttyping
\startcatcodetable \vrbcatcodes
  \catcode \tabasciicode       \othercatcode
  \catcode \endoflineasciicode \othercatcode
  \catcode \formfeedasciicode  \othercatcode
  \catcode \spaceasciicode     \othercatcode
  \catcode \endoffileasciicode \othercatcode
\stopcatcodetable
\stoptyping

\stopsection

\startsection[title={Where are we heading?}]

When defining macros, in \CONTEXT\ we not only use the \type {@} to provide some
protection against overloading, but also the \type {?} and \type {!}. There is of
course some freedom in how to use them but there are a few rules, like:

\starttyping
\c!width         % interface neutral key
\v!yes           % interface neutral value
\s!default       % system constant
\e!start         % interface specific command name snippet
\!!depth         % width as keyword to primitive
\!!stringa       % scratch macro
\??ab            % namespace
\@@abwidth       % namespace-key combination
\stoptyping

There are some more but this demonstrates the principle. When defining macros
that use these, you need to push and pop the current catcode regime

\starttyping
\pushcatcodes
\catcodetable \prtcatcodes
....
\popcatcodes
\stoptyping

or more convenient:

\starttyping
\unprotect
....
\protect
\stoptyping

Recently we introduced named parameters in \CONTEXT\ and files that are coded
that way are tagged as \MKVI. Because we nowadays are less concerned about
performance, some of the commands that define the user interface have been
rewritten. At the cost of a bit more runtime we move towards a somewhat cleaner
inheritance model that uses less memory. As a side effect module writers can
define the interface to functionality with a few commands; think of defining
instances with inheritance, setting up instances, accessing parameters etc. It
sounds more impressive than it is in practice but the reason for mentioning it
here is that this opportunity is also used to provide module writers an
additional protected character: \type {_}.

\starttyping
\def\do_this_or_that#variable#index%
  {$#variable_{#index}$}

\def\thisorthat#variable#index%
  {(\do_this_or_that{#variable}{#index})}
\stoptyping

Of course in the user macros we don't use the \type {_} if only because we want
that character to show up as it is meant.

\starttyping
\startcatcodetable \prtcatcodes
  \catcode \tabasciicode        \spacecatcode
  \catcode \endoflineasciicode  \endoflinecatcode
  \catcode \formfeedasciicode   \endoflinecatcode
  \catcode \spaceasciicode      \spacecatcode
  \catcode \endoffileasciicode  \ignorecatcode
  \catcode \circumflexasciicode \superscriptcatcode
  \catcode \underscoreasciicode \lettercatcode
  \catcode \ampersandasciicode  \alignmentcatcode
  \catcode \backslashasciicode  \escapecatcode
  \catcode \leftbraceasciicode  \begingroupcatcode
  \catcode \rightbraceasciicode \endgroupcatcode
  \catcode \dollarasciicode     \mathshiftcatcode
  \catcode \hashasciicode       \parametercatcode
  \catcode \commentasciicode    \commentcatcode
  \catcode `\@                  \lettercatcode
  \catcode `\!                  \lettercatcode
  \catcode `\?                  \lettercatcode
  \catcode \tildeasciicode      \activecatcode
  \catcode \barasciicode        \activecatcode
\stopcatcodetable
\stoptyping

This table is currently used when defining core macros and modules. A rather
special case is the circumflex. It still has a superscript related catcode, and
this is only because the circumflex has an additional special meaning

Instead of the symbolic names in the previous blob of code we could have
indicated characters numbers as follows:

\starttyping
\catcode `\^^I \spacecatcode
\stoptyping

However, if at some point we decide to treat the circumflex similar as the
underscore, i.e.\ give it a letter catcode, then we should not use this double
circumflex method. In fact, the code base does not do that any longer, so we can
decide on that any moment. If for some reason the double circumflex method is
needed, for instance when defining macros like \type {\obeylines}, one can do
this:

\starttyping
\bgroup
  \permitcircumflexescape
  \catcode \endoflineasciicode \activecatcode
  \gdef\obeylines%
    {\catcode\endoflineasciicode\activecatcode%
     \def^^M{\par}}
\egroup
\stoptyping

However, in the case of a newline one can also do this:

\starttyping
\bgroup
  \catcode \endoflineasciicode \activecatcode
  \gdef\obeylines%
    {\catcode\endoflineasciicode\activecatcode%
     \def
       {\par}}
\egroup
\stoptyping

Or just:

\starttyping
\def\obeylines{\defineactivecharacter 13 {\par}}
\stoptyping

In \CONTEXT\ we have the following variant, which is faster
than the previous one.

\starttyping
\def\obeylines
  {\catcode\endoflineasciicode\activecatcode
   \expandafter\def\activeendoflinecode{\obeyedline}}
\stoptyping

So there are not circumflexes used at all. Also, we only need to change the
meaning of \type {\obeyedline} to give this macro another effect.

All this means that we are upgrading catcode tables, we also consider making
\type {\nonknuthmode} the default, i.e.\ move the initialization to the catcode
vectors. Interesting is that we could have done that long ago, as the mentioned
\type {"8000} trickery has proven to be quite robust. In fact, in math mode we're
still pretty much in knuth mode anyway.

There is one pitfall. Take this:

\starttyping
\def\test{$\something_2$} % \something_
\def\test{$\something_x$} % \something_x
\stoptyping

When we are in unprotected mode, the underscore is part of the macro name, and
will not trigger a subscript. The solution is simple:

\starttyping
\def\test{$\something _2$}
\def\test{$\something _x$}
\stoptyping

In the rather large \CONTEXT\ code base there were only a few spots where we had
to add a space. When moving on to \MKIV\ we have the freedom to introduce such
changes, although we don't want to break compatibility too much and only for the
good. We expect this all to settle down in 2011. No matter what we decide upon,
some characters will always have a special meaning. So in fact we always stay in
some sort of donknuthmode, which is what \TEX\ is all about.

\stopsection

\stopchapter

\stopcomponent

% ligatures