diff options
Diffstat (limited to 'doc/context/sources/general/manuals/lowlevel/lowlevel-tokens.tex')
-rw-r--r-- | doc/context/sources/general/manuals/lowlevel/lowlevel-tokens.tex | 540 |
1 files changed, 540 insertions, 0 deletions
diff --git a/doc/context/sources/general/manuals/lowlevel/lowlevel-tokens.tex b/doc/context/sources/general/manuals/lowlevel/lowlevel-tokens.tex new file mode 100644 index 000000000..483b4a8dc --- /dev/null +++ b/doc/context/sources/general/manuals/lowlevel/lowlevel-tokens.tex @@ -0,0 +1,540 @@ +% language=us runpath=texruns:manuals/lowlevel + +\environment lowlevel-style + +\usemodule[system-tokens] + +\startdocument + [title=tokens, + color=middleblue] + +\startsectionlevel[title=Introduction] + +Most users don't need to know anything about tokens but it happens that when \TEX +ies meet in person (users group meetings), or online (support platforms) there +always seem to pop up folks who love token speak. When you try to explain +something to a user it makes sense to talk in terms of characters but then those +token speakers can jump in and start correcting you. In the past I have been +puzzled by this because, when one can write a decent macro that does the job +well, it really doesn't matter if one knows about tokens. Of course one should +never make the assumption that token speakers really know \TEX\ that well or can +come up with better solutions than users but that is another matter. \footnote +{Talking about fashion: it would be more impressive to talk about \TEX\ and +friends as a software stack than calling it a distribution. Today, it's all about +marketing.} + +That said, because in documents about \TEX\ the word \quote {token} does pop up I +will try to give a little insight here. But for using \TEX\ it's mostly +irrelevant. The descriptions below for sure won't match the proper token speak +criteria which is why at a presentation for the 2020 user meeting I used the +title \quotation {Tokens as I see them.} + +\stopsectionlevel + +\startsectionlevel[title=What are tokens] + +Both the words \quote {node} and \quote {token} are quite common in programming +and also rather old which is proven by the fact that they also are used in the +\TEX\ source. A node is a storage container that is part of a linked list. When +you input the characters \type {tex} the three characters become part of the +current linked list. They become \quote {character} nodes (or in \LUATEX\ speak +\quote {glyph} nodes) with properties like the font and the character referred +to. But before that happens, the three characters in the input \type {t}, \type +{e} and \type {x}, are interpreted as in this case being just that: characters. +When you enter \type {\TeX} the input processors first sees a backslash and +because that has a special meaning in \TEX\ it will read following characters and +when done does a lookup in it's internal hash table to see what it actually is: a +macro that assembled the word \TEX\ in uppercase with special kerning and a +shifted (therefore boxed) \quote {E}. When you enter \type {$} \TEX\ will look +ahead for a second one in order to determine display math, push back the found +token when there is no match and then enter inline math mode. + +A token is internally just a 32 bit number that encodes what \TEX\ has seen. It +is the assembled token that travels through the system, get stored, interpreted +and often discarded afterwards. So, the character \quote {e} in our example gets +tagged as such and encoded in this number in a way that the intention can be +derived later on. + +Now, the way \TEX\ looks at these tokens can differ. In some cases it will just +look at this (32 bit) number, for instance when checking for a specific token, +which is fast, but sometimes it needs to know some detail. The mentioned integer +actually encodes a command (opcode) and a so called char code (operand). The +second name is somewhat confusing because in many cases that code is not +representing a character but that is not that relevant here. When you look at the +source code of a \TEX\ engine it is enough to know that a char can also be a sub +command. + +\startlinecorrection[blank] + \setupTABLE[each][align=middle] + \setupTABLE[c][1][width=44mm] + \setupTABLE[c][2][width=4em] + \setupTABLE[c][3][width=11mm] + \setupTABLE[c][4][width=33mm] + \bTABLE + \bTR + \bTD token \eTD + \bTD[frame=off] = \eTD + \bTD cmd \eTD + \bTD chr \eTD + \eTR + \eTABLE +\stoplinecorrection + +Back to the three characters: these become tokens where the command code +indicates that it is a letter and the char code stores what letter we have at +hand and in the case of \LUATEX\ and \LUAMETATEX\ these are \UNICODE\ values. +Contrary to the traditional 8 bit \TEX\ engine, in the \UNICODE\ engines an \UTF\ +sequence is read, but these multiple bytes still become one number that will be +encoded in the token number. In order to determine that something is a letter the +engine has to be told (which is what a macro package does when it sets up the +engine). For instance, digits are so called other characters and the backslash is +called escape. Every \TEX\ user knows that curly braces are special and so are +dollar symbols and hashes. If this rings a bell, and you relate this to catcodes, +you can indeed assume that the command codes of these tokens have the same +numbers as the catcodes. Given that \UNICODE\ has plenty of characters slots you +can imagine that combining 16 catcode commands with all the possible \UNICODE\ +values makes a large repertoire of tokens. + +There are more commands than the 16 basic characters related ones, in +\LUAMETATEX\ we have just over 150 command codes (\LUATEX\ has a few more but +they are also organized differently). Each of these codes can have a sub +command, For instance the primitives \type {\vbox} and \type {\hbox} are both a +\type {make_box_cmd} (we use the symbolic name here) and in \LUAMETATEX\ the +first one has sub command code 9 (\type {vbox_code}) and the second one has code +10 (\type {hbox_code}). There are twelve primitives that are in the same +category. The many primitives that make up the core of the engine are grouped in +a way that permits processing similar ones with one function and also makes it +possible to distinguish between the way commands are handled, for instance with +respect to expansion. + +Now, before we move on it is important to know that al these codes are in fact +abstract numbers. Although it is quite likely that engines that are derived from +each other have similar numbers (just more) this is not the case for \LUAMETATEX. +Because the internals have been opened up (even more than in \LUATEX) the command +and char codes have been reorganized in a such a way that exposure is consistent. +We could not use some of the reuse and remap tricks that the other engines use +because it would simply be too confusing (and demand real in depth knowledge of +the internals). This is also the reason why development took some time. You +probably won't notice it from the current source but it was a very stepwise +process. We not only had to make sure that all kept working (\CONTEXT\ \LMTX\ and +\LUAMETATEX\ were pretty useable during the process), but also had to +(re)consider intermediate choices. + +So, input is converted into tokens, in most cases one|-|by|-|one. When a token is +assembled, it either gets stored (deliberately or as part of some look ahead +scanning), or it immediately gets (what is called:) expanded. Depending on what +the command is, some action is triggered. For instance, a character gets appended +to the node list immediately. An \type {\hbox} command will start assembling a +box which its own node list that then gets some treatment: if this primitive was a +follow up on \type {\setbox} it will get stored, otherwise it might end up in the +current node list as so called hlist node. Commands that relate to registers have +\type {0xFFFF} char codes because that is how many registers we have per category. + +When a token gets stored for later processing it becomes part of a larger data +structure, a so called memory word. These memory words are taken from a large +pool of words and they store a token and additional properties. The info field +contains the token value, the mentioned command and char. When there is no linked +list, the link can actually be used to store a value, something that in +\LUAMETATEX\ we actually do. + +\startlinecorrection[blank] + \setupTABLE[each][align=middle] + \setupTABLE[c][1][width=8mm] + \setupTABLE[c][2][width=64mm] + \setupTABLE[c][3][width=64mm] + \bTABLE + \bTR \bTD 1 \eTD \bTD info \eTD \bTD link \eTD \eTR + \bTR \bTD 2 \eTD \bTD info \eTD \bTD link \eTD \eTR + \bTR \bTD 3 \eTD \bTD info \eTD \bTD link \eTD \eTR + \bTR \bTD n \eTD \bTD info \eTD \bTD link \eTD \eTR + \eTABLE +\stoplinecorrection + +When for instance we say \typ {\toks 0 {tex}} the scanner sees an escape, +followed by 4 letters (\type {toks}) and the escape triggers a lookup of the +primitive (or macro or \unknown) with that name, in this case a primitive +assignment command. The found primitive (its property gets stored in the token) +triggers scanning for a number and when that is successful scanning of a brace +delimited token list starts. The three characters become three letter tokens and +these are a linked list of the mentioned memory words. This list then gets stored +in token register zero. The input sequence \typ {\the \toks 0} will push back a +copy of this list into the input. + +In addition to the token memory pool, there is also a table of equivalents. That +one is part of a larger table of memory words where \TEX\ stores all it needs to +store. The 16 groups of character commands are virtual, storing these makes no +sense, so the first real entries are all these registers (count, dimension, skip, +box, etc). The rest is taken up by possible hash entries. + +\startlinecorrection[blank] + \bTABLE + \bTR \bTD[ny=4] main hash \eTD \bTD null control sequence \eTD \eTR + \bTR \bTD 128K hash entries \eTD \eTR + \bTR \bTD frozen control sequences \eTD \eTR + \bTR \bTD special sequences (undefined) \eTD \eTR + \bTR \bTD[ny=7] registers \eTD \bTD 17 internal & 64K user glues \eTD \eTR + \bTR \bTD 4 internal & 64K user mu glues \eTD \eTR + \bTR \bTD 12 internal & 64K user tokens \eTD \eTR + \bTR \bTD 2 internal & 64K user boxes \eTD \eTR + \bTR \bTD 116 internal & 64K user integers \eTD \eTR + \bTR \bTD 0 internal & 64K user attribute \eTD \eTR + \bTR \bTD 22 internal & 64K user dimensions \eTD \eTR + \bTR \bTD specifications \eTD \bTD 5 internal & 0 user \eTD \eTR + \bTR \bTD extra hash \eTD \bTD additional entries (grows dynamic) \eTD \eTR + \eTABLE +\stoplinecorrection + +So, a letter token \type {t} is just that, a token. A token referring to a register +is again just a number, but its char code points to a slot in the equivalents table. +A macro, which we haven't discussed yet, is actually just a token list. When a name +lookup happens the hash table is consulted and that one runs in parallel to part of the +table of equivalents. When there is a match, the corresponding entry in the equivalents +table points to a token list. + +\startlinecorrection[blank] + \setupTABLE[each][align=middle] + \setupTABLE[c][1][width=16mm] + \setupTABLE[c][2][width=64mm] + \setupTABLE[c][3][width=64mm] + \bTABLE + \bTR \bTD 1 \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR + \bTR \bTD 2 \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR + \bTR \bTD n \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR + \bTR \bTD n + 1 \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR + \bTR \bTD n + 2 \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR + \bTR \bTD n + m \eTD \bTD string index \eTD \bTD equivalents or (next > n) index \eTD \eTR + \eTABLE +\stoplinecorrection + +It sounds complex and it actually also is somewhat complex. It is not made easier +by the fact that we also track information related to grouping (saving and +restoring), need reference counts for copies of macros and token lists, sometimes +store information directly instead of via links to token lists, etc. And again +one cannot compare \LUAMETATEX\ with the other engines. Because we did away with +some of the limitations of the traditional engine we not only could save some +memory but in the end also simplify matters (we're 32/64 bit after all). On the one +hand some traditional speedups were removed but these have been compensated by +improvements elsewhere, so overall processing is more efficient. + +\startlinecorrection[blank] + \setupTABLE[each][align=middle] + \setupTABLE[c][1][width=8mm] + \setupTABLE[c][2][width=32mm] + \setupTABLE[c][3][width=16mm] + \setupTABLE[c][4][width=16mm] + \setupTABLE[c][5][width=64mm] + \bTABLE + \bTR \bTD 1 \eTD \bTD level \eTD \bTD type \eTD \bTD flag \eTD \bTD value \eTD \eTR + \bTR \bTD 2 \eTD \bTD level \eTD \bTD type \eTD \bTD flag \eTD \bTD value \eTD \eTR + \bTR \bTD 3 \eTD \bTD level \eTD \bTD type \eTD \bTD flag \eTD \bTD value \eTD \eTR + \bTR \bTD n \eTD \bTD level \eTD \bTD type \eTD \bTD flag \eTD \bTD value \eTD \eTR + \eTABLE +\stoplinecorrection + +So, here \LUAMETATEX\ differs from other engines because it combines two tables, +which is possible because we have at least 32 bits. There are at most \type +{0xFFFF} levels but we need at most \type {0xFF} types. in \LUAMETATEX\ macros +can have extra properties (flags) and these also need one byte. Contrary to the +other engines, \type {\protected} macros are native and have their own command +code, but \type {\tolerant} macros duplicate that (so we have four distinct macro +commands). All other properties, like the \type {\permanent} ones are stored in +the flags. + +Because a macro starts with a reference count we have some room in the info field +to store information about it having arguments or not. It is these details that +make \LUAMETATEX\ a bit more efficient in terms of memory usage and performance +than its ancestor \LUATEX. But as with the other changes, it was a very stepwise +process in order to keep the system compatible and working. + +\stopsectionlevel + +\startsectionlevel[title=Some implementation details] + +Sometimes there is a special head token at the start. This makes for easier +appending of extra tokens. In traditional \TEX\ node lists are forward linked, in +\LUATEX\ they are double linked \footnote {On the agenda of \LUAMETATEX\ is to +use this property in the underlying code, that doesn't yet profit from this and +therefore keep previous pointers in store.}. Token lists are always forward +linked. Shared token lists use the head node for a reference count. + +For various reasons original \TEX\ uses global variables temporary lists. This is +for instance needed when we expand (nested) and need to report issues. But in +\LUATEX\ we often just serialize lists and using local variables makes more +sense. One of the first things done in \LUAMETATEX\ was to group all global +variables in (still global) structures but well isolated. That also made it +possible to actually get rid of some globals. + +Because \TEX\ had to run on machines that we nowadays consider rather limited, it +had to be sparse and efficient. There are quite some optimizations to limit code +and memory consumption. The engine also does its own memory management. Freed +token memory words are collected in a cache and reused but they can get scattered +which is not that bad, apart from maybe cache hits. In \LUAMETATEX\ we stay as +close to original \TEX\ as possible but there have been some improvements. The +\LUA\ interfaces force us to occasionally divert from the original, and that in +fact might lead to some retrofit but the original documentation still mostly +applies. However, keep in mind that in \LUATEX\ we store much more in nodes (each +has a prev pointer and an attribute list pointer and for instance glyph nodes +have some 20 extra fields compared to traditional \TEX\ character nodes). + +\stopsectionlevel + +\startsectionlevel[title=Other data management] + +There is plenty going on in \TEX\ when it processes your input, just to mention a +few: + +\startitemize[packed] +\startitem Grouping is handled by a nesting stack. \stopitem +\startitem Nested conditionals (\type {\if...}) have their own stack. \stopitem +\startitem The values before assignments are saved on the save stack. \stopitem +\startitem Also other local changes (housekeeping) ends up in the save stack. \stopitem +\startitem Token lists and macro aliases have references pointers (reuse). \stopitem +\startitem Attributes, being linked node lists, have their own management. \stopitem +\stopitemize + +In all these subsystems tokens or references to tokens can play a role. Reading a +single character from the input can trigger a lot of action. A curly brace tagged +as begin group command will push the grouping level and from then on registers +and some other quantities that are changed will be stored on the save stack +so that after the group ends they can be restored. When primitives take keywords, +and no match happens, tokens are pushed back into the input which introduces a +new input level (also some stack). When numbers are read a token that represents +no digit is pushed back too and macro packages use numbers and dimensions a lot. +It is a surprise that \TEX\ is so fast. + +\stopsectionlevel + +\startsectionlevel[title=Macros] + +There is a distinction between primitives, the build in commands, and macros, the +commands defined by users. A primitive relates to a command code and char code +but macros are, unless they are made an alias to something else, like a \type +{\countdef} or \type {\let} does, basically pointers to a token list. There is +some additional data stored that makes it possible to parse and grab arguments. + +When we have a control sequence (macro) \type {\controlsequence} the name is +looked up in the hash table. When found its value will point to the table of +equivalents. As mentioned, that table keeps track of the cmd and points to a +token list (the meaning). We saw that this table also stores the current level +of grouping and flags. + +If we say, in the input, \typ {\hbox to 10pt {x\hss}}, the box is assembled as we +go and when it is appended to the current node list there are no tokens left. +When scanning this, the engine literally sees a backslash and the four letters +\type {hbox}. However when we have this: + +\starttyping[option=TEX] +\def\MyMacro{\hbox to 10pt {x\hss}} +\stoptyping + +the \type {\hbox} has become one memory word which has a token representing the +\type {\hbox} primitive plus a link to the next token. The space after a control +sequence is gobbled so the next two tokens, again stored in a linked memory word, +are letter tokens, followed by two other and two letter tokens for the +dimensions. Then we have a space, a brace, a letter, a primitive and a brace. The +about 20 characters in the input became a dozen memory words each two times four +bytes, so in terms of memory usage we end up with quite a bit more. However, when +\TEX\ runs over that list it only has to interpret the token values because the +scanning and conversion already happened. So, the space that a macro takes is +more than compensated by efficient reprocessing. + +\stopsectionlevel + +\startsectionlevel[title=Looking at tokens] + +When you say \type {\tracingall} you will see what the engine does: read input, +expand primitives and macros, typesetting etc.\ You might need to set \type +{\tracingonline} to get a bit more output on the console. One way to look at +macros is to use the \type {\meaning} command, so if we have: + +\startbuffer[definition] +\permanent\protected\def\MyMacro#1#2{Do #1 or #2!} +\stopbuffer + +\startbuffer[meaning] +\meaning \MyMacro +\meaningless\MyMacro +\meaningfull\MyMacro +\stopbuffer + +\typebuffer[definition][option=TEX] + +we can say this: + +\typebuffer[meaning][option=TEX] + +and get: + +{\getbuffer[definition]\startlines\tttf \getbuffer[meaning]\stoplines} + +You get less when you ask for the meaning of a primitive, just its name. The +\type {\meaningfull} primitive gives the most information. In \LUAMETATEX\ +protected macros are first class commands: they have their own command code. In +the other engines they are just regular macros with an initial token indicating +that they are protected. There are specific command codes for \type {\outer} and +\type {\long} macros but we dropped that in \LUAMETATEX . Instead we have \type +{\tolerant} macros but that's another story. The flags that were mentioned can +mark macros in a way that permits overload protection as well as some special +treatment in otherwise tricky cases (like alignments). The overload related flags +permits a rather granular way to prevent users from redefining macros and such. +They are set via prefixes, and add to that repertoire: we have 14 prefixes but +only some eight deal with flags (we can add more if really needed). The probably +most wel known prefix is \type {\global} and that one will not become a flag: it +has immediate effect. + +For the above definition, the \type {\showluatokens} command will show a meaning +on the console. + +\starttyping[option=TEX] +\showluatokens\MyMacro +\stoptyping + +% {\getbuffer[definition]\getbuffer} + +This gives the next list, where the first column is the address of the token, the +second one the command code, and the third one the char code. When there are +arguments involved, the list of what needs to get matched is shown. + +\starttyping +permanent protected control sequence: MyMacro +501263 19 49 match argument 1 +501087 19 50 match argument 2 +385528 20 0 end match +-------------- +501090 11 68 letter D (U+00044) + 30833 11 111 letter o (U+0006F) +500776 10 32 spacer +385540 21 1 parameter reference +112057 10 32 spacer +431886 11 111 letter o (U+0006F) + 30830 11 114 letter r (U+00072) + 30805 10 32 spacer +500787 21 2 parameter reference +213412 12 33 other char ! (U+00021) +\stoptyping + +In the next subsections I will give some examples. This time we use +helper defined in a module: + +\starttyping[option=TEX] +\usemodule[system-tokens] +\stoptyping + +\startsectionlevel[title=Example 1: in the input] + +\startbuffer +\luatokentable{1 \bf{2} 3\what {!}} +\stopbuffer + +\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer} + +\stopsectionlevel + +\startsectionlevel[title=Example 2: in the input] + +\startbuffer +\luatokentable{a \the\scratchcounter b \the\parindent \hbox to 10pt{x}} +\stopbuffer + +\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer} + +\stopsectionlevel + +\startsectionlevel[title=Example 3: user registers] + +\startbuffer +\scratchtoks{foo \framed{\red 123}456} + +\luatokentable\scratchtoks +\stopbuffer + +\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer} + +\stopsectionlevel + +\startsectionlevel[title=Example 4: internal variables] + +\startbuffer +\luatokentable\everypar +\stopbuffer + +\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer} + +\stopsectionlevel + +\startsectionlevel[title=Example 5: macro definitions] + +\startbuffer +\protected\def\whatever#1[#2](#3)\relax + {oeps #1 and #2 & #3 done ## error} + +\luatokentable\whatever +\stopbuffer + +\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer} + +\stopsectionlevel + +\startsectionlevel[title=Example 6: commands] + +\startbuffer +\luatokentable\startitemize +\luatokentable\stopitemize +\stopbuffer + +\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer} + +\stopsectionlevel + +\startsectionlevel[title=Example 7: commands] + +\startbuffer +\luatokentable\doifelse +\stopbuffer + +\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer } + +\stopsectionlevel + +\startsectionlevel[title=Example 8: nothing] + +\startbuffer +\luatokentable\relax +\stopbuffer + +\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer } + +\stopsectionlevel + +\startsectionlevel[title=Example 9: hashes] + +\startbuffer +\edef\foo#1#2{(#1)(\letterhash)(#2)} \luatokentable\foo +\stopbuffer + +\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer } + +\stopsectionlevel + +\startsectionlevel[title=Example 10: nesting] + +\startbuffer +\def\foo#1{\def\foo##1{(#1)(##1)}} \luatokentable\foo +\stopbuffer + +\typebuffer[option=TEX] \blank[line] {\switchtobodyfont[8pt] \getbuffer } + +\stopsectionlevel + +\startsectionlevel[title=Remark] + +In all these examples the numbers are to be seen as abstractions. Some command +codes and sub command codes might change as the engine evolves. This is why the +\LUAMETATEX\ engine has lots of \LUA\ functions that provide information about +what number represents what command. + +\stopsectionlevel + +\stopsectionlevel + +\stopdocument |