summaryrefslogtreecommitdiff
path: root/doc/context/sources/general/manuals/evenmore/evenmore-tokens.tex
blob: d653703a9b6a228f10bd99186ecb5bffb3bc1a48 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
% language=us

% TODO: copy_node_list : return tail
% TODO: grabnested

\environment evenmore-style

\startcomponent evenmore-tokens

\startchapter[title=Tokens]

\usemodule[article-basic,abbreviations-logos]

\starttext

{\em This is mostly a wrapup of some developments, and definitely not a tutorial.}

Talking deep down \TEX\ is talking about tokens and nodes. Roughly spoken, from
the perspective of the user, tokens are what goes in and stays in (as macro,
token list of whatever) and nodes is what get produced and eventually results in
output. A character in the input becomes one token (before expansion) and a
control sequence like \type {\foo} also is turned into a token. Tokens can be
linked into lists. This actually means that in the engine we can talk of tokens
in two ways: the single item with properties that trigger actions, or as compound
item with that item and a pointer to the next token (called link). In \LUA\ speak
token memory can be seen as:

\starttyping
fixmem = {
    { info, link },
    { info, link },
    { info, link },
    { info, link },
    ....
}
\stoptyping

Both are 32 bit integers. The \type {info} is a combination of a command code (an
operator) and a so called chr code (operand) and these determine its behaviour.
For instance the command code can indicate an integer register and the chr code
then indicates the number of that register. So, like:

\starttyping
fixmem = {
    { { cmd, chr}, index_into_fixmem },
    { { cmd, chr}, index_into_fixmem },
    { { cmd, chr}, index_into_fixmem },
    { { cmd, chr}, index_into_fixmem },
    ....
}
\stoptyping


In the following line the characters that make three words are tokens (letters),
so are the space (spacer), the curly braces (begin- and endgroup token) and the
bold face switch (which becomes one token which resolves to a token list of
tokens that trigger actions (in this case switching to a bolder font).

\starttyping
foo {\bf bar} foo
\stoptyping

When \TEX\ reads a line of input tokens are expanded immediately but a sequence
can also become part fo a macro body or token list. Here we have $3_{\type{foo}}
+ 1 + 1_{\type+{+} + 1_{\type{\bf}} + 3_{\type{bar}} + 1_{\type+}+} + 1 +
3_{\type{foo}} = 14$ tokens.

A control sequence normally starts with a backslash. Some are built in, these are
called primitives, and others are defined by the macro package or the user. There
is a lookup table that relates the tokenized control sequence to some action. For
instance:

\starttyping
\def\foo{foo}
\stoptyping

creates an entry that leads (directly or following a hash chain) to the three
letter token list. Every time the input sees \type {\foo} it gets resolved to
that list via a hash lookup. However, once internalized and part of a token list,
it is a direct reference. On the other hand,

\starttyping
\the\count0
\stoptyping

triggers the \type {\the} action that relates to this control sequence, which
then reads a next token and operates on that. That next token itself expects a
number as follow up. In the end the value of \type {\count0} is found and that
one is also in the so called equivalent lookup table, in what \TEX\ calls
specific regions.

\starttyping
equivalents = {
    { level, type, value },
    { level, type, value },
    { level, type, value },
    ...
}
\stoptyping

The value is in most cases similar to the info (cmd & chr) field in fixmem, but
one difference is that counters, dimensions etc directly store their value, which
is why we sometimes need the type separately, for instance in order to reclaim
memory for glue or node specifications. It sound complicated and it is, but as
long as you get a rough idea we can continue. Just keep in mind that tokens
sometimes get expanded on the fly, and sometimes just get stored.

There are a lot of primitives and each has a unique info. The same is true for
characters (each category has its own command code, so regular letters can be
distinguished from other tokens, comment signs, math triggers etc). All important
basic bits are in table of equivalents: macros as well as registers although the
meaning of a macro and content of token lists lives in the fixmem table and
the content of boxes in so called node lists (nodes have their own memory).

In traditional \TEX\ the lookup table for primitives, registers and macros is as
compact as can be: it is an array of so called 32 bit memory words. These can be
divided into halfs and quarters, so in the source you find terms like \type
{halfword} and \type {quarterword}. The lookup table is a hybrid:

\starttyping
[level 8] [type 8] [value 16] | [equivalent 32]
[level 8] [type 8] [value 16] | [equivalent 32]
[level 8] [type 8] [value 16] | [equivalent 32]
...
\stoptyping

The mentioned counters and such are directly encoded in an equivalent and the
rest is a combination of level, type and value. The level is used for the
grouping, and in for instance \PDFTEX\ there can therefore be at most 255 levels.
In \LUATEX\ we use a wider model. There we have 64 bit memory words which means
that we have way more levels and don't need to have this dual nature:

\starttyping
[level 16] [type 16] [value 32]
[level 16] [type 16] [value 32]
[level 16] [type 16] [value 32]
...
\stoptyping

We already showed a \LUA\ representation. The type in this table is what a
command code is in an \quote {info} field. In such a token the integer encodes
the command as well as a value (called chr). In the lookup table the type is the
command code. When \TEX\ is dealing with a control sequences it looks at the
type, otherwise it filters the command from the token integer. This means that a
token cannot store an integer (or dimension), but the lookup table actually can
do that. However, commands can limit the range, for instance characters are bound
by what \UNICODE\ permits.

Internally, \LUATEX\ still uses these ranges of fast accessible registers, like
counters, dimensions and attributes. However, we saw that in \LUATEX\ they don't
overlap with the level and type. In \LUATEX, at least till version 1.13 we still
have the shadow array for levels but in \LUAMETATEX\ we just use those in the
equivalents lookup table. If you look in the \PASCAL\ source you will notice that
arrays run from \type {[somemin ... somemax]} which in the \CCODE\ source would
mean using offsets. Actually, the shadow array starts at zero so we waste the
part that doesn't need shadowing. It is good to remind ourselves that traditional
\TEX\ is 8 bit character based.

The equivalents lookup table has all kind of special ranges (combined into
regions of similar nature, in \TEX\ speak), like those for lowercase mapping,
specific catcode mappings, etc.\ but we're still talking of $n \times 256$
entries. In \LUATEX\ all these mappings are in dedicated sparse hash tables
because we need to support the full \UNICODE\ repertoire. This means that, while
on the one hand \LUATEX\ uses more memory for the lookup table the number of
slots can be less. But still there was the waste of the shadow level table: I
didn't calculate the exact saving of ditching that one, but I bet it came close
to what was available as total memory for programs and data on the first machines
that I used for running \TEX. But \unknown\ after more than a decade of \LUATEX\
we now reclaimed that space in \LUAMETATEX. \footnote {Don't expect a gain in
performance, although using less memory might pay back on a virtual machine or
when \TEX\ has to share the \CPU\ cache.}

Now, in case you're interested (and actually I just write it down because I don't
want to forget it myself) the lookup table in \LUAMETATEX\ is layout as follows

\starttabulate
\NC the hash table                    \NC \NC \NR
\NC some frozen primitives            \NC \NC \NR
\NC current and defined fonts         \NC one slot + many pointers \NC \NR
\NC undefined control sequence        \NC one slot \NC \NR
\NC internal and register glue        \NC pointer to node \NC \NR
\NC internal and register muglue      \NC pointer to node \NC \NR
\NC internal and register toks        \NC pointer to token list \NC \NR
\NC internal and register boxes       \NC pointer to node list \NC \NR
\NC internal and register counts      \NC value in token \NC \NR
\NC internal and register attributes  \NC value in token \NC \NR
\NC internal and register dimens      \NC value in token \NC \NR
\NC some special data structures      \NC pointer to node list \NC  \NC \NR
\NC the (runtime) extended hash table \NC  \NC \NR
\stoptabulate

Normally a user doesn't need to know anything about these specific properties of
the engine and it might comfort you to know that for a long time I could stay
away from these details. One difference with the other engines is that we have
internal variables and registers split more explicitly. The special data
structures have their own slots and are not just put somewhere (semi random). The
initialization is bit more granular in that we properly set the types (cmd codes)
for registers which in turn is possible because for instance we're able to
distinguish glue types. This is all part of coming up with a bit more consistent
interface to tokens from the \LUA\ end. It also permits diagnostics.

Anyway, we now are ready for some more details about tokens. You don't need to
understand all of it in order to define decent macros. But when you are using
\LUATEX\ and do want to mess around here is some insight. Assume we have defined
these macros:

\startluacode
    local alsoraw = false
    function documentdata.StartShowTokens(rawtoo)
        context.starttabulate { "|T|rT|lT|rT|rT|rT|" .. (rawtoo and "rT|" or "") }
        context.BC()
        context.BC() context("cmd")
        context.BC() context("name")
        context.BC() context("chr")
        context.BC() context("cs")
        if rawtoo then
            context.BC() context("rawchr")
        end
        context.BC() context.NR()
        context.SL()
        alsoraw = rawtoo
    end
    function documentdata.StopShowTokens()
        context.stoptabulate()
    end
    function documentdata.ShowToken(name)
        local cmd, chr, cs = token.get_cmdchrcs(name)
        local _,   raw, _  = token.get_cmdchrcs(name,true)
        context.NC() context("\\string\\"..name)
        context.NC() context(cmd)
        context.NC() context(tokens.commands[cmd])
        context.NC() context(chr)
        context.NC() context(cs)
        if alsoraw and chr ~= raw then
            context.NC() context(raw)
        end
        context.NC() context.NR()
    end
\stopluacode

\startbuffer
\def\MacroA{a} \def\MacroB{b}
\def\macroa{a} \def\macrob{b}
\def\MACROa{a} \def\MACROb{b}
\stopbuffer

\typebuffer \getbuffer

How does that end up internally?

\startluacode
    documentdata.StartShowTokens(true)
    documentdata.ShowToken("scratchcounterone")
    documentdata.ShowToken("scratchcountertwo")
    documentdata.ShowToken("scratchdimen")
    documentdata.ShowToken("scratchtoks")
    documentdata.ShowToken("scratchcounter")
    documentdata.ShowToken("letterpercent")
    documentdata.ShowToken("everypar")
    documentdata.ShowToken("%")
    documentdata.ShowToken("pagegoal")
    documentdata.ShowToken("pagetotal")
    documentdata.ShowToken("hangindent")
    documentdata.ShowToken("hangafter")
    documentdata.ShowToken("dimdim")
    documentdata.ShowToken("relax")
    documentdata.ShowToken("dimen")
    documentdata.ShowToken("stoptext")
    documentdata.ShowToken("MacroA")
    documentdata.ShowToken("MacroB")
    documentdata.ShowToken("MacroC")
    documentdata.ShowToken("macroa")
    documentdata.ShowToken("macrob")
    documentdata.ShowToken("macroc")
    documentdata.ShowToken("MACROa")
    documentdata.ShowToken("MACROb")
    documentdata.ShowToken("MACROc")
    documentdata.StopShowTokens()
\stopluacode

We show the raw chr value but in the \LUA\ interface these are normalized to for
instance proper register indices. This is because the raw numbers can for
instance be indices into memory or some \UNICODE\ reference with catcode specific
bits set. But, while these indices are real and stable, these offsets can
actually change when the implementation changes. For that reason, in \LUAMETATEX\
we can better talk of command codes as main indicator and:

\starttabulate
\NC subcommand       \NC for tokens that have variants, like \type {\ifnum} \NC \NR
\NC register indices \NC for the 64K register banks, like \type {\count0} \NC \NR
\NC internal indices \NC for internal variables like \type {\parindent} \NC \NR
\NC characters       \NC specific \UNICODE\ slots combined with catcode \NC \NR
\NC pointers         \NC to token lists, macros, \LUA\ functions, nodes \NC \NR
\stoptabulate

This so called \type {cs} number is a pointer into the table of equivalents. That
number results comes from the hash table. A macro name, when scanned the first
time, is still a sequence of bytes. This sequence is used to compute a hash
number, which is a pointer to a slot in the lower part of the hash (lookup)
table. That slot points to a string and a next hash entry in the higher end. A
lookup goes as follows:

\startitemize[n,packed]
    \startitem
        compute the index into the hash table from the string
    \stopitem
    \startitem
        goto the slot with that index and compare the \type {string} field
    \stopitem
    \startitem
        when there is no match goto the slot indicated by the \type {next} field
    \stopitem
    \startitem
        compare again and keep following \type {next} fields till there is no
        follow up
    \stopitem
    \startitem
        optionally create a new entry
    \stopitem
    \startitem
        use the index of that entry as index in the table of equivalents
    \stopitem
\stopitemize

So, in \LUA\ speak, we have:

\starttyping
hashtable = {
    -- lower part, accessed via the calculated hash number
    { stringpointer, nextindex },
    { stringpointer, nextindex },
    ...
    -- higher part, accessed by following nextindex
    { stringpointer, nextindex },
    { stringpointer, nextindex },
    ...
}
\stoptyping

Eventually, after following a lookup chain in the hash tabl;e, we end up at
pointer to the equivalents lookup table that we already discussed. From then on
we're talking tokens. When you're lucky, the list is small and you have a quick
match. The maximum initial hash index is not that large, around 64K (double that
in \LUAMETATEX), so in practice there will often be some indirect
(multi|-|compare) match but increasing the lower end of the hash table might
result in less string comparisons later on, but also increases the time to
calculate the initial hash needed for accessing the lower part. Here you can sort
of see that:

\startbuffer
\dostepwiserecurse{`a}{`z}{1}{
    \expandafter\def\csname whatever\Uchar#1\endcsname
      {}
}
\dostepwiserecurse{`a}{`z}{1}{
    \expandafter\let\csname somemore\Uchar#1\expandafter\endcsname
        \csname whatever\Uchar#1\endcsname
}
\stopbuffer

\typebuffer \getbuffer

\startluacode
    documentdata.StartShowTokens(true)
    for i=utf.byte("a"),utf.byte("z") do
        documentdata.ShowToken("whatever"..utf.char(i))
        documentdata.ShowToken("somemore"..utf.char(i))
    end
    documentdata.StopShowTokens()
\stopluacode

The command code indicates a macro and the action related to it is an expandable
call. We have no sub command \footnote {We cheat a little here because chr
actually is an index into token memory but we don't show them as such.} so that
column shows zeros. The fifth column is the hash entry which can bring us back to
the verbose name as needed in reporting while the last column is the index to
into token memory (watch the duplicates for \type {\let} macros: a ref count is
kept in order to be able to manage such shared references). When you look a the
cs column you will notice that some numbers are close which (I think) in this
case indicates some closeness in the calculated hash name and followed chain.

It will be clear that it is best to not make any assumptions with respect to the
numbers which is why, in \LUAMETATEX\ we sort of normalize them when accessing
properties.

\starttabulate
\NC field      \NC meaning \NC \NR
\FL
\NC command    \NC operator \NC \NR
\NC cmdname    \NC internal name of operator \NC \NR
\NC index      \NC sanitized operand \NC \NR
\NC mode       \NC original operand  \NC \NR
\NC csname     \NC associated name   \NC \NR
\NC id         \NC the index in token memory (a virtual address) \NC \NR
\NC tok        \NC the integer representation \NC \NR
\ML
\NC active     \NC true when an active character \NC \NR
\NC expandable \NC true when expandable command \NC \NR
\NC protected  \NC true when a protected command \NC \NR
\NC frozen     \NC true when a frozen command \NC \NR
\NC user       \NC true when a user defined command \NC \NR
\LL
\stoptabulate

When a control sequence is an alias to an existing primitive, for instance
made by \type {\let}, the operand (chr) picked up from its meaning. Take this:

\startbuffer
\newif\ifmyconditionone
\newif\ifmyconditiontwo

                    \meaning\ifmyconditionone    \crlf
                    \meaning\ifmyconditiontwo    \crlf
                    \meaning\myconditiononetrue  \crlf
                    \meaning\myconditiontwofalse \crlf
\myconditiononetrue \meaning\ifmyconditionone    \crlf
\myconditiontwofalse\meaning\ifmyconditiontwo    \crlf
\stopbuffer

\typebuffer \getbuffer

Internally this is:

\startluacode
    documentdata.StartShowTokens(false)
    documentdata.ShowToken("ifmyconditionone")
    documentdata.ShowToken("ifmyconditiontwo")
    documentdata.ShowToken("iftrue")
    documentdata.ShowToken("iffalse")
    documentdata.StopShowTokens()
\stopluacode

The whole list of available commands is given below. Once they are stable the \LUAMETATEX\ manual
will document the accessors. In this chapter we use:

\starttyping
kind, min, max, fixedvalue token.get_range("primitive")
cmd, chr, cs  = token.get_cmdchrcs("primitive")
\stoptyping

The kind of command is given in the first column, which can have the following values:

\starttabulate[|l|l|p|]
\NC 0 \NC no        \NC not accessible \NC \NR
\NC 1 \NC regular   \NC possibly with subcommand \NC \NR
\NC 2 \NC character \NC the \UNICODE\ slot is encodes in the the token \NC \NR
\NC 3 \NC register  \NC this is an indexed register (zero upto 64K) \NC \NR
\NC 4 \NC internal  \NC this is an internal register (range given) \NC \NR
\NC 5 \NC reference \NC this is a reference to a node, \LUA\ function, etc. \NC \NR
\NC 6 \NC data      \NC a general data entry (kind of private) \NC \NR
\NC 7 \NC token     \NC a token reference (that can have a followup) \NC \NR
\stoptabulate

\usemodule[system-tokens]

\start \switchtobodyfont[7pt] \showsystemtokens \stop

\stopchapter

\stopcomponent