summaryrefslogtreecommitdiff
path: root/doc/context/sources/general/manuals/evenmore/evenmore-parsing.tex
blob: d0cc29db11f64bd965f1255d4e9b09e65a90f724 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
% language=us

\environment evenmore-style

\startcomponent evenmore-parsing

\startchapter[title=Parsing]

The macro mechanism is \TEX\ is quite powerful and once you understand the
concept of mixing parameters and delimiters you can do a lot with it. I assume
that you know what we're talking about, otherwise quit reading. When grabbing
arguments, there are a few catches.

\startitemize
\startitem
    When they are used, delimiters are mandate: \TEX\ will go on reading an
    argument till the (current) delimiter condition is met. This means that when
    you forget one you end up with way more in the argument than expected or even
    run out of input.
\stopitem
\startitem
    Because specified arguments and delimiters are mandate, when you want to
    parse input, you often need multi|-|step macros that first pick up the to be
    parsed input, and then piecewise fetch snippets. Bogus delimiters have to be
    appended to the original in order to catch a run away argument and checking
    has to be done to get rid of them when all is ok.
\stopitem
\stopitemize

The first item can be illustrated as follows:

\starttyping[option=TEX]
\def\foo[#1]{...}
\stoptyping

When \type {\foo} gets expanded \TEX\ first looks for a \type{[} and then starts
collecting tokens for parameter \type {#1}. It stops doing that when aa \type {]}
is seen. So,

\starttyping[option=TEX]
\starttext
    \foo[whatever
\stoptext
\stoptyping

will for sure give an error. When collecting tokens, \TEX\ doesn't expand them so
the \type {\stoptext} is just turned into a token that gets appended.

The second item is harder to explain (or grasp):

\starttyping[option=TEX]
\def\foo[#1=#2]{(#1/#2)}
\stoptyping

Here we expect a key and a value, so these will work:

\starttyping[option=TEX]
\foo[key=value]
\foo[key=]
\stoptyping

while these will fail:

\starttyping[option=TEX]
\foo[key]
\foo[]
\stoptyping

unless we have:

\starttyping[option=TEX]
\foo[key]=]
\foo[]=]
\stoptyping

But, when processing the result, we then need to analyze the found arguments and
correct for them being wrong. For instance, argument \type {#1} can become \type
{]} or here \type {key]}. When indeed a valid key|/|value combination is given we
need to get rid of the two \quote {fixup} tokens \type{=]}. Normally we will have
multiple key|/|value pairs separated by a comma, and in practice we only need to
catch the missing equal because we can ignore empty cases. There are plenty of
examples (rather old old code but also more modern variants) in the \CONTEXT\
code base.

I will now show some new magic that is available in \LUAMETATEX\ as experimental
code. It will be tested in \LMTX\ for a while and might evolve in the process.

\startbuffer
\def\foo#1=#2,{(#1/#2)}

\foo 1=2,\ignorearguments
\foo 1=2\ignorearguments
\foo 1\ignorearguments
\foo \ignorearguments
\stopbuffer

\typebuffer[option=TEX]

Here we pick up a key and value separated by an equal sign. We end the input with
a special signal command: \type {\ignorearguments}. This tells the parser to quit
scanning. So, we get this, without any warning with respect to a missing
delimiter of running away:

\getbuffer

The implementation is actually fairly simple and adds not much overhead.
Alternatives (and I pondered a few) are just too messy, would remind me too much
of those awful expression syntaxes, and definitely impact performance of macro
expansion, therefore: a no|-|go.

Using this new feature, we can implement a key value parser that does a sequence.
The prototypes used to get here made only use of this one new feature and
therefore still had to do some testing of the results. But, after looking at the
code, I decided that a few more helpers could make better looking code. So this
is what I ended up with:

\startbuffer
\def\grabparameter#1=#2,%
  {\ifarguments\or\or
     % (\whatever/#1/#2)\par%
     \expandafter\def\csname\namespace#1\endcsname{#2}%
     \expandafter\grabnextparameter
   \fi}

\def\grabnextparameter
  {\expandafterspaces\grabparameter}

\def\grabparameters[#1]#2[#3]%
  {\def\namespace{#1}%
   \expandafterspaces\grabparameter#3\ignorearguments\ignorearguments}
\stopbuffer

\typebuffer[option=TEX]

Now, this one actually does what the \CONTEXT\ \type {\getparameters} command
does: setting variables in a namespace. Being a parameter driven macro package
this kind of macros have been part of \CONTEXT\ since the beginning. There are
some variants and we also need to deal with the multilingual interface. Actually,
\MKIV\ (and therefore \LMTX) do things a bit different, but the same principles
apply.

The \type {\ignorearguments} quits the scanning. Here we need two because we
actually quit twice. The \type {\expandafterspaces} can be implemented in
traditional \TEX\ macros but I though it is nice to have it this way; the fact
that I only now added it has more to do with cosmetics. One could use the already
somewhat older extension \type {\futureexpandis} (which expands the second or
third token depending seeing the first, in this variant ignoring spaces) or a
bunch of good old primitives to do the same. The new conditional \type
{\ifarguments} can be used to act upon the number of arguments given. It reflects
the most recently expanded macro. There is also a \type {\lastarguments}
primitive (that provides the number of arguments.

So, what are the benefits? You might think that it is about performance, but in
practice there are not that many parameter settings going on. When I process the
\LUAMETATEX\ manual, only some 5000 times one or more parameters are set. And
even in a way more complex document that I asked my colleague to run I was a bit
disappointed that only some 30.000 cases were reported. I know of users who have
documents with hundreds of thousands of cases, but compared to the rest of
processing this is not where the performance bottleneck is. \footnote {Think of
thousands of pages of tables with cell settings applied.} This means that a
change in implementation like the above is not paying off in significantly better
runtime: all these low level mechanisms in \CONTEXT\ have been very well
optimized over the years. And faster machines made old bottlenecks go away
anyway. Take this use case:

\starttyping[option=TEX]
\grabparameters
  [foo]
  [key0=value0,
   key1=value1,
   key2=value2,
   key3=value3]
\stoptyping

After this, parameters can be accessed with:

\starttyping[option=TEX]
\def\getvalue#1#2{\csname#1#2\endcsname}
\stoptyping

used as:

\starttyping[option=TEX]
\getvalue{foo}{key2}
\stoptyping

which takes care of characters normally not permitted in macro names, like the
digits in this example. Of course some namespace protection can be added, like
adding a colon between the namespace and the key, but let's take just this one.

Some 10.000 expansions of the grabber take on my machine 0.045 seconds while the
original \type {\getparameters} takes 0.090 so although for this case we're twice
as fast, the 0.045 difference will not be noticed on a real run. After all, when
these parameters are set some action will take place. Also, we don't actually use
this macro for collecting settings with the \type {\setupsomething} commands, so
the additional overhead that is involved adds a baseline to performance that can
turn any gain into noise. But some users might notice some gain. Of course this
observation might change once we apply this trickery in more places than
parameter parsing, because I have to admit that there might be other places in
the support macros where we can benefit: less code, double performance, but these
are all support macros that made sense in \MKII\ and not that much in \MKIV\ or
\LMTX\ and are kept just for convenience and backward compatibility. Think of
some list processing macros. So, as a kind of nostalgic trip I decided to rewrite
some low level macros anyway, if only to see what is no longer used and|/|or to
make the code base somewhat (c)leaner.

Elsewhere I introduce the \type {#0} argument indicator. That one will just
gobbles the argument and does not store a token list on the stack. It saves some
memory access and token recycling when arguments are not used. Another special
indicator is \type {#+}. That one will flag an argument to be passed as|-|is. The
\type {#-} variant will simply discard an argument and move on. The following
examples demonstrate this:

\startbuffer
\def\foo    [#1]{\detokenize{#1}}
\def\ofo    [#0]{\detokenize{#1}}
\def\oof    [#+]{\detokenize{#1}}
\def\fof[#1#-#2]{\detokenize{#1#2}}
\def\fff[#1#0#3]{\detokenize{#1#3}}

\meaning\foo\ : <\foo[{123}]> \crlf
\meaning\ofo\ : <\ofo[{123}]> \crlf
\meaning\oof\ : <\oof[{123}]> \crlf
\meaning\fof\ : <\fof[123]>   \crlf
\meaning\fff\ : <\fof[123]>   \crlf
\stopbuffer

\typebuffer[option=TEX]

This gives:

{\tttf \getbuffer}

% \getcommalistsize[a,b,c]   \commalistsize\par
% \getcommalistsize[{a,b,c}] \commalistsize\par

When playing with new features like the one described here, it makes sense to use
them in existing macros so that they get well tested. Some of the low level
system files come in different versions: for \MKII, \MKIV\ and \LMTX. The \MKII\
files often also have the older implementations, so they are also good for
looking at the history. The \LMTX\ files can be leaner and meaner than the \MKIV\
files because they use the latest features. \footnote {Some 70 primitives present
in \LUATEX\ are not in \LUAMETATEX. On the other hand there are also about 70 new
primitives. Of those gone, most concerned the backend, fonts or no longer
relevant features from other engines. Of those new, some are really new
primitives (conditionals, expansion magic), some control previously hardwired
behaviour, some give access to properties of for instance boxes, and some are
just variants of existing ones but with options for control.}

When I was rewriting some of these low level \MKIV\ macros using the newer features,
at some point I wondered why I still had to jump through some hoops. Why not just
add some more primitives to deal with that? After all, \LUATEX\ and \LUAMETATEX\
already have more primitives that are helpful in parsing, so a few dozen more lines
don't hurt. As long as these primitives are generic and not that specific. In this
particular case we talk about two new conditionals (in addition to the already
present comparison primitives):

\starttyping[option=TEX]
\ifhastok    <token>       {<token list>}
\ifhastoks  {<token list>} {<token list>}
\ifhasxtoks {<token list>} {<token list>}
\stoptyping

You can probably guess what they do from their names. The last one is the
expandable variant of the second one. The first one is the fast one. When playing
with these I decided to redo the set checker. In \MKII\ that one is done in good
old \TEX, in \MKIV\ we use \LUA. So, how about going back to \TEX ?

\starttyping[option=TEX]
\ifhasxtoks {cd} {abcdef}
\stoptyping

This check is true. But that doesn't work well with a comma separated list, but
there is a way out:

\starttyping[option=TEX]
\ifhasxtoks {,cd,} {,ab,cd,ef,}
\stoptyping

However, when I applied that a user reported that it didn't handle optional
spaces before commas. So how do we deal with such optional characters tokens?

\startbuffer
\def\setcontains#1#2{\ifhasxtoks{,#1,}{,#2,}}

\ifcondition\setcontains{cd}{ab,cd,ef}YES \else NO \fi
\ifcondition\setcontains{cd}{ab, cd, ef}YES \else NO \fi
\stopbuffer

\typebuffer[option=TEX]

We get:

\getbuffer

The \type {\ifcondition} is an old one. When nested in a condition it will be
seen as an \type {\if...} by the fast skipping scanner, but when expanded it will
go on and a following macro has to expand to a proper condition. That said, we
can take care of the optional space by entering some new territory. Look at this:

\startbuffer
\def\setcontains#1#2{\ifhasxtoks{,\expandtoken 9 "20 #1,}{,#2,}}

\ifcondition\setcontains{cd}{ab,cd,ef}YES \else NO \fi
\ifcondition\setcontains{cd}{ab, cd, ef}YES \else NO \fi
\stopbuffer

\typebuffer[option=TEX]

We get:

\getbuffer

So how does that work? The \type {\expandtoken} injects a space token with
catcode~9 which means that it is in the to be ignored category. When a to be
ignored token is seen, and the to be checked token is a character (letter, other,
space or ignored) then the character code will be compared. When they match, we
move on, otherwise we just skip over the ignored token (here the space).

In the \CONTEXT\ code base there are already files that are specific for \MKIV\
and \LMTX. The most visible difference is that we use the \type {\orelse}
primitive to construct nicer test trees, and we also use some of the additional
\type {\future...} and \type {\expandafter...} features. The extensions discussed
here make for the most recent differences (we're talking end May 2020).

After implementing this trick I decided to look at the macro definition mechanism
one more time and see if I could also use this there. Before I demonstrate
another next feature, I will again show the argument extensions, this time with
a fourth variant:

\startbuffer[definitions]
\def\TestA#1#2#3{{(#1)(#2)(#3)}}
\def\TestB#1#0#3{(#1)(#2)(#3)}
\def\TestC#1#+#3{(#1)(#2)(#3)}
\def\TestD#1#-#2{(#1)(#2)}
\stopbuffer

\typebuffer[definitions][option=TEX] \getbuffer[definitions]

The last one specifies a to be thrashed argument: \type {#-}. It goes further
than the second one (\type {#0}) which still keeps a reference. This is why in
this last case the third argument gets number \type {2}. The meanings of these
four are:

\startlines \tttf
\meaning\TestA
\meaning\TestB
\meaning\TestC
\meaning\TestD
\stoplines

There are some subtle differences between these variants, as you can see from
the following examples:

\startbuffer[usage]
\TestA1{\red 2}3
\TestB1{\red 2}3
\TestC1{\red 2}3
\TestD1{\red 2}3
\stopbuffer

\typebuffer[usage][option=TEX]

Here you also see the side effect of keeping the braces. The zero argument (\type
{#0}) is ignored, and the thrashed argument (\type {#-}) can't even be accessed.

\startlines \tttf \getbuffer[usage] \stoplines

In the next example we see two delimiters being used, a comma and a space, but
they have catcode~9 which flags them as ignored. This is a signal for the parser
that both the comma and the space can be skipped. The zero arguments are still on
the parameter stack, but the thrashed ones result in a smaller stack, not that
the later matters much on today's machines.

\startbuffer
\normalexpanded {
    \def\noexpand\foo
    \expandtoken 9 "2C % comma
    \expandtoken 9 "20 % space
    #1=#2]%
}{(#1)(#2)}
\stopbuffer

\typebuffer[option=TEX] \getbuffer

This means that the next tree expansions won't bark:

\startbuffer
\foo,key=value]
\foo, key=value]
\foo  key=value]
\stopbuffer

\typebuffer[option=TEX]

or expanded:

\startlines \tttf \getbuffer \stoplines

Now, why didn't I add these primitives long ago already? After all, I already
added dozens of new primitives over the years. To quote Andrew Cuomo, what
follows now are opinions, not facts.

Decades ago, when \TEX\ showed up, there was no Internet. I remember that I got
my first copy on floppy disks. Computers were slow and memory was limited. The
\TEX book was the main resource and writing macros was a kind of art. One could
not look up solutions, so trial and error was a valid way to go. Figuring out
what was efficient in terms of memory consumption and runtime was often needed
too. I remember meetings where one was not taken serious when not talking in the
right \quote {token}, \quote {node}, \quote {stomach} and \quote {mouth} speak.
Suggesting extensions could end up in being told that there was no need because
all could be done in macros or even arguments of the \quotation {who needs that}.
I must admit that nowadays I wonder to what extend that was related to extensions
taking away some of the craftmanship and showing off. In a way it is no surprise
that (even trivial to implement) extensions never surfaced. Of course then the
question is: will extensions that once were considered not of importance be used
today? We'll see.

Let's end by saying that, as with other experiments, I might port some of the new
features in \LUAMETATEX\ to \LUATEX, but only after they have become stable and
have been tested in \LMTX\ for quite a while.

\stopchapter

\stopcomponent