summaryrefslogtreecommitdiff
path: root/doc/context/sources/general/manuals/lowlevel/lowlevel-buffers.tex
blob: 5c600cd6feba9cd4f5436e0a3329b50bfb68c7e2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
% language=us runpath=texruns:manuals/lowlevel

\environment lowlevel-style

\startdocument
  [title=buffers,
   color=middlegreen]

\startsectionlevel[title=Preamble]

Buffers are not that low level but it makes sense to discuss them in this
perspective because it relates to tokenization, internal representation and
manipulating.

{\em In due time we can describe some more commands and details here. This
is a start. Feel free to tell me what needs to be explained.}

\stopsectionlevel

\startsectionlevel[title=Encoding]

Normally processing a document starts with reading from file. In the past we were
talking single bytes that were then maps onto a specific input encoding that
itself matches the encoding of a font. When you enter an \quote {a} its (normally
\ASCII) number 97 becomes the index into a font. That same number is also used in
the hyphenator which is why font encoding and hyphenation are strongly related.
If in an eight bit \TEX\ engine you need a precomposed \quote {ä} you have to use
an encoding that has that character in some slot with again matching fonts and
patterns. The actually used font can have the {\em shapes} in different slots and
remapping is then done in the backend code using encoding and mapping files. When
\OPENTYPE\ fonts are used the relationship between characters (input) and glyphs
(rendering) also depends on the application of font features.

In eight bit environments all this brings a bit of a resource management
nightmare along with complex installation of new fonts. It also puts strain on
the macro package, especially when you want to mix different input encodings onto
different font encodings and thereby pattern encodings in the same document. You
can compare this with code pages in operating system, but imagine them
potentially being mixed in one document, which can happen when you mix multiple
languages where the accumulated number of different characters exceeds 256. You
end up switching between encodings. One way to deal with it is making special
characters active and let their meaning differ per situation. That is for
instance how in \MKII\ we handled \UTF8\ and thereby got around distributing
multiple pattern files per language as we only needed to encoding them in \UTF\
and then remap them to the required encoding when loading patterns. A mental
exercise is wondering how to support \CJK\ scripts in an eight bit \MKII,
something that actually can be done with some effort.

The good news is that when we moved from \MKII\ to \MKIV\ we went exclusively
\UTF8\ because that is what the \LUATEX\ engine expects. Upto four bytes are read
in and translated into one \UNICODE\ character. The internal representation is a
32 bit integer (four bytes) instead of a single byte. That also means that in the
transition we got rid of quite some encoding related low level font and pattern
handling. We still support input encodings (called regimes in \CONTEXT) but I'm
pretty sure that nowadays no one uses input other than \UTF8. While \CONTEXT\ is
normally quite upward compatible this is one area where there were fundamental
changes.

There is still some interpretation going on when reading from file: for instance,
we need to normalize the \UNICODE\ input, and we feed the engine separate lines
on demand. Apart from that, some characters like the backslash, dollar sign and
curly braces have special meaning so for accessing them as characters we have to
use commands that inject those characters. That didn't change when we went from
\MKII\ to \MKIV. In practice it's never really a problem unless you find yourself
in one of the following situations:

\startitemize
\startitem
    {\em Example code has to be typeset as|-|is, so braces etc.\ are just that.}
    This means that we have to change the way characters are interpreted.
    Typesetting code is needed when you want to document \TEX\ and macros which
    is why mechanisms for that have to be present right from the start.
\stopitem
\startitem
    {\em Content is collected and used later.} A separation of content and usage
    later on often helps making a source look cleaner. Examples are \quotation
    {wrapping a table in a buffer} and \quotation {including that buffer when a
    table is placed} using the placement macros.
\stopitem
\startitem
    {\em Embedded \METAPOST\ and \LUA\ code.} These languages come with different
    interpretation of some characters and especially \METAPOST\ code is often
    stored first and used (processed) later.
\stopitem
\startitem
    {\em The content comes from a different source.} Examples are \XML\ files
    where angle brackets are special but for instance braces aren't. The data is
    interpreted as a stream or as a structured tree.
\stopitem
\startitem
    {\em The content is generated.} It can for instance come from \LUA, where
    bytes (representing \UTF) is just text and no special characters are to be
    intercepted. Or it can come from a database (using a library).
\stopitem
\stopitemize

For these reasons \CONTEXT\ always had ways to store data in ways that makes this
possible. The details on how that is done might have changed over versions, been
optimized, extended with additional interfaces and features but given where we
come from most has been there from the start.

\stopsectionlevel

\startsectionlevel[title=Performance]

When \TEX\ came around, the bottlenecks in running \TEX\ were the processor,
memory and disks and depending on the way one used it the speed of the console or
terminal; so, basically the whole system. One could sit there and wait for the
page counters (\typ {[1] [2] ..} to show up. It was possible to run \TEX\ on a
personal computer but it was somewhat resource hungry: one needed a decent disk
(a 10 MB hard disk was huge and with todays phone camera snapshots that sounds
crazy). One could use memory extenders to get around the 640K limitation (keep in
mind that the programs and operating systems also took space). This all meant
that one could not afford to store too many tokens in memory but even using files
for all kind of (multi|-|pass) trickery was demanding.

When processors became faster and memory plenty the disk became the bottleneck,
but that changed when \SSD's showed up. Combined with already present file
caching that had some impact. We are now in a situation that \CPU\ cores don't
get that much faster (at least not twice as fast per iteration) and with \TEX\
being a single core byte cruncher we're more or less in a situation where
performance has to come from efficient programming. That means that, given enough
memory, in some cases storing in tokens wins over storing in files, but it is no
rule. In practice there is not much difference so one can even more than
yesterday choose for the most convenient method. Just assume that the \CONTEXT\
code, combined with \LUAMETATEX\ will give you what you need with a reasonable
performance. When in doubt, test with simple test files and it that works out
well compared to the real code, try to figure out where \quote {mistakes} are
made. Inefficient \LUA\ and \TEX\ code has way more impact than storing a few
more tokens or using some files.

\stopsectionlevel

\startsectionlevel[title=Files]

Nearly always files are read once per run. The content (mixed with commands) is
scanned and macros are expanded and|/|or text is typeset as we go. Internally the
\LUAMETATEX\ engine is in \quotation {scanning from file}, \quotation {scanning
from token lists}, or \quotation {scanning from \LUA\ output} mode. The first
mode is (in principle) the slowest because \UTF\ sequences are converted to
tokens (numbers) but there is no way around it. The second method is fast because
we already have these numbers, but we need to take into account where the linked
list of tokens comes from. If it is converted runtime from for instance file
input or macro expansion we need to add the involved overhead. But scanning a
stored macro body is pretty efficient especially when the macro is part of the
loaded macro package (format file). The third method is comparable with reading
from file but here we need to add the overhead involved with storing the \LUA\
output into data structures suitable for \TEX's input mechanism, which can
involve memory allocation outside the reserved pool of tokens. On modern systems
that is not really a problem. It is good to keep in mind that when \TEX\ was
written much attention was paid to optimization and in \LUAMETATEX\ we even went
a bit further, also because we know what kind of input, processing and output
we're dealing with.

When reading from file or \LUA\ output we interpret bytes turned \UTF\ numbers
and that is when catcode regimes kick in: characters are interpreted according to
the catcode properties: escape character (backslash), curly braces (grouping and
arguments), dollars (math), etc.\ While with reading from token lists these
catcodes are already taken care of and we're basically interpreting meanings
instead of characters. By changing the catcode regime we can for instance typeset
content verbatim from files and \LUA\ strings but when reading from token lists
we're sort of frozen. There are tricks to reinterpret the token list but that
comes with overhead and limitations.

\stopsectionlevel

\startsectionlevel[title=Macros]

A macro can be seen as a named token with a meaning attached. In \LUAMETATEX\
macros can take up to 15 arguments (six more than regular \TEX) that can be
separated by so called delimiters. A token has a command property (operator) and
a value (operand). Because a \UNICODE\ character doesn't need all four bytes of
an integer and because in the engine numbers, dimensions and pointers are limited
in size we can store all of these efficiently with the command code. Here the
body of \type {\foo} is a list of three tokens:

\starttyping
\def\foo{abc} \foo \foo \foo
\stoptyping

When the engine fetches a token from a list it will interpret the command and
when it fetches from file it will create tokens on the fly and then interpret
those. When a file or list is exhausted the engine pops the stack and continues
at the previous level. Because macros are already tokenized they are more
efficient than file input. For more about macros you can consult the low level
document about them.

The more you use a macro, the more it pays off compared to a file. However don't
overestimate this, because in the end the typesetting and expanding all kind of
other involved macros might reduce the file overhead to noise.

\stopsectionlevel

\startsectionlevel[title=Token lists]

A token list is like a macro but is part of the variable (register) system. It
is just a list (so no arguments) and you can append and prepend to that list.

\starttyping
\toks123={abc}    \the\toks123
\scratchtoks{abc} \the\scratchtoks
\stoptyping

Here \type {\scratchtoks} is defined with \type {\newtoks} which creates an
efficient reference to a list so that, contrary to the first line, no register
number has to be scanned. There are low level manuals about tokens and registers
that you can read if you want to know more about this. As with macros the list in
this example is three tokens long. Contrary to macros there is no macro overhead
as there is no need to check for arguments. \footnote {In \LUAMETATEX\ a macro
without arguments is also quite efficient.}

Because they use more or less the same storage method macros and token list
registers perform the same. The power of registers comes from some additional
manipulators in \LUATEX\ (and \LUAMETATEX) and the fact that one can control
expansion with \type {\the}, although that later advantage is compensated with
extensions to the macro language (like \type {\protected} macro definitions).

\stopsectionlevel

\startsectionlevel[title=Buffers]

Buffers are something specific for \CONTEXT\ and they have always been part of
this system. A buffer is defined as follows:

\startbuffer
\startbuffer[one]
line 1
line 2
\stopbuffer
\stopbuffer

\typebuffer

Among the operations on buffers the next two are used most often:

\starttyping
\typebuffer[one]
\getbuffer[one]
\stoptyping

Scanning a buffer at the \TEX\ end takes a little effort because when we start
reading the catcodes are ignored and for instance backslashes and curly braces
are retained. Hardly any interpretation takes place. The same is true for
spacing, so multiple spaces are not collapsed and newlines stay. The tokenized
content of a buffer is converted back to a string and that content is then read
in as a pseudo file when we need it. So, basically buffers are files! In \MKII\
they actually were files (in the \type {\jobname} name space and suffix \type
{tmp}), but in \MKIV\ they are stored in and managed by \LUA. That also means
that you can set them very efficiently at the \LUA\ end:

\starttyping
\startluacode
buffers.assign("one",[[
line 1
line 2
]])
\stopluacode
\stoptyping

Always keep in mind that buffers eventually are read as files: character by
character, and at that time the content gets (as with other files) tokenized. A
buffer name is optional. You can nest buffers, with and without names.

Because \CONTEXT\ is very much about re-use of content and selective processing
we have an (already old) subsystem for defining named blocks of text (using \type
{\begin...} and \type {\end...} tagging. These blocks are stored just like
buffers but selective flushing is part of the concept. Think of coding an
educational document with explanations, questions, answers and then typesetting
only the explanations, or the explanation along width some questions. Other
components can be typeset later so one can make for instance a special book(let)
with answers that either of not repeats the questions. Here we need features like
synchronization of numbers so that's why we cannot really use buffers. An
alternative is to use \XML\ and filter from that.

The \typ {\definebuffer} command defines a new buffer environment. When you set
buffers in \LUA\ you don't need to define a buffer because likely you don't need
the \type {\start} and \type {\stop} commands. Instead of \typ {\getbuffer} you
can also use \typ {\getdefinedbuffer} with defined buffers. In that case the
\type {before} and \type {after} keys of that specific instance are used.

The \typ {\getinlinebuffer} command, which like the getters takes a list of
buffer names, ignores leading and trailing spaces. When multiple buffers are
flushed this way, spacing between buffers is retained.

The most important aspect of buffers is that the content is {\em not} interpreted
and tokenized: the bytes stay as they are.

\startbuffer
\definebuffer[MyBuffer]

\startMyBuffer
\bold{this is
a buffer}
\stopMyBuffer

\typeMyBuffer \getMyBuffer
\stopbuffer

\typebuffer

These commands result in:

\getbuffer

There are not that many parameters that can be set: \type {before}, \type {after}
and \type {strip} (when set to \type {no} leading and trailing spacing will be
kept. The \type {\stop...} command, in our example \typ {\stopMyBuffer}, can be
defined independent to so something after the buffer has be read and stored but
by default nothing is done.

You can test if a buffer exists with \typ {\doifelsebuffer} (expandable) and \typ
{\doifelsebufferempty} (unexpandable). A buffer is kept in memory unless it gets
wiped clean with \typ {resetbuffer}.

\starttyping
\savebuffer      [MyBuffer][temp]     % gets name: jobname-temp.tmp
\savebufferinfile[MyBuffer][temp.log] % gets name: temp.log
\stoptyping

You can also stepwise fill such a buffer:

\starttyping
\definesavebuffer[slide]

\startslide
    \starttext
\stopslide
\startslide
    slide 1
\stopslide
text 1 \par
\startslide
    slide 2
\stopslide
text 2 \par
\startslide
    \stoptext
\stopslide
\stoptyping

After this you will have a file \type {\jobname-slide.tex} that has the two lines
wrapped as text. You can set up a \quote {save buffer} to use a different
filename (with the \type {file} key), a different prefix using \type {prefix} and
you can set up a \type {directory}. A different name is set with the \type {list}
key.

You can assign content to a buffer with a somewhat clumsy interface where we use
the delimiter \type {\endbuffer}. The only restriction is that this delimiter
cannot be part of the content:

\starttyping
\setbuffer[name]here comes some text\endbuffer
\stoptyping

For more details and obscure commands that are used in other commands
you can peek into the source.

% These are somewhat obscure:
%
% \getbufferdata{...}
% \grabbufferdatadirect % name start stop
% \grabbufferdata % was: \dostartbuffer
% \thebuffernumber
% \thedefinedbuffer

Using buffers in the \CLD\  interface is tricky because of the catcode magick that is
involved but there are setters and getters:

\starttabulate[|T|T|]
\BC function               \BC arguments \NC \NR
\ML
\NC buffers.assign         \NC name, content [,catcodes] \NC \NR
%NC buffers.raw            \NC \NC \NR
\NC buffers.erase          \NC name \NC \NR
\NC buffers.prepend        \NC name, content \NC \NR
\NC buffers.append         \NC name, content \NC \NR
\NC buffers.exists         \NC name \NC \NR
\NC buffers.empty          \NC name \NC \NR
\NC buffers.getcontent     \NC name \NC \NR
\NC buffers.getlines       \NC name \NC \NR
%NC buffers.collectcontent \NC \NC \NR
%NC buffers.loadcontent    \NC \NC \NR
%NC buffers.get            \NC \NC \NR
%NC buffers.getmkiv        \NC \NC \NR
%NC buffers.gettexbuffer   \NC \NC \NR
%NC buffers.run            \NC \NC \NR
\stoptabulate

There are a few more helpers that are used in other (low level) commands. Their
functionality might adapt to their usage there. The \typ {context.startbuffer}
and \typ {context.stopbuffer} are somewhat differently defined than regular
\CLD\ commands.

\stopsectionlevel

\startsectionlevel[title=Setups]

A setup is basically a macro but is stored and accessed in a namespace separated
from ordinary macros. One important characteristic is that inside setups newlines
are ignored.

\startbuffer
\startsetups MySetupA
    This is line 1
    and this is line 2
\stopsetups

\setup{MySetupA}
\stopbuffer

\typebuffer {\bf \getbuffer}

A simple way out is to add a comment character preceded by a space. Instead you
can also use \type {\space}:

\startbuffer
\startsetups [MySetupB]
    This is line 1 %
    and this is line 2\space
    while here we have line 3
\stopsetups

\setup[MySetupB]
\stopbuffer

\typebuffer {\bf \getbuffer}

You can use square brackets instead of space delimited names in definitions and
also in calling up a (list of) setup(s). The \type {\directsetup} command takes a
single setup name and is therefore more efficient.

Setups are basically simple macros although there is some magic involved that
comes from their usage in for instance \XML\ where we pass an argument. That
means we can do the following:

\startbuffer
\startsetups MySetupC
    before#1after
\stopsetups

\setupwithargument{MySetupC}{ {\em and} }
\stopbuffer

\typebuffer {\bf \getbuffer}

Because a setup is a macro, the body is a linked list of tokens where each token
takes 8 bytes of memory, so \type {MySetupC} has 12 tokens that take 96 bytes of
memory (plus some overhead related to macro management).

\stopsectionlevel

\startsectionlevel[title=\XML]

Discussing \XML\ is outside the scope of this document but it is worth mentioning
that once an \XML\ tree is read is, the content is stored in strings and can be
filtered into \TEX, where it is interpreted as if coming from files (in this case
\LUA\ strings). If needed the content can be interpreted as \TEX\ input.

\stopsectionlevel

\startsectionlevel[title=\LUA]

As mentioned already, output from \LUA\ is stored and when a \LUA\ call finishes
it ends up on the so called input stack. Every time the engine needs a token it
will fetch from the input stack and the top of the stack can represent a file,
token list or \LUA\ output. Interpreting bytes from files or \LUA\ strings
results in tokens. As a side note: \LUA\ output can also be already tokenized,
because we can actually write tokens and nodes from \LUA, but that's more an
implementation detail that makes the \LUA\ input stack entries a bit more
complex. It is normally not something users will do when they use \LUA\ in their
documents.

\stopsectionlevel

\startsectionlevel[title=Protection]

When you define macros there is the danger of overloading some defined by the
system. Best use CamelCase so that you stay away from clashes. You can enable
some checking:

\starttyping
\enabledirectives[overloadmode=warning]
\stoptyping

or when you want to quit on a clash:

\starttyping
\enabledirectives[overloadmode=error]
\stoptyping

When these trackers are enabled you can get around the check with:

\starttyping
\pushoverloadmode
  ...
\popoverloadmode
\stoptyping

But delay that till you're sure that redefining is okay.

\stopsectionlevel

% efficiency

\stopdocument