summaryrefslogtreecommitdiff
path: root/doc/context/sources/general/manuals/mk/mk-mix.tex
blob: dd2c72d5bafcb648b5f69038fe444f5e9a081dd4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
% language=uk

\startcomponent mk-mix

\environment mk-environment

\chapter{The \luaTeX\ Mix}

\subject{introduction}

The idea of embedding \LUA\ into \TEX\ originates in some
experiments with \LUA\ embedded in the \SCITE\ editor. You can add
functionality to this editor by loading \LUA\ scripts. This is
accomplished by a library that gives access to the internals of
the editing component.

The first integration of \LUA\ in \PDFTEX\ was relatively simple:
from \TEX\ one could call out to \LUA\ and from \LUA\ one could
print to \TEX. My first application was converting math encoded a
calculator syntax to \TEX. Following experiments dealt with
\METAPOST. At this point integration meant as little as: having some
scripting language as addition to the macro language. But, even in
this early stage further possibilities were explored, for instance
in manipulating the final output (i.e.\ the \PDF\ code). The first
versions of what by then was already called \LUATEX\ provided
access to some internals, like counter and dimension registers and
the dimensions of boxes.

Boosted by the oriental \TeX\ project, the team started exploring
more fundamental possibilities: hooks in the input|/|output,
tokenization, fonts and nodelists. This was followed by opening up
hyphenation, breaking lines into paragraphs and building
ligatures. At that point we not only had access to some internals
but also could influence the way \TEX\ operates.

After that, an excursion was made to \MPLIB, which fulfilled a
long standing wish for a more natural integration of \METAPOST\
into \TEX. At that point we ended up with mixtures of \TEX, \LUA\
and \METAPOST\ code.

Medio 2008 we still need to open up more of \TEX, like page
building, math, alignments and the backend. Eventually \LUATEX\
will be nicely split up in components, rewritten in \CCODE, and we may
even end up with \LUA\ glueing together the components that make
up the \TEX\ engine. At that point the interoperation between
\TEX\ and \LUA\ may be more rich that it is now.

In the next sections I will discuss some of the ideas behind
\LUATEX\ and the relationship between \LUA\ and \TEX\ and how it
presents itself to users. I will not discuss the interface itself,
which consists of quite some functions (organized in pseudo
libraries) and the mechanisms used to access and replace internals
(we call them callbacks).

\subject {tex vs. lua}

\TEX\ is a macro language. Everything boils down to either allowing
stepwise expansion or explicitly preventing it. There are no real
control features, like loops; tail recursion is a key concept.
There are few accessible data|-|structures like numbers, dimensions,
glue, token lists and boxes. What happens inside \TEX\ is
controlled by variables, mostly hidden from view, and optimized
within the constraints of 30 years ago.

The original idea behind \TEX\ was that an author would write a
specific collection of macros for each publication, but increasing
popularity among non-programmers quickly resulted in distributed
collections of macros, called macro packages. They started small
but grew and grew and by now have become pretty large. In these
packages there are macros dealing with fonts, structure, page
layout, graphic inclusion, etc. There is also code dealing with
user interfaces, process control, conversion and much of that code
looks out of place: the lack of control features and string
manipulation is solved by mimicking other languages, the
unavailability of a float datatype is compensated by misusing
dimension registers, and you can find provisions to force or
inhibit expansion all over the place.

\TEX\ is a powerful typographical programming language but
lacks some of the handy features of scripting languages. Handy in the
sense that you will need them when you want to go beyond the
original purpose of the system. \LUA\ is a powerful scripting
language, but knows nothing of typesetting. To some extent it
resembles the language that \TEX\ was written in: \PASCAL. And,
since \LUA\ is meant for embedding and extending existing systems,
it makes sense to bring \LUA\ into \TEX. How do they compare?
Let's give some examples.

About the simplest example of using \LUA\ in \TEX\ is the following:

\starttyping
\directlua { tex.print(math.sqrt(10)) }
\stoptyping

This kind of application is probably what most users will want and
use, if they use \LUA\ at all. However, we can go further than that.

In \TEX\ a loop can be implemented as in the plain format
(copied with comment):

\starttyping
\def\loop#1\repeat{\def\body{#1}\iterate}
\def\iterate{\body\let\next\iterate\else\let\next\relax\fi\next}
\let\repeat=\fi % this makes \loop...\if...\repeat skippable
\stoptyping

This is then used as:

\starttyping
\newcount \mycounter \mycounter=1
\loop
    ...
    \advance\mycounter 1
    \ifnum\mycounter < 11
\repeat
\stoptyping

The definition shows a bit how \TEX\ programming works. Of course
such definitions can be wrapped in macros, like:

\starttyping
\forloop{1}{10}{1}{some action}
\stoptyping

and this is what often happens in more complex macro packages. In
order to use such control loops without side effects, the macro
writer needs to take measures that permit for instance nested
usage and avoids clashes between local variables (counters or
macros) and user defined ones. Here we use a counter in the
condition, but in practice expressions will be more complex
and this is not that trivial to implement.

The original definition of the iterator can be written a bit
more efficient:

\starttyping
\def\iterate{\body \expandafter\iterate \fi}
\stoptyping

And indeed, in macro packages you will find many such expansion
control primitives being used, which does not make reading macros
easier.

Now, get me right, this does not make \TEX\ less powerful, it's
just that the language is focused on typesetting and not on
general purpose programming, and in principle users can do
without: documents can be preprocessed using another language, and
document specific styles can be used.

We have to keep in mind that \TEX\ was written in a time when
resources in terms of memory and \CPU\ cycles weres less abundant
than they are now. The 255 registers per class and the about 3000
hash slots in original \TEX\ were more than enough for typesetting
a book, but in huge collections of macros they are not all that much. For
that reason many macropackages use obscure names to hide their
private registers from users and instead of allocating new ones
with meaningful names, existing ones are shared. It is therefore
not completely fair to compare \TEX\ code with \LUA\ code: in \LUA\
we have plenty of memory and the only limitations are those
imposed by modern computers.

In \LUA, a loop looks like this:

\starttyping
for i=1,10 do
    ...
end
\stoptyping

But while in the \TEX\ example, the content directly ends up in
the input stream, in \LUA\ we need to do that explicitly, so in
fact we will have:

\starttyping
for i=1,10 do
    tex.print("...")
end
\stoptyping

And, in order to execute this code snippet, in \LUATEX\ we will do:

\starttyping
\directlua 0 {
    for i=1,10 do
        tex.print("...")
    end
}
\stoptyping

So, eventually we will end up with more code than just \LUA\ code,
but still the loop itself looks quite readable and more complex loops
are possible:

\starttyping
\directlua 0 {
    local t, n = { }, 0
    while true do
        local r = math.random(1,10)
        if not t[r] then
            t[r], n = true, n+1
            tex.print(r)
            if n == 10 then break end
        end
    end
}
\stoptyping

This will typeset the numbers 1 to 10 in randomized order.
Implementing a random number generator in pure \TEX\ takes some bit of
code and keeping track of already defined numbers in macros can be
done with macros, but both are not very efficient.

I already stressed that \TEX\ is a typographical programming
language and as such some things in \TEX\ are easier than in \LUA,
given some access to internals:

\starttyping
\setbox0=\hbox{x} \the\wd0
\stoptyping

In \LUA\ we can do this as follows:

\starttyping
\directlua 0 {
    local n = node.new('glyph')
    n.font = font.current()
    n.char = string.byte('x')
    tex.box[0] = node.hpack(n)
    tex.print(tex.box[0].width/65536 .. "pt")
}
\stoptyping

One pitfall here is that \TEX\ rounds the number differently than
\LUA. Both implementations can be wrapped in a macro cq. function:

\starttyping
\def\measured#1{\setbox0=\hbox{#1}\the\wd0\relax}
\stoptyping

Now we get:

\starttyping
\measured{x}
\stoptyping

The same macro using \LUA\ looks as follows:

\starttyping
\directlua 0 {
    function measure(chr)
        local n = node.new('glyph')
        n.font = font.current()
        n.char = string.byte(chr)
        tex.box[0] = node.hpack(n)
        tex.print(tex.box[0].width/65536 .. "pt")
    end
}
\def\measured#1{\directlua0{measure("#1")}}
\stoptyping

In both cases, special tricks are needed if you want to pass for
instance a \type {#} to \TEX's variant, or a \type {"} to \LUA. In
both cases we can use shortcuts like \type {\#} and in the second
case we can pass strings as long strings using double square
brackets to \LUA.

This example is somewhat misleading. Imagine that we want to
pass more than one character. The \TEX\ variant is already suited
for that, but the function will now look like:

\starttyping
\directlua 0 {
    function measure(str)
        if str == "" then
            tex.print("0pt")
        else
            local head, tail = nil, nil
            for chr in str:gmatch(".") do
                local n = node.new('glyph')
                n.font = font.current()
                n.char = string.byte(chr)
                if not head then
                    head = n
                else
                    tail.next = n
                end
                tail = n
            end
            tex.box[0] = node.hpack(head)
            tex.print(tex.box[0].width/65536 .. "pt")
        end
    end
}
\stoptyping

And still it's not okay, since \TEX\ inserts kerns between
characters (depending on the font) and glue between words, and
doing that all in \LUA\ takes more code. So, it will be clear that
although we will use \LUA\ to implement advanced features, \TEX\
itself still has quite some work to do.

In the following example we show code, but this is not of
production quality. It just demonstrates a new way of dealing
with text in \TEX.

Occasionally a design demands that at some place the first
character of each word should be uppercase, or that the first word
of a paragraph should be in small caps, or that each first line of a
paragraph has to be in dark blue. When using traditional \TEX\ the user
then has to fall back on parsing the data stream, and preferably
you should then start such a sentence with a command that can pick
up the text. For accentless languages like English this is quite
doable but as soon as commands (for instance dealing with accents)
enter the stream this process becomes quite hairy.

The next code shows how \CONTEXT\ \MKII\ defines the \type {\Word}
and \type {\Words} macros that capitalize the first characters of
word(s). The spaces are really important here because they signal
the end of a word.

\starttyping
\def\doWord#1%
  {\bgroup\the\everyuppercase\uppercase{#1}\egroup}

\def\Word#1%
  {\doWord#1}

\def\doprocesswords#1 #2\od
  {\doifsomething{#1}{\processword{#1} \doprocesswords#2 \od}}

\def\processwords#1%
  {\doprocesswords#1 \od\unskip}

\let\processword\relax

\def\Words
  {\let\processword\Word \processwords}
\stoptyping

Actually, the code is not that complex. We split of words and feed
them to a macro that picks up the first token (hopefully a character)
which is then fed into the \type {\uppercase} primitive. This assumes that
for each character a corresponding uppercase variant is defined using the
\type {\uccode} primitive. Exceptions can be dealt with by assigning relevant
code to the token register \type {\everyuppercase}.
However, such macros are far from robust. What happens if the text
is generated and not input as-is? What happens with commands in
the stream that do something with the following tokens?

A \LUA\ based solution can look as follows:

\starttyping
\def\Words#1{\directlua 0
    for s in unicode.utf8.gmatch("#1", "([^ ])") do
        tex.sprint(string.upper(s:sub(1,1)) .. s:sub(2))
    end
}
\stoptyping

But there is no real advantage here, apart from the fact that less code
is needed. We still operate on the input and therefore we need to look
to a different kind of solution: operating on the node list.

\starttyping
function CapitalizeWords(head)
    local done = false
    local glyph = node.id("glyph")
    for start in node.traverse_id(glyph,head) do
        local prev, next = start.prev, start.next
        if prev and prev.id == kern and prev.subtype == 0 then
            prev = prev.prev
        end
        if next and next.id == kern and next.subtype == 0 then
            next = next.next
        end
        if (not prev or prev.id ~= glyph) and
                    next and next.id == glyph then
            done = upper(start)
        end
    end
    return head, done
end
\stoptyping

A node list is a forward|-|linked list. With a helper
function in the \type {node} library we can loop over such lists. Instead
of traversing we can use a regular while loop, but it is probably less
efficient in this case. But how to apply this function to the relevant
part of the input? In \LUATEX\ there are several callbacks that operate
on the horizontal lists and we can use one of them to plug in this
function. However, in that case the function is applied to probably
more text than we want.

The solution for this is to assign attributes to the range of text
that such a function has to take care of. These attributes (there
can be many) travel with the nodes. This is also a reason why such
code normally is not written by end users, but by macropackage
writers: they need to provide the frameworks where you can plug in
code. In \CONTEXT\ we have several such mechanisms and therefore
in \MKIV\ this function looks (slightly stripped) as follows:

\starttyping
function cases.process(namespace,attribute,head)
    local done, actions = false, cases.actions
    for start in node.traverse_id(glyph,head) do
        local attr = has_attribute(start,attribute)
        if attr and attr > 0 then
            unset_attribute(start,attribute)
            local action = actions[attr]
            if action then
                local _, ok = action(start)
                done = done and ok
            end
        end
    end
    return head, done
end
\stoptyping

Here we check attributes (these are set at the \TEX\ end) and we have
all kind of actions that can be applied, depending on the value of the
attribute. Here the function that does the actual uppercasing
is defined somewhere else. The \type {cases} table provides us a
namespace; such namespaces needs to be coordinated by macro package
writers.

This approach means that the macro code looks completely different; in
pseudo code we get:

\starttyping
\def\Words#1{{<setattribute><cases><somevalue>#1}}
\stoptyping

Or alternatively:

\starttyping
\def\StartWords{\begingroup<setattribute><cases><somevalue>}
\def\StopWords {\endgroup}
\stoptyping

Because starting a paragraph with a group can have unwanted side
effects (like \type {\everypar} being expanded inside a group) a
variant is:

\starttyping
\def\StartWords{<setattribute><cases><somevalue>}
\def\StopWords {<resetattribute><cases>}
\stoptyping

So, what happens here is that the users sets an attribute using some high
level command, and at some point during the transformation of the input into
node lists, some action takes place. At that point commands, expansion and
whatever no longer can interfere.

In addition to some infrastructure, macro packages need to carry some
knowledge, just as with the \type {\uccode} used in \type {\uppercase}.
The \type {upper} function in the first example looks as follows:

\starttyping
local function upper(start)
    local data, char = characters.data, start.char
    if data[char] then
        local uc = data[char].uccode
        if uc and fonts.ids[start.font].characters[uc] then
            start.char = uc
            return true
        end
    end
    return false
end
\stoptyping

Such code is really macro package dependent: \LUATEX\ only
provides the means, not the solutions. In \CONTEXT\ we have
collected information about characters in a \type {data} table
in the \type {characters} namespace. There we have stored the
uppercase codes (\type {uccode}). The, again \CONTEXT\ specific,
\type {fonts} table keeps track of all defined fonts and before
we change the case, we make sure that this character is present
in the font. Here \type {id} is the number by which
\LUATEX\ keeps track of the used fonts. Each glyph node carries
such a reference.

In this example, eventually we end up with more code than in \TEX,
but the solution is much more robust. Just imagine what would happen
when in the \TEX\ solution we would have:

\starttyping
\Words{\framed[offset=3pt]{hello world}}
\stoptyping

It simply does not work. On the other hand, the \LUA\ code never
sees \TEX\ commands, it only sees the two words represented by
glyphs nodes and separated by glue.

Of course, there is a danger when we start opening \TEX's core
features. Currently macro packages know what to expect, they know
what \TEX\ can and cannot do. Of course macro writers have
exploited every corner of \TEX, even the dark ones. Where dirty
tricks in the \TEX book had an educational purpose, those of users
sometimes have obscene traits. If we just stick to the trickery
introduced for parsing input, converting this into that, doing
some calculations, and alike, it will be clear that \LUA\ is more
than welcome. It may hurt to throw away thousands of lines of
impressive code and replace it by a few lines of \LUA\ but that's
the price the user pays for abusing \TEX. Eventually \CONTEXT\ \MKIV\
will be a decent mix of \LUA\ and \TEX\ code, and hopefully the
solutions programmed in those languages are as clean as possible.

Of course we can discuss until eternity whether \LUA\ is the best
choice. Taco, Hartmut and I are pretty confident that it is, and
in the couple of years that we are working on \LUATEX\ nothing has proved
us wrong yet. We can fantasize about concepts, only to find out that
they are impossible to implement or hard to agree on; we just go
ahead using trial and error. We can talk over and over how opening up
should be done, which is what the team does in a nicely
closed and efficient loop, but at some points decisions have to be
made. Nothing is perfect, neither is \LUATEX, but most users won't
notice it as long as it extends \TEX's live and makes usage more
convenient.

Users of \TEX\ and \METAPOST\ will have noticed that both
languages have their own grouping (scope) model. In \TEX\ grouping
is focused on content: by grouping the macro writer (or author)
can limit the scope to a specific part of the text or keep certain
macros live within their own world.

\starttyping
.1. \bgroup .2. \egroup .1.
\stoptyping

Everything done at 2 is local unless explicitly told otherwise.
This means that users can write (and share) macros with a small
chance of clashes. In \METAPOST\ grouping is available too, but
variables explicitly need to be saved.

\starttyping
.1. begingroup ; save p ; path p ; .2. endgroup .1.
\stoptyping

After using \METAPOST\ for a while this feels quite natural
because an enforced local scope demands multiple return values
which is not part of the macro language. Actually, this is another
fundamental difference between the languages: \METAPOST\ has (a
kind of) functions, which \TEX\ lacks. In \METAPOST\ you can write

\starttyping
draw origin for i=1 upto 10 : .. (i,sin(i)) endfor ;
\stoptyping

but also:

\starttyping
draw some(0) for i=1 upto 10 : .. some(i) endfor ;
\stoptyping

with

\starttyping
vardef some (expr i) =
    if i > 4 : i = i - 4 fi ;
    (i,sin(i))
enddef ;
\stoptyping

The condition and assignment in no way interfere with the loop where
this function is called, as long as some value is returned (a pair in
this case).

In \TEX\ things work differently. Take this:

\starttyping
\count0=1
\message{\advance\count0 by 1 \the\count0}
\the\count0
\stoptyping

The terminal wil show:

\starttyping
\advance \count 0 by 1 1
\stoptyping

At the end the counter still has the value~1. There are quite some
situations like this, for instance when data like a table of
contents has to be written to a file. You cannot write macros where
such calculations are done and hidden and only the result is seen.

The nice thing about the way \LUA\ is presented to the user is that it
permits the following:

\starttyping
\count0=1
\message{\directlua0{tex.count[0] = tex.count[0] + 1}\the\count0}
\the\count0
\stoptyping

This will report~2 to the terminal and typeset a 2 in the
document. Of course this does not solve everything, but it is a
step forward. Also, compared to \TEX\ and \METAPOST, grouping is
done differently: there is a \type {local} prefix that makes
variables (and functions are variables too) local in modules,
functions, conditions, loops etc. The \LUA\ code in this story
contains such locals.

In practice most users will use a macro package and so, if a user
sees \TEX, he or she sees a user interface, not the code behind
it. As such, they will also not encounter the code written in
\LUA\ that deals with for instance fonts or node list
manipulations. If a user sees \LUA, it will most probably be in
processing actual data. Therefore, in the next section I will give an
example of two ways to deal with \XML: one more suitable for
traditional \TEX, and one inspired by \LUA. It demonstrates how
the availability of \LUA\ can result in different solutions for
the same problem.

\subject {an example: xml}

In \CONTEXT\ \MKII, the version that deals with \PDFTEX\ and \XETEX,
we use a stream based \XML\ parser, written in \TEX. Each \type {<}
and \type {&} triggers a macro that then parses the tag and/or entity.
This method is quite efficient in terms of memory but the associated
code is not simple because it has to deal with attributes, namespaces
and nesting.

The user interface is not that complex, but involves quite some
commands. Take for instance the following \XML\ snippet:

\starttyping
<document>
    <section>
        <title>Whatever</title>
        <p>some text</p>
        <p>some more</p>
    </section>
</document>
\stoptyping

When using \CONTEXT\ commands, we can imagine the following definitions:

\starttyping
\defineXMLenvironment[document]{\starttext} {\stoptext}
\defineXMLargument   [title]   {\section}
\defineXMLenvironment[p]       {\ignorespaces}{\par}
\stoptyping

When attributes have to be dealt with, for instance a reference to
this section, things quickly start looking more complex. Also,
users need to know what definitions to use in situations like this:

\starttyping
<table>
    <tr><td>first</td><td>...</td> <td>last</td></tr>
    <tr><td>left</td><td>...</td> <td>right</td></tr>
</table>
\stoptyping

Here we cannot be sure if a cell does not contain a nested table,
which is why we need to define the mapping as follows:

\starttyping
\defineXMLnested[table]{\bTABLE} {\eTABLE}
\defineXMLnested[tr]   {\bTR}    {\eTR}
\defineXMLnested[td]   {\bTD}    {\eTD}
\stoptyping

The \type {\defineXMLnested} macro is rather messy because it has
to collect snippets and keep track of the nesting level, but users
don't see that code, they just need to know when to use what
macro. Once it works, it keeps working.

Unfortunately mappings from source to style are never that simple
in real life. We usually need to collect, filter and relocate
data. Of course this can be done before feeding the source to
\TEX, but \MKII\ provides a few mechanisms for that too. If for
instance you want to reverse the order you can do this:

\starttyping
<article>
    <title>Whatever</title>
    <author>Someone</author>
    <p>some text</p>
</article>
\stoptyping

\starttyping
\defineXMLenvironment[article]
    {\defineXMLsave[author]}
    {\blank author: \XMLflush{author}}
\stoptyping

This will save the content of the \type {author} element and flush
it when the end tag \type {article} is seen. So, given previous
definitions, we will get the title, some text and then the author.
You may argue that instead we should use for instance \XSLT\ but
even then a mapping is needed from the \XML\ to \TEX, and it's a
matter of taste where the burden is put.

Because \CONTEXT\ also wants to support standards like
\MATHML, there are some more mechanisms but these are hidden from
the user. And although these do a good job in most cases, the code
associated with the solutions has never been satisfying.

Supporting \XML\ this way is doable, and \CONTEXT\ has used this method
for many years in fairly complex situations. However, now that we
have \LUA\ available, it is possible to see if some things can be done
simpler (or differently).

After some experimenting I decided to write a full blown \XML\
parser in \LUA, but contrary to the stream based approach, this
time the whole tree is loaded in memory. Although this uses more
memory than a streaming solution, in practice the difference is
not significant because often in \MKII\ we also needed to store
whole chunks.

Loading \XML\ files in memory is real fast and once it is done we
can have access to the elements in a way similar to \XPATH. We can
selectively pipe data to \TEX\ and manipulate content using \TEX\
or \LUA. In most cases this is faster than the stream|-|based
method. Interesting is that we can do this without linking to
existing \XML\ libraries, and as a result we are pretty
independent.

So how does this look from the perspective of the user? Say that
we have the simple article definition stored in \type {demo.xml}.

\starttyping
<?xml version ='1.0'?>
<article>
    <title>Whatever</title>
    <author>Someone</author>
    <p>some text</p>
</article>
\stoptyping

This time we associate so called setups with the elements. Each
element can have its own setup, and we can use expressions to
assign them. Here we have just one such setup:

\starttyping
\startxmlsetups xml:document
   \xmlsetsetup{main}{article}{xml:article}
\stopxmlsetups
\stoptyping

When loading the document it will automatically be associated with the tag \type
{main}. The previous rule associates setup \type {xml:article}
with the \type {article} element in tree \type {main}. We need to
register this setup so that it will be applied to the document
after loading:

\starttyping
\xmlregistersetup{xml:document}
\stoptyping

and the document itself is processed with:

\starttyping
\xmlprocessfile{main}{demo.xml}{} % optional setup
\stoptyping

The setup \type {xml:article} can look as follows:

\starttyping
\startxmlsetups xml:article
    \section{\xmltext{#1}{/title}}
    \xmlall{#1}{!(title|author)}
    \blank author: \xmltext{#1}{/author}
\stopxmlsetups
\stoptyping

Here \type {#1} refers to the current node in the \XML\ tree, in
this case the root element, \type {article}. The second argument
of \type {\xmltext} and \type {\xmlall} is a path expression,
comparable with \XPATH: \type {/title} means: the \type {title}
element anchored to the current root (\type{#1}), and \type
{!(title|author)} is the negation of (complement to) \type{title}
or \type {author}. Such expressions can be more complex that the
one above, like:

\starttyping
\xmlfirst{#1}{/one/(alpha|beta)/two/text()}
\stoptyping

which returns the content of the first element that satisfies one of
the paths (nested tree):

\starttyping
/one/alpha/two
/one/beta/two
\stoptyping

There is a whole bunch of commands like \type {\xmltext} that
filter content and pipe it into \TEX. These are calling \LUA\
functions. This is no manual, so we will not discuss them here.
However, it is important to realize that we have to associate
setups (consider them free formatted macros) to at least one
element in order to get started. Also, \XML\ inclusions have to be
dealt with before assigning the setups. These are simple
one|-|line commands. You can also assign defaults to elements,
which saves some work.

Because we can use \LUA\ to access the tree and manipulate
content, we can now implement parts of \XML\ handling in \LUA. An
example of this is dealing with so|-|called Cals tables. This is
done in approximately 150 lines of \LUA\ code, loaded at runtime in a
module. This time the association uses functions instead of setups and those
functions will pipe data back to \TEX. In the module you will find:

\starttyping
\startxmlsetups xml:cals:process
    \xmlsetfunction {\xmldocument} {cals:table} {lxml.cals.table}
\stopxmlsetups

\xmlregistersetup{xml:cals:process}

\xmlregisterns{cals}{cals}
\stoptyping

These commands tell \MKIV\ that elements with a namespace
specification that contains \type {cals} will be remapped to the
internal namespace \type {cals} and the setup associates a
function with this internal namespace.

By now it will be clear that from the perspective of the user
hardly any \LUA\ is visible. Sure, he or she can deduce that deep
down some magic takes place, especially when you run into more
complex expressions like this (the \type {@} denotes an
attribute):

\starttyping
\xmlsetsetup
    {main} {item[@type='mpctext' or @type='mrtext']}
    {questions:multiple:text}
\stoptyping

Such expressions resemble \XPATH, but can go much further than
that, just by adding more functions to the library.

\starttyping
b[position() > 2 and position() < 5 and text() == 'ok']
b[position() > 2 and position() < 5 and text() == upper('ok')]
b[@n=='03' or @n=='08']
b[number(@n)>2 and number(@n)<6]
b[find(text(),'ALSO')]
\stoptyping

Just to give you an idea \unknown\ in the module that implements
the parser you will find definitions that match the function calls
in the above expressions.

\starttyping
xml.functions.find   = string.find
xml.functions.upper  = string.upper
xml.functions.number = tonumber
\stoptyping

So much for the different approaches. It's up to the user what
method to use: stream based \MKII, tree based \MKIV, or a mixture.

The main reason for taking \XML\ as an example of mixing \TEX\ and
\LUA\ is in that it can be a bit mind boggling if you start
thinking of what happens behind the screens. Say that we have

\starttyping
<?xml version ='1.0'?>
<article>
    <title>Whatever</title>
    <author>Someone</author>
    <p>some <b>bold</b> text</p>
</article>
\stoptyping

and that we use the setup shown before with \type {article}.

At some point, we are done with defining setups and load the
document. The first thing that happens is that the list of
manipulations is applied: file inclusions are processed first,
setups and functions are assigned next, maybe some elements are
deleted or added, etc. When that is done we serialize the tree to
\TEX, starting with the root element. When piping data to \TEX\ we
use the current catcode regime; linebreaks and spaces are honored
as usual.

Each element can have a function (command) associated and when
this is the case, control is given to that function. In our case
the root element has such a command, one that will trigger a
setup. And so, instead of piping content to \TEX, a function is
called that lets \TEX\ expand the macro that deals with this
setup.

However, that setup itself calls \LUA\ code that filters the title
and feeds it into the \type {\section} command, next it flushes
everything except the title and author, which again involves
calling \LUA. Last it flushes the author. The nested sequence
of events is as follows:

\startitemize[2*broad]

    \sym{lua:} Load the document and apply setups and alike.

    \sym{lua:} Serialize the \type {article} element, but since
    there is an associated setup, tell \TEX\ do expand that one
    instead.

    \startitemize[2*broad]

        \sym{tex:} Execute the setup, first expand the \type {\section}
        macro, but its argument is a call to \LUA.

        \startitemize[2*broad]

            \sym{lua:} Filter \type {title} from the subtree under
            \type {article}, print the content to \TEX\ and return
            control to \TEX.

        \stopitemize

        \sym{tex:} Tell \LUA\ to filter the paragraphs i.e.\ skip \type
        {title} and \type {author}; since the \type {b} element has
        no associated setup (or whatever) it is just serialized.

        \startitemize[2*broad]

            \sym{lua:} Filter the requested elements and return control
            to \TEX.

        \stopitemize

        \sym{tex:} Ask \LUA\ to filter \type {author}.

        \startitemize[2*broad]
            \sym{lua:} Pipe \type {author}'s content to \TEX.
        \stopitemize

        \sym{tex:} We're done.

    \stopitemize

    \sym{lua:} We're done.

\stopitemize

This is a really simple case. In my daily work I am dealing
with rather extensive and complex educational documents where in
one source there is text, math, graphics, all kind of fancy stuff,
questions and answers in several categories and of different kinds,
either or not to be reshuffled, omitted or combined. So there
we are talking about many more levels of \TEX\ calling \LUA\ and \LUA\
piping to \TEX\ etc. To stay in \TEX\ speak: we're dealing with
one big ongoing nested expansion (because \LUA calls expand), and
you can imagine that this somewhat stresses \TEX's input stack, but
so far I have not encountered any problems.

\subject{some remarks}

Here I discussed several possible applications of \LUA\ in \TEX. I
didn't mention yet that because \LUATEX\ contains a scripting engine
plus some extra libraries, it can also be used purely for that.
This means that support programs can now be written in \LUA\ and
that there are no longer dependencies of other scripting engines
being present on the system. Consider this a bonus.

Usage in \TEX\ can be organized in four categories:

\startitemize[n]
\item  Users can use \LUA\ for generating data, do all kind of
       data manipulations, maybe read data from file, etc. The
       only link with \TEX\ is the print function.
\item  Users can use information provided by \TEX\ and use this
       when making decisions. An example is collecting data in
       boxes and use \LUA\ to do calculations with the dimensions.
       Another example is a converter from \METAPOST\ output to
       \PDF\ literals. No real knowledge of \TEX's internals is
       needed. The \MKIV\ \XML\ functionality discussed before
       demonstrates this: it's mostly data processing and piping
       to \TEX. Other examples are dealing with buffers, defining
       character mappings, and handling error messages, verbatim
       \unknown\ the list is long.
\item  Users can extend \TEX's core functionality. An example is
       support for \OPENTYPE\ fonts: \LUATEX\ itself does not
       support this format directly, but provides ways to feed
       \TEX\ with the relevant information. Support for \OPENTYPE\
       features demands manipulating node lists. Knowledge of
       internals is a requirement. Advanced spacing and language
       specific features are made possible by node list
       manipulations and attributes. The alternative \type {\Words}
       macro is an example of this.
\item  Users can replace existing \TEX\ functionality. In \MKIV\
       there are numerous example of this, for instance all file
       \IO\ is written in \LUA, including reading from \ZIP\ files
       and remote locations. Loading and defining fonts is also
       under \LUA\ control. At some point \MKIV\ will provide
       dedicated splitters for multicolumn typesetting and
       probably also better display spacing and display
       math splitting.
\stopitemize

The boundaries between these categories are not frozen. For
instance, support for image inclusion and \MPLIB\ in \CONTEXT\
\MKIV\ sits between category 3 and~4. Category 3 and~4, and
probably also~2 are normally the domain of macro package writers
and more advanced users who contribute to macro packages. Because
a macropackage has to provide some stability it is not a good idea
to let users mess around with all those internals, because of
potential interference. On the other hand, normally users operate
on top of a kernel using some kind of \API\ and history has
proved that macro packages are stable enough for this.

Sometime around 2010 the team expects \LUATEX\ to be feature
complete and stable. By that time I can probably provide a more
detailed categorization.

\stopcomponent