% language=us runpath=texruns:manuals/luametatex

\environment luametatex-style

\startcomponent luametatex-modifications

\startchapter[reference=modifications,title={The original engines}]

\startsection[title=The merged engines]

\startsubsection[title=The rationale]

\topicindex {engines}
\topicindex {history}

The first version of \LUATEX, made by Hartmut after we discussed the possibility
of an extension language, only had a few extra primitives and it was largely the
same as \PDFTEX. It was presented to the public in 2005. As part of the Oriental
\TEX\ project, Taco merged some parts of \ALEPH\ into the code and some more
primitives were added. Then we started more fundamental experiments. After many
years, when the engine had become more stable, the decision was made to clean up
the rather hybrid nature of the program. This means that some primitives were
promoted to core primitives, often with a different name, and that others were
removed. This also made it possible to start cleaning up the code base, which
showed decades of stepwise additions to original \TEX. In \in {chapter}
[enhancements] we discuss some new primitives, here we will cover most of the
adapted ones.

During more than a decade stepwise new functionality was added and after 10 years
the more of less stable version 1.0 was presented. But we continued and after
some 15 years the \LUAMETATEX\ follow up entered its first testing stage. But
before details about the engine are discussed in successive chapters, we first
summarize where we started from. Keep in mind that in \LUAMETATEX\ we have a bit
less than in \LUATEX, so this section differs from the one in the \LUATEX\
manual.

Besides the expected changes caused by new functionality, there are a number of
not|-|so|-|expected changes. These are sometimes a side|-|effect of a new
(conflicting) feature, or, more often than not, a change necessary to clean up
the internal interfaces. These will also be mentioned.

Again we stress that {\em this is not a \TEX\ manual, nor a tutorial}. If you are
unfamiliar with \TEX\ first play a little with a macro package, take a look at
the \TEX\ book, make yourself familiar with the concepts and macro language. That
will likely take days and not hours. Also, many of the new concepts introduced in
\LUATEX\ and \LUAMETATEX\ are explained in documents that come with the \CONTEXT\
distribution, articles and presentations. It doesn't pay of to repeat that here,
especially not in a time when users often search instead of read from cover to
cover.

\stopsubsection

\startsubsection[title=Changes from \TEX\ 3.1415926...]

\topicindex {\TEX}

Of course it all starts with traditional \TEX. Even if we started with \PDFTEX,
most still comes from original Knuthian \TEX. But we divert a bit.

\startitemize

\startitem
    The current code base is written in \CCODE, not \PASCAL. The original \WEB\
    documentation is kept when possible and not wrapped in tagged comments. As a
    consequence instead of one large file plus change files, we now have multiple
    files organized in categories like \type {tex}, \type {lua}, \type
    {languages}, \type {fonts}, \type {libraries}, etc. There are some artifacts
    of the conversion to \CCODE, but these got (and get) removed stepwise. The
    documentation, which actually comes from the mix of engines (via so called
    change files), is a mix of what authors of the engines wove into the source,
    and most is of course from Don Knuths original. In \LUAMETATEX\ we try to
    stay as close as possible to the original so that the documentation of the
    fundamentals behind \TEX\ by Don Knuth still applies. However, because we use
    \CCODE, some documentation is a bit off. Also, most global variables are now
    collected in structures, but the original names and level of abstraction were
    mostly kept. On the other hand, opening up had its impact on the code, so
    that makes some documentation a bit off too. Adapting that all will take time.
\stopitem

\startitem
    See \in {chapter} [languages] for many small changes related to paragraph
    building, language handling and hyphenation. The most important change is
    that adding a brace group in the middle of a word (like in \type {of{}fice})
    does not prevent ligature creation. Also, the hyphenation, ligature building
    and kerning has been split so that we can hook in alternative or extra code
    wherever we like. There are various options to control discretionary
    injection and related penalties are now integrated in these nodes. Language
    information is now bound to glyphs. The number of languages in \LUAMETATEX\
    is smaller than in \LUATEX. Control over discretionaries is more granular and
    now managed by less variables.
\stopitem

\startitem
    There is no pool file, all strings are embedded during compilation. This also
    removed some memory constraints. We kept token and node memory management
    because it is convenient and efficient but parts were reimplemented in order
    to remove some constraints. Token memory management is largely the same. All
    the other large memory structures, like those related to nesting, the save
    stack, input levels, the hash table and table of equivalents, etc. now all
    start out small and are enlarged when needed, where maxima are controlled in
    the usual way. In principle the initial memory footprint is smaller while at
    the same time we can go real large. Because we have wide memory words some
    data (arrays) used for housekeeping could be reorganized a bit.
\stopitem

\startitem
    The specifier \type {plus 1 fillll} does not generate an error. The extra
    \quote {l} is simply typeset.
\stopitem

\startitem
    The upper limit to \prm {endlinechar} and \prm {newlinechar} is 127.
\stopitem

\startitem
    Because the backend is not built|-|in, the magnification (\tex {mag})
    primitive is gone. A \tex {shipout} command just discards the content of the
    given box. The write related primitives have to be implemented in the used
    macro package using \LUA. None of the \PDFTEX\ derived primitives is present.
\stopitem

\startitem
    Because there is no font loader, a \LUA\ variant is free to either support or
    not the \OMEGA\ \type {ofm} file format. As there are hardly any such fonts
    it probably makes no sense. There is plenty of control over the way glyphs
    get treated and scaling of fonts and glyphs is also more dynamic.
\stopitem

\startitem
    There is more control over some (formerly hard|-|coded) math properties. In
    fact, there is a whole extra bit of math related code because we need to deal
    with \OPENTYPE\ fonts. The math processing has been adapted to the new
    (dynamic) font and glyph scaling features.
\stopitem

\startitem
    The \type {\outer} and \type {\long} prefixed are silently ignored. It is
    permitted to use \type {\par} in math.
\stopitem

\startitem
    The lack of a backend means that some primitives related to it are not
    implemented. This is no big deal because it is possible to use the scanner
    library to implement them as needed, which depends on the macro package and
    backend.
\stopitem

\startitem
    The math style related primitives can use numbers as well as symbolic names.
    There is some more (control over) math anyway, which is a side effect of
    supporting \OPENTYPE\ math.
\stopitem

\stopitemize

\stopsubsection

\startsubsection[title=Changes from \ETEX\ 2.2]

\topicindex {\ETEX}

Being the de|-|facto standard extension of course we provide the \ETEX\
features, but with a few small adaptations.

\startitemize

\startitem
    The \ETEX\ functionality is always present and enabled so the prepended
    asterisk or \type {-etex} switch for \INITEX\ is not needed.
\stopitem

\startitem
    The \TEXXET\ extension is not present, so the primitives \type
    {\TeXXeTstate}, \type {\beginR}, \type {\beginL}, \type {\endR} and \type
    {\endL} are missing. Instead we used the \OMEGA|/|\ALEPH\ approach to
    directionality as starting point, albeit it has been changed quite a bit,
    so that we're probably not that far from \TEXXET.
\stopitem

\startitem
    Some of the tracing information that is output by \ETEX's \prm
    {tracingassigns} and \prm {tracingrestores} is not there. Also keep in mind
    that tracing doesn't involve what \LUA\ does.
\stopitem

\startitem
    Register management in \LUAMETATEX\ uses the \OMEGA|/|\ALEPH\ model, so the
    maximum value is 65535 and the implementation uses a flat array instead of
    the mixed flat & sparse model from \ETEX.
\stopitem

\startitem
    Because we have more nodes, conditionals, etc.\ the \ETEX\ status related
    variables are adapted to \LUAMETATEX: we use different \quote {constants},
    but that should be no problem because any sane macro package uses
    abstraction.
\stopitem

\startitem
    The \type {\scantokens} primitive is now using the same mechanism as \LUA\
    print|-|to|-|\TEX\ uses, which simplifies the code. There is a little
    performance hit but it will not be noticed in \CONTEXT, because we never use
    this primitive.
\stopitem

\startitem
    Because we don't use change files on top of original \TEX, the integration of
    \ETEX\ functionality is bit more natural, code wise.
\stopitem

\startitem
    The \tex {readline} primitive has to be implemented in \LUA. This is a side
    effect of delegating all file \IO.
\stopitem

\startitem
    Most of the code is rewritten but the original primitives are still tagged as
    coming from \ETEX.
\stopitem

\stopitemize

\stopsubsection

\startsubsection[title=Changes from \PDFTEX\ 1.40]

\topicindex {\PDFTEX}

Because we want to produce \PDF\ the most natural starting point was the popular
\PDFTEX\ program. We inherit the stable features, dropped most of the
experimental code and promoted some functionality to core \LUATEX\ functionality
which in turn triggered renaming primitives. However, as the backend was dropped,
not that much from \PDFTEX\ is present any more. Basically all we now inherit
from \PDFTEX\ is expansion and protrusion but even that has been adapted. So
don't expect \LUAMETATEX\ to be compatible.

\startitemize

\startitem
    The experimental primitives \prm {ifabsnum} and \prm {ifabsdim} have been
    promoted to core primitives.
\stopitem

\startitem
    The primitives \prm {ifincsname}, \prm {expanded} and \prm {quitvmode}
    have become core primitives.
\stopitem

\startitem
    As the hz (expansion) and protrusion mechanism are part of the core the
    related primitives \prm {lpcode}, \prm {rpcode}, \prm {efcode}, \prm
    {leftmarginkern}, \prm {rightmarginkern} are promoted to core primitives. The
    two commands \prm {protrudechars} and \prm {adjustspacing} control these
    processes. The protrusion and kern related primitives are now dimensions
    while expansion is still one of these 1000 based scales.
\stopitem

\startitem
    In \LUAMETATEX\ three extra primitives can be used to overload the font
    specific settings: \prm {adjustspacingstep} (max: 100), \prm
    {adjustspacingstretch} (max: 1000) and \prm {adjustspacingshrink} (max: 500).
\stopitem

\startitem
    The hz optimization code has been partially redone so that we no longer need
    to create extra font instances. The front- and backend have been decoupled
    and the glyph and kern nodes carry the used values. In \LUATEX\ that made a
    more efficient generation of \PDF\ code possible. It also resulted in much
    cleaner code. The backend code is gone, but of course the information is
    still carried around.
\stopitem

\startitem
    When \prm {adjustspacing} has value~2, hz optimization will be applied to
    glyphs and kerns. When the value is~3, only glyphs will be treated. A value
    smaller than~2 disables this feature.
\stopitem

\startitem
    When \prm {protrudechars} has a value larger than zero characters at the edge
    of a line can be made to hang out. A value of~2 will take the protrusion into
    account when breaking a paragraph into lines. A value of~3 will try to deal
    with right|-|to|-|left rendering; this is a still experimental feature.
\stopitem

\startitem
    The pixel multiplier dimension \prm {pxdimen} has be inherited as core
    primitive.
\stopitem

\startitem
    The primitive \prm {tracingfonts} is now a core primitive but doesn't relate
    to the backend.
\stopitem

\stopitemize

\stopsubsection

\startsubsection[title=Changes from \ALEPH\ RC4]

\topicindex {\ALEPH}

In \LUATEX\ we took the 32 bit aspects and much of the directional mechanisms and
merged it into the \PDFTEX\ code base as starting point for further development.
Then we simplified directionality, fixed it and opened it up. In \LUAMETATEX\ not
that much of the later is left. We only have two horizontal directions. Instead
of vertical directions we introduce an orientation model bound to boxes.

The already reduced|-|to|-|four set of directions now only has two members:
left|-|to|-|right and right|-|to|-|left. They don't do much as it is the backend
that has to deal with them. When paragraphs are constructed a change in
horizontal direction is irrelevant for calculating the dimensions. So, basically
most that we do is registering state and passing that on till the backend can do
something with it.

Here is a summary of inherited functionality:

\startitemize

\startitem
    The \type {^^} notation has been extended: after \type {^^^^} four
    hexadecimal characters are expected and after \type {^^^^^^} six hexadecimal
    characters have to be given. The original \TEX\ interpretation is still valid
    for the \type {^^} case but the four and six variants do no backtracking,
    i.e.\ when they are not followed by the right number of hexadecimal digits
    they issue an error message. Because \type {^^^} is a normal \TEX\ case, we
    don't support the odd number of \type {^^^^^} either.
\stopitem

\startitem
    Glues {\it immediately after} direction change commands are not legal
    breakpoints. There is a bit more sanity testing for the direction state. This
    can be configured.
\stopitem

\startitem
    The placement of math formula numbers is direction aware and adapts
    accordingly. Boxes carry directional information but rules don't.
\stopitem

\startitem
    There are no direction related primitives for page and body directions. The
    paragraph, text and math directions are specified using primitives that
    take a number. The three letter codes are dropped.
\stopitem

\stopitemize

\stopsubsection

\startsubsection[title=Changes from standard \WEBC]

\topicindex {\WEBC}

The \LUAMETATEX\ codebase is not dependent on the \WEBC\ framework. The
interaction with the file system and \TDS\ is up to \LUA. There still might be
traces but eventually the code base should be lean and mean. The \METAPOST\
library is coded in \CWEB\ and in order to be independent from related tools,
conversion to \CCODE\ is done with a \LUA\ script ran by, surprise, \LUAMETATEX.

\stopsubsection

\stopsection

\startsection[title=Implementation notes]

\startsubsection[title=Memory allocation]

\topicindex {memory}

The single internal memory heap that traditional \TEX\ used for tokens and nodes
is split into two separate arrays. Each of these will grow dynamically when
needed. Internally a token or node is an index into these arrays. This permits
for an efficient implementation and is also responsible for the performance of
the core. All other data structures are mostly the same but managed dynamically
too. Because we operate in a 64 bit world, the parallel table of equivalents
needed for managing levels, is gone. Anyhow, the original documentation in \TEX\
The Program mostly applies!

\stopsubsection

\startsubsection[title=Sparse arrays]

The \prm {mathcode}, \prm {delcode}, \prm {catcode}, \prm {sfcode}, \prm {lccode}
and \prm {uccode} (and the new \prm {hjcode}) tables are now sparse arrays that
are implemented in~\CCODE. They are no longer part of the \TEX\ \quote
{equivalence table} and because each had 1.1 million entries with a few memory
words each, this makes a major difference in memory usage. Performance is not
really hurt by this.

The \prm {catcode}, \prm {sfcode}, \prm {lccode}, \prm {uccode} and \prm {hjcode}
assignments don't show up when using the \ETEX\ tracing routines \prm
{tracingassigns} and \prm {tracingrestores} but we don't see that as a real
limitation. It also saves a lot of clutter.

The glyph ids within a font are also managed by means of a sparse array as glyph
ids can go up to index $2^{21}-1$ but these are never accessed directly so again
users will not notice this.

\stopsubsection

\startsubsection[title=Simple single|-|character csnames]

\topicindex {csnames}

Single|-|character commands are no longer treated specially in the internals,
they are stored in the hash just like the multiletter control sequences. This is
a side effect of going \UNICODE\ and \UTF. Where using 256 slots in an array add
no burden supporting the whole \UNICODE\ range is a waste of space. Therefore,
also active characters are internally implemented as a special type of
multi|-|letter control sequences that uses a prefix that is otherwise impossible
to obtain.

The code that displays control sequences explicitly checks if the length is one
when it has to decide whether or not to add a trailing space.

\stopsubsection

\startsubsection[title=Binary file reading]

\topicindex {files+binary}

All input now goes via \LUA: files loaded with \type {\input} as well as files
that are opened with \type {\openin}. Actually the later has to be implemented
in terms of macros and \LUA\ calls. This also means that compared to \LUATEX\
the internal handling of input has been changed but users won't notice that.

Setting a callback is expected now. Although reading input natively using \type
{getc} calls is more efficient, we now fetch lines from \LUA, put them in a
buffer and then pick successive bytes (keep in mind that we read \UTF) from that.
The performance is quite ok, also because \LUA\ is fast, todays operating systems
cache, and storage media have become very fast. Also, \TEX\ is spending more time
messing around with what it has input than actually reading input.

\stopsubsection

\startsubsection[title=Tabs and spaces]

\topicindex {space}
\topicindex {newline}

We conform to the way other \TEX\ engines handle trailing tabs and spaces. For
decades trailing tabs and spaces (before a newline) were removed from the input
but this behaviour was changed in September 2017 to only handle spaces. We are
aware that this can introduce compatibility issues in existing workflows but
because we don't want too many differences with upstream \TEXLIVE\ we just follow
up on that patch (which is a functional one and not really a fix). It is up to
macro packages maintainers to deal with possible compatibility issues and in
\LUAMETATEX\ they can do so via the callbacks that deal with reading from files.

The previous behaviour was a known side effect and (as that kind of input
normally comes from generated sources) it was normally dealt with by adding a
comment token to the line in case the spaces and|/|or tabs were intentional and
to be kept. We are aware of the fact that this contradicts some of our other
choices but consistency with other engines. We still stick to our view that at
the log level we can (and might be) more incompatible. We already expose some
more details anyway.

\stopsubsection

\startsubsection[title=Logging]

When detailed logging is enabled more detail is output with respect to what nodes
are involved. This is a side effect of the core nodes having more detailed
subtype information. The benefit of more detail wins from any wish to be byte
compatible in the logging. One can always write additional logging in \LUA.

The information that goes into the log file can be different from \LUATEX, and
might even differ a bit more in the future. The main reason is that inside the
engine we have more granularity, which for instance means that we output subtype
and attribute related information when nodes are printed. Of course we could have
offered a compatibility mode but it serves no purpose. Over time there have been
many subtle changes to control logs in the \TEX\ ecosystems so another one is
bearable.

In a similar fashion, there is a bit different behaviour when \TEX\ expects
input, which in turn is a side effect of removing the interception of \type {*}
and \type {&} which made for cleaner code (quite a bit had accumulated as side
effect of continuous adaptations in the \TEX\ ecosystems). There was already code
that was never executed, simply as side effect of the way \LUATEX\ initializes
itself (one needs to enable classes of primitives for instance). Keep in mind
that over time system dependencies have been handles with \TEX\ change files, the
\WEBC\ infrastructure, \KPSE\ features, compilation variables and flags, etc. In
\LUAMETATEX\ we try to minimize all that.

When it became unavoidable that we output more detail, it also became clear that
it made no sense to stay log and trace compatible. Some is controlled by
parameters in order to stay close the original, but \CONTEXT\ is configured such
that we benefit from the new possibilities. Examples are that in addition to
\type {\meaning} we have \type {\meaningfull} that also exposes macro properties,
and \type {\meaningless} that only exposes the body. The \type {\untraced} prefix
will suppress some in the log, and we set \type {\tracinglevels} to 3 in order to
get details about the input and grouping level. When there's less shown than
expected keep in mind that \LUAMETATEX\ has a somewhat optimized saving and
restoring of meanings so less can happen which is reflected in tracing. When node
lists are serialized (as with \type {\showbox}) some nodes, like discretionaries
report more detail. The compact serializer, used for instance to signal overfull
boxes, also shows a bit more detail with respect to non|-|content nodes. I math
more is shown if only because we have more control and additional mechanisms.

\stopsubsection

\startsubsection[title=Parsing]

Token parsers have been upgraded for the sake of \LUA, \type {\csname} handling
has been extended, macro definitions can be more flexible so there code was
adapted, more conditionals also brought some changes. But we build upon the
(reorganized) \TEX\ foundation so the basics can definitely be recognized.

Because of interfacing in \LUA\ the internal token and node organization has
been normalized (read: we cannot cheat because all is kind of visible). On
the one hand this can come with a performance penalty but that is more than
compensated by extensions, optimized parsers and such. Still the fact that we
are \UTF\ based (32 bit) makes the machinery slower than the 8~bit original.
The reworked \LUAMETATEX\ engine is substantially faster than the \LUATEX\
predecessor.

The handling of conditionals has been adapted so that we can have flatter
branches (\type {\orelse} cum suis). This again has some consequences for
parsing. Because parsing alignments is rather interwoven in general parsing and
expansion the handling of related primitives has been slightly adapted (also for
the sake of \LUA\ interfacing) and dealing with \type {\noalign} situations is a
bit more convenient.

This are just a few of the adaptations and most of this happened stepwise with
testing in the \CONTEXT\ code base. It will be clear that \LUAMETATEX\ is a quite
different extension to the original. You're warned.

\stopsubsection

\stopsection

\stopchapter

\stopcomponent