% language=us \environment evenmore-style \startcomponent evenmore-keywords \startchapter[title=Keywords] Some primitives in \TEX\ can take one or more optional keywords and|/|or keywords followed by one or more values. In traditional \TEX\ it concerns a handful of primitives, in \PDFTEX\ there are plenty of backend related primitives, \LUATEX\ introduced optional keywords to some math constructs and attributes to boxes, while \LUAMETATEX\ adds some more too. The keyword scanner in \TEX\ is kind of special. Keywords are used in cases like: \starttyping \hbox spread 10cm {...} \advance\scratchcounter by 10 \vrule width 3cm height 1ex \stoptyping Sometimes there are multiple keywords, as with rules, in which case you can imagine use cases like: \starttyping \vrule width 3cm depth 1ex width 10cm depth 0ex height 1ex\relax \stoptyping Here we add a \type {\relax} to end the scanning. If we don't do that and the rule specification is followed by arbitrary (read: unpredictable) text, the next word can as well be valid keyword and when followed by a dimensions (unlikely) it will happily take that as directive or when not followed by a dimension an error message will show up. Sometimes the scanning is more restricted, like with glue where the optional \type {plus} and \type {minus} are to come in that order, but when missing, again a word from the text can be picked up if one doesn't explicitly ends with a \type {\relax} or some other not relevant token. \starttyping \scratchskip = 10pt plus 10pt minus 10pt % okay \scratchskip = 10pt plus 10pt % okay \scratchskip = 10pt minus 10pt % okay \scratchskip = 10pt minus 10pt plus 10pt % typesets "plus 10pt" \scratchskip = 10pt plus whatever % an error \stoptyping The scanner is case insensitive, so the following specifications are all valid: \starttyping \hbox To 10cm {To} \hbox TO 10cm {TO} \hbox tO 10cm {tO} \hbox to 10cm {to} \stoptyping It happens that keywords are always simple english words so the engine uses a cheap check deep down, just offsetting to uppercase, but of course that will not work for arbitrary \UTF\ (as used in \LUATEX) and it's also unrelated to the upper- and lowercase codes as \TEX\ knows them. The above lines scan for the keyword \type {to} and after that for a dimension. Where keyword scanning is case tolerant, dimension scanning is period tolerant: \starttyping \hbox to 10cm {10cm} \hbox to 10.0cm {10.0cm} \hbox to .0cm {.0cm} \hbox to .cm {.cm} \hbox to 10.cm {10.cm} \stoptyping These are all valid and according to the specification; even the single period one is okay, although it looks funny. It would not be hard to intercept that but I guess that when \TEX\ was written anything that could harm performance was taken into account and the above is quite okay. One can even argue for cases like: \starttyping \hbox to \first.\second cm {.cm} \stoptyping Here \type {\first} and|/|or \type {\second} can be empty. Most users won't notice these side effects of scanning numbers anyway. The reason for even spending words on keywords is the following. Optional keyword scanning is kind of costly, not so much now, but more so decades ago. For instance, in the first line below, there is no keyword. The scanner sees a \type {1} and it not being a keyword, pushes that character back in the input. \starttyping \advance\scratchcounter 10 \advance\scratchcounter by 10 \stoptyping In the case of: \starttyping \scratchskip 10pt plux \stoptyping It has to push back the four scanned tokens \type {plux}. Now, in the engine there are lots of cases where lookahead happens and when a condition is not satisfied, the just read token is pushed back. Incidentally, when picking up the next token triggered some expansion, it's not the original next token that gets pushed back, but the first token seen at the expansion. Pushing back tokens is not that inefficient, although it involves allocating a token and pushing and popping input stacks (we're talking of a mix of reading from file, token memory, \LUA\ prints, etc) but it always takes a little time and memory. In \LUATEX\ there are more keywords for boxes, and there we have loops too: in a box specification one or more optional attributes are scanned before the optional \type {to} or \type {spread}, so again there can be push back when no more \type {attr} are seen. \starttyping \hbox attr 1 98 attr 2 99 to 1cm{...} \stoptyping In \LUAMETATEX\ there is even more optional keyword scanning, but we leave that for now and just show one example: \starttyping \hbox spread 10em {\hss \hbox orientation 0 yoffset 1mm to 2em {up}\hss \hbox to 2em {here}\hss \hbox orientation 0 xoffset -1mm to 2em {down}\hss } \stoptyping Although one cannot mess to much with these low level scanners there was room for some optimization so the penalty we pay for more keyword scanning in \LUAMETATEX\ is not that high. In fact, I often manage to compensate adding features that have a possible performance hit with some gain elsewhere. Anyway, it will be no surprise that there can be interesting side effects to keyword scanning. For instance, using the two character keyword \type {by} in an advance can be more efficient because nothing needs to be pushed back. The same is true for the sometimes optional equal: \starttyping \scratchskip = 10pt \stoptyping Similar impacts on efficiency can be found in the way the end of a number is seen, basically anything not resolving to a number (or digit). \starttyping \scratchcounter 10% space not seen, ends \cs \scratchcounter =10% no push back of optional = \scratchcounter = 10% extra optional space gobble \scratchcounter = 10 % efficient ending of number scanning \scratchcounter = 10\relax % depending on engine less efficient \stoptyping In the above examples scanning the number involves: skipping over spaces, checking for an optional equal, skipping over spaces, scanning for a sign, checking for an optional octal or hexadecimal trigger (single or double quote), scanning the number till a non digit is seen. In the case of dimensions there is fraction scanning as well as unit scanning too. In any case, the equal is optional and kind of a keyword. Having an \type {equal} can be more efficient then not having one, again due to push back in case of no equal being seen, In the process spaces have been skipped, so add to the overhead the scanning for optional spaces. In \LUAMETATEX\ all that has been optimized a bit. By the way, in dimension scanning \type {pt} is actually a keyword and as there are several dimensions possible quite some push back can happen there, but we scan for the most likely candidates first. All that said, we're now ready for a surprise. The keyword scanner gets a string that it will test for, say \type {to} in case of a box specification. It then will fetch tokens from whatever provides the input. A token encodes a so called command and a character and can be related to a control sequence. For instance, the character \type {t} becomes a letter command with related value \number`t. So, we have three properties: the command code, the character code and the control sequence code. Now, instead of checking if the command code is a letter or other character (two checks) a fast check happens for the control sequence code being zero. If that is the case, the character code is compared. In practice that works out well because the characters that make up a keyword are in the range \number"41\ upto \number"5A\ and \number"61\ upto \number"7A, and all other character codes are either below that (the ones that relate to primitives where the character code is actually a sub command of a limited range) or much larger numbers that for instance indicate an entry in some array, where the first useful index is above the mentioned ranges. The surprise is in the fact that there is no checking for letters or other characters, so this is why the next code will work too: \footnote {No longer in \LUAMETATEX\ where we do a bit more robust check.} \starttyping \catcode `O= 1 \hbox tO 10cm {...} % { begingroup \catcode `O= 2 \hbox tO 10cm {...} % } endgroup \catcode `O= 3 \hbox tO 10cm {...} % $ mathshift \catcode `O= 4 \hbox tO 10cm {...} % & alignment \catcode `O= 6 \hbox tO 10cm {...} % # parameter \catcode `O= 7 \hbox tO 10cm {...} % ^ superscript \catcode `O= 8 \hbox tO 10cm {...} % _ subscript \catcode `O=11 \hbox tO 10cm {...} % letter \catcode `O=12 \hbox tO 10cm {...} % other \stoptyping In the first line, when we would use change the catcode of \type {T} and use that one it would kind of fails because they \TEX\ sees a begin group character and starts the group, but as a second character in a keyword it's okay because \TEX\ will not look at the category code. Of course only the cases \type {11} and \type {12} make sense because one can imagine that messing with the category codes of regular letters this way will definitely give problems with processing the text. In a case like: \starttyping {\catcode `o=3 \hbox to 10cm {oeps}} % $ mathshift {oeps} {\catcode `O=3 \hbox to 10cm {Oeps}} % $ mathshift {$eps} \stoptyping we have several issues: the primitive control sequence \type {\hbox} has an \type {o} so \TEX\ will stop after \type {\hb} which can be undefined or a valid macro and what happens next is hard to predict. Going uppercase will work but then the content of the box is bad because there the \type {O} enters math. \starttyping {\catcode `O=3 \hbox tO 10cm {Oeps Oeps}} % {$eps $eps} \stoptyping This will work because there are now two \type {O} in the box so we have balanced inline math triggers. But how does one explain that to a user, who probably doesn't understand where an error message comes from in the first place. Anyway, this kind of tolerance is still not pretty so in \LUAMETATEX\ we now check for the command code and stick to letters and other characters. On today's machines (and even on my by now ancient workhorse) the performance hit can be neglected. Actually, by intercepting the weird cases we also avoid an unnecessary case check when we fall through the zero cs test. Of course that also means that the above mentioned category code trickery doesn't work any more: only letters and other characters are now valid in keyword scanning. Now, it can be that some macro programmer actually used those side effects but apart from some macro hacker being hurt because no longer mastering those details can be showed off, it is users that we care more for, don't we? Now get me right, the above mentioning of performance of keyword and equal scanning is not that relevant in practice. But for the record, here are some timings on a laptop with a i7-3849QM processor using \MINGW\ binaries on a 64 bit \MSWINDOWS\ 10. The times are the averages of five times a million such assignments and advancements: \starttabulate[|l|c|c|c|] \FL \NC one million times \NC terminal \NC \LUAMETATEX\ \NC \LUATEX \NC \NR \ML \NC \type {\advance\scratchcounter 1} \NC space \NC 0.068 \NC 0.085 \NC \NR \NC \type {\advance\scratchcounter 1} \NC \type {\relax} \NC 0.135 \NC 0.149 \NC \NR \NC \type {\advance\scratchcounter by 1} \NC space \NC 0.087 \NC 0.099 \NC \NR \NC \type {\advance\scratchcounter by 1} \NC \type {\relax} \NC 0.155 \NC 0.161 \NC \NR \NC \type {\scratchcounter 1} \NC space \NC 0.057 \NC 0.096 \NC \NR \NC \type {\scratchcounter 1} \NC \type {\relax} \NC 0.125 \NC 0.151 \NC \NR \NC \type {\scratchcounter=1} \NC space \NC 0.063 \NC 0.080 \NC \NR \NC \type {\scratchcounter=1} \NC \type {\relax} \NC 0.131 \NC 0.138 \NC \NR \LL \stoptabulate We differentiate between using a space as terminal or a \type {\relax}. The later is a bit less efficient because more code is involved in resolving the meaning of that control sequence (which eventually boils down to nothing) but nevertheless, these are not timings that one can loose sleep over, especially when the rest of a decent \TEX\ run is taken into account. And yes, \LUAMETATEX\ is a bit faster here than \LUATEX, but I would be disappointed if that weren't the case. % luametatex: % \luaexpr{(0.068+0.070+0.069+0.067+0.068)/5} 0.068\crlf % \luaexpr{(0.137+0.132+0.136+0.137+0.134)/5} 0.135\crlf % \luaexpr{(0.085+0.088+0.084+0.089+0.087)/5} 0.087\crlf % \luaexpr{(0.145+0.160+0.158+0.156+0.154)/5} 0.155\crlf % \luaexpr{(0.060+0.055+0.059+0.055+0.056)/5} 0.057\crlf % \luaexpr{(0.118+0.127+0.128+0.122+0.130)/5} 0.125\crlf % \luaexpr{(0.063+0.062+0.067+0.061+0.063)/5} 0.063\crlf % \luaexpr{(0.127+0.128+0.133+0.128+0.140)/5} 0.131\crlf % luatex: % \luaexpr{(0.087+0.090+0.083+0.081+0.086)/5} 0.085\crlf % \luaexpr{(0.150+0.151+0.146+0.154+0.145)/5} 0.149\crlf % \luaexpr{(0.100+0.092+0.113+0.094+0.098)/5} 0.099\crlf % \luaexpr{(0.162+0.165+0.161+0.160+0.157)/5} 0.161\crlf % \luaexpr{(0.093+0.101+0.086+0.100+0.098)/5} 0.096\crlf % \luaexpr{(0.147+0.151+0.160+0.144+0.151)/5} 0.151\crlf % \luaexpr{(0.076+0.085+0.088+0.073+0.078)/5} 0.080\crlf % \luaexpr{(0.136+0.138+0.142+0.135+0.140)/5} 0.138\crlf \stopchapter \stopcomponent