From 250c5684b9ee44ac972db51f87289ef935182c53 Mon Sep 17 00:00:00 2001 From: Hans Hagen Date: Fri, 10 Mar 2023 12:42:42 +0100 Subject: 2023-03-10 12:17:00 --- doc/context/documents/general/manuals/musings.pdf | Bin 1469208 -> 2144254 bytes .../general/manuals/lowlevel/lowlevel-buffers.tex | 506 +++++++++++++++++++++ .../general/manuals/musings/musings-toocomplex.tex | 2 + .../sources/general/manuals/musings/musings.tex | 1 + 4 files changed, 509 insertions(+) create mode 100644 doc/context/sources/general/manuals/lowlevel/lowlevel-buffers.tex (limited to 'doc/context') diff --git a/doc/context/documents/general/manuals/musings.pdf b/doc/context/documents/general/manuals/musings.pdf index 69197240f..a02e1b6a1 100644 Binary files a/doc/context/documents/general/manuals/musings.pdf and b/doc/context/documents/general/manuals/musings.pdf differ diff --git a/doc/context/sources/general/manuals/lowlevel/lowlevel-buffers.tex b/doc/context/sources/general/manuals/lowlevel/lowlevel-buffers.tex new file mode 100644 index 000000000..5c600cd6f --- /dev/null +++ b/doc/context/sources/general/manuals/lowlevel/lowlevel-buffers.tex @@ -0,0 +1,506 @@ +% language=us runpath=texruns:manuals/lowlevel + +\environment lowlevel-style + +\startdocument + [title=buffers, + color=middlegreen] + +\startsectionlevel[title=Preamble] + +Buffers are not that low level but it makes sense to discuss them in this +perspective because it relates to tokenization, internal representation and +manipulating. + +{\em In due time we can describe some more commands and details here. This +is a start. Feel free to tell me what needs to be explained.} + +\stopsectionlevel + +\startsectionlevel[title=Encoding] + +Normally processing a document starts with reading from file. In the past we were +talking single bytes that were then maps onto a specific input encoding that +itself matches the encoding of a font. When you enter an \quote {a} its (normally +\ASCII) number 97 becomes the index into a font. That same number is also used in +the hyphenator which is why font encoding and hyphenation are strongly related. +If in an eight bit \TEX\ engine you need a precomposed \quote {รค} you have to use +an encoding that has that character in some slot with again matching fonts and +patterns. The actually used font can have the {\em shapes} in different slots and +remapping is then done in the backend code using encoding and mapping files. When +\OPENTYPE\ fonts are used the relationship between characters (input) and glyphs +(rendering) also depends on the application of font features. + +In eight bit environments all this brings a bit of a resource management +nightmare along with complex installation of new fonts. It also puts strain on +the macro package, especially when you want to mix different input encodings onto +different font encodings and thereby pattern encodings in the same document. You +can compare this with code pages in operating system, but imagine them +potentially being mixed in one document, which can happen when you mix multiple +languages where the accumulated number of different characters exceeds 256. You +end up switching between encodings. One way to deal with it is making special +characters active and let their meaning differ per situation. That is for +instance how in \MKII\ we handled \UTF8\ and thereby got around distributing +multiple pattern files per language as we only needed to encoding them in \UTF\ +and then remap them to the required encoding when loading patterns. A mental +exercise is wondering how to support \CJK\ scripts in an eight bit \MKII, +something that actually can be done with some effort. + +The good news is that when we moved from \MKII\ to \MKIV\ we went exclusively +\UTF8\ because that is what the \LUATEX\ engine expects. Upto four bytes are read +in and translated into one \UNICODE\ character. The internal representation is a +32 bit integer (four bytes) instead of a single byte. That also means that in the +transition we got rid of quite some encoding related low level font and pattern +handling. We still support input encodings (called regimes in \CONTEXT) but I'm +pretty sure that nowadays no one uses input other than \UTF8. While \CONTEXT\ is +normally quite upward compatible this is one area where there were fundamental +changes. + +There is still some interpretation going on when reading from file: for instance, +we need to normalize the \UNICODE\ input, and we feed the engine separate lines +on demand. Apart from that, some characters like the backslash, dollar sign and +curly braces have special meaning so for accessing them as characters we have to +use commands that inject those characters. That didn't change when we went from +\MKII\ to \MKIV. In practice it's never really a problem unless you find yourself +in one of the following situations: + +\startitemize +\startitem + {\em Example code has to be typeset as|-|is, so braces etc.\ are just that.} + This means that we have to change the way characters are interpreted. + Typesetting code is needed when you want to document \TEX\ and macros which + is why mechanisms for that have to be present right from the start. +\stopitem +\startitem + {\em Content is collected and used later.} A separation of content and usage + later on often helps making a source look cleaner. Examples are \quotation + {wrapping a table in a buffer} and \quotation {including that buffer when a + table is placed} using the placement macros. +\stopitem +\startitem + {\em Embedded \METAPOST\ and \LUA\ code.} These languages come with different + interpretation of some characters and especially \METAPOST\ code is often + stored first and used (processed) later. +\stopitem +\startitem + {\em The content comes from a different source.} Examples are \XML\ files + where angle brackets are special but for instance braces aren't. The data is + interpreted as a stream or as a structured tree. +\stopitem +\startitem + {\em The content is generated.} It can for instance come from \LUA, where + bytes (representing \UTF) is just text and no special characters are to be + intercepted. Or it can come from a database (using a library). +\stopitem +\stopitemize + +For these reasons \CONTEXT\ always had ways to store data in ways that makes this +possible. The details on how that is done might have changed over versions, been +optimized, extended with additional interfaces and features but given where we +come from most has been there from the start. + +\stopsectionlevel + +\startsectionlevel[title=Performance] + +When \TEX\ came around, the bottlenecks in running \TEX\ were the processor, +memory and disks and depending on the way one used it the speed of the console or +terminal; so, basically the whole system. One could sit there and wait for the +page counters (\typ {[1] [2] ..} to show up. It was possible to run \TEX\ on a +personal computer but it was somewhat resource hungry: one needed a decent disk +(a 10 MB hard disk was huge and with todays phone camera snapshots that sounds +crazy). One could use memory extenders to get around the 640K limitation (keep in +mind that the programs and operating systems also took space). This all meant +that one could not afford to store too many tokens in memory but even using files +for all kind of (multi|-|pass) trickery was demanding. + +When processors became faster and memory plenty the disk became the bottleneck, +but that changed when \SSD's showed up. Combined with already present file +caching that had some impact. We are now in a situation that \CPU\ cores don't +get that much faster (at least not twice as fast per iteration) and with \TEX\ +being a single core byte cruncher we're more or less in a situation where +performance has to come from efficient programming. That means that, given enough +memory, in some cases storing in tokens wins over storing in files, but it is no +rule. In practice there is not much difference so one can even more than +yesterday choose for the most convenient method. Just assume that the \CONTEXT\ +code, combined with \LUAMETATEX\ will give you what you need with a reasonable +performance. When in doubt, test with simple test files and it that works out +well compared to the real code, try to figure out where \quote {mistakes} are +made. Inefficient \LUA\ and \TEX\ code has way more impact than storing a few +more tokens or using some files. + +\stopsectionlevel + +\startsectionlevel[title=Files] + +Nearly always files are read once per run. The content (mixed with commands) is +scanned and macros are expanded and|/|or text is typeset as we go. Internally the +\LUAMETATEX\ engine is in \quotation {scanning from file}, \quotation {scanning +from token lists}, or \quotation {scanning from \LUA\ output} mode. The first +mode is (in principle) the slowest because \UTF\ sequences are converted to +tokens (numbers) but there is no way around it. The second method is fast because +we already have these numbers, but we need to take into account where the linked +list of tokens comes from. If it is converted runtime from for instance file +input or macro expansion we need to add the involved overhead. But scanning a +stored macro body is pretty efficient especially when the macro is part of the +loaded macro package (format file). The third method is comparable with reading +from file but here we need to add the overhead involved with storing the \LUA\ +output into data structures suitable for \TEX's input mechanism, which can +involve memory allocation outside the reserved pool of tokens. On modern systems +that is not really a problem. It is good to keep in mind that when \TEX\ was +written much attention was paid to optimization and in \LUAMETATEX\ we even went +a bit further, also because we know what kind of input, processing and output +we're dealing with. + +When reading from file or \LUA\ output we interpret bytes turned \UTF\ numbers +and that is when catcode regimes kick in: characters are interpreted according to +the catcode properties: escape character (backslash), curly braces (grouping and +arguments), dollars (math), etc.\ While with reading from token lists these +catcodes are already taken care of and we're basically interpreting meanings +instead of characters. By changing the catcode regime we can for instance typeset +content verbatim from files and \LUA\ strings but when reading from token lists +we're sort of frozen. There are tricks to reinterpret the token list but that +comes with overhead and limitations. + +\stopsectionlevel + +\startsectionlevel[title=Macros] + +A macro can be seen as a named token with a meaning attached. In \LUAMETATEX\ +macros can take up to 15 arguments (six more than regular \TEX) that can be +separated by so called delimiters. A token has a command property (operator) and +a value (operand). Because a \UNICODE\ character doesn't need all four bytes of +an integer and because in the engine numbers, dimensions and pointers are limited +in size we can store all of these efficiently with the command code. Here the +body of \type {\foo} is a list of three tokens: + +\starttyping +\def\foo{abc} \foo \foo \foo +\stoptyping + +When the engine fetches a token from a list it will interpret the command and +when it fetches from file it will create tokens on the fly and then interpret +those. When a file or list is exhausted the engine pops the stack and continues +at the previous level. Because macros are already tokenized they are more +efficient than file input. For more about macros you can consult the low level +document about them. + +The more you use a macro, the more it pays off compared to a file. However don't +overestimate this, because in the end the typesetting and expanding all kind of +other involved macros might reduce the file overhead to noise. + +\stopsectionlevel + +\startsectionlevel[title=Token lists] + +A token list is like a macro but is part of the variable (register) system. It +is just a list (so no arguments) and you can append and prepend to that list. + +\starttyping +\toks123={abc} \the\toks123 +\scratchtoks{abc} \the\scratchtoks +\stoptyping + +Here \type {\scratchtoks} is defined with \type {\newtoks} which creates an +efficient reference to a list so that, contrary to the first line, no register +number has to be scanned. There are low level manuals about tokens and registers +that you can read if you want to know more about this. As with macros the list in +this example is three tokens long. Contrary to macros there is no macro overhead +as there is no need to check for arguments. \footnote {In \LUAMETATEX\ a macro +without arguments is also quite efficient.} + +Because they use more or less the same storage method macros and token list +registers perform the same. The power of registers comes from some additional +manipulators in \LUATEX\ (and \LUAMETATEX) and the fact that one can control +expansion with \type {\the}, although that later advantage is compensated with +extensions to the macro language (like \type {\protected} macro definitions). + +\stopsectionlevel + +\startsectionlevel[title=Buffers] + +Buffers are something specific for \CONTEXT\ and they have always been part of +this system. A buffer is defined as follows: + +\startbuffer +\startbuffer[one] +line 1 +line 2 +\stopbuffer +\stopbuffer + +\typebuffer + +Among the operations on buffers the next two are used most often: + +\starttyping +\typebuffer[one] +\getbuffer[one] +\stoptyping + +Scanning a buffer at the \TEX\ end takes a little effort because when we start +reading the catcodes are ignored and for instance backslashes and curly braces +are retained. Hardly any interpretation takes place. The same is true for +spacing, so multiple spaces are not collapsed and newlines stay. The tokenized +content of a buffer is converted back to a string and that content is then read +in as a pseudo file when we need it. So, basically buffers are files! In \MKII\ +they actually were files (in the \type {\jobname} name space and suffix \type +{tmp}), but in \MKIV\ they are stored in and managed by \LUA. That also means +that you can set them very efficiently at the \LUA\ end: + +\starttyping +\startluacode +buffers.assign("one",[[ +line 1 +line 2 +]]) +\stopluacode +\stoptyping + +Always keep in mind that buffers eventually are read as files: character by +character, and at that time the content gets (as with other files) tokenized. A +buffer name is optional. You can nest buffers, with and without names. + +Because \CONTEXT\ is very much about re-use of content and selective processing +we have an (already old) subsystem for defining named blocks of text (using \type +{\begin...} and \type {\end...} tagging. These blocks are stored just like +buffers but selective flushing is part of the concept. Think of coding an +educational document with explanations, questions, answers and then typesetting +only the explanations, or the explanation along width some questions. Other +components can be typeset later so one can make for instance a special book(let) +with answers that either of not repeats the questions. Here we need features like +synchronization of numbers so that's why we cannot really use buffers. An +alternative is to use \XML\ and filter from that. + +The \typ {\definebuffer} command defines a new buffer environment. When you set +buffers in \LUA\ you don't need to define a buffer because likely you don't need +the \type {\start} and \type {\stop} commands. Instead of \typ {\getbuffer} you +can also use \typ {\getdefinedbuffer} with defined buffers. In that case the +\type {before} and \type {after} keys of that specific instance are used. + +The \typ {\getinlinebuffer} command, which like the getters takes a list of +buffer names, ignores leading and trailing spaces. When multiple buffers are +flushed this way, spacing between buffers is retained. + +The most important aspect of buffers is that the content is {\em not} interpreted +and tokenized: the bytes stay as they are. + +\startbuffer +\definebuffer[MyBuffer] + +\startMyBuffer +\bold{this is +a buffer} +\stopMyBuffer + +\typeMyBuffer \getMyBuffer +\stopbuffer + +\typebuffer + +These commands result in: + +\getbuffer + +There are not that many parameters that can be set: \type {before}, \type {after} +and \type {strip} (when set to \type {no} leading and trailing spacing will be +kept. The \type {\stop...} command, in our example \typ {\stopMyBuffer}, can be +defined independent to so something after the buffer has be read and stored but +by default nothing is done. + +You can test if a buffer exists with \typ {\doifelsebuffer} (expandable) and \typ +{\doifelsebufferempty} (unexpandable). A buffer is kept in memory unless it gets +wiped clean with \typ {resetbuffer}. + +\starttyping +\savebuffer [MyBuffer][temp] % gets name: jobname-temp.tmp +\savebufferinfile[MyBuffer][temp.log] % gets name: temp.log +\stoptyping + +You can also stepwise fill such a buffer: + +\starttyping +\definesavebuffer[slide] + +\startslide + \starttext +\stopslide +\startslide + slide 1 +\stopslide +text 1 \par +\startslide + slide 2 +\stopslide +text 2 \par +\startslide + \stoptext +\stopslide +\stoptyping + +After this you will have a file \type {\jobname-slide.tex} that has the two lines +wrapped as text. You can set up a \quote {save buffer} to use a different +filename (with the \type {file} key), a different prefix using \type {prefix} and +you can set up a \type {directory}. A different name is set with the \type {list} +key. + +You can assign content to a buffer with a somewhat clumsy interface where we use +the delimiter \type {\endbuffer}. The only restriction is that this delimiter +cannot be part of the content: + +\starttyping +\setbuffer[name]here comes some text\endbuffer +\stoptyping + +For more details and obscure commands that are used in other commands +you can peek into the source. + +% These are somewhat obscure: +% +% \getbufferdata{...} +% \grabbufferdatadirect % name start stop +% \grabbufferdata % was: \dostartbuffer +% \thebuffernumber +% \thedefinedbuffer + +Using buffers in the \CLD\ interface is tricky because of the catcode magick that is +involved but there are setters and getters: + +\starttabulate[|T|T|] +\BC function \BC arguments \NC \NR +\ML +\NC buffers.assign \NC name, content [,catcodes] \NC \NR +%NC buffers.raw \NC \NC \NR +\NC buffers.erase \NC name \NC \NR +\NC buffers.prepend \NC name, content \NC \NR +\NC buffers.append \NC name, content \NC \NR +\NC buffers.exists \NC name \NC \NR +\NC buffers.empty \NC name \NC \NR +\NC buffers.getcontent \NC name \NC \NR +\NC buffers.getlines \NC name \NC \NR +%NC buffers.collectcontent \NC \NC \NR +%NC buffers.loadcontent \NC \NC \NR +%NC buffers.get \NC \NC \NR +%NC buffers.getmkiv \NC \NC \NR +%NC buffers.gettexbuffer \NC \NC \NR +%NC buffers.run \NC \NC \NR +\stoptabulate + +There are a few more helpers that are used in other (low level) commands. Their +functionality might adapt to their usage there. The \typ {context.startbuffer} +and \typ {context.stopbuffer} are somewhat differently defined than regular +\CLD\ commands. + +\stopsectionlevel + +\startsectionlevel[title=Setups] + +A setup is basically a macro but is stored and accessed in a namespace separated +from ordinary macros. One important characteristic is that inside setups newlines +are ignored. + +\startbuffer +\startsetups MySetupA + This is line 1 + and this is line 2 +\stopsetups + +\setup{MySetupA} +\stopbuffer + +\typebuffer {\bf \getbuffer} + +A simple way out is to add a comment character preceded by a space. Instead you +can also use \type {\space}: + +\startbuffer +\startsetups [MySetupB] + This is line 1 % + and this is line 2\space + while here we have line 3 +\stopsetups + +\setup[MySetupB] +\stopbuffer + +\typebuffer {\bf \getbuffer} + +You can use square brackets instead of space delimited names in definitions and +also in calling up a (list of) setup(s). The \type {\directsetup} command takes a +single setup name and is therefore more efficient. + +Setups are basically simple macros although there is some magic involved that +comes from their usage in for instance \XML\ where we pass an argument. That +means we can do the following: + +\startbuffer +\startsetups MySetupC + before#1after +\stopsetups + +\setupwithargument{MySetupC}{ {\em and} } +\stopbuffer + +\typebuffer {\bf \getbuffer} + +Because a setup is a macro, the body is a linked list of tokens where each token +takes 8 bytes of memory, so \type {MySetupC} has 12 tokens that take 96 bytes of +memory (plus some overhead related to macro management). + +\stopsectionlevel + +\startsectionlevel[title=\XML] + +Discussing \XML\ is outside the scope of this document but it is worth mentioning +that once an \XML\ tree is read is, the content is stored in strings and can be +filtered into \TEX, where it is interpreted as if coming from files (in this case +\LUA\ strings). If needed the content can be interpreted as \TEX\ input. + +\stopsectionlevel + +\startsectionlevel[title=\LUA] + +As mentioned already, output from \LUA\ is stored and when a \LUA\ call finishes +it ends up on the so called input stack. Every time the engine needs a token it +will fetch from the input stack and the top of the stack can represent a file, +token list or \LUA\ output. Interpreting bytes from files or \LUA\ strings +results in tokens. As a side note: \LUA\ output can also be already tokenized, +because we can actually write tokens and nodes from \LUA, but that's more an +implementation detail that makes the \LUA\ input stack entries a bit more +complex. It is normally not something users will do when they use \LUA\ in their +documents. + +\stopsectionlevel + +\startsectionlevel[title=Protection] + +When you define macros there is the danger of overloading some defined by the +system. Best use CamelCase so that you stay away from clashes. You can enable +some checking: + +\starttyping +\enabledirectives[overloadmode=warning] +\stoptyping + +or when you want to quit on a clash: + +\starttyping +\enabledirectives[overloadmode=error] +\stoptyping + +When these trackers are enabled you can get around the check with: + +\starttyping +\pushoverloadmode + ... +\popoverloadmode +\stoptyping + +But delay that till you're sure that redefining is okay. + +\stopsectionlevel + +% efficiency + +\stopdocument + diff --git a/doc/context/sources/general/manuals/musings/musings-toocomplex.tex b/doc/context/sources/general/manuals/musings/musings-toocomplex.tex index 103dd1906..f12f15c1e 100644 --- a/doc/context/sources/general/manuals/musings/musings-toocomplex.tex +++ b/doc/context/sources/general/manuals/musings/musings-toocomplex.tex @@ -387,3 +387,5 @@ is not needed any more by then. \stopsection \stopchapter + +\stopcomponent diff --git a/doc/context/sources/general/manuals/musings/musings.tex b/doc/context/sources/general/manuals/musings/musings.tex index 13bf4f4ef..bccab890a 100644 --- a/doc/context/sources/general/manuals/musings/musings.tex +++ b/doc/context/sources/general/manuals/musings/musings.tex @@ -28,6 +28,7 @@ % \component musings-whytex-again \component musings-dontusetex \component musings-speed + \component musings-texlive \stopbodymatter \stopproduct -- cgit v1.2.3