From 06f5d61e0db05d0803ac5b6b4953937c3b88f1ea Mon Sep 17 00:00:00 2001 From: Hans Hagen Date: Sat, 7 Aug 2021 23:36:31 +0200 Subject: 2021-08-07 22:51:00 --- .../documents/general/manuals/workflows-mkiv.pdf | Bin 82860 -> 90319 bytes .../manuals/workflows/workflows-graphics.tex | 30 ++-- .../general/manuals/workflows/workflows-hashed.tex | 160 +++++++++++++++++++++ .../manuals/workflows/workflows-injectors.tex | 2 +- .../general/manuals/workflows/workflows-mkiv.tex | 1 + .../manuals/workflows/workflows-parallel.tex | 5 +- .../manuals/workflows/workflows-resources.tex | 22 +-- .../manuals/workflows/workflows-synctex.tex | 37 +++-- .../general/manuals/workflows/workflows-xml.tex | 12 +- 9 files changed, 225 insertions(+), 44 deletions(-) create mode 100644 doc/context/sources/general/manuals/workflows/workflows-hashed.tex (limited to 'doc') diff --git a/doc/context/documents/general/manuals/workflows-mkiv.pdf b/doc/context/documents/general/manuals/workflows-mkiv.pdf index b63ecd054..1e5f3f287 100644 Binary files a/doc/context/documents/general/manuals/workflows-mkiv.pdf and b/doc/context/documents/general/manuals/workflows-mkiv.pdf differ diff --git a/doc/context/sources/general/manuals/workflows/workflows-graphics.tex b/doc/context/sources/general/manuals/workflows/workflows-graphics.tex index a24d293df..2246c1c88 100644 --- a/doc/context/sources/general/manuals/workflows/workflows-graphics.tex +++ b/doc/context/sources/general/manuals/workflows/workflows-graphics.tex @@ -8,12 +8,13 @@ \startsection[title=Bad names] -After many years of using \CONTEXT\ in workflows where large amounts of source files -as well as graphics were involved we can safely say that it's hard for publishers to -control the way these are named. This is probably due to the fact that in a -click|-|and|-|point based desktop publishing workflow names don't matter as one stays on -one machine, and names are only entered once (after that these names become abstractions and -get cut and pasted). Proper consistent resource managament is simply not part of the flow. +After many years of using \CONTEXT\ in workflows where large amounts of source +files as well as graphics were involved we can safely say that it's hard for +publishers to control the way these are named. This is probably due to the fact +that in a click|-|and|-|point based desktop publishing workflow names don't +matter as one stays on one machine, and names are only entered once (after that +these names become abstractions and get cut and pasted). Proper consistent +resource managament is simply not part of the flow. This means that you get names like: @@ -29,19 +30,20 @@ like one. In fancy screen fonts upper and lowercase usage might get obscured. It really makes one wonder if copy|-|editing or adding labels to graphics isn't suffering from the same problem. -Anyhow, as in an automated rendering workflow the rendering is often the last step you -can imagine that when names get messed up it's that last step that gets blamed. It's not -that hard to sanitize names of files on disk as well as in the files that refer to them, -and we normally do that we have complete control. This is no option when all the resources -are synchronzied from elsewhere. In that case the only way out is signaling potential -issues. Say that in the source file there is a reference: +Anyhow, as in an automated rendering workflow the rendering is often the last +step you can imagine that when names get messed up it's that last step that gets +blamed. It's not that hard to sanitize names of files on disk as well as in the +files that refer to them, and we normally do that we have complete control. This +is no option when all the resources are synchronzied from elsewhere. In that case +the only way out is signaling potential issues. Say that in the source file there +is a reference: \starttyping foo_Bar_01_03-a.EPS \stoptyping -and that the graphic on disk has the same name, but for some reason after an update -has become: +and that the graphic on disk has the same name, but for some reason after an +update has become: \starttyping foo-Bar_01_03-a.EPS diff --git a/doc/context/sources/general/manuals/workflows/workflows-hashed.tex b/doc/context/sources/general/manuals/workflows/workflows-hashed.tex new file mode 100644 index 000000000..85aa5d5f1 --- /dev/null +++ b/doc/context/sources/general/manuals/workflows/workflows-hashed.tex @@ -0,0 +1,160 @@ +% language=us runpath=texruns:manuals/workflows + +% Musical timestamp: Welcome 2 America by Prince, with a pretty good lineup, +% August 2021. + +\environment workflows-style + +\startcomponent workflows-hashed + +\startchapter[title=Hashed files] + +In a (basically free content) project we had to deal with tens of thousands of +files. Most are in \XML\ format, but there are also thousands of \PNG, \JPG\ and +\SVG images. In large project like this, which covers a large part of Dutch +school math, images can be shared. All the content is available for schools as +\HTML\ but can also be turned into printable form and because schools want to +have stable content over specified periods one has to make a regular snapshot of +this corpus. Also, distributing a few gigabytes if data is not much fun. + +So, in order to bring the amount down a dedicated mechanism for handling files +has bene introduced. After playing with a \SQLITE\ database we finally settled on +just \LUA, simply because it was faster and it also makes the solution +independent. + +The process comes down to creating a file database once in a while, loading a +relatively small hash mapping at runtime and accessing files from a large +data file on demand. Optionally files can be compressed, which makes sense for +the textual files. + +A database is created with one of the \CONTEXT\ extras, for instance: + +\starttyping +context --extra=hashed --database=m4 --pattern=m4all/**.xml --compress +context --extra=hashed --database=m4 --pattern=m4all/**.svg --compress +context --extra=hashed --database=m4 --pattern=m4all/**.jpg +context --extra=hashed --database=m4 --pattern=m4all/**.png +\stoptyping + +The database uses two files: a small \type {m4.lua} file (some 11 megabytes) and +a large \type {m4.dat} (about 820 megabytes, coming from 1850 megabytes +originals). Alternatively you can use a specification, say \type {m4all.lua}: + +\starttyping +return { + { pattern = "m4all/**.xml$", compress = true }, + { pattern = "m4all/**.svg$", compress = true }, + { pattern = "m4all/**.jpg$", compress = false }, + { pattern = "m4all/**.png$", compress = false }, +} +\stoptyping + +\starttyping +context --extra=hashed --database=m4 --patterns=m4all.lua +\stoptyping + +You should see something like on the console: + +\starttyping +hashed > database 'hasheddata', 1627 paths, 46141 names, + 36935 unique blobs, 29674 compressed blobs +\stoptyping + +So here we share some ten thousand files (all images). In case you wonder why we +keep the duplicates: they have unique names (copies) so that when a section is +updated there is no interference with other sections. The tree structure is +mostly six deep (sometimes there is an additional level). + +% \startluacode +% if not resolvers.finders.helpers.validhashed("hasheddata") then +% resolvers.finders.helpers.createhashed { +% database = "hasheddata", +% pattern = "m4all/**.jpg$", +% compress = false, +% } +% resolvers.finders.helpers.createhashed { +% database = "hasheddata", +% pattern = "m4all/**.png$", +% compress = false, +% } +% resolvers.finders.helpers.createhashed { +% database = "hasheddata", +% pattern = "m4all/**.xml$", +% compress = true, +% } +% end +% \stopluacode + +% \startluacode +% if not resolvers.finders.helpers.validhashed("hasheddata") then +% resolvers.finders.helpers.createhashed { +% database = "hasheddata", +% patterns = { +% { pattern = "m4all/**.jpg$", compress = false }, +% { pattern = "m4all/**.png$", compress = false }, +% { pattern = "m4all/**.svg$", compress = true }, +% { pattern = "m4all/**.xml$", compress = true }, +% }, +% } +% end +% \stopluacode + +Accessing files is the same as with files on the system, but one has to register +a database first: + +\starttyping +\registerhashedfiles[m4] +\stoptyping + +A fully qualified specifier looks like this (not to different from other +specifiers): + +\starttyping +\externalfigure + [hashed:///m4all/books/chapters/h3/h3-if1/images/casino.jpg] +\externalfigure + [hashed:///m4all/books/chapters/ha/ha-c4/images/ha-c44-ex2-s1.png] +\stoptyping + +but nicer would be : + +\starttyping +\externalfigure + [m4all/books/chapters/h3/h3-if1/images/casino.jpg] +\externalfigure + [m4all/books/chapters/ha/ha-c4/images/ha-c44-ex2-s1.png] +\stoptyping + +This is possible when we also specify: + +\starttyping +\registerfilescheme[hashed] +\stoptyping + +This makes the given scheme based resolver kick in first, while the normal +file lookup is used as last resort. + +This mechanism is written on top of the infrastructure that has been part of +\CONTEXT\ \MKIV\ right from the start but this particular feature is only +available in \LMTX\ (backporting is likely a waste of time). + +Just for the record: this mechanism is kept simple, so the database has no update +and replace features. One can just generate a new one. You can test for a valid database +and act upon the outcome: + +\starttyping +\doifelsevalidhashedfiles {m4} { + \writestatus{hashed}{using hashed data} + \registerhashedfiles[m4] + \registerfilescheme[hashed] +} { + \writestatus{hashed}{no hashed data} +} +\stoptyping + +Future version might introduce filename normalization (lowercase, cleanup) so +consider this is first step. First we need test it for a while. + +\stopchapter + +\stopcomponent diff --git a/doc/context/sources/general/manuals/workflows/workflows-injectors.tex b/doc/context/sources/general/manuals/workflows/workflows-injectors.tex index 4b784f7cf..e2d46f060 100644 --- a/doc/context/sources/general/manuals/workflows/workflows-injectors.tex +++ b/doc/context/sources/general/manuals/workflows/workflows-injectors.tex @@ -108,7 +108,7 @@ automated \XML\ workflows where last minute control is needed. \stopcomponent -% some day to be described: +% Some day to be described (check Willis tests): % % \showinjector % diff --git a/doc/context/sources/general/manuals/workflows/workflows-mkiv.tex b/doc/context/sources/general/manuals/workflows/workflows-mkiv.tex index 6b8b172b0..1165b621e 100644 --- a/doc/context/sources/general/manuals/workflows/workflows-mkiv.tex +++ b/doc/context/sources/general/manuals/workflows/workflows-mkiv.tex @@ -49,6 +49,7 @@ \component workflows-setups \component workflows-synctex \component workflows-parallel + \component workflows-hashed \stopbodymatter \stopdocument diff --git a/doc/context/sources/general/manuals/workflows/workflows-parallel.tex b/doc/context/sources/general/manuals/workflows/workflows-parallel.tex index a82028de6..006088c2c 100644 --- a/doc/context/sources/general/manuals/workflows/workflows-parallel.tex +++ b/doc/context/sources/general/manuals/workflows/workflows-parallel.tex @@ -88,7 +88,7 @@ provides. Again, caching the input file as above saves a little bit: 10 seconds, so we get 250 seconds. When you run these tests on the machine that you normally work on, waiting for that many jobs to finish is no fun, so what if we (as I then normally -do) watch some music video? With a fullscreen high resolution video shown in the +do) watch some music video? With a full screen high resolution video shown in the foreground the runtime didn't change: still 250 seconds for 1000 jobs with eight parallel runs. On the other hand, a test with Firefox, which is quite demanding, running a video in the background, made the runtime going up by 30 seconds to @@ -118,6 +118,3 @@ is still there, the management script can decide when a next run can be started. \stopchapter \stopcomponent - -% downloaded video : Jojo Mayer's 2019 TED talk: https://www.youtube.com/watch?v=Npq-bhz1ll0} -% realtime video : Andrew Cuomo's daily press conference on dealing with Covid 19 diff --git a/doc/context/sources/general/manuals/workflows/workflows-resources.tex b/doc/context/sources/general/manuals/workflows/workflows-resources.tex index 41de6dc35..323bf8209 100644 --- a/doc/context/sources/general/manuals/workflows/workflows-resources.tex +++ b/doc/context/sources/general/manuals/workflows/workflows-resources.tex @@ -6,10 +6,10 @@ \startchapter[title=Accessing resources] -One of the benefits of \TEX\ is that you can use it in automated workflows -where large quantities of data is involved. A document can consist of -several files and normally also includes images. Of course there are styles -involved too. At \PRAGMA\ normally put styles and fonts in: +One of the benefits of \TEX\ is that you can use it in automated workflows where +large quantities of data is involved. A document can consist of several files and +normally also includes images. Of course there are styles involved too. At +\PRAGMA\ normally put styles and fonts in: \starttyping /data/site/context/tex/texmf-project/tex/context/user//... @@ -37,11 +37,12 @@ The processing happens in: Putting styles (and resources like logos and common images) and fonts (if the project has specific ones not present in the distribution) in the \TEX\ tree makes sense because that is where such files are normally searched. Of course you -need to keep the distributions file database upto|-|date after adding files there. +need to keep the distributions file database upto|-|date after adding files +there. Processing has to happen isolated from other runs so there we use unique -locations. The services responsible for running also deal with regular cleanup -of these temporary files. +locations. The services responsible for running also deal with regular cleanup of +these temporary files. Resources are somewhat special. They can be stable, i.e.\ change seldom, but more often they are updated or extended periodically (or even daily). We're not @@ -55,7 +56,7 @@ resource tree. In the 100K case there is a deeper structure which is in itself predictable but because many authors are involved the references to these files are somewhat instable (and undefined). It is surprising to notice that publishers don't care about filenames (read: cannot control all the parties involved) which -means that we have inconsist use of mixed case in filenames, and spaces, +means that we have inconsistent use of mixed case in filenames, and spaces, underscores and dashes creeping in. Because typesetting for paper is always at the end of the pipeline (which nowadays is mostly driven by (limitations) of web products) we need to have a robust and flexible lookup mechanism. It's a side @@ -67,7 +68,7 @@ get it fixed. \footnote {From what we normally receive we often conclude that copy|-|editing and image production companies don't impose any discipline or probably simply lack the tools and methods to control this. Some of our workflows had checkers and fixers, so that when we got 5000 new resources while only a few -needed to be replaced we could filter the right ones. It was not uncommon to find +needed to be replaced we could filter the right ones. It was not uncommon to find duplicates for thousands of pictures: similar or older variants.} \starttyping @@ -151,6 +152,9 @@ When you add, remove or move files the tree, you need to remove the \type {dirlist.*} files in the root because these are used for locating files. A new file will be generated automatically. Don't forget this! +When content doesn't change an alternative discussed in in a later chapter can be +considered: hashed databases of files. + \stopchapter \stopcomponent diff --git a/doc/context/sources/general/manuals/workflows/workflows-synctex.tex b/doc/context/sources/general/manuals/workflows/workflows-synctex.tex index bb2128da4..4349461f0 100644 --- a/doc/context/sources/general/manuals/workflows/workflows-synctex.tex +++ b/doc/context/sources/general/manuals/workflows/workflows-synctex.tex @@ -82,10 +82,10 @@ editor} are the following: \stopitemize -It is unavoidable that we get more run time but I assume that for the average user -that is no big deal. It pays off when you have a workflow when a book (or even a -chapter in a book) is generated from hundreds of small \XML\ files. There is no -overhead when \SYNCTEX\ is not used. +It is unavoidable that we get more run time but I assume that for the average +user that is no big deal. It pays off when you have a workflow when a book (or +even a chapter in a book) is generated from hundreds of small \XML\ files. There +is no overhead when \SYNCTEX\ is not used. In \CONTEXT\ we don't use the built|-|in \SYNCTEX\ features, that is: we let filename and line numbers be set but often these are overloaded explicitly. The @@ -120,10 +120,10 @@ A third method is to put this at the top of your file: Often an \XML\ files is very structured and although probably the main body of text is flushed as a stream, specific elements can be flushed out of order. In -educational documents flushing for instance answers to exercises can happen out of -order. In that case we still need to make sure that we go to the right spot in -the file. It will never be 100\% perfect but it's better than nothing. The -above command will also enable \XML\ support. +educational documents flushing for instance answers to exercises can happen out +of order. In that case we still need to make sure that we go to the right spot in +the file. It will never be 100\% perfect but it's better than nothing. The above +command will also enable \XML\ support. If you don't want a file to be accessed, you can block it: @@ -131,8 +131,8 @@ If you don't want a file to be accessed, you can block it: \blocksynctexfile[foo.tex] \stoptyping -Of course you need to configure the viewer to respond to the request for -editing. In Sumatra combined with SciTE the magic command is: +Of course you need to configure the viewer to respond to the request for editing. +In Sumatra combined with \SCITE\ the magic command is: \starttyping c:\data\system\scite\wscite\scite.exe "%f" "-goto:%l" @@ -202,6 +202,23 @@ as described here. \stopsection +\startsection[title=Two-way] + +In for instance the \TEX shop editor, there is a two way connection. The nice +thing about this editor is, is that it is also the first one to use the \type +{mtx-synctex} script to resolve these links, instead of relying on a library. You +can also use this script to inspect a \SYNCTEX\ file yourself, The help into +shows the possible directives. + +\starttyping +mtxrun --script synctex +\stoptyping + +You can resolve positions in the \PDF\ as well as in the sources and list all the +known areas in the log. + +\stopsection + \stopchapter \stopcomponent diff --git a/doc/context/sources/general/manuals/workflows/workflows-xml.tex b/doc/context/sources/general/manuals/workflows/workflows-xml.tex index 93b03bb7b..95a6f71b6 100644 --- a/doc/context/sources/general/manuals/workflows/workflows-xml.tex +++ b/doc/context/sources/general/manuals/workflows/workflows-xml.tex @@ -15,9 +15,9 @@ \startchapter[title={XML}] When you have an \XML\ project with many files involved, finding the right spot -of something that went wrong can be a pain. In one of our project the production of -some 50 books involves 60.000 \XML\ files and 20.000 images. Say that we have the -following file: +of something that went wrong can be a pain. In one of our project the production +of some 50 books involves 60.000 \XML\ files and 20.000 images. \footnote {In the +meantime we could trim this down a lot.} Say that we have the following file: \startbuffer[demo] @@ -32,9 +32,9 @@ following file: \typebuffer[demo] -Before we process this file we will merge the content of the files defined -as includes into it. When this happens the filename is automatically -registered so it can be accessed later. +Before we process this file we will merge the content of the files defined as +includes into it. When this happens the filename is automatically registered so +it can be accessed later. \startbuffer \startxmlsetups xml:initialize -- cgit v1.2.3