% language=us

\startcomponent workflows-parallel

\environment workflows-style

\startchapter[title={Parallel processing}]

% \startsection[title={Introduction}]

% \stopsection

This is just a small intermezzo. Mid April 2020 Mojca asked on the mailing list how
to best compile 5000 files, based on a template. The answer depends on the workflow
and circumstances but one can easily come up with some factors that play a role.

\startitemize
    \startitem
        How complex is the document? How many pages are generated, how many fonts
        get used? Do we need multiple runs per document? Are images involved and
        if so, what format are they in? When processing relative small files we
        normally need seconds, not minutes.
    \stopitem
    \startitem
        What machine is used? How powerful is the \CPU, how many cores are
        available and how much memory do we have? Is the filesystem on a local
        \SSD\ or on a remote file system? How well does file caching work? Again,
        we're talking seconds here.
    \stopitem
    \startitem
        What engine is used? Assuming that \MKIV\ is used, we can choose for
        \LUATEX\ or \LUAMETATEX. The former has faster backend code, the later a
        faster frontend. What is more efficient depends on the document. The
        later has some advantages that we will not mention here.
    \stopitem
\stopitemize

The tests mentioned below are run with a simple \LUA\ script that manages the
parallel runs. More about that later. As sample document we use this:

\starttyping
\setupbodyfont[dejavu]

\starttext
    \dorecurse{\getdocumentargument{noffiles}}{\input tufte\par}
\stoptext
\stoptyping

We start with 100 runs of 10 inclusions. We permit 8 runs in parallel. A \LUATEX\
run of 100 takes 32 seconds, a \LUAJITTEX\ run uses 26 seconds, and \LUAMETATEX\
does it in 25 seconds. \footnote {I used a mingw cross compiled 64 bit binary;
the GCC9 version seems somewhat slower than the previous compiler version.} An
interesting observation is memory consumption: \LUAJITTEX, which has a different
virtual machine and a limited memory model, peaks at 0.8GB for the eight parallel
runs. The \LUAMETATEX\ engine has the same demands. However, \LUATEX\ needs
1.2GB. Bumping to 20 inclusions increased the runtime a few seconds for each
engine.

The differences can be explained by a faster startup time of \LUAMETATEX; for
instance we don't use a compressed format (dump), but there are some other
optimizations too, and even when they're close to unmeasurable, they might add
up. The \LUAJITTEX\ engine speeds up \LUA\ interpretation which is reflected in
runtime because \CONTEXT\ spends half its time in \LUA.

As a next test I decided to run the test file 5000 times: Mojca's scenario.
Including 10 sample files (per run) for those 5000 files took 1320 seconds. When
we cache the included file we gain some 5~percent.

Does it matter how many jobs we run in parallel? The 2013 laptop I used for
testing has four real cores that hyperthread to eight cores. \footnote {The
machine has an Intel i7-3840QM \CPU, 16GB of memory and a 512 GB Samsung Pro
\SSD.} On 1000 jobs we need 320 seconds for 1000 files (10 inclusions) when we
use four cores. With six cores we need 270 seconds, which is much better. With
eight cores we go down to 260 seconds and ten cores, which is two more than there
are, we get about the same runtime. \footnote {On a more modern system, let alone
a desktop computer, I expect these numbers to be much lower.} A \TEX\ program is
a single core process and it makes no sense to use more cores than the \CPU\
provides.

\starttyping
\setupbodyfont[dejavu]

\starttext
    \dorecurse{\getdocumentargument{noffiles}}{\samplefile{tufte}\par}
\stoptext
\stoptyping

Again, caching the input file as above saves a little bit: 10 seconds, so we get
250 seconds. When you run these tests on the machine that you normally work on,
waiting for that many jobs to finish is no fun, so what if we (as I then normally
do) watch some music video? With a fullscreen high resolution video shown in the
foreground the runtime didn't change: still 250 seconds for 1000 jobs with eight
parallel runs. On the other hand, a test with Firefox, which is quite demanding,
running a video in the background, made the runtime going up by 30 seconds to
280. So, when doing some networking, decompression, all kinds of unknown tracking
using \JAVASCRIPT, etc.\ and therefore its own demands on cores and memory you
might want to limit the number of parallel runs. These tests are probably not
that meaningful but a good distraction when in lock down.

I'm still not sure if I should come up with a script for managing these parallel
runs. But one thing I have added to the \type {context} runner is the (for now
undocumented) option

\starttyping
--wipebusy
\stoptyping

which, after a run removes the file

\starttyping
context-is-busy.tmp
\stoptyping

This permits a management script to check if a run is done. Before starting a run
(in a separate process) the script can write that file and by just checking if it
is still there, the management script can decide when a next run can be started.

\stopchapter

\stopcomponent

% downloaded video : Jojo Mayer's 2019 TED talk: https://www.youtube.com/watch?v=Npq-bhz1ll0}
% realtime video   : Andrew Cuomo's daily press conference on dealing with Covid 19