doc/context/sources/general/manuals/workflows/workflows-hashed.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160

% language=us runpath=texruns:manuals/workflows

% Musical timestamp: Welcome 2 America by Prince, with a pretty good lineup,
% August 2021.

\environment workflows-style

\startcomponent workflows-hashed

\startchapter[title=Hashed files]

In a (basically free content) project we had to deal with tens of thousands of
files. Most are in \XML\ format, but there are also thousands of \PNG, \JPG\ and
\SVG images. In large project like this, which covers a large part of Dutch
school math, images can be shared. All the content is available for schools as
\HTML\ but can also be turned into printable form and because schools want to
have stable content over specified periods one has to make a regular snapshot of
this corpus. Also, distributing a few gigabytes if data is not much fun.

So, in order to bring the amount down a dedicated mechanism for handling files
has bene introduced. After playing with a \SQLITE\ database we finally settled on
just \LUA, simply because it was faster and it also makes the solution
independent.

The process comes down to creating a file database once in a while, loading a
relatively small hash mapping at runtime and accessing files from a large
data file on demand. Optionally files can be compressed, which makes sense for
the textual files.

A database is created with one of the \CONTEXT\ extras, for instance:

\starttyping
context --extra=hashed --database=m4 --pattern=m4all/**.xml --compress
context --extra=hashed --database=m4 --pattern=m4all/**.svg --compress
context --extra=hashed --database=m4 --pattern=m4all/**.jpg
context --extra=hashed --database=m4 --pattern=m4all/**.png
\stoptyping

The database uses two files: a small \type {m4.lua} file (some 11 megabytes) and
a large \type {m4.dat} (about 820 megabytes, coming from 1850 megabytes
originals). Alternatively you can use a specification, say \type {m4all.lua}:

\starttyping
return {
    { pattern  = "m4all/**.xml$", compress = true  },
    { pattern  = "m4all/**.svg$", compress = true  },
    { pattern  = "m4all/**.jpg$", compress = false },
    { pattern  = "m4all/**.png$", compress = false },
}
\stoptyping

\starttyping
context --extra=hashed --database=m4 --patterns=m4all.lua
\stoptyping

You should see something like on the console:

\starttyping
hashed > database 'hasheddata', 1627 paths, 46141 names,
    36935 unique blobs, 29674 compressed blobs
\stoptyping

So here we share some ten thousand files (all images). In case you wonder why we
keep the duplicates: they have unique names (copies) so that when a section is
updated there is no interference with other sections. The tree structure is
mostly six deep (sometimes there is an additional level).

% \startluacode
%     if not resolvers.finders.helpers.validhashed("hasheddata") then
%         resolvers.finders.helpers.createhashed {
%             database = "hasheddata",
%             pattern  = "m4all/**.jpg$",
%             compress = false,
%         }
%         resolvers.finders.helpers.createhashed {
%             database = "hasheddata",
%             pattern  = "m4all/**.png$",
%             compress = false,
%         }
%         resolvers.finders.helpers.createhashed {
%             database = "hasheddata",
%             pattern  = "m4all/**.xml$",
%             compress = true,
%         }
%     end
% \stopluacode

% \startluacode
%     if not resolvers.finders.helpers.validhashed("hasheddata") then
%         resolvers.finders.helpers.createhashed {
%             database = "hasheddata",
%             patterns = {
%                 { pattern  = "m4all/**.jpg$", compress = false },
%                 { pattern  = "m4all/**.png$", compress = false },
%                 { pattern  = "m4all/**.svg$", compress = true  },
%                 { pattern  = "m4all/**.xml$", compress = true  },
%             },
%         }
%     end
% \stopluacode

Accessing files is the same as with files on the system, but one has to register
a database first:

\starttyping
\registerhashedfiles[m4]
\stoptyping

A fully qualified specifier looks like this (not to different from other
specifiers):

\starttyping
\externalfigure
  [hashed:///m4all/books/chapters/h3/h3-if1/images/casino.jpg]
\externalfigure
  [hashed:///m4all/books/chapters/ha/ha-c4/images/ha-c44-ex2-s1.png]
\stoptyping

but nicer would be :

\starttyping
\externalfigure
  [m4all/books/chapters/h3/h3-if1/images/casino.jpg]
\externalfigure
  [m4all/books/chapters/ha/ha-c4/images/ha-c44-ex2-s1.png]
\stoptyping

This is possible when we also specify:

\starttyping
\registerfilescheme[hashed]
\stoptyping

This makes the given scheme based resolver kick in first, while the normal
file lookup is used as last resort.

This mechanism is written on top of the infrastructure that has been part of
\CONTEXT\ \MKIV\ right from the start but this particular feature is only
available in \LMTX\ (backporting is likely a waste of time).

Just for the record: this mechanism is kept simple, so the database has no update
and replace features. One can just generate a new one. You can test for a valid database
and act upon the outcome:

\starttyping
\doifelsevalidhashedfiles {m4} {
    \writestatus{hashed}{using hashed data}
    \registerhashedfiles[m4]
    \registerfilescheme[hashed]
} {
    \writestatus{hashed}{no hashed data}
}
\stoptyping

Future version might introduce filename normalization (lowercase, cleanup) so
consider this is first step. First we need test it for a while.

\stopchapter

\stopcomponent