From f72c2cf29d36ae836c894bad29dfd965d1af0236 Mon Sep 17 00:00:00 2001 From: Hans Hagen Date: Sun, 18 Aug 2019 22:51:53 +0200 Subject: 2019-08-18 22:26:00 --- .../manuals/followingup/followingup-bitmaps.tex | 189 +++++++++++++++++++++ 1 file changed, 189 insertions(+) create mode 100644 doc/context/sources/general/manuals/followingup/followingup-bitmaps.tex (limited to 'doc/context/sources/general/manuals/followingup/followingup-bitmaps.tex') diff --git a/doc/context/sources/general/manuals/followingup/followingup-bitmaps.tex b/doc/context/sources/general/manuals/followingup/followingup-bitmaps.tex new file mode 100644 index 000000000..cf74c0cad --- /dev/null +++ b/doc/context/sources/general/manuals/followingup/followingup-bitmaps.tex @@ -0,0 +1,189 @@ +% language=us + +\startcomponent followingup-bitmaps + +\environment followingup-style + +\startchapter[title={Bitmap images}] + +\startsection[title={Introduction}] + +In \TEX\ image inclusion is traditionally handled by specials. Think of a signal +added someplace in the page stream that says: + +\starttyping +\special{image: foo.png 2000 3000} +\stoptyping + +Here the number for instance indicate a scale factor to be divided by 1000. +Because \TEX\ has no floating point numbers, normally one uses an integer and the +magic multiplier 1000 representing 1.000. Such a special is called a \quote +{whatsit} and is one reason why \TEX\ is so flexible and adaptive. + +In \PDFTEX\ instead of a \type {\special} the command \type {\pdfximage} and its +companions are used. In \LUATEX\ this concept has been generalized to \type +{\useimageresource} which internally is not a so called whatsit (an extension +node) but a special kind of rule. This makes for nicer code as now we don't need +to check if a certain whatsit node is actually one with dimensions, while rules +already are part of calculating box dimensions, so no extra overhead in checking +for whatsits is added. In retrospect this was one of the more interesting +conceptual changes in \LUATEX. + +In \LUAMETATEX\ we don't have such primitives but we do have these special rule +nodes; we're talking of subtypes and the frontend doesn't look at those details. +Depending on what the backend needs one can easily define a scanner that +implements a primitive. We already did that in \CONTEXT. More important is that +inclusion is not handled by the engine simply because there is no backend. This +means that we need to do it ourselves. There are two steps involved in this that +we will discuss below. + +\stopsection + +\startsection[title={Identifying}] + +There is only a handful of image formats that makes sense in a typesetting +workflow. Because \PDF\ inclusion is supported (but not discussed here) one can +actually take any format as long as it converts to \PDF, and tools like graphic +magic do a decent job on that. \footnote {Although one really need to check a +converted image. When we moved to pplib, I found out that lots of converted +images in a project had invalid \PDF\ objects, but apart from a warning nothing +bad resulted from this because those objects were not used.} The main bitmap +formats that we care about are \JPEG, \JPEG2000, and \PNG. We could deal with +\JBIG\ files but I never encountered them so let's forget about them for now. + +One of the problems with a built|-|in analyzer (and embedder) is that it can +crash or just abort the engine. The main reason is that when the used libraries +run into some issue, the engine is not always able to recover from it: a +converter just aborts which then cleans up (potentially messed up) memory. In +\LUATEX\ we also abort, simply because we have no clue to what extend further on +the libraries are still working as expected. We play safe. For the average user +this is quite ok as it signals that an image has to be fixed. + +In a workflow that runs unattended on a server and where users push images to a +resource tree, there is a good change that a \TEX\ job fails because of some +problem with images. A crash is not really an option then. This is one reason why +converting bitmaps to \PDF\ makes much sense. Another reason is that some color +profiling might be involved. Runtime manipulations make no sense, unless there is +only one typesetting run. + +Because in \LMTX\ we do the analyzing ourselves \footnote {Actually, in \MKIV\ +this was also possible but not widely advertised, but we now exclusively keep +this for \LMTX.} we can recover much easier. The main reason is of course that +because we use \LUA, memory management and garbage collection happens pretty well +controlled. And crashing \LUA\ code can easily be intercepted by a \type {pcall}. + +Most (extensible) file formats are based on tables that gets accessed from an +index of names and offsets into the file. This means that filtering for instance +metadata like dimensions and resolutions is no big deal (we always did that). I +can extend analyzing when needed without a substantial change in the engine that +can affect other macro packages. And \LUA\ is fast enough (and often faster) for +such tasks. + +\stopsection + +\startsection[title={Embeding}] + +Once identified the frontend can use that information for scaling and (if needed) +reuse of the same image. Embedding of the image resource happens when a page is +shipped out. For \JPEG\ images this is actually quite simple: we only need to +create a dictionary with the right information and push the bitmap itself into +the associated stream. + +For \PNG\ images it's a bit different. Unfortunately \PDF\ only supports certain +formats, for instance masks are separated and transparency needs to be resolved. +This means that there are two routes: either pass the bitmap blob to the stream, +or convert it to a suitable format supported by \PDF. In \LUATEX\ that is +normally done by the backend code, which uses a library for this. It is a typical +example of a dependency of something much larger than actually needed. In +\LUATEX\ the original poppler library used for filtering objects from a \PDF\ +file as well as the \PNG\ library also have tons of code on board that relates to +manipulating (writing) data. But we don't need those features. As a side note: +this is something rather general. You decide to use a small library for a simple +task only to find out after a decade that it has grown a lot offering features +and having extra dependencies that you really don't want. Even worse: you end up +with constant updates due to fixed security (read: bug) fixes. + +Passing the \PNG\ blob unchanged in itself to the \PDF\ file is trivial, but +massaging it into an acceptable form when it doesn't suit the \PDF\ specification +takes a bit more code. In fact, \PDF\ does not really support \PNG\ as format, +but it supports \PNG\ compression (aka filters). + +Trying to support more complex \PNG\ files is a nice way to test if you can +transform a public specification into a program as for instance happens with +\PDF, \OPENTYPE, and font embedding in \CONTEXT. So this again was a nice +exercise in coding. After a while I was able to process the \PNG\ test suite +using \LUA. Optimizing the code came with understanding the specification. +However, for large images, especially interlaced ones, runtime was definitely not +to be ignored. It all depended on the tasks at hand: + +\startitemize + +\startitem + A \PNG\ blob is compressed with \ZIP\ compression, so first it needs to be + decompressed. This takes a bit of time (and in the process we found out that + the \type {zlib} library used in \LUATEX\ had a bug that surfaced when a + mostly zero byte image was uncompressed and we can then hit a filled up + buffer condition. +\stopitem + +\startitem + The resulting uncompressed stream is itself compressed with a so called + filter. Each row starts with a filter byte that indicates how to convert + bytes into other bytes. The most commonly used methods are deltas with + preceding pixels and/or pixels on a previous row. When done the filter bytes + can go away. +\stopitem + +\startitem + Sometimes an image uses 1, 2 or 4 bits per pixel, in which case the rows + needs to be expanded. This can involve a multiplication factor per pixel (it + can also be an index in a palette). +\stopitem + +\startitem + An image can be interlaced which means that there are seven parts of the + image that stepwise build up the whole. In professional workflows with high + res images interlacing makes no sense as transfer over the internet is not an + issue and the overhead due to reassembling the image and the potentially + larger file size (due to independent compression of the seven parts) are not + what we want either. +\stopitem + +\startitem + There can be an image mask that needs to be separated from the main blob. A + single byte gray scale image then has two bytes per pixel, and a double byte + pixel has four bytes of information. An \RGB\ image has three bytes per pixel + plus an alpha byte, and in the case of double byte pixels we get eight bytes + per pixel. +\stopitem + +\startitem + Finally the resulting blob has to be compressed again. The current amount of + time involved in that suggests that there is room for improvement. +\stopitem + +\stopitemize + +The process is controlled by number of rows and columns, the number of bytes per +pixel (one or two) and the color space which effectively means one or three +bytes. These numbers get fed into the filter, deinterlacer, expander and|/|or +mask separator. In order to speed up the embedding these basic operations can be +assisted by a helpers written in \CCODE. Because \LUA\ is quite good with +strings, we pass strings and get back strings. So, most of the logic stays at the +\LUA\ end. + +\stopsection + +\startsection[title=Conclusion] + +Going for a library|-|less solution for bitmap inclusion is quite doable and in +most cases as efficient. Because we have a pure \LUA\ implementation for testing +and an optimized variant for production, we can experiment as we like. A positive +side effect is that we can more robustly intercept bad images and inject a +placeholder instead. + +\stopsection + +\stopchapter + +\stopcomponent -- cgit v1.2.3