% language=uk \startluacode job.files.context(dir.glob("exported-*.tex"),"--directives=structures.export.lessstate") \stopluacode \startcomponent hybrid-export \environment hybrid-environment \startchapter[title={Exporting XML}] \startsection [title={Introduction}] Every now and then on the the mailing list users ask if \CONTEXT\ can produce \HTML\ instead of for instance \PDF, and the answer has always been unsatisfying. In this chapter I will present the \MKIV\ way of doing this. \stopsection \startsection [title={The clumsy way}] My favourite answer to the question about how to produce \HTML\ (or more general \XML\ as it can be transformed) has always been: \quotation {I'd just typeset it!}. Take: \starttyping \def\MyChapterCommand#1#2{

#2

} \setuphead[chapter][command=\MyChapterCommand] \stoptyping Here \type {\chapter{Hello World}} will produce: \starttyping

Hello World

\stoptyping Now imagine that you hook such commands into all relevant environments and that you use a style with no header and footer lines. You use a large page (A2) and a small monospaced font (4pt) so that page breaks will not interfere too much. If you want columns, fine, just hook in some code that typesets the final columns as tables. In the end you will have an ugly looking \PDF\ file but by feeding it into \type {pdftotext} you will get a nicely formatted \HTML\ file. For some languages of course encoding issues would show up and there can be all kind of interferences, so eventually the amount of code dealing with it would have accumulated. This is why we don't follow this route. An alternative is to use \type {tex4ht} which does an impressive job for \LATEX, and supports \CONTEXT\ to some extent as well. As far as I know it overloads some code deep down in the kernel which is something \quote {not done} in the \CONTEXT\ universe if only because we cannot keep control over side effects. It also complicates maintainance of both systems. In \MKIV\ however, we do have the ability to export the document to a structured \XML\ file so let's have a look at that. \stopsection \startsection [title={Structure}] The ability to export to some more verbose format depends on the availability of structural information. As we already tag elements for the sake of tagged \PDF, it was tempting to see how well we could use those tags for exporting to \XML. In principle it is possible to use Acrobat Professional to export the content using tags but you can imagine that we get a better quality if we stay within the scope of the producing machinery. \starttyping \setupbackend[export=yes] \stoptyping This is all you need unless you want to fine tune the resulting \XML\ file. If you are familiar with tagged \PDF\ support in \CONTEXT, you will recognize the result. When you process the following file: \typefile{exported-001.tex} You will get a file with the suffix \type {export} that looks as follows: \footnote{We will omit the topmost lines in following examples.} \typefile{exported-001.export} It's no big deal to postprocess such a file. In that case one can for instance ignore the chapter number or combine the number and the title. Of course rendering information is lost here. However, sometime it makes sense to export some more details. Take the following table: \typefile[range=2]{exported-002.tex} Here we need to preserve the span related information as well as cell specific alignments as for tables this is an essential part of the structure. \typefile[range=7]{exported-002.export} The tabulate mechanism is quite handy for regular text especially when the content of cells has to be split over pages. As each line in a paragraph in a tabulate becomes a cell, we need to reconstruct the paragraphs from the (split) alignment cells. \typefile[range=2]{exported-003.tex} This becomes: \typefile[range=7]{exported-003.export} The \type {} elements are injected automatically between paragraphs. We could tag each paragraph individually but that does not work that well when we have for instance a quotation that spans multiple paragraphs (and maybe starts in the middle of one). An empty element is not sensitive for this and is still a signal that vertical spacing is supposed to be applied. \stopsection \startsection[title=The implementation] We implement tagging using attributes. The advantage of this is that it does not interfere with typesetting, but a disadvantage is that not all parent elements are visible. When we encounter some content, we're in the innermost element so if we want to do something special, we need to deduce the structure from the current child. This is no big deal as we have that information available at each child element in the tree. The first implementation just flushed the \XML\ on the fly (i.e.\ when traversing the node list) but when I figured out that collapsing was needed for special cases like tabulated paragraphs this approach was no longer valid. So, after some experiments I decided to build a complete structure tree in memory \footnote {We will see if this tree will be used for other purposes in the future.}. This permits us to handle situations like the following: \typefile[range=2]{exported-005.tex} Here we get: \typefile[range=7]{exported-005.export} The \type {symbol} and \type {packed} attributes are first seen at the \type {itemcontent} level (the innermost element) so when we flush the \type {itemgroup} element's attributes we need to look at the child elements (content) that actually carry the attribute.\footnote {Only glyph nodes are investigated for structure.} I already mentioned collapsing. As paragraphs in a tabulate get split into cells, we encounter a mixture that cannot be flushed sequentially. However, as each cell is tagged uniquely we can append the lines within a cell. Also, as each paragraph gets a unique number, we can add breaks before a new paragraph starts. Collapsing and adding breakpoints is done at the end, and not per page, as paragraphs can cross pages. Again, thanks to the fact that we have a tree, we can investigate content and do this kind of manipulations. Moving data like footnotes are somewhat special. When notes are put on the page (contrary to for instance end notes) the so called \quote {insert} mechanism is used where their content is kept with the line where it is defined. As a result we see them end up instream which is not that bad a coincidence. However, as in \MKIV\ notes are built on top of (enumerated) descriptions, we need to distinguish them somehow so that we can cross reference them in the export. \typefile[range=2]{exported-006.tex} Currently this will end up as follows: \typefile[range=7]{exported-006.export} Graphics are also tagged and the \type {image} element reflects the included image. \typefile[range=2]{exported-007.tex} If the image sits on another path then that path shows up in an attribute and when a page other than~1 is taken from the (pdf) image, it gets mentioned as well. \typefile[range=7]{exported-007.export} Cross references are another relevant aspect of an export. In due time we will export them all. It's not so much complicated because all information is there but we need to hook some code into the right spot and making examples for those cases takes a while as well. \typefile[range=2]{exported-009.tex} We export references in the \CONTEXT\ specific way, so no interpretation takes place. \typefile[range=7]{exported-009.export} As \CONTEXT\ has an integrated referencing system that deals with internal as well as external references, url's, special interactive actions like controlling widgets and navigations, etc.\ and we export the raw reference specification as well as additional attributes that provide some detail. \typefile[range=2]{exported-013.tex} Of course, when postprocessing the exported data, you need to take these variants into account. \typefile[range=7]{exported-013.export} \stopsection \startsection[title=Math] Of course there are limitations. For instance \TEX ies doing math might wonder if we can export formulas. To some extent the export works quite well. \typefile[range=2]{exported-008.tex} This results in the usual rather verbose presentation \MATHML: \typefile[range=7]{exported-008.export} More complex math (like matrices) will be dealt with in due time as for this Aditya and I have to take tagging into account when we revisit the relevant code as part of the \MKIV\ cleanup and extensions. It's not that complex but it makes no sense to come up with intermediate solutions. Display verbatim is also supported. In this case we tag individual lines. \typefile[range=2]{exported-010.tex} The export is not that spectacular: \typefile[range=7]{exported-010.export} A rather special case are marginal notes. We do tag them because they often contain usefull information. \typefile[range=2]{exported-012.tex} The output is currently as follows: \typefile[range=7]{exported-012.export} However, this might change in future versions. \stopsection \startsection[title=Formatting] The output is formatted using indentation and newlines. The extra run time needed for this (actually, quite some of the code is related to this) is compensated by the fact that inspecting the result becomes more convenient. Each environment has one of the properties \type {inline}, \type {mixed} and \type {display}. A display environment gets newlines around it and an inline environment none at all. The mixed variant does something in between. In the following example we tag some user elements, but you can as well influence the built in ones. \typefile[range=2]{exported-004.tex} This results in: \typefile[range=7]{exported-004.export} Keep in mind that elements have no influence on the typeset result apart from introducing spaces when used this way (this is not different from other \TEX\ commands). In due time the formatting might improve a bit but at least we have less chance ending up with those megabyte long one||liners that some applications produce. \stopsection \startsection[title=A word of advise] In (for instance) \HTML\ class attributes are used to control rendering driven by stylesheets. In \CONTEXT\ you can often define derived environments and their names will show up in the detail attribute. So, if you want control at that level in the export, you'd better use the structure related options built in \CONTEXT, for instance: \typefile[range=2]{exported-011.tex} This gives two different sections: \typefile[range=7]{exported-011.export} \stopsection \startsection[title=Conclusion] It is an open question if such an export is useful. Personally I never needed a feature like this and there are several reasons for this. First of all, most of my work involves going from (often complex) \XML\ to \PDF\ and if you have \XML\ as input, you can also produce \HTML\ from it. For documents that relate to \CONTEXT\ I don't need it either because manuals are somewhat special in the sense that they often depend on showing something that ends up on paper (or its screen counterpart) anyway. Loosing the makeup also renders the content somewhat obsolete. But this feature is still a nice proof of concept anyway. \stopsection \stopchapter \stopcomponent