diff options
Diffstat (limited to 'doc/context/sources/general/manuals/languages/languages-hyphenation.tex')
-rw-r--r-- | doc/context/sources/general/manuals/languages/languages-hyphenation.tex | 810 |
1 files changed, 810 insertions, 0 deletions
diff --git a/doc/context/sources/general/manuals/languages/languages-hyphenation.tex b/doc/context/sources/general/manuals/languages/languages-hyphenation.tex new file mode 100644 index 000000000..48e6eb385 --- /dev/null +++ b/doc/context/sources/general/manuals/languages/languages-hyphenation.tex @@ -0,0 +1,810 @@ +% language=uk + +\environment languages-environment + +\startcomponent languages-hyphenation + +\startchapter[title=Hyphenation][color=darkmagenta] + +\startsection[title=How it works] + +Proper hyphenation is one of the strong points of \TEX. Hyphenation in \TEX\ is +done using so called hyphenation patterns. Making these patterns is an art +and most users (including me) happily use whatever is available. Patterns can be +created automatically using \type {patgen} but often manual tweaking is needed +too. A pattern looks as follows: + +\starttyping +pat1tern +\stoptyping + +This means as much as: you can split the word \type {pattern} in two pieces, with +a hyphen between the two \type {t}'s. Actually it will also split the word \type +{patterns} because the hyphenation mechanism looks at substrings. When no number +between characters in a pattern is given, a zero is assumed. This means as much +as {\em undefined}. An even number inhibits hyphenation, an odd number permits +it. The larger the number (weight), the more influence it has. A more restricted +pattern is: + +\starttyping +.pat1tern. +\stoptyping + +Here the periods set the word boundaries. The pattern dictionary for us +english has smaller patterns and the next trace shows how these are applied. + +\starthyphenation[traditional] +\showhyphenationtrace[en][pattern] +\stophyphenation + +The effective hyphenation of a word is determined by several factors: + +\startitemize[packed] +\startitem the current language, each language can have different patterns \stopitem +\startitem the characters, as some characters might block hyphenation \stopitem +\startitem the settings of \type {\lefthyphenmin} and \type {\righthyphenmin} \stopitem +\stopitemize + +A place where a word can be hyphenated is called a discretionary. When \TEX\ +analyzes a stream, it will inject discretionary nodes into that stream. + +\starttyping +pat\discretionary{-}{}{}tern. +\stoptyping + +In traditional \TEX\ hyphenation, ligature building and kerning are tightly +interwoven which is quite effective. However, there was also a strong +relationship between the current font and hyphenation. This is a side effect of +traditional \TEX\ having at most 256 characters in a font and the fact that the +used character is fact a reference to a slot in a font. There a character in the +input initially ends up as a character node and eventually becomes a glyph node. +For instance two characters \type {fi} can become a ligature glyph representing +this combination. + +In \LUATEX\ the hyphenation, ligature building and kerning stages are separated +and can be overloaded. In \CONTEXT\ all three can be replaced by code written in +\LUA. Because normally hyphenation happens before font logic is applied, there is +no relationship with font encoding. I wrote the first \LUA\ version of the +hyohenator on a rainy weekend and the result was not that bad so it was presented +at the 2014 \CONTEXT\ meeting. After some polishing I decided to add this routine +to the standard \MKIV\ repertoire which then involved some proper interfacing. + +You can enable the \LUA\ variant with the following command: + +\starttyping +\setuphyphenation[method=traditional] +\stoptyping + +We call this method \type {traditional} because in principle we can have +many more methods and this one is (supposed to be) mostly compatible to the +built-in method. This is a global setting. You can switch back with: + +\starttyping +\setuphyphenation[method=default] +\stoptyping + +In the next sections we will see how we can provide alternatives within the +traditional method. These alternatives can be set local and therefore can operate +over a limited range of characters. + +One complication in interfacing is that \TEX\ has grouping (which permits local +settings) and we want to limit some of the above functionality using groups. At +the same time hyphenation is a paragraph related action so we need to enable the +hyphenation related code at a global level (or at least make sure that it gets +exercised by forcing a \type {\par}). That means that the alternative +hyphenator has to be quite compatible so that we could just enable it for a whole +document. This can have an impact on performance but in practice that can be +neglected. In \LUATEX\ the \LUA\ variant is 4~times slower than the built-in one, +in \LUAJITTEX\ it's 3~times slower. But the good news is that the amount of time +spent in the hyphenator is relatively small compared to other manipulations and +macro expansion. The additional time needed for loading and preparing the +patterns into a more \LUA\ specific format can be neglected. + +You can check how words get hyphenated using the patterns management script: + +\starttyping +>mtxrun --script patterns --hyphenate language + +hyphenator | +hyphenator | . l a n g u a g e . . l a n g u a g e . +hyphenator | 0a2n0 0 0 2 0 0 0 0 0 0 +hyphenator | 2a0n0g0 0 2 2 0 0 0 0 0 0 +hyphenator | 0n1g0u0 0 2 2 1 0 0 0 0 0 +hyphenator | 0g0u4a0 0 2 2 1 0 4 0 0 0 +hyphenator | 2g0e0.0 0 2 2 1 0 4 2 0 0 +hyphenator | .0l2a2n1g0u4a2g0e0. . l a n-g u a g e . +hyphenator | +mtx-patterns | us 3 3 : language : lan-guage +\stoptyping + +\stopsection + +\startsection[title=The last words] + +Mid 2014 we had to upgrade a style for a \PDF\ assembly service: chapters from +(technical) school books are combined into arbitrary new books. There are some +nasty aspects with this flow: for instance, all section numbers in a chapter are +replaced by new numbers and this also involves figure and table prefixes. +It boils down to splitting up books, analyzing the typeset content and +preparing it for replacements. The structure is described in \XML\ files so that +we can generate tables of contents. The reason for not generating from \XML\ +sources is that the publisher doesn't have a \XML\ workflow and that books +already were available. Also, books from several series are combined and even +within a series structure (and rendering) differs. + +What has this to do with hyphenation? Writing a style for such a flow always +results in a more complex one that estimated and as usual it's in the details. +The original style was written in \MKII\ and used some box juggling to achieve +reasonable results but in \MKIV\ we can do better. + +Each chapter has a title and books get titles and subtitles as well. The titles +are typeset each time a new book is composed. This happens within some layout +constraints. Think of constraints like these: + +\startitemize[packed] +\startitem the title goes on top of a shape that doesn't permit much overflow \stopitem +\startitem there can be very long words (not uncommon in Dutch or German) \stopitem +\startitem a short word or hyphenated part should not end up on the last line \stopitem +\startitem the left and right hyphenation minima are at least four \stopitem +\stopitemize + +The last requirement is a compromise because in most cases publishers seem to +want ragged right not hyphenated rendering (at least in Dutch schoolbooks). The +arguments for this are quite weak and probably originate in fear of bad rendering +given past experiences. It's this kind of situations that drive the development +of the more obscure features that ship with \CONTEXT\ and a (partial) solution +for this specific case will be given later. + +If you look at thousands of titles and turn these into (small) paragraphs \TEX\ +does a pretty good job. It's the few exceptions that we need to catch. The next +examples demonstrate such an extreme case. + +\startbuffer[example] +\dorecurse{5} { % dejavu + \startlinecorrection[blank] + \bTABLE + \bTR + \bTD[align=middle,width=2em,foregroundstyle=bold] + #1 + \eTD + \bTD[align={verytolerant,flushleft},width=15em,offset=1ex] + \hsize \dimexpr11\emwidth-#1\dimexpr.5\emwidth\relax + \dontcomplain + \lefthyphenmin=4\righthyphenmin=4 + \blackrule[color=darkyellow,width=\hsize,height=-3pt,depth=5pt]\par + \begstrut\getbuffer[long]\endstrut\par + \eTD + \bTD[align={verytolerant,flushleft},width=15em,offset=1ex] + \sethyphenationfeatures[demo] + \hsize \dimexpr11\emwidth-#1\dimexpr.5\emwidth\relax + \dontcomplain + \blackrule[color=darkyellow,width=\hsize,height=-3pt,depth=5pt]\par + \begstrut\getbuffer[long]\endstrut\par + \eTD + \eTR + \eTABLE + \stoplinecorrection +} +\stopbuffer + +\definehyphenationfeatures + [demo] + [rightwords=1, + lefthyphenmin=4, + righthyphenmin=4] + +\startbuffer[long] +a verylongword and then anevenlongerword +\stopbuffer + +\starthyphenation[traditional] + \enabletrackers[hyphenator.visualize] + \getbuffer[example]\par + \disabletrackers[hyphenator.visualize] +\stophyphenation + +Of course in practice there need to be some reasonable width and when we pose +these limits the longest possible word should fit into the allocated space. In +these examples the rule shows the width. In the right columns we see a red +colored word and that one will not get hyphenated. + +\stopsection + +\startsection[title=Explicit hyphens] + +Another special case that we needed to handle were (compound) words with explicit +hyphens. Because often data comes from \XML\ files we can not really control the +typesetting as in a \TEX\ document where the author sees what gets done. So here +we need a way to turn these hyphens into proper hyphenation directives and at the +same time permit the words to be hyphenated. + +\definehyphenationfeatures + [demo] + [hyphens=yes, + lefthyphenmin=4, + righthyphenmin=4] + +\startbuffer[long] +a very-long-word and then an-even-longer-word +\stopbuffer + +\starthyphenation[traditional] + \enabletrackers[hyphenator.visualize] + \getbuffer[example]\par + \disabletrackers[hyphenator.visualize] +\stophyphenation + +\stopsection + +\startsection[title=Extended patterns] + +As with more opened up mechanisms, in \MKIV\ we can extend functionality. As an +example I have implemented the extensions discussed in the article by László +Németh in the Proceedings of Euro\TEX\ 2006: {\em Hyphenation in OpenOffice.org} +(TUGboat, Volume 27, 2006). The syntax for these extension is somewhat ugly and +involves optional offsets and ranges. \footnote {I'm not sure if there were ever +patterns released that used this syntax.} + +\startbuffer +\registerhyphenationpattern[nl][e1ë/e=e] +\registerhyphenationpattern[nl][a9atje./a=t,1,3] +\registerhyphenationpattern[en][eigh1tee/t=t,5,1] +\registerhyphenationpattern[de][c1k/k=k] +\registerhyphenationpattern[de][schif1f/ff=f,5,2] +\stopbuffer + +\typebuffer \getbuffer + +These patterns result in the following hyphenations: + +\starthyphenation[traditional] + \switchtobodyfont[big] + \starttabulate[|||] + \NC reëel \NC \language[nl]\hyphenatedcoloredword{reëel} \NC \NR + \NC omaatje \NC \language[nl]\hyphenatedcoloredword{omaatje} \NC \NR + \NC eighteen \NC \language[en]\hyphenatedcoloredword{eighteen} \NC \NR + \NC Zucker \NC \language[de]\hyphenatedcoloredword{Zucker} \NC \NR + \NC Schiffahrt \NC \language[de]\hyphenatedcoloredword{Schiffahrt} \NC \NR + \stoptabulate +\stophyphenation + +In a specification, the \type {.} indicates a word boundary and numbers indicate +the weight of a breakpoint. The optional extended specification comes after the +\type {/}. The values separated by a \type {=} are the pre and post sequences: +these end up at the end of the current line and beginning of the next one. The +optional numbers are the start position and length. These default to~1 and~2, so +in the first example they identify \type {eë} (the weights don't count). + +There is a pitfall here. When the language already has patterns that for +instance prohibit a hyphen between \type {e} and type {ë}, like \type{e2ë}, we +need to make sure that we give our new one a higher priority, which is why we +used a \type{e9ë}. + +This feature is somewhat experimental and can be improved. Here is a more \LUA-ish +way of setting such patterns: + +\starttyping +local registerpattern = + languages.hyphenators.traditional.registerpattern + +registerpattern("nl","e1ë", { + start = 1, + length = 2, + before = "e", + after = "e", +} ) + +registerpattern("nl","a9atje./a=t,1,3") +\stoptyping + +Just adding extra patterns to an existing set without much testing is not wise. For +instance we could add these to the dutch dictionary: + +\starttyping +\registerhyphenationpattern[nl][e3ë/e=e] +\registerhyphenationpattern[nl][o3ë/o=e] +\registerhyphenationpattern[nl][e3ï/e=i] +\registerhyphenationpattern[nl][i3ë/i=e] +\registerhyphenationpattern[nl][a5atje./a=t,1,3] +\registerhyphenationpattern[nl][toma8at5je] +\stoptyping + +That would work oke well for words like + +\starttyping +coëfficiënt +geïntroduceerd +copiëren +omaatje +tomaatje +\stoptyping + +However, the last word only goes right because we explicitly added a pattern +for it. One reason is that the existing patterns already contain rules to +prevent weird hyphenations. The same is true for the accented characters. So, +consider these examples and coordinate additional patterns with other users +so that errors can be identified. + +\stopsection + +\startsection[title=Exceptions] + +We have a variant on the \TEX\ primitive \type {\hyphenation}, the official way +to register a specific way to hyphenate a word. + +\startbuffer +\registerhyphenationexception[aaaaa-bbbbb] +aaaaabbbbb \par +\stopbuffer + +\typebuffer + +\noindentation This code is self explaining and results in: + +\blank + +\starthyphenation[traditional] +\setupindenting[no]\hsize 1mm \lefthyphenmin 1 \righthyphenmin 1 \getbuffer +\stophyphenation + +\noindentation There can be multiple hyphens and even multiple words in such a +specification: + +\startbuffer +\registerhyphenationexception[aaaaa-bbbbb cc-ccc-ddd-dd] +aaaaabbbbb \par +cccccddddd \par +\stopbuffer + +\typebuffer + +\noindentation We get: + +\blank + +\starthyphenation[traditional] +\setupindenting[no]\hsize 1mm \lefthyphenmin 1 \righthyphenmin 1 \getbuffer +\stophyphenation + + +\stopsection + +\startsection[title=Boundaries] + +A box, rule, math or discretionary will end a word and prohibit hyphenation +of that word. Take this example: + +\startbuffer[demo] +whatever \par +whatever\hbox{!} \par +\vl whatever\vl \par +whatever$x$ \par +whatever-whatever \par +\stopbuffer + +\typebuffer[demo] + +These lines will hyphenate differently and in traditional \TEX\ you need to +insert penalties and|/|or glue to get around it. In the \LUA\ variant we can +enable that limitation. + +\startbuffer +\definehyphenationfeatures + [strict] + [rightedge=tex] +\stopbuffer + +\typebuffer \getbuffer + +Here we show the three variants: traditional \TEX\ and \LUA\ with and without +strict settings. + +\starttabulate[|p|p|p|] +\HL +\NC \ttbf \hbox to 11em{default\hss} +\NC \ttbf \hbox to 11em{traditional\hss} +\NC \ttbf \hbox to 11em{traditional strict\hss} +\NC \NR +\HL +\NC \starthyphenation[default] \hsize1mm \getbuffer[demo] \stophyphenation +\NC \starthyphenation[traditional] \hsize1mm \getbuffer[demo] \stophyphenation +\NC \starthyphenation[traditional] \sethyphenationfeatures[strict] + \hsize1mm \getbuffer[demo] \stophyphenation +\NC \NR +\HL +\stoptabulate + +By default \CONTEXT\ is configured to hyphenate words that start with an +uppercase character. This behaviour is controlled in \TEX\ by the \typ {\uchyph} +variable. A positive value will enable this and a negative one disables it. + +\starttabulate[|p|p|p|p|] +\HL +\NC \ttbf \hbox to 8em{default 0\hss} +\NC \ttbf \hbox to 8em{default 1\hss} +\NC \ttbf \hbox to 8em{traditional 0\hss} +\NC \ttbf \hbox to 8em{traditional 1\hss} +\NC \NR +\HL +\NC \starthyphenation[default] \hsize1mm \uchyph\zerocount TEXified \dontcomplain \stophyphenation +\NC \starthyphenation[traditional] \hsize1mm \uchyph\zerocount TEXified \dontcomplain \stophyphenation +\NC \starthyphenation[default] \hsize1mm \uchyph\plusone TEXified \dontcomplain \stophyphenation +\NC \starthyphenation[traditional] \hsize1mm \uchyph\plusone TEXified \dontcomplain \stophyphenation +\NC \NR +\HL +\stoptabulate + +The \LUA\ variants behaves the same as the built-in implementation (that of course +remains the reference). + +\stopsection + +\startsection[title=Plug-ins] + +The default hyphenator is similar to the built-in one, with a couple of +extensions as mentioned. However, you can plug in your own code, given that it +does return a proper hyphenation result. One reason for providing this plug is +that there are users who want to play with hyphenators based on a different +logic. In \CONTEXT\ we already have some methods to deal with languages that +(for instance) have no spaces but split on words or syllabes. A more tight +integration with the hyphenator can have advantages so I will explore these +options when there is demand. + +A result table indicates where we can break a word. If we have a four character +word and can break after the second character, the result looks like this: + +\starttyping +result = { false, true, false, false } +\stoptyping + +Instead of \type {true} we can also have a table that has entries like the +extensions discussed in a previous section. Let's give an example of a +plug-in. + +\startbuffer +\startluacode + local subset = { + a = true, + e = true, + i = true, + o = true, + u = true, + y = true, + } + + languages.hyphenators.traditional.installmethod("test", + function(dictionary,word,n) + local t = { } + for i=1,#word do + local w = word[i] + if subset[w] then + t[i] = { + before = "<" .. w, + after = w .. ">", + left = false, + right = false, + } + else + t[i] = false + end + end + return t + end + ) +\stopluacode +\stopbuffer + +\typebuffer \getbuffer + +Here we hyphenate on vowels and surround them by angle brackets when +split over lines. This alternative is installed as follows: + +\startbuffer +\definehyphenationfeatures + [demo] + [alternative=test] +\stopbuffer + +\typebuffer \getbuffer + +We can now use it as follows: + +\starttyping +\setuphyphenation[method=traditional] +\sethyphenationfeatures[demo] +\stoptyping + +When applied to one the tufte example we get: + +\startbuffer[demo] +\starthyphenation[traditional] + \setuptolerance[tolerant] + \sethyphenationfeatures[demo] + \noindentation % \dontleavehmode + \input tufte\relax +\stophyphenation +\stopbuffer + +\blank \startnarrower \getbuffer[demo] \stopnarrower \blank + +A more realistic (but not perfect) example is the following: + +\startbuffer +\startluacode + local packslashes = false + + local specials = { + ["!"] = "before", ["?"] = "before", + ['"'] = "before", ["'"] = "before", + ["/"] = "before", ["\\"] = "before", + ["#"] = "before", + ["$"] = "before", + ["%"] = "before", + ["&"] = "before", + ["*"] = "before", + ["+"] = "before", ["-"] = "before", + [","] = "before", ["."] = "before", + [":"] = "before", [";"] = "before", + ["<"] = "before", [">"] = "before", + ["="] = "before", + ["@"] = "before", + ["("] = "before", + ["["] = "before", + ["{"] = "before", + ["^"] = "before", ["_"] = "before", + ["`"] = "before", + ["|"] = "before", + ["~"] = "before", + -- + [")"] = "after", + ["]"] = "after", + ["}"] = "after", + } + + languages.hyphenators.traditional.installmethod("url", + function(dictionary,word,n) + local t = { } + local p = nil + for i=1,#word do + local w = word[i] + local s = specials[w] + if s == "after" then + s = { + start = 1, + length = 1, + after = w, + left = false, + right = false, + } + specials[w] = s + elseif s == "before" then + s = { + start = 1, + length = 1, + before = w, + left = false, + right = false, + } + specials[w] = s + end + if not s then + s = false + elseif w == p and w == "/" then + t[i-1] = false + end + t[i] = s + if packslashes then + p = w + end + end + return t + end + ) +\stopluacode +\stopbuffer + +\typebuffer \getbuffer + +Again we define a plug: + +\startbuffer +\definehyphenationfeatures + [url] + [characters=all, + alternative=url] +\stopbuffer + +\typebuffer \getbuffer + +So, we only break a line after symbols. + +\startlinecorrection[blank] + \starthyphenation[traditional] + \tt + \sethyphenationfeatures[url] + \scale[width=\hsize]{\hyphenatedcoloredword{http://www.pragma-ade.nl}} + \stophyphenation +\stoplinecorrection + +\noindentation A quick test can look as follows: + +\startbuffer +\starthyphenation[traditional] + \sethyphenationfeatures[url] + \tt + \dontcomplain + \hsize 1mm + http://www.pragma-ade.nl +\stophyphenation +\stopbuffer + +\typebuffer + +Or: + +\getbuffer + +\stopsection + +\startsection[title=Blocking ligatures] + +Yet another predefined feature is the ability to block a ligature. In +traditional \TEX\ this can be done by putting a \type {{}} between +the characters, although that effect can get lost when the text is +manipulated. The natural way to do this in a \UNICODE\ environment +is to use the special characters \type {zwj} and \type {zwnj}. + +We use the following example lines: + +\startbuffer[sample] +supereffective \blank +superef\zwnj fective +\stopbuffer + +\typebuffer[sample] + +\noindentation and define two featuresets: + +\startbuffer +\definehyphenationfeatures + [demo-1] + [characters=\zwnj\zwj, + joiners=yes] + +\definehyphenationfeatures + [demo-2] + [joiners=no] +\stopbuffer + +\typebuffer \getbuffer + +\noindentation We limit the width to 1mm and get: + +\startlinecorrection[blank] +\bTABLE[option=stretch,offset=.5ex] + \bTR + \bTD \tx + \type{method=default} + \eTD + \bTD \tx + \type{method=traditional} + \eTD + \bTD \tx + \type{method=traditional}\par + \type{featureset=demo-1} + \eTD + \bTD \tx + \type{method=traditional}\par + \type{featureset=demo-2} + \eTD + \eTR + \bTR + \bTD + \hsize 1mm \dontcomplain + \starthyphenation[default] + \getbuffer[sample] + \stophyphenation + \eTD + \bTD + \hsize 1mm \dontcomplain + \starthyphenation[traditional] + \getbuffer[sample] + \stophyphenation + \eTD + \bTD + \hsize 1mm \dontcomplain + \starthyphenation[traditional] + \sethyphenationfeatures[demo-1] + \getbuffer[sample] + \stophyphenation + \eTD + \bTD + \hsize 1mm \dontcomplain + \starthyphenation[traditional] + \sethyphenationfeatures[demo-2] + \getbuffer[sample] + \stophyphenation + \eTD + \eTR +\eTABLE +\stoplinecorrection + +\stopsection + +\startsection[title=Special characters] + +The \type {characters} example can be used (to some extend) to do the +same as the breakpoints mechanism (compounds). + +\startbuffer +\definehyphenationfeatures + [demo-3] + [characters={()[]}] +\stopbuffer + +\typebuffer \blank \getbuffer \blank + +\startbuffer[demo] +\starthyphenation[traditional] + \sethyphenationfeatures[demo-3] + \dontcomplain + \hsize 1mm \noindentation + we use (super)special(ized) patterns +\stophyphenation +\stopbuffer + +\typebuffer[demo] \blank \getbuffer[demo] \blank + +We can make this more clever by adding patterns: + +\startbuffer +\registerhyphenationpattern[en][)9] +\registerhyphenationpattern[en][9(] +\stopbuffer + +\typebuffer \blank \getbuffer \blank + +\noindentation This gives: + +\blank \getbuffer[demo] \blank + +\noindentation A detailed trace shows that these patterns get applied: + +\starthyphenation[traditional] + \ttx + \showhyphenationtrace[en][(super)special(ized)] +\stophyphenation + +\unregisterhyphenationpattern[en][)9] +\unregisterhyphenationpattern[en][9(] + +\noindentation The somewhat weird hyphens at the edges will in practice not show +up because there is always one regular character there. + +\stopsection + +\startsection[title=Tracing] + +Among the tracing options (low level trackers) there is one for pattern developers: + +\startbuffer +\usemodule[s-languages-hyphenation] + +\startcomparepatterns[de,nl,en,fr] + \input zapf \quad (\showcomparepatternslegend) +\stopcomparepatterns +\stopbuffer + +\typebuffer + +The different hyphenation points are shown with colored bars. Some valid points +might not be shown because the font engine can collapse successive +discretionaries. + +\getbuffer + +\stopsection + +\stopchapter + +\stopcomponent |