From 53e1c8df1422d0e8b75dca725ec8a141698810ae Mon Sep 17 00:00:00 2001 From: Philipp Gesang Date: Wed, 10 Mar 2010 09:32:20 +0100 Subject: Now using LPeg, tables and functions in different files. --- .../third/transliterator/transliterator.tex | 78 +++++++++++++--------- 1 file changed, 48 insertions(+), 30 deletions(-) (limited to 'doc/context') diff --git a/doc/context/third/transliterator/transliterator.tex b/doc/context/third/transliterator/transliterator.tex index 92c6ee3..f4cc914 100644 --- a/doc/context/third/transliterator/transliterator.tex +++ b/doc/context/third/transliterator/transliterator.tex @@ -227,7 +227,7 @@ \tfxx\ss\setupinterlinespace[small] The {\em Transliterator} module and mini-manual,\par by Philipp Gesang, Dossenheim.\par -Mail any bugs or improvements to\par +Mail any patches or suggestions to\par pgesang -- AT -- ix -- DOT -- urz -- DOT -- uni-heidelberg -- DOT -- de\par } \stopstandardmakeup @@ -311,10 +311,22 @@ Thus, we would typeset one of Epicuros' sayings like this: \transliterate[mode=gr]{κακὸν ἀνάγκη, ἀλλ' οὐδεμία ἀνάγκη ζῆν μετὰ ἀνάγκης} \stoptyping -which yields \quotation{\transliterate[mode=gr]{κακὸν ἀνάγκη, ἀλλ' οὐδεμία ἀνάγκη ζῆν +\noindentation which yields \quotation{\transliterate[mode=gr]{κακὸν ἀνάγκη, ἀλλ' οὐδεμία ἀνάγκη ζῆν μετὰ ἀνάγκης}} in the pdf output. } - +Alternatively there is an environment, \type{\starttransliterate[#1]}, as well, +that takes the same arguments. + +For orientation purposes the Transliterator comes with two macros that allow +for closer inspection of the internal tables. +\type{\showOneTranslitTab{#1}} outputs, obviously, a single table; their +identifiers +can be found in the \type{trans_} +\type{tables_*.lua} files in the transliterator +directory. +The lazy alternative is \type{\showTranslitTabs} which prints all registered +tables in a row nicely formatted as indexable sections. +(Be warned, this may take some time.) \chapter{Introduction} @@ -379,7 +391,8 @@ the scholarly transliterations. To amend the situation the Transliterator provides an extension to ISO~9 for Old Slavonic containing the glyphs \startluacode -local cnt, len = 0, 0 -- Wishing for a len() function that works on dictionaries as in python… +dofile("trans_tables_scntfc.lua") +local cnt, len = 0, 0 for i,j in pairs(translit.ocs_add_low) do len = len + 1 end @@ -427,15 +440,15 @@ However, as there is no hyphenation pattern I know of that closely resembles the transliteration of Greek you might have to resort to putting \type{\discretionary} hyphens when line breaking does not satisfy. -To conclude this, let me have a word on the way the Transliterator works. -Basically, it is a bunch of dictionaries containing substitution rules for -elements that may occur in the text. -These elements may be single characters or strings of more than one character. +The Transliterator as a whole is nothing more than a bunch of dictionaries +containing substitution rules for tokens that may occur in the text. +These tokens may be single characters or strings of more than one character. As there is no simple way to impose order onto those dictionaries the rules for one transliteration method are, if needed, distributed over more than one table which will be applied successively to ensure that multi-character rules are processed first. + \setupfloats[spacebefore=small,spaceafter=small] \placetable[left][none]{% Processing time for corpus Evgenij Onegin according @@ -452,44 +465,49 @@ are processed first. \bTABLE[split=no,stretch=yes] \bTABLEhead \bTR - \bTH mode \eTH\bTH time(1) \eTH\bTH \CONTEXT \eTH + \bTH mode \eTH\bTH time(1) in $s$ \eTH\bTH \CONTEXT \eTH \eTR \eTABLEhead \bTABLEbody + \tfx \bTR - \bTC \eTC\bTC 8.59 \eTC\bTC 8.43 \eTC + \bTC \eTC\bTC 8.98 \eTC\bTC 8.82 \eTC \eTR\bTR - \bTC \type{ru} \eTC\bTC 8.84 \eTC\bTC 8.71 \eTC + \bTC \type{all} \eTC\bTC 8.37 \eTC\bTC 8.25 \eTC \eTR\bTR - \bTC \type{all} \eTC\bTC 10.14 \eTC\bTC 10.01 \eTC + \bTC \type{ru_cz} \eTC\bTC 8.61 \eTC\bTC 8.48 \eTC \eTR\bTR - \bTC \type{ru_cz} \eTC\bTC 9.05 \eTC\bTC 8.92 \eTC + \bTC \type{ru_transcript_en} \eTC\bTC 9.26 \eTC\bTC 9.10 \eTC \eTR\bTR - \bTC \type{ru_transcript_en} \eTC\bTC 11.36 \eTC\bTC 11.23 \eTC - \eTR\bTR - \bTC \type{ru_transcript_de} \eTC\bTC 36.19 \eTC\bTC 36.03 \eTC + \bTC \type{ru_transcript_de} \eTC\bTC 14.83 \eTC\bTC 14.71 \eTC \eTR \eTABLEbody \eTABLE } \setuptolerance[tolerant] -The transliteration itself is, admittedly, extremely inefficient as it uses -global substitution iteratively on the whole string for every rule in the -dictionary. -(Maybe this could be replaced by a faster implementation using look ahead that -goes through the string only once, but for now it'll stay as it is until I find -time to care for speed.) -In ordinary use when transliterating single words or short phrases only the +Following suggestions from the mailing list, the Transliterator uses {\em LPeg} +when substituting. +This means a huge speed improvement for most substitution modes when compared +to the older mechanism that used \type{string.gsub} iteratively. +In ordinary use when transliterating single words or short phrases the Transliterator should have little impact on document processing time at large, -with the exception of the German transcription mode, perhaps. -For sake of completeness, here are some numbers: +with the exception of the German transcription mode, perhaps.\footnote{ + The problem lies within the rule set for the German transcription which + dictates different instructions depending on the environment of a character; + these may conflict, i.~e. it is impossible to substitute a character stream + in a single run as some rules may apply only to the result of previous rule. + Let me know if there's a way to tell LPeg to backtrack to the last character + of a match and not to continue on the next. +} Transliterating (and typesetting in MKIV) \transliterate{Александр Пушкин}'s verse novel -\transliterate{Евгений Онегин}, a corpus of about 27000 words, took only -9.7~seconds in \type{[mode=ru]}, compared to 8.9~seconds without -transliteration.\footnote{% - On an IBM T43: \tt 2.6.32-ARCH \#1 SMP PREEMPT Tue Feb 9 14:46:08 UTC 2010 i686 - Intel(R) Pentium(R) M processor 1.60GHz GenuineIntel GNU/Linux. +\transliterate{Евгений Онегин}, a corpus of about 27000 words, in +\type{[mode=all]} shows little to no delay at all. +In fact, typesetting cyrillic letters with russian hyphenation seems slow +things down so much that transliteration may be faster and uses slightly less +memory.\footnote{% + On an IBM T43: \tt 2.6.32-ARCH \#1 SMP PREEMPT Tue Feb 9 14:46:08 UTC 2010 + i686 Intel(R) Pentium(R) M processor 1.60GHz GenuineIntel GNU/Linux. } -- cgit v1.2.3