2018-09-13 17:49:00

author: Hans Hagen <pragma@wxs.nl> 2018-09-13 18:21:39 +0200
committer: Context Git Mirror Bot <phg@phi-gamma.net> 2018-09-13 18:21:39 +0200
commit: 56ca0139232f16679918613ef45a5dd643f0f9b3 (patch)
tree: f5afef4d57e2cdbf1a6cb777635ec871be34837c /tex/context/patterns/common/lang-bg.rme
parent: 5c433e6e8accaa4bc9ebe0a094b925fe11a8edf5 (diff)
download: context-56ca0139232f16679918613ef45a5dd643f0f9b3.tar.gz
1 files changed, 890 insertions, 0 deletions
diff --git a/tex/context/patterns/common/lang-bg.rme b/tex/context/patterns/common/lang-bg.rme
new file mode 100644
index 000000000..25a3e2ca5
--- /dev/null
+++ b/tex/context/patterns/common/lang-bg.rme
@@ -0,0 +1,890 @@
+% generated by mtxrun --script pattern --convert
+
+% copyright: Copyright (C) 2000, 2004, 2017 by Anton Zinoviev <anton@lml.bas.bg>
+% title: Bulgarian hyphenation patterns
+% version: 21 October 2017
+% language:
+%     name: Bulgarian
+%     tag: bg
+% notice: >
+%     This file is part of the hyph-utf8 package.
+%     See http://www.hyphenation.org for more information.
+% authors:
+%     -
+%         name: Anton Zinoviev
+%         contact: anton:lml.bas.bg
+% licence:
+%     text: >
+%         This software may be used, modified, copied, distributed, and sold,
+%         both in source and binary form provided that the above copyright
+%         notice and these terms are retained. The name of the author may not
+%         be used to endorse or promote products derived from this software
+%         without prior permission.  THIS SOFTWARE IS PROVIDES "AS IS" AND
+%         ANY EXPRESS OR IMPLIED WARRANTIES ARE DISCLAIMED.  IN NO EVENT
+%         SHALL THE AUTHOR BE LIABLE FOR ANY DAMAGES ARISING IN ANY WAY OUT
+%         OF THE USE OF THIS SOFTWARE.
+% hyphenmins:
+%     typesetting:
+%         left: 2
+%         right: 2
+% changes: See below
+% ==========================================
+% Copyright (C) 2000,2004,2017 by Anton Zinoviev <anton@lml.bas.bg>
+%
+% This software may be used, modified, copied, distributed, and sold,
+% both in source and binary form provided that the above copyright
+% notice and these terms are retained. The name of the author may not
+% be used to endorse or promote products derived from this software
+% without prior permission.  THIS SOFTWARE IS PROVIDES "AS IS" AND
+% ANY EXPRESS OR IMPLIED WARRANTIES ARE DISCLAIMED.  IN NO EVENT
+% SHALL THE AUTHOR BE LIABLE FOR ANY DAMAGES ARISING IN ANY WAY OUT
+% OF THE USE OF THIS SOFTWARE.
+%
+% Bulgarian hyphenation patterns
+%
+% Generated by ./hyph-bg.sh --safe-morphology --standalone-tex
+%
+% Both left and right hyphenmins should be set to 2.
+%
+% % Automated Bulgarian Hyphenation
+% % Anton Zinoviev
+% % 21 October 2017
+% 
+% Principles of the Bulgarian hyphenation
+% =======================================
+% 
+% One specificity of the Bulgarian language is that the average length
+% of the words is greater than in English.  When typesetting a Bulgarian
+% text, hyphenation is more important than when typesetting an English
+% text.  Knuth's algorithm for line-breaking is such that in most
+% English paragraphs no hyphenation will be used.  With a Bulgarian
+% text, however, even the Knuth's algorithm will use hyphenation in most
+% paragraphs.  Hyphenation becomes an absolute necessity if we want to
+% obtain nice, justified paragraphs when using a software with dumb
+% line-breaking algorithm, such as LibreOffice.
+% 
+% According to Decree 936 of the Council of Ministers promulgated on 27
+% November 1950, the Institute for Bulgarian Language at the Bulgarian
+% Academy of Sciences is authorised to publish the rules of the
+% orthography of the Bulgarian language (within certain limits).
+% 
+% Hyphenation rules between 1945 and 1983
+% ---------------------------------------
+% 
+% Between 1945 and 1983 Bulgarian used syllable hyphenation with two
+% morphological exceptions: hyphenation is preferred between a prefix
+% and a stem and at the boundary of compound words.  The following were
+% the rules governing the hyphenation:
+% 
+% 1. One letter does not stay alone.  Words of one syllable can not be
+%    hyphenated.
+% 2. No hyphenation before or after ь.
+% 3. In a sequence of vowels at least one vowel stays before the
+%    hyphen.
+% 4. A single consonant between two vowels links with the second vowel.
+%    For example по-ле /po-le/, ра-бо-та /ra-bo-ta/.
+% 5. In a sequence of consonants between two vowels, at least one
+%    consonant stays with the second vowel.  For example те-сто /te-sto/
+%    or тес-то /tes-to/.[^b]
+% 6. In a sequence of consonants between two vowels, if the first
+%    consonant is sonorant (й /y/, л /l/, м /m/, н /n/, р /r/), then it
+%    stays with the first vowel.  For example гер-дан /ger-dan/, сен-ки
+%    /sen-ki/.
+% 7. The hyphenation separates two successive equal consonants. For
+%    example времен-но /vremen-no/, пролет-та /prolet-ta/.
+% 8. When the letters дж /dzh/ and дз /dz/ denote a single consonant,
+%    then they are not separated.  For example боя-джия /boya-dzhiya/
+%    but not бояд-жия /boyad-zhiya/.  When these letters denote two
+%    consonants, then the normal rules apply: над-живявам
+%    /nad-zhivyavam/.
+% 9. Word prefixes may not be broken.  Compound words are hyphenated
+%    either at the boundary of the components or the hyphenation rules
+%    are applied to each of the components separately.  For example:
+%    пред-упреждавам /pred-uprezhdavam/ (not пре-дупреждавам
+%    /pre-duprezhdavam/), пред-известие /pred-izvestie/ (not
+%    пре-дизвестие /pre-dizvestie/), за-движвам /za-dvizhvam/ (not
+%    зад-вижвам /zad-vizhvam/), авто-клуб /avto-klub/ (not авток-луб
+%    /avtok-lub/), вакуум-апарат /vakuum-aparat/ (not вакуу-мапарат
+%    /vakuu-maparat/).
+% 
+% In some rare cases the proper application of rule 9 depends on the
+% semantics of the word.  For example пре-дреша /pre-dresha/ 'change
+% clothes' but пред-реша /pred-resha/ 'predetermine' or прес-пите
+% /pres-pite/ 'the snow-drifts' but пре-спите /pre-spite/ 'sleep for a
+% while/overnight'.
+% 
+% [^b]: In several publications this rule is formulated with the
+%     additional restriction that the sequence of consonants begins with
+%     an obstruent.  I believe this restriction is unintentional.  It
+%     makes no sense to forbid a hyphenation of the form AB-A but to
+%     permit ABB-A (A denotes a vowel and B – a consonant).
+% 
+% Hyphenation rules between 1983 and 2012
+% ---------------------------------------
+% 
+% The Orthographic dictionary published by the Institute for Bulgarian
+% language in 1983 introduced new hyphenation rules.  The complexity of
+% the previous rules was the main reason for the change.  The new rules
+% aimed at two objectives: simplicity and unambiguity.
+% 
+% The new rules are:
+% 
+% 1. A consonant between two vowels links with the second vowel.  For
+%    example ви-со-чи-на /vi-so-chi-na/.
+% 2. In a sequence of two or more consonants between two vowels, at
+%    least one consonant stays with first vowel and at least one with
+%    the second vowel.  For example сес-тра /ses-tra/ and сест-ра
+%    /sest-ra/.
+% 3. Two equal consonants are separated.  For example плен-ник
+%    /plen-nik/.
+% 4. In a sequence of two or more vowels, the first vowel stays before
+%    the hyphen.  For example пре-одолея /pre-odoleya/ and прео-долея
+%    /preo-doleya/.
+% 5. In a sequence of three or more vowels, the last vowel stays after
+%    the hyphen.  For example мао-изъм /mao-izam/ but not маои-зъм
+%    /maoi-zam/.
+% 6. The letter й /y/ between a vowel and a consonant stays with the
+%    vowel.  For example май-ка /may-ka/.
+% 7. When a sequence of two or more consonants follows й /y/ then at
+%    least one consonant links with й /y/.  For example айс-берг
+%    /ays-berg/ (not ай-сберг /ay-sberg/).
+% 8. The letter й /y/ between two vowels links with the second vowel.
+%    For example ма-йор /ma-yor/.
+% 9. No hyphenation before or after ь.
+% 10. When the letters дж /dzh/ denote a single consonant, then they are
+%     not separated.  For example су-джук /su-dzhuk/ (not суд-жук
+%     /sud-zhuk/) but над-живея /nad-zhiveya/.
+% 11. There must be at least one vowel before and after the hyphen.
+% 12. One letter does not stay alone.
+% 
+% The total disregard of the morphology by these rules leads to some
+% strange results.  For example пре-дизвестие /pre-dizvestie/ is
+% permitted and пред-известие /pred-izvestie/ is forbidden, зад-вижвам
+% /zad-vizhvam/ is permitted and за-движвам /za-dvizhvam/ is forbidden,
+% авток-луб /avtok-lub/ is permitted and авто-клуб /avto-klub/ is
+% forbidden, вакуу-мапарат /vakuu-maparat/ is permitted and
+% вакуум-апарат /vakuum-aparat/ is forbidden.  Because of this, the new
+% rules were not universally accepted.  The old rules are still
+% mentioned in various places in Internet, they are included even in
+% some grammar books published by the publishing houses of the Ministry
+% of Education and of Sofia University.  The software developers,
+% however, soon came into love with the new hyphenation rules.
+% 
+% Hyphenation rules after 2012
+% ----------------------------
+% 
+% In 2012 new rules came into force.  There are two differences with
+% respect to the previous rules:
+% 
+% 1. Rule 5 of the previous rules is revoked.  For example маои-зъм
+%    /maoi-zam/ becomes a valid hyphenation.
+% 2. The new rules permit morphologically based hyphenation (however it
+%    is not obligatory).  For example пред-известие /pred-izvestie/,
+%    за-движвам /za-dvizhvam/, авто-клуб /avto-klub/, вакуум-апарат
+%    /vakuum-aparat/ are valid hyphenations.
+% 
+% Good hyphenation is a complex matter and it seems the linguists at the
+% Institute for Bulgarian Language have recognised this.  They no longer
+% attempt to provide universal rules about everything.  Instead, they
+% provide some very permissible rules while the good application of
+% these rules is leaved to the discretion and the experience of the
+% printers and the developers of hyphenation software.
+% 
+% It makes sense to use at least two different sets of hyphenation rules
+% for Bulgarian.  In most cases a more restrictive version should be
+% used, one which attempts to eliminate the controversial cases of
+% hyphenation.  When typesetting a Bulgarian text in a narrow newspaper
+% column, however, it will be appropriate to use more liberal
+% hyphenation rules.  It should be noted that one of the reasons for the
+% hyphenation reform in 1983 was the desire to fix the chaotic
+% hyphenation in the Bulgarian newspapers at that time.
+% 
+% Computer implementations
+% ========================
+% 
+% Mathematical analysis of the Bulgarian hyphenation
+% --------------------------------------------------
+% 
+% The earliest mathematical analysis of the Bulgarian hyphenation rules
+% belongs to Veska Noncheva.[^1] In 1988 she proposed a mathematical
+% formalisation of the hyphenation rules in a table with 22 rows.[^2]
+% 
+% [^1]: <http://www.researchgate.net/profile/Veska_Noncheva>
+% 
+% [^2]: Нончева В. Алгоритъм за автоматично пренасяне на думи в
+%     българския език. Математика и математическо
+%     образование. Сб. доклади на 17. ПК на СМБ. С., БАН, 1988, 479-482.
+% 
+% In the same year Eugene Belogay[^3] proposed an alternative
+% formalisation with only 9 rules.[^4] Belogay proved that his rules are
+% consistent and that they form a minimal set.  The rules of Belogay
+% have negative character – every hyphenation which is not forbidden by
+% a rule is possible hyphenation.
+% 
+% [^3]: <http://www.linkedin.com/in/belogay>
+% 
+% [^4]: Белогай Е. Алгоритъм за автоматично пренасяне на думи. Компютър
+%     за вас (1988) 3, 12-14.
+% 
+% The following are the first 7 rules, as formulated by Belogay:
+% 
+% 1. Б-А
+% 2. А-ББ
+% 3. Б-ТТ, ТТ-Б
+% 4. ААА-Б
+% 5. й-ББ
+% 6. Б-ь
+% 7. д-ж
+% 
+% Here А denotes an arbitrary vowel letter, Б denotes an arbitrary
+% consonant letter (including ь and й), ТТ denotes a sequence of two
+% equal consonant letters and the letters й, ь, д and ж denote
+% themselves.  For example the rule "Б-А" says that we are not permitted
+% to separate a consonant letter from immediately following vowel
+% letter.
+% 
+% The eighth rule of Belogay says that hyphenation is forbidden before
+% the first and after the last vowel letter.  The ninth rule of Belogay
+% says that hyphenation is forbidden immediately after the first or
+% immediately before the last letter of the word.
+% 
+% Notice that is is very easy to translate the rules of Belogay in the
+% form, required for the hyphenation algorithm of Knuth and Liang used
+% in TeX.[^a] Let us remind that this algorithm matches the word with a
+% set of string patterns in which the odd numbers say hyphenation is
+% permitted in this position and even numbers say the hyphenation is
+% forbidden.  When two patterns give conflicting numbers for the same
+% position, then the greater number wins.
+% 
+% First, since the rules of Belogay are negative (they say where
+% hyphenation is forbidden, not where it is permitted), we have to
+% permit the hyphenation everywhere:
+% 
+% 1. А1
+% 2. Б1
+% 
+% Then, the first seven rules of Belogay obtain the form:
+% 
+% 1. Б2А
+% 2. А2ББ
+% 3. Б2ТТ ТТ2Б
+% 4. ААА2Б
+% 5. й2ББ
+% 6. Б2ь
+% 7. д2ж
+% 
+% Since no Bulgarian word starts with more that four consonants and no
+% Bulgarian word ends with more than three consonants, the eighth rule
+% of Belogay can be translated in the following way:
+% 
+% 1. .Б2
+% 2. .ББ2
+% 3. .БББ2
+% 4. 2Б.
+% 5. 2ББ.
+% 
+% The ninth rule of Belogay means that left and right hyphen mins should
+% be set to 2.
+% 
+% The work of Eugene Belogay was not limited to merely a mathematical
+% analysis of the Bulgarian hyphenation rules.  In his paper he
+% published a short algorithm in Pascal which implements these rules.
+% It didn't take long for this algorithm to be used in various text
+% processing software.  The algorithm of Belogay was famous for many
+% years.  Even as late as 1997 in one book about TeX, the author didn't
+% care to give any explanations but simply wrote about "the algorithm of
+% Belogay" as something well known to the reader.[^5]
+% 
+% [^a]: Liang, Franklin Mark. Word Hy-phen-a-tion by
+%     Com-put-er (Doctoral Dissertation). Stanford University, 1983
+% 
+% [^5]: Василев В. Ултимативният ТеХ.  Удоволствието да правим
+%     предпечатна подготовка сами. София, Интела, 1997, 36
+% 
+% Bulgarian hyphenation in TeX
+% ----------------------------
+% 
+% One unfortunate design decision of Knuth was that the hyphenation
+% algorithm of TeX applied the hyphenation patterns not to the input
+% character codes but to the internal codes of the glyphs in the font.
+% This created a problem for the Cyrillic languages because in TeX the
+% Cyrillic fonts did not have standardised encoding.  Perhaps this is
+% one of the reasons why the earliest implementations of the Bulgarian
+% hyphenation in TeX did not rely on the internal hyphenation algorithm
+% of TeX.  Instead, external tools were used to insert soft hyphens in
+% all Bulgarian words.  For example such a tool would replace the word
+% сричкопренасяне /srichkoprenasyane/ with
+% срич\\-коп\\-ре\\-на\\-ся\\-не /srich\\-kop\\-re\\-na\\-sya\\-ne/.
+% The saying "To every disadvantage there is a corresponding advantage"
+% is true – since Cyrillic and Latin letters use different character
+% codes, an external tool could easily insert soft hyphens in all
+% Bulgarian words while leaving the TeX commands intact.
+% 
+% The earliest known attempt to use the hyphenation algorithm of TeX for
+% Bulgarian was made by Ognyan Tonev in 1990.[^6] He described his work
+% as "a not very good translation of the rules.  I work in this
+% direction.  But I don't have a 100% working complect of patterns.  So,
+% the copy I send to you[^7] is only a beta-version."  The hyphenation
+% patterns of Tonev don't work correctly and it seems he never completed
+% his work.
+% 
+% [^6]: The author of this text was unable to find current information
+%     about Ognyan Tonev in Internet.  Apparently in 1990 he worked in
+%     the Center of Informatics and Computer Technology of the Bulgarian
+%     Academy of Sciences.
+% 
+% [^7]: To Yannis Haralambous,
+%     <http://perso.telecom-bretagne.eu/yannisharalambous>
+% 
+% The first usable Bulgarian hyphenation patterns for TeX were developed
+% by Georgi Boshnakov[^8] in 1994.  In order to solve the encoding
+% problem, Boshnakov had developed TeX fonts supporting the MIK encoding
+% (the prevalent encoding at that time in Bulgaria).  This allowed him
+% to introduce a fully working implementation only a few months after
+% LaTeX2e became the official LaTeX version.  Later Boshnakov modified
+% his work with the Babel system.  The hyphenation patterns of Boshnakov
+% did their job well enough, so that for almost quarter a century after
+% their initial creation, they remained the only Bulgarian hyphenation
+% patterns in the standard distributions of TeX and CTAN.
+% 
+% [^8]: <http://www.maths.manchester.ac.uk/~gb/>
+% 
+% There are some similarities between the patterns of Boshnakov and the
+% patterns of Belogay.  The following are the main differences.
+% 
+% First, Boshnakov used an ingenious and more compact implementation of
+% the second and the third rule.  Instead of {А2ББ, Б2ТТ, ТТ2Б}, or
+% 8×22×22+22×22+22×22=4840 patterns in total, Boshnakov has patterns of
+% the form 2Б3Б2 and 4Т3Т4, or only 22×22=484 in total, with the same
+% effect.
+% 
+% The second main difference between the patterns of Boshnakov and the
+% patterns of Belogay concerns the letter combination дж /dzh/.  In
+% Bulgarian this letter combination can denote either a single
+% consonant, or a sequence of two consonants and the hyphenation rules
+% change respectively.  Unfortunately, it is impossible to know the
+% meaning of дж /dzh/ without a vocabulary.  The solution of Belogay was
+% a cautious one – his rules do the hyphenation in a way which will be
+% correct regardless of whether дж /dzh/ is a single consonant or a
+% sequence of two consonant.  On the other hand, the approach of
+% Boshnakov is a bold one – since дж /dzh/ is more often a single
+% consonant, his rules assume that it is always a single consonant.  The
+% number of the cases when this decision leads to bad hyphenations is
+% insignificant in comparison with the cases in which we obtain improved
+% hyphenation.
+% 
+% The third main difference between the patterns of Boshnakov and the
+% patterns of Belogay concerns the eighth rule – its implementation in
+% the rules of Boshnakov is rather limited which leads to wrong
+% hyphenations like бри-дж /bri-dzh/.  A full implementation of this
+% rule would require 11660 patterns in total and this would be too much
+% for the computers in 1994.
+% 
+% Later developments
+% ------------------
+% 
+% In 1995 Atanas Topalov defended a Masters thesis in the Faculty of
+% Mathematics and Informatics at Sofia University titled "Algorithms and
+% software about text processing".[^9] One of the main topics in his
+% thesis was the Bulgarian hyphenation.  Topalov criticised vehemently
+% the official hyphenation rules and their total disregard of the
+% morphology.  He wrote:
+% 
+% > If we look at the history of the problems of the hyphenation, we
+% > will discover something very strange.  Instead of the expected
+% > involvement with the depths and aspiration for more admissible and
+% > satisfactory style, we can find a growing tendency for
+% > simplification.  One unpleasant discovery is that the development of
+% > the hyphenation software stays firmly on the principle "let us do
+% > the easiest thing".  The earliest works which have been studied are
+% > from 1978.  It turned out that they present the best approach
+% > concerning the automated hyphenation.  The authors have chosen the
+% > most difficult but the most correct (from literary point of view)
+% > method for hyphenation, namely the morphological approach.
+% 
+% Topalov proposed his own hyphenation algorithm.  The hyphenation it
+% generated was smooth and easy to read.  One obvious defect of the
+% algorithm of Topalov was that it contradicted the official hyphenation
+% rules at that time.  One can argue, however, that his algorithm is
+% compatible with the current hyphenation rules.
+% 
+% [^9]: The thesis of Atanas Topalov can be accessed at the author's
+%     website <http://www.mind-print.com>
+% 
+% In 1999 Svetla Koeva[^10] wrote a paper about the automated Bulgarian
+% hyphenation.[^11] At that time she was a junior member of the
+% Department of Computational Linguistics at the Institute for Bulgarian
+% Language but now she is a director of the whole institute.  The paper
+% of Koeva contains a list of hyphenation patterns which can be used as
+% a basis of automated hyphenation.  In 2004 with the help of Stoyan
+% Mihov[^12] the rules of Koeva were formalised with regular relations
+% and rewriting rules.  They were implemented in a software product
+% named ItaEst which provided Bulgarian hyphenation and grammar checking
+% for various software products of Microsoft and Apple.
+% 
+% [^10]: <http://dcl.bas.bg/svetla_koeva/>
+% 
+% [^11]: Коева, Светла. Правила за пренасяне на части от думите на нов
+%     ред. Български език. 1999/2000, 1, 84-86
+% 
+% [^12]: <http://lml.bas.bg/~stoyan/>
+% 
+% The main differences between the hyphenation of Koeva and the official
+% hyphenation rules effective after 2012 is that the separation of a
+% long sequence of consonants between two vowels is done according to
+% the rules valid before 1983.  For example се-стра /se-stra/ and
+% ай-сберг /ay-sberg/ are permitted.  The main difference between the
+% hyphenation of Koeva and the official hyphenation rules effective
+% before 1983 is that the rules of Koeva disregard the morphology of the
+% words.  The following rule of Koeva is specific: in a sequence of two
+% sonorant consonants between two vowels, we are permitted to separate
+% the first vowel from the first consonant, for example материа-лна
+% /materia-lna/.
+% 
+% In 2000 Anton Zinoviev[^13] created new hyphenation patterns for TeX.
+% He didn't know about the previous work of Boshnakov and he didn't
+% bother to make his work available in the various TeX distributions and
+% CTAN.  His work was used mostly by the local Linux enthusiasts and the
+% colleagues of Zinoviev.  In 2001 Radostin Radnev[^14] created a free
+% grammar dictionary of Bulgarian[^15] where he used the hyphenation
+% patterns of Zinoviev.  From there the work of Zinoviev propagated to
+% OpenOffice, LibreOffice and various online dictionaries, including
+% <http://bg.wiktionary.org> and <http://rechnik.chitanka.info>.
+% 
+% [^13]: The author of this text.
+% 
+% [^14]: <http://bg.linkedin.com/in/radostinradnev>
+% 
+% [^15]: <http://bgoffice.sourceforge.net/>
+% 
+% The following are the main differences between the hyphenation of
+% Zinoviev and the hyphenation of Boshnakov.
+% 
+% First, the eighth rule of Belogay is fully implemented.
+% 
+% Second, the rules of Zinoviev try to detect when the letters дж /dzh/
+% (and дз /dz/) denote a single consonant and when they denote a
+% sequence of two consonants.  By default, however, Zinoviev (like
+% Boshnakov) assumes that дж /dzh/ is a single consonant and hyphenates
+% accordingly.
+% 
+% Third, the rules of Zinoviev disable some cases of unpleasant
+% hyphenations:
+% 
+% 1. In a consonant sequence like тст /tst/, the two equal consonants т
+%    /t/ are separated.  For example братст-во /bratst-vo/ is forbidden
+%    while братс-тво /brats-tvo/ and брат-ство /brat-stvo/ are
+%    permitted.
+% 2. The hyphenation is forbidden after a sonorant consonant following
+%    an obstruent consonant.  For example отм-ра /otm-ra/ is forbidden
+%    and от-мра /ot-mra/ is permitted.
+% 3. The hyphenation separates two consecutive kindred voiced/voiceless
+%    consonants.  For example субп-родукт /subp-roduct/ is forbidden and
+%    суб-продукт /sub-product/ is permitted.
+% 
+% At the start of his work on the Bulgarian hyphenation, Zinoviev had
+% the opportunity to discuss the hyphenation with Svetla Koeva.  He
+% remembers that some cases of unpleasant hyphenation were suggested to
+% him by Koeva.  Unfortunately, he hasn't taken notes so now he doesn't
+% know which cases of unpleasant hyphenation have been suggested to him
+% by Koeva and which are his own findings.
+% 
+% The present work
+% ================
+% 
+% Motivation
+% ----------
+% 
+% The present work was carried out on the initiative of the leader of
+% the Bulgarian localisation team of Mozilla, who contacted Zinoviev,
+% Boshnakov and the maintainers of the TeX hyphenation patterns.[^17]
+% This work pursues the following main objectives:
+% 
+% 1. to update the hyphenation patterns in accordance with the current
+%    hyphenation rules;
+% 2. to generate the hyphenation patterns by a publicly available
+%    script;
+% 3. to make the hyphenation patterns customisable;
+% 4. to provide documentation for the future developers.
+% 
+% [^16]: <http://mozillians.org/en-US/u/stoyan/>
+% 
+% [^17]: <http://hyphenation.org>
+% 
+% The current official hyphenating rules for Bulgarian are rather
+% liberal.  Very often, in a long sequence of consonants we are
+% permitted to split the word at any position, for example аген-т-с-т-во
+% /agen-t-s-t-vo/.  This is prone to many unusual and unexpected results
+% that interrupt the attention of the reader or deceive his expectations
+% during the movement of his eyes to the next line.  On the other hand,
+% in order to produce nice justified paragraphs there is no need for so
+% many hyphenation possibilities.  It would be sufficient even if only
+% one possible separation between any two syllables was permitted.
+% 
+% Therefore, it makes sense to use a more restrictive version of the
+% Bulgarian hyphenation, one which eliminates the controversial cases of
+% hyphenation.  Only when typesetting a Bulgarian text in a very narrow
+% newspaper column it will be appropriate to use a more liberal version.
+% It should be noted that some specialised English dictionaries also
+% separate the word-division positions into two categories – preferred
+% positions and less recommended positions.
+% 
+% There are two methods to determine the optimal division within a
+% sequence of consonants between two vowels:
+% 
+% * we can hyphenate according to the syllables in the word or
+% * we can hyphenate morphologically.
+% 
+% Hyphenation according to the syllables in the word
+% --------------------------------------------------
+% 
+% Let us look at the properties of the Bulgarian syllables.  All
+% syllables have the following structure:
+% 
+% > onset - nucleus - code
+% 
+% The nucleus in Bulgarian is always a vowel.  Both the onset and the
+% code are (possibly empty) sequences of consonants.
+% 
+% The Bulgarian syllables adhere to the Sonority Sequencing Principle.
+% According to this principle, the consonants within the onset have
+% raising sonority and the consonants within the code have decreasing
+% sonority.
+% 
+% Several grammar books agree that the following sonority scale is valid
+% for Bulgarian:
+% 
+% > voiceless obtrusive < voiced obtrusive < sonorant consonant < vowel
+% 
+% According to the investigations of the author, the only exception to
+% this law is due to the letter в /v/ which is a voiced obtrusive but it
+% can be used also as a voiceless obtrusive.  This exception is due to a
+% spelling particularity of the Bulgarian language.  Whenever the letter
+% в /v/ seemingly violates the Sonority Sequencing Principle, in the
+% spoken language this letter is read as ф /f/, that is as a voiceless
+% obtrusive (for example the word отвсякъде /otvsyakade/ is read as
+% отфсякъде /otfsyakade/).[^18]
+% 
+% [^18]: No Primitive Slavonic word contains the phoneme ф /f/.
+% Therefore, we can safely assume that in the Primitive Slavonic
+% language the consonant ф /f/ was a positional variant of the consonant
+% в /v/.
+% 
+% The author has found that the sonorant consonants in Bulgarian have
+% their own sonority scale:
+% 
+% > м /m/ < н /n/ < л /l/ < р /r/ < й /y/
+% 
+% Only a few words such as жанр /zhanr/ and химн /himn/ violate this
+% scale.  Such words are always loan-words and their pronunciation is
+% somewhat problematic for the native Bulgarian speakers.
+% 
+% In addition to the Sonority Sequencing Principle, the consonant
+% clusters within the Bulgarian syllable adhere to the following
+% additional principles:
+% 
+% 1. Both in the onset and in the code, the labial and dorsal plosives
+%    precede the coronal plosives and affricates.
+% 2. If the onset or the code contains two plosives or affricates, then
+%    there are no fricatives between them.  Few words with the Latin
+%    root 'text' are exceptions: контекст /kontekst/.
+% 3. If the onset or the code contains two fricatives other than в /v/,
+%    then there are no plosives or affricates between them.
+% 4. If the onset or the code contains two plosives or affricates, then
+%    they both have equal sonority (both are voiced, or both are
+%    voiceless).
+% 5. If the onset or the code contains two fricatives other than в /v/,
+%    then they both have equal sonority (both are voiced, or both are
+%    voiceless).
+% 6. Neither the onset, nor the code may contain two labial plosives, or
+%    two coronal plosives or affricates or two dorsal plosives.
+% 7. Neither the onset, nor the code may contain two equal consonants
+%    with the exception of в /v/ (for example втвърди /vtvardi/).[^19]
+% 
+% [^19]: Actually, the letter в /v/ is not a real exception because in
+% all such cases this letter denotes two different consonants – в /v/
+% and ф /f/.  Only in the Russian loan-word взвод /vzvod/ the two
+% letters в /v/ denote a repeating consonant в /v/.
+% 
+% From all these properties of the Bulgarian syllable we can deduce the
+% following hyphenation rules:
+% 
+% 1. In a sequence МК where М is a consonant with higher sonority than
+%    K, we are not permitted to hyphenate before М.  Exception: when М
+%    is в /v/ and К is a voiceless consonant.
+% 2. In a sequence КМ where М is a consonant with higher sonority than
+%    K, we are not permitted to hyphenate after М.
+% 3. In a sequence KBT where K and T are plosives or affricates and B is
+%    fricative, we separate K from T.
+% 4. In a sequence CKB where K is a plosive or affricate and C and B are
+%    fricatives other than в /v/, we separate C from B.
+% 5. If in a consonant sequence a coronal plosive or affricate Т is
+%    followed by a labial or dorsal plosive К, then we separate Т from К.
+% 6. If a consonant sequence contains two plosives or affricates, one
+%    voiced and one voiceless, then we separate them.
+% 7. If a consonant sequence contains two fricatives other than в /v/,
+%    one voiced and one voiceless, then we separate them.
+% 8. If a consonant sequence contains two labial plosives or two coronal
+%    plosives or affricates or two dorsal plosives then they are
+%    separated.
+% 9. If a consonant sequence contains two equal consonants (not
+%    necessarily consecutive), then they are separated.
+% 
+% With so many prohibitive rules, a question arises: if we apply all
+% these rules, aren't we going to eliminate too many hyphenation
+% possibilities?  The answer is no.  It can be demonstrated that between
+% any two consecutive syllables at least one separation point will be
+% permitted.
+% 
+% 
+% Hyphenation according to the morphology
+% ---------------------------------------
+% 
+% Between 1983 and 2012 the official orthographic rules of the
+% Bulgarian language forbade morphologically based hyphenation.  After
+% 2012 such hyphenation is permitted (but not obligatory).
+% 
+% The most important case when it is very desirable to use
+% morphologically based hyphenation is the case of the compound words.
+% Divisions such as авток-луб /avtok-lub/ and вакуу-мапарат
+% /vakuu-maparat/ are extremely irritating even if they are formally
+% correct.  Unfortunately, we do not have a vocabulary of the compound
+% Bulgarian words that would permit us to produce rules for automated
+% hyphenation.  Therefore, the current Bulgarian hyphenation patterns do
+% not attempt to apply morphological hyphenation to such words.
+% 
+% Second in importance (but far more significant in terms of numbers) is
+% the case with the word prefixes.  While the eyes of the reader still
+% look at the start of the word, the word is still unknown to him.  At
+% this point, it is very important not to deceive his expectations.  For
+% example, when the reader sees над- /nad-/ at the end of the line, he
+% will expect that this is the prefix над- /nad-/ with semantics 'attain
+% more than'.  This expectation will be fooled if this wasn't really a
+% prefix, but a deceiving (while formally correct) hyphenation of the
+% word надремя /nadremya/ 'have dozed enough' where the real prefix is
+% not над- /nad-/ but на- /na-/ with semantics 'achieve a state after
+% accumulation'.  Such hyphenation distracts the reader and makes the
+% reading more difficult.
+% 
+% Third in importance is the case with the word suffixes.  With respect
+% to the hyphenation rules we can divide the suffixes into three
+% categories:
+% 
+% 1. Suffixes starting with a vowel, for example -ар /-ar/.  It is not
+%    appropriate to follow the morphology with such suffixes because
+%    this will contradict the whole hyphenation tradition of the
+%    Bulgarian language.  For example крав-ар /krav-ar/ is unwarranted.
+% 2. Suffixes starting with one consonant, for example -ка /-ka/.
+%    Usually with such suffixes the syllable boundary in the word
+%    coincides with morpheme boundary so no specific cares are
+%    necessary, for example кравар-ка /kravar-ka/.  The exceptions are
+%    rare, for example: обек-тната /obek-tnata/ instead of обект-ната
+%    /obekt-nata/.
+% 3. Suffixes starting with more than one consonant (-ски /-ski/, -ство
+%    /-stvo/).  It is possible to use morphological hyphenation rules
+%    with such suffixes.
+% 
+% Even if it is possible to use morphological hyphenation with the
+% suffixes of the third category, it turns out, this is not as useful as
+% it is with the case of the prefixes.  When the eyes of the reader have
+% reached this part of the word, the word is already more or less known
+% to the reader.  Therefore, at this point the morphological hyphenation
+% does not provide any significant advantages in comparison to the
+% simpler hyphenation based only on the syllables in the word.  Consider
+% for example the word геройс-тво /geroys-tvo/ with suffix -ство
+% /-stvo/.  When the reader sees геройс- /geroys-/ at the end of the
+% line this will give him an early clue that the suffix of the word is
+% -ство /-stvo/.  Such non-morphological hyphenation does not deceive
+% the expectations of the reader.  On the contrary, it makes the reading
+% easier because it gives clues to the reader about what follows on the
+% next line.
+% 
+% Because of these considerations, the current Bulgarian hyphenation
+% patterns do not attempt to use morphological hyphenation with respect
+% to the suffixes of the words.  Though it would be useful to implement
+% rules about the suffixes of the second cateogory.  Hopefully, some
+% future version will have such rules.
+% 
+% Occasionally,[^20] a fourth morphological requirement is stated: that
+% hyphenation should conform with the boundary between the word and the
+% definitive articles -та /-ta/ and -те /-te/ (postfixed in Bulgarian).
+% There is no need to pay attention to this rule because it seems to be
+% satisfied by its own nature.  The author has searched in a dictionary
+% with over 860000 Bulgarian words for cases when the hyphenation rules
+% would hyphenate badly with respect to the definitive article.  He was
+% unable to find even one such case with the hyphenation rules valid
+% after 1983 and only about 10 cases with the rules valid before 1983
+% (one of them is живопи-ста /zhivopi-sta/ instead of живопис-та
+% /zhivopis-ta/).
+% 
+% One unavoidable characteristic of any morphologically based automated
+% hyphenation is that it can create wrong hyphenations.  Because of
+% this, one useful option is to use the morphology in a safe way – to
+% use it in order to forbid bad hyphenations but to create no new
+% hyphenation possibilities solely on the basis of the morphology.
+% 
+% Take for example the word дозрея /dozreya/ 'ripen fully'.  According
+% to the phonological rules, we should hyphenate it as доз-рея
+% /doz-reya/.  According to the morphology, however, we should hyphenate
+% as до-зрея /do-zreyq/ because this word is formed with the prefix до-
+% /do-/ with semantics 'complete or supplement' and this semantics would
+% be lost if the reader sees доз- /doz-/ at the end of the line.
+% Therefore, there are three methods to hyphenate this word:
+% 
+% 1. доз-рея /doz-reya/ when morphology is not used;
+% 2. до-зрея /do-zreya/ when morphology is fully used;
+% 3. дозрея /dozreya/ (no hyphenation) when morphology is used in a safe
+%    way.
+% 
+% The option to use the morphology in a safe way is very attractive when
+% the software uses a smart line-breaking algorithm which can produce
+% good results even with less hyphenation possibilities.  TeX is one
+% such software.  It should be noted that this option does not eliminate
+% too many hyphenation possibilities because the morpheme boundaries
+% most of the time are also syllable boundaries.
+% 
+% [^20]: Правописен и правоговорен наръчник. Състав. Иван Хаджов,
+%     Цв. Минков; Ред. Ив. Хаджов и др. София, Бълг. кн., 1945
+% 
+% The following are results of a statistics about the quality of the
+% morphological rules (the number after the sign ± is the expected
+% standard deviation of our estimations):
+% 
+% With the option `--morphology`:
+% 
+% * in 0.1% ±0.3% of the dictionary words the morphological patterns
+%   create very wrong hyphenation;
+% * in 89.8% ±0.1% of the dictionary words the morphological patterns
+%   hyphenate identically with the case when no morphology patterns are
+%   used;
+% * in 0.3% ±0.2% of the dictionary words the morphological patterns
+%   hyphenate differently in comparison to the case when no morphology
+%   patterns are used and the word is hyphenated in a way which
+%   contradicts the morphology;
+% * in 0.6% ±0.1% of the dictionary words the morphological patterns
+%   hyphenate differently in comparison to the case when no morphology
+%   patterns are used and there is a possible hyphenation which is
+%   compatible with the word morphology but which is nevertheless
+%   forbidden by the morphology patterns.
+%   
+% With the option `--safe-morphology`:
+% 
+% * in 0% of the dictionary words the morphological patterns create very
+%   wrong hyphenation;
+% * in 90.0% ±0.1% of the dictionary words the morphological patterns
+%   hyphenate identically with the case when no morphology patterns are
+%   used;
+% * in 0.3% ±0.2% of the dictionary words the morphological patterns
+%   hyphenate differently in comparison to the case when no morphology
+%   patterns are used and the word is hyphenated in a way which
+%   contradicts the morphology;
+% * in 0.6% ±0.1% of the dictionary words the morphological patterns
+%   hyphenate differently in comparison to the case when no morphology
+%   patterns are used and there is a possible hyphenation which is
+%   compatible both with the word morphology and with the syllable
+%   boundaries but which is nevertheless forbidden by the morphology
+%   patterns.
+%   
+% Notice that the morphological patterns create a different hyphenation
+% only in about 10% of the words.  The following explanation can be
+% given for this surprising fact.  First, the natural evolution of the
+% human languages tends to simplify the complex sequences of consonants.
+% Therefore, no morpheme contains a complex sequence of consonants.  And
+% second, the Bulgarian orthography is morphological.  This means that
+% the morphemes are written according to their actual pronunciation,
+% however the simplifications in the spoken languages which take place
+% at the morpheme boundaries are not taken into account in the
+% orthography.  The independent operation of these two factors leads to
+% the result that most of the time the morpheme boundaries coincide with
+% the conventional syllable boundaries.  The main exception to this is
+% when a morpheme starts with a vowel, in this case its syllable will
+% include one or more consonants of the preceeding morpheme.  The second
+% exception is when a morpheme ends with a vowel and the next morpheme
+% starts with a sequence of two or more consonants.
+% 
+% Usage of the script `hyph-bg.sh`
+% --------------------------------
+% 
+% The `hyph-bg.sh` is all-in-one script which can generate both
+% documentation (this text) and Bulgarian hyphenation patterns.  When
+% given the option `--help` the script gives short usage instructions:
+% 
+% ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+% hyph-bg.sh --help
+%           Show this info
+% hyph-bg.sh [--doc-html | --doc-latex | --doc-txt]
+%           Print documentation in various formats
+% hyph-bg.sh [other options]
+%           Generate Bulgarian hyphenation patterns
+% 
+% Options when generating hyphenation patterns:
+% 
+%  --standalone-tex
+%           Produce hyphenation patterns for TeX with \patterns{ ... }.
+% 
+%  --no-hyphen-mins
+%           Hyphenation patterns which do not require hyphen mins.
+%           Otherwise: both left and right hyphen mins should be set to 2.
+% 
+%  --safe-dz
+%           Do not try to guess whether DZ is a single consonant or not.
+%           Only use hyphenation which will be correct in both cases.
+% 
+%  --permissible
+%           Permit any formally correct hyphenation, including unnatural
+%           divisions, such as studen-tstvo.  Useful for educational tools
+%           or when typesetting Bulgarian text in a very short column.
+% 
+%  --morphology
+%           Apply morphology when hyphenating, for example: za-dvizhvam.
+%           May hyphenate incorrectly in some cases.
+% 
+%  --safe-morphology
+%           Apply morphology when hyphenating.  Never hyphenates incorrectly
+%           but may prohibit some correct hyphenations.
+% 
+%  --no-morphology
+%           Disregard the morphology.  Default.
+% 
+%  --1945
+%           Hyphenate according to the rules effective between 1945 and 1982
+% 
+%  --1983
+%           Hyphenate according to the rules effective between 1983 and 2011
+% 
+%  --2012
+%           Hyphenate according to the rules effective after 2012.  Default.
+% ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+% 
+% The following are the recommended ways to generate hyphenation
+% patterns by this script:
+% 
+% `hyph-bg.sh --standalone-tex --safe-morphology`
+% :   For TeX.  Apply the morphology in a safe way when the software
+%     uses a smart line-breaking algorithm.
+% 
+% `hyph-bg.sh`
+% :   For most other software.
+% 
+% `hyph-bg.sh --no-hyphen-mins`
+% :   The current versions of Mozilla (as of 2017) seem to ignore the
+%     hyphen mins in words that contain a dash.
+% 
+% `hyph-bg.sh --morphology`
+% :   For professional typography with human proof-reader.
+% 
+% `hyph-bg.sh --permissible`
+% :   For educational tools and online dictionaries which can show only one
+%     kind of hyphenation.
+% 
+% Notice that some specialised English dictionaries separate the
+% word-division positions into two categories – preferred positions and
+% less recommended positions.  It would be best if the Bulgarian online
+% dictionaries could do the same.  For example hyphen "-" can be used to
+% display the preferred positions and dot "." – the less recommended
+% positions.  If a word-division position is permitted only by the
+% patterns of `hyph-bg.sh --permissible`, then this position is less
+% recommended.
+% 
+
+\message{Bulgarian hyphenation patterns (options: --safe-morphology --standalone-tex, version 21 October 2017)}
author	Hans Hagen <pragma@wxs.nl>	2018-09-13 18:21:39 +0200
committer	Context Git Mirror Bot <phg@phi-gamma.net>	2018-09-13 18:21:39 +0200
commit	56ca0139232f16679918613ef45a5dd643f0f9b3 (patch)
tree	f5afef4d57e2cdbf1a6cb777635ec871be34837c /tex/context/patterns/common/lang-bg.rme
parent	5c433e6e8accaa4bc9ebe0a094b925fe11a8edf5 (diff)
download	context-56ca0139232f16679918613ef45a5dd643f0f9b3.tar.gz