summaryrefslogtreecommitdiff
path: root/tex/context/patterns/common/lang-bg.rme
diff options
context:
space:
mode:
authorHans Hagen <pragma@wxs.nl>2018-09-13 18:21:39 +0200
committerContext Git Mirror Bot <phg@phi-gamma.net>2018-09-13 18:21:39 +0200
commit56ca0139232f16679918613ef45a5dd643f0f9b3 (patch)
treef5afef4d57e2cdbf1a6cb777635ec871be34837c /tex/context/patterns/common/lang-bg.rme
parent5c433e6e8accaa4bc9ebe0a094b925fe11a8edf5 (diff)
downloadcontext-56ca0139232f16679918613ef45a5dd643f0f9b3.tar.gz
2018-09-13 17:49:00
Diffstat (limited to 'tex/context/patterns/common/lang-bg.rme')
-rw-r--r--tex/context/patterns/common/lang-bg.rme890
1 files changed, 890 insertions, 0 deletions
diff --git a/tex/context/patterns/common/lang-bg.rme b/tex/context/patterns/common/lang-bg.rme
new file mode 100644
index 000000000..25a3e2ca5
--- /dev/null
+++ b/tex/context/patterns/common/lang-bg.rme
@@ -0,0 +1,890 @@
+% generated by mtxrun --script pattern --convert
+
+% copyright: Copyright (C) 2000, 2004, 2017 by Anton Zinoviev <anton@lml.bas.bg>
+% title: Bulgarian hyphenation patterns
+% version: 21 October 2017
+% language:
+% name: Bulgarian
+% tag: bg
+% notice: >
+% This file is part of the hyph-utf8 package.
+% See http://www.hyphenation.org for more information.
+% authors:
+% -
+% name: Anton Zinoviev
+% contact: anton:lml.bas.bg
+% licence:
+% text: >
+% This software may be used, modified, copied, distributed, and sold,
+% both in source and binary form provided that the above copyright
+% notice and these terms are retained. The name of the author may not
+% be used to endorse or promote products derived from this software
+% without prior permission. THIS SOFTWARE IS PROVIDES "AS IS" AND
+% ANY EXPRESS OR IMPLIED WARRANTIES ARE DISCLAIMED. IN NO EVENT
+% SHALL THE AUTHOR BE LIABLE FOR ANY DAMAGES ARISING IN ANY WAY OUT
+% OF THE USE OF THIS SOFTWARE.
+% hyphenmins:
+% typesetting:
+% left: 2
+% right: 2
+% changes: See below
+% ==========================================
+% Copyright (C) 2000,2004,2017 by Anton Zinoviev <anton@lml.bas.bg>
+%
+% This software may be used, modified, copied, distributed, and sold,
+% both in source and binary form provided that the above copyright
+% notice and these terms are retained. The name of the author may not
+% be used to endorse or promote products derived from this software
+% without prior permission. THIS SOFTWARE IS PROVIDES "AS IS" AND
+% ANY EXPRESS OR IMPLIED WARRANTIES ARE DISCLAIMED. IN NO EVENT
+% SHALL THE AUTHOR BE LIABLE FOR ANY DAMAGES ARISING IN ANY WAY OUT
+% OF THE USE OF THIS SOFTWARE.
+%
+% Bulgarian hyphenation patterns
+%
+% Generated by ./hyph-bg.sh --safe-morphology --standalone-tex
+%
+% Both left and right hyphenmins should be set to 2.
+%
+% % Automated Bulgarian Hyphenation
+% % Anton Zinoviev
+% % 21 October 2017
+%
+% Principles of the Bulgarian hyphenation
+% =======================================
+%
+% One specificity of the Bulgarian language is that the average length
+% of the words is greater than in English. When typesetting a Bulgarian
+% text, hyphenation is more important than when typesetting an English
+% text. Knuth's algorithm for line-breaking is such that in most
+% English paragraphs no hyphenation will be used. With a Bulgarian
+% text, however, even the Knuth's algorithm will use hyphenation in most
+% paragraphs. Hyphenation becomes an absolute necessity if we want to
+% obtain nice, justified paragraphs when using a software with dumb
+% line-breaking algorithm, such as LibreOffice.
+%
+% According to Decree 936 of the Council of Ministers promulgated on 27
+% November 1950, the Institute for Bulgarian Language at the Bulgarian
+% Academy of Sciences is authorised to publish the rules of the
+% orthography of the Bulgarian language (within certain limits).
+%
+% Hyphenation rules between 1945 and 1983
+% ---------------------------------------
+%
+% Between 1945 and 1983 Bulgarian used syllable hyphenation with two
+% morphological exceptions: hyphenation is preferred between a prefix
+% and a stem and at the boundary of compound words. The following were
+% the rules governing the hyphenation:
+%
+% 1. One letter does not stay alone. Words of one syllable can not be
+% hyphenated.
+% 2. No hyphenation before or after ь.
+% 3. In a sequence of vowels at least one vowel stays before the
+% hyphen.
+% 4. A single consonant between two vowels links with the second vowel.
+% For example по-ле /po-le/, ра-бо-та /ra-bo-ta/.
+% 5. In a sequence of consonants between two vowels, at least one
+% consonant stays with the second vowel. For example те-сто /te-sto/
+% or тес-то /tes-to/.[^b]
+% 6. In a sequence of consonants between two vowels, if the first
+% consonant is sonorant (й /y/, л /l/, м /m/, н /n/, р /r/), then it
+% stays with the first vowel. For example гер-дан /ger-dan/, сен-ки
+% /sen-ki/.
+% 7. The hyphenation separates two successive equal consonants. For
+% example времен-но /vremen-no/, пролет-та /prolet-ta/.
+% 8. When the letters дж /dzh/ and дз /dz/ denote a single consonant,
+% then they are not separated. For example боя-джия /boya-dzhiya/
+% but not бояд-жия /boyad-zhiya/. When these letters denote two
+% consonants, then the normal rules apply: над-живявам
+% /nad-zhivyavam/.
+% 9. Word prefixes may not be broken. Compound words are hyphenated
+% either at the boundary of the components or the hyphenation rules
+% are applied to each of the components separately. For example:
+% пред-упреждавам /pred-uprezhdavam/ (not пре-дупреждавам
+% /pre-duprezhdavam/), пред-известие /pred-izvestie/ (not
+% пре-дизвестие /pre-dizvestie/), за-движвам /za-dvizhvam/ (not
+% зад-вижвам /zad-vizhvam/), авто-клуб /avto-klub/ (not авток-луб
+% /avtok-lub/), вакуум-апарат /vakuum-aparat/ (not вакуу-мапарат
+% /vakuu-maparat/).
+%
+% In some rare cases the proper application of rule 9 depends on the
+% semantics of the word. For example пре-дреша /pre-dresha/ 'change
+% clothes' but пред-реша /pred-resha/ 'predetermine' or прес-пите
+% /pres-pite/ 'the snow-drifts' but пре-спите /pre-spite/ 'sleep for a
+% while/overnight'.
+%
+% [^b]: In several publications this rule is formulated with the
+% additional restriction that the sequence of consonants begins with
+% an obstruent. I believe this restriction is unintentional. It
+% makes no sense to forbid a hyphenation of the form AB-A but to
+% permit ABB-A (A denotes a vowel and B – a consonant).
+%
+% Hyphenation rules between 1983 and 2012
+% ---------------------------------------
+%
+% The Orthographic dictionary published by the Institute for Bulgarian
+% language in 1983 introduced new hyphenation rules. The complexity of
+% the previous rules was the main reason for the change. The new rules
+% aimed at two objectives: simplicity and unambiguity.
+%
+% The new rules are:
+%
+% 1. A consonant between two vowels links with the second vowel. For
+% example ви-со-чи-на /vi-so-chi-na/.
+% 2. In a sequence of two or more consonants between two vowels, at
+% least one consonant stays with first vowel and at least one with
+% the second vowel. For example сес-тра /ses-tra/ and сест-ра
+% /sest-ra/.
+% 3. Two equal consonants are separated. For example плен-ник
+% /plen-nik/.
+% 4. In a sequence of two or more vowels, the first vowel stays before
+% the hyphen. For example пре-одолея /pre-odoleya/ and прео-долея
+% /preo-doleya/.
+% 5. In a sequence of three or more vowels, the last vowel stays after
+% the hyphen. For example мао-изъм /mao-izam/ but not маои-зъм
+% /maoi-zam/.
+% 6. The letter й /y/ between a vowel and a consonant stays with the
+% vowel. For example май-ка /may-ka/.
+% 7. When a sequence of two or more consonants follows й /y/ then at
+% least one consonant links with й /y/. For example айс-берг
+% /ays-berg/ (not ай-сберг /ay-sberg/).
+% 8. The letter й /y/ between two vowels links with the second vowel.
+% For example ма-йор /ma-yor/.
+% 9. No hyphenation before or after ь.
+% 10. When the letters дж /dzh/ denote a single consonant, then they are
+% not separated. For example су-джук /su-dzhuk/ (not суд-жук
+% /sud-zhuk/) but над-живея /nad-zhiveya/.
+% 11. There must be at least one vowel before and after the hyphen.
+% 12. One letter does not stay alone.
+%
+% The total disregard of the morphology by these rules leads to some
+% strange results. For example пре-дизвестие /pre-dizvestie/ is
+% permitted and пред-известие /pred-izvestie/ is forbidden, зад-вижвам
+% /zad-vizhvam/ is permitted and за-движвам /za-dvizhvam/ is forbidden,
+% авток-луб /avtok-lub/ is permitted and авто-клуб /avto-klub/ is
+% forbidden, вакуу-мапарат /vakuu-maparat/ is permitted and
+% вакуум-апарат /vakuum-aparat/ is forbidden. Because of this, the new
+% rules were not universally accepted. The old rules are still
+% mentioned in various places in Internet, they are included even in
+% some grammar books published by the publishing houses of the Ministry
+% of Education and of Sofia University. The software developers,
+% however, soon came into love with the new hyphenation rules.
+%
+% Hyphenation rules after 2012
+% ----------------------------
+%
+% In 2012 new rules came into force. There are two differences with
+% respect to the previous rules:
+%
+% 1. Rule 5 of the previous rules is revoked. For example маои-зъм
+% /maoi-zam/ becomes a valid hyphenation.
+% 2. The new rules permit morphologically based hyphenation (however it
+% is not obligatory). For example пред-известие /pred-izvestie/,
+% за-движвам /za-dvizhvam/, авто-клуб /avto-klub/, вакуум-апарат
+% /vakuum-aparat/ are valid hyphenations.
+%
+% Good hyphenation is a complex matter and it seems the linguists at the
+% Institute for Bulgarian Language have recognised this. They no longer
+% attempt to provide universal rules about everything. Instead, they
+% provide some very permissible rules while the good application of
+% these rules is leaved to the discretion and the experience of the
+% printers and the developers of hyphenation software.
+%
+% It makes sense to use at least two different sets of hyphenation rules
+% for Bulgarian. In most cases a more restrictive version should be
+% used, one which attempts to eliminate the controversial cases of
+% hyphenation. When typesetting a Bulgarian text in a narrow newspaper
+% column, however, it will be appropriate to use more liberal
+% hyphenation rules. It should be noted that one of the reasons for the
+% hyphenation reform in 1983 was the desire to fix the chaotic
+% hyphenation in the Bulgarian newspapers at that time.
+%
+% Computer implementations
+% ========================
+%
+% Mathematical analysis of the Bulgarian hyphenation
+% --------------------------------------------------
+%
+% The earliest mathematical analysis of the Bulgarian hyphenation rules
+% belongs to Veska Noncheva.[^1] In 1988 she proposed a mathematical
+% formalisation of the hyphenation rules in a table with 22 rows.[^2]
+%
+% [^1]: <http://www.researchgate.net/profile/Veska_Noncheva>
+%
+% [^2]: Нончева В. Алгоритъм за автоматично пренасяне на думи в
+% българския език. Математика и математическо
+% образование. Сб. доклади на 17. ПК на СМБ. С., БАН, 1988, 479-482.
+%
+% In the same year Eugene Belogay[^3] proposed an alternative
+% formalisation with only 9 rules.[^4] Belogay proved that his rules are
+% consistent and that they form a minimal set. The rules of Belogay
+% have negative character – every hyphenation which is not forbidden by
+% a rule is possible hyphenation.
+%
+% [^3]: <http://www.linkedin.com/in/belogay>
+%
+% [^4]: Белогай Е. Алгоритъм за автоматично пренасяне на думи. Компютър
+% за вас (1988) 3, 12-14.
+%
+% The following are the first 7 rules, as formulated by Belogay:
+%
+% 1. Б-А
+% 2. А-ББ
+% 3. Б-ТТ, ТТ-Б
+% 4. ААА-Б
+% 5. й-ББ
+% 6. Б-ь
+% 7. д-ж
+%
+% Here А denotes an arbitrary vowel letter, Б denotes an arbitrary
+% consonant letter (including ь and й), ТТ denotes a sequence of two
+% equal consonant letters and the letters й, ь, д and ж denote
+% themselves. For example the rule "Б-А" says that we are not permitted
+% to separate a consonant letter from immediately following vowel
+% letter.
+%
+% The eighth rule of Belogay says that hyphenation is forbidden before
+% the first and after the last vowel letter. The ninth rule of Belogay
+% says that hyphenation is forbidden immediately after the first or
+% immediately before the last letter of the word.
+%
+% Notice that is is very easy to translate the rules of Belogay in the
+% form, required for the hyphenation algorithm of Knuth and Liang used
+% in TeX.[^a] Let us remind that this algorithm matches the word with a
+% set of string patterns in which the odd numbers say hyphenation is
+% permitted in this position and even numbers say the hyphenation is
+% forbidden. When two patterns give conflicting numbers for the same
+% position, then the greater number wins.
+%
+% First, since the rules of Belogay are negative (they say where
+% hyphenation is forbidden, not where it is permitted), we have to
+% permit the hyphenation everywhere:
+%
+% 1. А1
+% 2. Б1
+%
+% Then, the first seven rules of Belogay obtain the form:
+%
+% 1. Б2А
+% 2. А2ББ
+% 3. Б2ТТ ТТ2Б
+% 4. ААА2Б
+% 5. й2ББ
+% 6. Б2ь
+% 7. д2ж
+%
+% Since no Bulgarian word starts with more that four consonants and no
+% Bulgarian word ends with more than three consonants, the eighth rule
+% of Belogay can be translated in the following way:
+%
+% 1. .Б2
+% 2. .ББ2
+% 3. .БББ2
+% 4. 2Б.
+% 5. 2ББ.
+%
+% The ninth rule of Belogay means that left and right hyphen mins should
+% be set to 2.
+%
+% The work of Eugene Belogay was not limited to merely a mathematical
+% analysis of the Bulgarian hyphenation rules. In his paper he
+% published a short algorithm in Pascal which implements these rules.
+% It didn't take long for this algorithm to be used in various text
+% processing software. The algorithm of Belogay was famous for many
+% years. Even as late as 1997 in one book about TeX, the author didn't
+% care to give any explanations but simply wrote about "the algorithm of
+% Belogay" as something well known to the reader.[^5]
+%
+% [^a]: Liang, Franklin Mark. Word Hy-phen-a-tion by
+% Com-put-er (Doctoral Dissertation). Stanford University, 1983
+%
+% [^5]: Василев В. Ултимативният ТеХ. Удоволствието да правим
+% предпечатна подготовка сами. София, Интела, 1997, 36
+%
+% Bulgarian hyphenation in TeX
+% ----------------------------
+%
+% One unfortunate design decision of Knuth was that the hyphenation
+% algorithm of TeX applied the hyphenation patterns not to the input
+% character codes but to the internal codes of the glyphs in the font.
+% This created a problem for the Cyrillic languages because in TeX the
+% Cyrillic fonts did not have standardised encoding. Perhaps this is
+% one of the reasons why the earliest implementations of the Bulgarian
+% hyphenation in TeX did not rely on the internal hyphenation algorithm
+% of TeX. Instead, external tools were used to insert soft hyphens in
+% all Bulgarian words. For example such a tool would replace the word
+% сричкопренасяне /srichkoprenasyane/ with
+% срич\\-коп\\-ре\\-на\\-ся\\-не /srich\\-kop\\-re\\-na\\-sya\\-ne/.
+% The saying "To every disadvantage there is a corresponding advantage"
+% is true – since Cyrillic and Latin letters use different character
+% codes, an external tool could easily insert soft hyphens in all
+% Bulgarian words while leaving the TeX commands intact.
+%
+% The earliest known attempt to use the hyphenation algorithm of TeX for
+% Bulgarian was made by Ognyan Tonev in 1990.[^6] He described his work
+% as "a not very good translation of the rules. I work in this
+% direction. But I don't have a 100% working complect of patterns. So,
+% the copy I send to you[^7] is only a beta-version." The hyphenation
+% patterns of Tonev don't work correctly and it seems he never completed
+% his work.
+%
+% [^6]: The author of this text was unable to find current information
+% about Ognyan Tonev in Internet. Apparently in 1990 he worked in
+% the Center of Informatics and Computer Technology of the Bulgarian
+% Academy of Sciences.
+%
+% [^7]: To Yannis Haralambous,
+% <http://perso.telecom-bretagne.eu/yannisharalambous>
+%
+% The first usable Bulgarian hyphenation patterns for TeX were developed
+% by Georgi Boshnakov[^8] in 1994. In order to solve the encoding
+% problem, Boshnakov had developed TeX fonts supporting the MIK encoding
+% (the prevalent encoding at that time in Bulgaria). This allowed him
+% to introduce a fully working implementation only a few months after
+% LaTeX2e became the official LaTeX version. Later Boshnakov modified
+% his work with the Babel system. The hyphenation patterns of Boshnakov
+% did their job well enough, so that for almost quarter a century after
+% their initial creation, they remained the only Bulgarian hyphenation
+% patterns in the standard distributions of TeX and CTAN.
+%
+% [^8]: <http://www.maths.manchester.ac.uk/~gb/>
+%
+% There are some similarities between the patterns of Boshnakov and the
+% patterns of Belogay. The following are the main differences.
+%
+% First, Boshnakov used an ingenious and more compact implementation of
+% the second and the third rule. Instead of {А2ББ, Б2ТТ, ТТ2Б}, or
+% 8×22×22+22×22+22×22=4840 patterns in total, Boshnakov has patterns of
+% the form 2Б3Б2 and 4Т3Т4, or only 22×22=484 in total, with the same
+% effect.
+%
+% The second main difference between the patterns of Boshnakov and the
+% patterns of Belogay concerns the letter combination дж /dzh/. In
+% Bulgarian this letter combination can denote either a single
+% consonant, or a sequence of two consonants and the hyphenation rules
+% change respectively. Unfortunately, it is impossible to know the
+% meaning of дж /dzh/ without a vocabulary. The solution of Belogay was
+% a cautious one – his rules do the hyphenation in a way which will be
+% correct regardless of whether дж /dzh/ is a single consonant or a
+% sequence of two consonant. On the other hand, the approach of
+% Boshnakov is a bold one – since дж /dzh/ is more often a single
+% consonant, his rules assume that it is always a single consonant. The
+% number of the cases when this decision leads to bad hyphenations is
+% insignificant in comparison with the cases in which we obtain improved
+% hyphenation.
+%
+% The third main difference between the patterns of Boshnakov and the
+% patterns of Belogay concerns the eighth rule – its implementation in
+% the rules of Boshnakov is rather limited which leads to wrong
+% hyphenations like бри-дж /bri-dzh/. A full implementation of this
+% rule would require 11660 patterns in total and this would be too much
+% for the computers in 1994.
+%
+% Later developments
+% ------------------
+%
+% In 1995 Atanas Topalov defended a Masters thesis in the Faculty of
+% Mathematics and Informatics at Sofia University titled "Algorithms and
+% software about text processing".[^9] One of the main topics in his
+% thesis was the Bulgarian hyphenation. Topalov criticised vehemently
+% the official hyphenation rules and their total disregard of the
+% morphology. He wrote:
+%
+% > If we look at the history of the problems of the hyphenation, we
+% > will discover something very strange. Instead of the expected
+% > involvement with the depths and aspiration for more admissible and
+% > satisfactory style, we can find a growing tendency for
+% > simplification. One unpleasant discovery is that the development of
+% > the hyphenation software stays firmly on the principle "let us do
+% > the easiest thing". The earliest works which have been studied are
+% > from 1978. It turned out that they present the best approach
+% > concerning the automated hyphenation. The authors have chosen the
+% > most difficult but the most correct (from literary point of view)
+% > method for hyphenation, namely the morphological approach.
+%
+% Topalov proposed his own hyphenation algorithm. The hyphenation it
+% generated was smooth and easy to read. One obvious defect of the
+% algorithm of Topalov was that it contradicted the official hyphenation
+% rules at that time. One can argue, however, that his algorithm is
+% compatible with the current hyphenation rules.
+%
+% [^9]: The thesis of Atanas Topalov can be accessed at the author's
+% website <http://www.mind-print.com>
+%
+% In 1999 Svetla Koeva[^10] wrote a paper about the automated Bulgarian
+% hyphenation.[^11] At that time she was a junior member of the
+% Department of Computational Linguistics at the Institute for Bulgarian
+% Language but now she is a director of the whole institute. The paper
+% of Koeva contains a list of hyphenation patterns which can be used as
+% a basis of automated hyphenation. In 2004 with the help of Stoyan
+% Mihov[^12] the rules of Koeva were formalised with regular relations
+% and rewriting rules. They were implemented in a software product
+% named ItaEst which provided Bulgarian hyphenation and grammar checking
+% for various software products of Microsoft and Apple.
+%
+% [^10]: <http://dcl.bas.bg/svetla_koeva/>
+%
+% [^11]: Коева, Светла. Правила за пренасяне на части от думите на нов
+% ред. Български език. 1999/2000, 1, 84-86
+%
+% [^12]: <http://lml.bas.bg/~stoyan/>
+%
+% The main differences between the hyphenation of Koeva and the official
+% hyphenation rules effective after 2012 is that the separation of a
+% long sequence of consonants between two vowels is done according to
+% the rules valid before 1983. For example се-стра /se-stra/ and
+% ай-сберг /ay-sberg/ are permitted. The main difference between the
+% hyphenation of Koeva and the official hyphenation rules effective
+% before 1983 is that the rules of Koeva disregard the morphology of the
+% words. The following rule of Koeva is specific: in a sequence of two
+% sonorant consonants between two vowels, we are permitted to separate
+% the first vowel from the first consonant, for example материа-лна
+% /materia-lna/.
+%
+% In 2000 Anton Zinoviev[^13] created new hyphenation patterns for TeX.
+% He didn't know about the previous work of Boshnakov and he didn't
+% bother to make his work available in the various TeX distributions and
+% CTAN. His work was used mostly by the local Linux enthusiasts and the
+% colleagues of Zinoviev. In 2001 Radostin Radnev[^14] created a free
+% grammar dictionary of Bulgarian[^15] where he used the hyphenation
+% patterns of Zinoviev. From there the work of Zinoviev propagated to
+% OpenOffice, LibreOffice and various online dictionaries, including
+% <http://bg.wiktionary.org> and <http://rechnik.chitanka.info>.
+%
+% [^13]: The author of this text.
+%
+% [^14]: <http://bg.linkedin.com/in/radostinradnev>
+%
+% [^15]: <http://bgoffice.sourceforge.net/>
+%
+% The following are the main differences between the hyphenation of
+% Zinoviev and the hyphenation of Boshnakov.
+%
+% First, the eighth rule of Belogay is fully implemented.
+%
+% Second, the rules of Zinoviev try to detect when the letters дж /dzh/
+% (and дз /dz/) denote a single consonant and when they denote a
+% sequence of two consonants. By default, however, Zinoviev (like
+% Boshnakov) assumes that дж /dzh/ is a single consonant and hyphenates
+% accordingly.
+%
+% Third, the rules of Zinoviev disable some cases of unpleasant
+% hyphenations:
+%
+% 1. In a consonant sequence like тст /tst/, the two equal consonants т
+% /t/ are separated. For example братст-во /bratst-vo/ is forbidden
+% while братс-тво /brats-tvo/ and брат-ство /brat-stvo/ are
+% permitted.
+% 2. The hyphenation is forbidden after a sonorant consonant following
+% an obstruent consonant. For example отм-ра /otm-ra/ is forbidden
+% and от-мра /ot-mra/ is permitted.
+% 3. The hyphenation separates two consecutive kindred voiced/voiceless
+% consonants. For example субп-родукт /subp-roduct/ is forbidden and
+% суб-продукт /sub-product/ is permitted.
+%
+% At the start of his work on the Bulgarian hyphenation, Zinoviev had
+% the opportunity to discuss the hyphenation with Svetla Koeva. He
+% remembers that some cases of unpleasant hyphenation were suggested to
+% him by Koeva. Unfortunately, he hasn't taken notes so now he doesn't
+% know which cases of unpleasant hyphenation have been suggested to him
+% by Koeva and which are his own findings.
+%
+% The present work
+% ================
+%
+% Motivation
+% ----------
+%
+% The present work was carried out on the initiative of the leader of
+% the Bulgarian localisation team of Mozilla, who contacted Zinoviev,
+% Boshnakov and the maintainers of the TeX hyphenation patterns.[^17]
+% This work pursues the following main objectives:
+%
+% 1. to update the hyphenation patterns in accordance with the current
+% hyphenation rules;
+% 2. to generate the hyphenation patterns by a publicly available
+% script;
+% 3. to make the hyphenation patterns customisable;
+% 4. to provide documentation for the future developers.
+%
+% [^16]: <http://mozillians.org/en-US/u/stoyan/>
+%
+% [^17]: <http://hyphenation.org>
+%
+% The current official hyphenating rules for Bulgarian are rather
+% liberal. Very often, in a long sequence of consonants we are
+% permitted to split the word at any position, for example аген-т-с-т-во
+% /agen-t-s-t-vo/. This is prone to many unusual and unexpected results
+% that interrupt the attention of the reader or deceive his expectations
+% during the movement of his eyes to the next line. On the other hand,
+% in order to produce nice justified paragraphs there is no need for so
+% many hyphenation possibilities. It would be sufficient even if only
+% one possible separation between any two syllables was permitted.
+%
+% Therefore, it makes sense to use a more restrictive version of the
+% Bulgarian hyphenation, one which eliminates the controversial cases of
+% hyphenation. Only when typesetting a Bulgarian text in a very narrow
+% newspaper column it will be appropriate to use a more liberal version.
+% It should be noted that some specialised English dictionaries also
+% separate the word-division positions into two categories – preferred
+% positions and less recommended positions.
+%
+% There are two methods to determine the optimal division within a
+% sequence of consonants between two vowels:
+%
+% * we can hyphenate according to the syllables in the word or
+% * we can hyphenate morphologically.
+%
+% Hyphenation according to the syllables in the word
+% --------------------------------------------------
+%
+% Let us look at the properties of the Bulgarian syllables. All
+% syllables have the following structure:
+%
+% > onset - nucleus - code
+%
+% The nucleus in Bulgarian is always a vowel. Both the onset and the
+% code are (possibly empty) sequences of consonants.
+%
+% The Bulgarian syllables adhere to the Sonority Sequencing Principle.
+% According to this principle, the consonants within the onset have
+% raising sonority and the consonants within the code have decreasing
+% sonority.
+%
+% Several grammar books agree that the following sonority scale is valid
+% for Bulgarian:
+%
+% > voiceless obtrusive < voiced obtrusive < sonorant consonant < vowel
+%
+% According to the investigations of the author, the only exception to
+% this law is due to the letter в /v/ which is a voiced obtrusive but it
+% can be used also as a voiceless obtrusive. This exception is due to a
+% spelling particularity of the Bulgarian language. Whenever the letter
+% в /v/ seemingly violates the Sonority Sequencing Principle, in the
+% spoken language this letter is read as ф /f/, that is as a voiceless
+% obtrusive (for example the word отвсякъде /otvsyakade/ is read as
+% отфсякъде /otfsyakade/).[^18]
+%
+% [^18]: No Primitive Slavonic word contains the phoneme ф /f/.
+% Therefore, we can safely assume that in the Primitive Slavonic
+% language the consonant ф /f/ was a positional variant of the consonant
+% в /v/.
+%
+% The author has found that the sonorant consonants in Bulgarian have
+% their own sonority scale:
+%
+% > м /m/ < н /n/ < л /l/ < р /r/ < й /y/
+%
+% Only a few words such as жанр /zhanr/ and химн /himn/ violate this
+% scale. Such words are always loan-words and their pronunciation is
+% somewhat problematic for the native Bulgarian speakers.
+%
+% In addition to the Sonority Sequencing Principle, the consonant
+% clusters within the Bulgarian syllable adhere to the following
+% additional principles:
+%
+% 1. Both in the onset and in the code, the labial and dorsal plosives
+% precede the coronal plosives and affricates.
+% 2. If the onset or the code contains two plosives or affricates, then
+% there are no fricatives between them. Few words with the Latin
+% root 'text' are exceptions: контекст /kontekst/.
+% 3. If the onset or the code contains two fricatives other than в /v/,
+% then there are no plosives or affricates between them.
+% 4. If the onset or the code contains two plosives or affricates, then
+% they both have equal sonority (both are voiced, or both are
+% voiceless).
+% 5. If the onset or the code contains two fricatives other than в /v/,
+% then they both have equal sonority (both are voiced, or both are
+% voiceless).
+% 6. Neither the onset, nor the code may contain two labial plosives, or
+% two coronal plosives or affricates or two dorsal plosives.
+% 7. Neither the onset, nor the code may contain two equal consonants
+% with the exception of в /v/ (for example втвърди /vtvardi/).[^19]
+%
+% [^19]: Actually, the letter в /v/ is not a real exception because in
+% all such cases this letter denotes two different consonants – в /v/
+% and ф /f/. Only in the Russian loan-word взвод /vzvod/ the two
+% letters в /v/ denote a repeating consonant в /v/.
+%
+% From all these properties of the Bulgarian syllable we can deduce the
+% following hyphenation rules:
+%
+% 1. In a sequence МК where М is a consonant with higher sonority than
+% K, we are not permitted to hyphenate before М. Exception: when М
+% is в /v/ and К is a voiceless consonant.
+% 2. In a sequence КМ where М is a consonant with higher sonority than
+% K, we are not permitted to hyphenate after М.
+% 3. In a sequence KBT where K and T are plosives or affricates and B is
+% fricative, we separate K from T.
+% 4. In a sequence CKB where K is a plosive or affricate and C and B are
+% fricatives other than в /v/, we separate C from B.
+% 5. If in a consonant sequence a coronal plosive or affricate Т is
+% followed by a labial or dorsal plosive К, then we separate Т from К.
+% 6. If a consonant sequence contains two plosives or affricates, one
+% voiced and one voiceless, then we separate them.
+% 7. If a consonant sequence contains two fricatives other than в /v/,
+% one voiced and one voiceless, then we separate them.
+% 8. If a consonant sequence contains two labial plosives or two coronal
+% plosives or affricates or two dorsal plosives then they are
+% separated.
+% 9. If a consonant sequence contains two equal consonants (not
+% necessarily consecutive), then they are separated.
+%
+% With so many prohibitive rules, a question arises: if we apply all
+% these rules, aren't we going to eliminate too many hyphenation
+% possibilities? The answer is no. It can be demonstrated that between
+% any two consecutive syllables at least one separation point will be
+% permitted.
+%
+%
+% Hyphenation according to the morphology
+% ---------------------------------------
+%
+% Between 1983 and 2012 the official orthographic rules of the
+% Bulgarian language forbade morphologically based hyphenation. After
+% 2012 such hyphenation is permitted (but not obligatory).
+%
+% The most important case when it is very desirable to use
+% morphologically based hyphenation is the case of the compound words.
+% Divisions such as авток-луб /avtok-lub/ and вакуу-мапарат
+% /vakuu-maparat/ are extremely irritating even if they are formally
+% correct. Unfortunately, we do not have a vocabulary of the compound
+% Bulgarian words that would permit us to produce rules for automated
+% hyphenation. Therefore, the current Bulgarian hyphenation patterns do
+% not attempt to apply morphological hyphenation to such words.
+%
+% Second in importance (but far more significant in terms of numbers) is
+% the case with the word prefixes. While the eyes of the reader still
+% look at the start of the word, the word is still unknown to him. At
+% this point, it is very important not to deceive his expectations. For
+% example, when the reader sees над- /nad-/ at the end of the line, he
+% will expect that this is the prefix над- /nad-/ with semantics 'attain
+% more than'. This expectation will be fooled if this wasn't really a
+% prefix, but a deceiving (while formally correct) hyphenation of the
+% word надремя /nadremya/ 'have dozed enough' where the real prefix is
+% not над- /nad-/ but на- /na-/ with semantics 'achieve a state after
+% accumulation'. Such hyphenation distracts the reader and makes the
+% reading more difficult.
+%
+% Third in importance is the case with the word suffixes. With respect
+% to the hyphenation rules we can divide the suffixes into three
+% categories:
+%
+% 1. Suffixes starting with a vowel, for example -ар /-ar/. It is not
+% appropriate to follow the morphology with such suffixes because
+% this will contradict the whole hyphenation tradition of the
+% Bulgarian language. For example крав-ар /krav-ar/ is unwarranted.
+% 2. Suffixes starting with one consonant, for example -ка /-ka/.
+% Usually with such suffixes the syllable boundary in the word
+% coincides with morpheme boundary so no specific cares are
+% necessary, for example кравар-ка /kravar-ka/. The exceptions are
+% rare, for example: обек-тната /obek-tnata/ instead of обект-ната
+% /obekt-nata/.
+% 3. Suffixes starting with more than one consonant (-ски /-ski/, -ство
+% /-stvo/). It is possible to use morphological hyphenation rules
+% with such suffixes.
+%
+% Even if it is possible to use morphological hyphenation with the
+% suffixes of the third category, it turns out, this is not as useful as
+% it is with the case of the prefixes. When the eyes of the reader have
+% reached this part of the word, the word is already more or less known
+% to the reader. Therefore, at this point the morphological hyphenation
+% does not provide any significant advantages in comparison to the
+% simpler hyphenation based only on the syllables in the word. Consider
+% for example the word геройс-тво /geroys-tvo/ with suffix -ство
+% /-stvo/. When the reader sees геройс- /geroys-/ at the end of the
+% line this will give him an early clue that the suffix of the word is
+% -ство /-stvo/. Such non-morphological hyphenation does not deceive
+% the expectations of the reader. On the contrary, it makes the reading
+% easier because it gives clues to the reader about what follows on the
+% next line.
+%
+% Because of these considerations, the current Bulgarian hyphenation
+% patterns do not attempt to use morphological hyphenation with respect
+% to the suffixes of the words. Though it would be useful to implement
+% rules about the suffixes of the second cateogory. Hopefully, some
+% future version will have such rules.
+%
+% Occasionally,[^20] a fourth morphological requirement is stated: that
+% hyphenation should conform with the boundary between the word and the
+% definitive articles -та /-ta/ and -те /-te/ (postfixed in Bulgarian).
+% There is no need to pay attention to this rule because it seems to be
+% satisfied by its own nature. The author has searched in a dictionary
+% with over 860000 Bulgarian words for cases when the hyphenation rules
+% would hyphenate badly with respect to the definitive article. He was
+% unable to find even one such case with the hyphenation rules valid
+% after 1983 and only about 10 cases with the rules valid before 1983
+% (one of them is живопи-ста /zhivopi-sta/ instead of живопис-та
+% /zhivopis-ta/).
+%
+% One unavoidable characteristic of any morphologically based automated
+% hyphenation is that it can create wrong hyphenations. Because of
+% this, one useful option is to use the morphology in a safe way – to
+% use it in order to forbid bad hyphenations but to create no new
+% hyphenation possibilities solely on the basis of the morphology.
+%
+% Take for example the word дозрея /dozreya/ 'ripen fully'. According
+% to the phonological rules, we should hyphenate it as доз-рея
+% /doz-reya/. According to the morphology, however, we should hyphenate
+% as до-зрея /do-zreyq/ because this word is formed with the prefix до-
+% /do-/ with semantics 'complete or supplement' and this semantics would
+% be lost if the reader sees доз- /doz-/ at the end of the line.
+% Therefore, there are three methods to hyphenate this word:
+%
+% 1. доз-рея /doz-reya/ when morphology is not used;
+% 2. до-зрея /do-zreya/ when morphology is fully used;
+% 3. дозрея /dozreya/ (no hyphenation) when morphology is used in a safe
+% way.
+%
+% The option to use the morphology in a safe way is very attractive when
+% the software uses a smart line-breaking algorithm which can produce
+% good results even with less hyphenation possibilities. TeX is one
+% such software. It should be noted that this option does not eliminate
+% too many hyphenation possibilities because the morpheme boundaries
+% most of the time are also syllable boundaries.
+%
+% [^20]: Правописен и правоговорен наръчник. Състав. Иван Хаджов,
+% Цв. Минков; Ред. Ив. Хаджов и др. София, Бълг. кн., 1945
+%
+% The following are results of a statistics about the quality of the
+% morphological rules (the number after the sign ± is the expected
+% standard deviation of our estimations):
+%
+% With the option `--morphology`:
+%
+% * in 0.1% ±0.3% of the dictionary words the morphological patterns
+% create very wrong hyphenation;
+% * in 89.8% ±0.1% of the dictionary words the morphological patterns
+% hyphenate identically with the case when no morphology patterns are
+% used;
+% * in 0.3% ±0.2% of the dictionary words the morphological patterns
+% hyphenate differently in comparison to the case when no morphology
+% patterns are used and the word is hyphenated in a way which
+% contradicts the morphology;
+% * in 0.6% ±0.1% of the dictionary words the morphological patterns
+% hyphenate differently in comparison to the case when no morphology
+% patterns are used and there is a possible hyphenation which is
+% compatible with the word morphology but which is nevertheless
+% forbidden by the morphology patterns.
+%
+% With the option `--safe-morphology`:
+%
+% * in 0% of the dictionary words the morphological patterns create very
+% wrong hyphenation;
+% * in 90.0% ±0.1% of the dictionary words the morphological patterns
+% hyphenate identically with the case when no morphology patterns are
+% used;
+% * in 0.3% ±0.2% of the dictionary words the morphological patterns
+% hyphenate differently in comparison to the case when no morphology
+% patterns are used and the word is hyphenated in a way which
+% contradicts the morphology;
+% * in 0.6% ±0.1% of the dictionary words the morphological patterns
+% hyphenate differently in comparison to the case when no morphology
+% patterns are used and there is a possible hyphenation which is
+% compatible both with the word morphology and with the syllable
+% boundaries but which is nevertheless forbidden by the morphology
+% patterns.
+%
+% Notice that the morphological patterns create a different hyphenation
+% only in about 10% of the words. The following explanation can be
+% given for this surprising fact. First, the natural evolution of the
+% human languages tends to simplify the complex sequences of consonants.
+% Therefore, no morpheme contains a complex sequence of consonants. And
+% second, the Bulgarian orthography is morphological. This means that
+% the morphemes are written according to their actual pronunciation,
+% however the simplifications in the spoken languages which take place
+% at the morpheme boundaries are not taken into account in the
+% orthography. The independent operation of these two factors leads to
+% the result that most of the time the morpheme boundaries coincide with
+% the conventional syllable boundaries. The main exception to this is
+% when a morpheme starts with a vowel, in this case its syllable will
+% include one or more consonants of the preceeding morpheme. The second
+% exception is when a morpheme ends with a vowel and the next morpheme
+% starts with a sequence of two or more consonants.
+%
+% Usage of the script `hyph-bg.sh`
+% --------------------------------
+%
+% The `hyph-bg.sh` is all-in-one script which can generate both
+% documentation (this text) and Bulgarian hyphenation patterns. When
+% given the option `--help` the script gives short usage instructions:
+%
+% ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+% hyph-bg.sh --help
+% Show this info
+% hyph-bg.sh [--doc-html | --doc-latex | --doc-txt]
+% Print documentation in various formats
+% hyph-bg.sh [other options]
+% Generate Bulgarian hyphenation patterns
+%
+% Options when generating hyphenation patterns:
+%
+% --standalone-tex
+% Produce hyphenation patterns for TeX with \patterns{ ... }.
+%
+% --no-hyphen-mins
+% Hyphenation patterns which do not require hyphen mins.
+% Otherwise: both left and right hyphen mins should be set to 2.
+%
+% --safe-dz
+% Do not try to guess whether DZ is a single consonant or not.
+% Only use hyphenation which will be correct in both cases.
+%
+% --permissible
+% Permit any formally correct hyphenation, including unnatural
+% divisions, such as studen-tstvo. Useful for educational tools
+% or when typesetting Bulgarian text in a very short column.
+%
+% --morphology
+% Apply morphology when hyphenating, for example: za-dvizhvam.
+% May hyphenate incorrectly in some cases.
+%
+% --safe-morphology
+% Apply morphology when hyphenating. Never hyphenates incorrectly
+% but may prohibit some correct hyphenations.
+%
+% --no-morphology
+% Disregard the morphology. Default.
+%
+% --1945
+% Hyphenate according to the rules effective between 1945 and 1982
+%
+% --1983
+% Hyphenate according to the rules effective between 1983 and 2011
+%
+% --2012
+% Hyphenate according to the rules effective after 2012. Default.
+% ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+%
+% The following are the recommended ways to generate hyphenation
+% patterns by this script:
+%
+% `hyph-bg.sh --standalone-tex --safe-morphology`
+% : For TeX. Apply the morphology in a safe way when the software
+% uses a smart line-breaking algorithm.
+%
+% `hyph-bg.sh`
+% : For most other software.
+%
+% `hyph-bg.sh --no-hyphen-mins`
+% : The current versions of Mozilla (as of 2017) seem to ignore the
+% hyphen mins in words that contain a dash.
+%
+% `hyph-bg.sh --morphology`
+% : For professional typography with human proof-reader.
+%
+% `hyph-bg.sh --permissible`
+% : For educational tools and online dictionaries which can show only one
+% kind of hyphenation.
+%
+% Notice that some specialised English dictionaries separate the
+% word-division positions into two categories – preferred positions and
+% less recommended positions. It would be best if the Bulgarian online
+% dictionaries could do the same. For example hyphen "-" can be used to
+% display the preferred positions and dot "." – the less recommended
+% positions. If a word-division position is permitted only by the
+% patterns of `hyph-bg.sh --permissible`, then this position is less
+% recommended.
+%
+
+\message{Bulgarian hyphenation patterns (options: --safe-morphology --standalone-tex, version 21 October 2017)}