For a large vocabulary speech recognition system, it is crucial to have a good language model, which can tell the recognizer how likely it is that a certain word sequence can occur.
There has already been extensive research on how to build a reliable language model from a given training text. The two most important problems are data sparseness and unseen words: training texts can never cover all words, bigrams and trigrams that are likely to occur in normal speech.
In this contribution a new approach is proposed to combat these problems: syllables instead of words are considered as text atoms. The advantages are clear: there are far less different syllables than words in a text, which reduces the sparseness problem. Also, most unseen words will be composed of known syllables, thus also reducing the second problem.
This approach does however introduce its own problems. This paper will discuss
and partially solve some of them:
- Automatic hyphenation of a text is less trivial than isolating words.
- Mapping syllables to phonemes is more subject to assimilation.
- Trigram models are no longer sufficient.
- The possibility of modeling unknown words also implies the possibility of
generating nonsensical words.