CORPUS ADAPTATION WITH WEIGHTED COUNTING
Dong Hoon Van Uytsel
Departement Elektrotechniek
Katholieke Universiteit Leuven
e-mail: donghoon@esat.kuleuven.ac.be
Statistical language models, as used in statistical speech recognition,
predict likelihood scores of competing transcription hypotheses. They need
to be trained with large amounts of written text, which is becoming
increasingly available through CD-ROM collections and the WWW. Hereby it
is however crucial for the resulting language model performance that the
training material is similar in topic and style to the speech to be
recognized eventually (target domain). The matching process, the adaptation
of the training corpus to the target domain, is usually done by selecting
relevant documents manually or automatically using a relevance measure.
We describe a method that applies a weighting to the observation of a
sentence. This weight reflects how much the sentence would match the target
domain. This way we avoid having to decide whether or not to include the
sentence in the training. We derive a formula for this weight in a Bayesian
re-estimation framework and study the accuracy of this method. We compare
performance of resulting models in terms of perplexity and word accuracy
with other recently published corpus adaptation methods.