Abstract
The first part of the lecture focuses on the main functions and characteristics of corpora and presents the standards and guidelines which are currently used for the construction and annotation of corpora. It also provides a survey of the most important corpora for English and Dutch.
The second part of the lecture is devoted to the most commonly applied type of annotation, i.e. part of speech tagging. It discusses the development of tagsets, the principles of probabilistic tagging and the implementation of part of speech taggers. It also includes a demonstration.
Slides
Slides for part 1 (.ppt)
Handout for part 2 (.pdf)
References for part 1
Roger Garside, Geoffry Leech & Tony McEnery (eds.),
Corpus Annotation. Linguistic information from computer text corpora.
Longman, 1997.
Tony McEnery, Richard Xiao & Yukio Tono,
Corpus-based Language Studies: An advanced Resource Book
Routledge, 2005.
Nelleke Oostdijk, Building a Corpus of Spoken Dutch. In Paola Monachesi (ed), Computational Linguistics in the Netherlands 1999. University of Utrecht, 2000.
References for part 2
Thorsten Brants,
TnT
- A statistical part-of-speech tagger.
Proceedings of the Sixth Applied Natural Language Processing Conference.
Seattle, 2000.
Daniel Jurafsky & James H. Martin,
Speech and Language Processing. An Introduction to Natural
Language Processing, Computational Linguistics and Speech Recognition.
Chapter 5: Part-of-Speech Tagging.
Second Edition. Prentice Hall, 2009.
Frank Van Eynde, Jakub Zavrel & Walter Daelemans,
Lemmatisation and Morphosyntactic
Annotation for the Spoken Dutch Corpus. In M. Gavrilidou et
al. (eds.), Proceedings of the Second International Conference on Language
Resources and Evaluation. European Language Resources Association, Paris,
2000, pp. 1427-1433.
Frank Van Eynde,
Part of Speech Tagging en
Lemmatisering van het Corpus Gesproken Nederlands.Leuven, 2003.
Links
A useful starting point for an excursion in the field of corpus linguistics is the web site of the University Centre for Computer Corpus Research on Language UCREL
English corpora
The Brown Corpus Brown
The Lancaster-Oslo/Bergen Corpus LOB
The London-Lund Corpus of Spoken English LoLu
The British National Corpus BNC
Dutch corpora
The Spoken Dutch Corpus CGN
The Dutch Language Corpus Initiative D-Coi
The Dutch Reference Corpus SoNaR
The Dutch Parallel Corpus DPC
Distribution agencies
The Dutch Human Language Technology Agency TST
The European Language Resources Association ELRA
The Linguistic Data Consortium LDC
Standardization efforts
The Expert Advisory Group on Language Engineering Standards EAGLES