H02D0A Language Engineering Applications

Corpus construction and annotation

Frank Van Eynde
Centre for Computational Linguistics (KU Leuven)

Abstract

The first part of the lecture focuses on the main functions and characteristics of corpora and presents the standards and guidelines which are currently used for the construction and annotation of corpora. It also provides a survey of the most important corpora for English and Dutch.

The second part of the lecture is devoted to the most commonly applied type of annotation, i.e. part of speech tagging. It discusses the development of tagsets, the principles of probabilistic tagging and the implementation of part of speech taggers. It also includes a demonstration.

Slides

Slides for part 1 (.ppt)
Handout for part 2 (.pdf)

References for part 1

Roger Garside, Geoffry Leech & Tony McEnery (eds.), Corpus Annotation. Linguistic information from computer text corpora. Longman, 1997.

Tony McEnery, Richard Xiao & Yukio Tono, Corpus-based Language Studies: An advanced Resource Book Routledge, 2005.

Nelleke Oostdijk, Building a Corpus of Spoken Dutch. In Paola Monachesi (ed), Computational Linguistics in the Netherlands 1999. University of Utrecht, 2000.

References for part 2

Thorsten Brants, TnT - A statistical part-of-speech tagger. Proceedings of the Sixth Applied Natural Language Processing Conference. Seattle, 2000.

Daniel Jurafsky & James H. Martin, Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Chapter 5: Part-of-Speech Tagging. Second Edition. Prentice Hall, 2009.

Frank Van Eynde, Jakub Zavrel & Walter Daelemans, Lemmatisation and Morphosyntactic Annotation for the Spoken Dutch Corpus. In M. Gavrilidou et al. (eds.), Proceedings of the Second International Conference on Language Resources and Evaluation. European Language Resources Association, Paris, 2000, pp. 1427-1433.

Frank Van Eynde, Part of Speech Tagging en Lemmatisering van het Corpus Gesproken Nederlands.Leuven, 2003.

Links

A useful starting point for an excursion in the field of corpus linguistics is the web site of the University Centre for Computer Corpus Research on Language UCREL

English corpora

The Brown Corpus Brown

The Lancaster-Oslo/Bergen Corpus LOB

The London-Lund Corpus of Spoken English LoLu

The British National Corpus BNC

Dutch corpora

The Spoken Dutch Corpus CGN

The Dutch Language Corpus Initiative D-Coi

The Dutch Reference Corpus SoNaR

The Dutch Parallel Corpus DPC

Distribution agencies

The Dutch Human Language Technology Agency TST

The European Language Resources Association ELRA

The Linguistic Data Consortium LDC

Standardization efforts

The Expert Advisory Group on Language Engineering Standards EAGLES