Mixed content grammars for SGML document analysis
Pim van der Eijk (pvdeijk@inetgate.capgemini.nl)
Dennis Janssen
Cap Gemini Nederland
SGML documents are composed of sequences of character data and
markup codes, which encode document structure under control of
a Document Type Definition. NLP systems that manipulate SGML
encoded data therefore need to account for both linguistic
and document structure. This raises a number of descriptive
linguistic, theoretical and engineering challenges and
opportunities to the NLP system design: structure-based
disambiguation, maintaining well-formedness and validity,
NLP-based content tagging, and system test methods based on
annotated source data. In the presentation, we will present
the use of mixed content grammars to process SGML-encoded
documents, designed for an NLP system for Controlled Language
applications.