Mixed content grammars for SGML document analysis

Pim van der Eijk (pvdeijk@inetgate.capgemini.nl)
Dennis Janssen
Cap Gemini Nederland

SGML documents are composed of sequences of character data and 
markup codes, which encode document structure under control of 
a Document Type Definition. NLP systems that manipulate SGML
encoded data therefore need to account for both linguistic
and document structure. This raises a number of descriptive
linguistic, theoretical and engineering challenges and 
opportunities to the NLP system design: structure-based 
disambiguation, maintaining well-formedness and validity,
NLP-based content tagging, and system test methods based on
annotated source data. In the presentation, we will present 
the use of mixed content grammars to process SGML-encoded 
documents, designed for an NLP system for Controlled Language
applications.