A NUMERIC TRANSFORM FOR MEMORY-BASED LANGUAGE PROCESSING

Walter Daelemans (ILK, Tilburg & CNTS, Antwerpen)
Marc Gemis (Computer Science, Antwerpen)

Memory-Based Language Processing has recently become increasingly used
as a method within statistical natural language processing for
problems ranging from phonology to semantics (Argamon et al. 1998;
Cardie, 1993; Daelemans et al.  1996).  In this approach, all
available training material for a task is kept in memory, and new
instances of the task are solved by extrapolating from the most
similar items in memory.  Advantages of this approach (compared to
other statistical methods) include (i) automatic smoothing of
low-frequency events by means of the similarity metric, (ii) easy
integration of different information sources by means of a feature
relevance weighting metric, and (iii) availability of all training
data for extrapolation.  It has recently been shown (Daelemans et al.
1998) that for language processing tasks, abstracting from training
data, however exceptional these data items are, is generaly harmful
for language learning.
 

For the most obvious implementation of memory-based algorithms, the approach is computationally too expensive to be used in practice (e.g. for part of speech tagging, millions of training items with tens of features per training item are common). As each new task instance has to be compared with each memory item, straightforward implementations are too slow. For numeric features, efficient optimizations (e.g. kd trees) have been developed which make average nearest neighbour search O(log2 n), n the number of training items. Unfortunately, for language processing tasks representations are mostly symbolic (using values such as words, phonemes, graphemes, linguistic categories etc.), and the performance increase using k-d tree like methods for symbolic features is marginal.

In this paper we propose an algorithm to transform the symbolic representations used in our natural language processing tasks into a numeric representation. On the basis of a modified value difference metric (Cost and Salzberg, 1993), differential similarities between values of the same feature are computed, and these are then used to sort the values of a feature into a numeric order, with intervals between values based on their value difference. The resulting numeric representations are then used in kd-tree optimized memory-based learning. We compare this method empirically (measuring generalization accuracy, processing speed and memory requirements) with standard memory-based learning, and various alternative optimizations for symbolic memory-based learning such as indexing of the case base with trees, and IGTrees (Daelemans et al. 1997).

Shlomo Argamon, Ido dagan, Yuval Krymolowski. A Memory-Based Approach to Learning Shallow Natural Language Patterns. Proceedings of COLING 1998, Montreal, 1998, 67-73.
Scott Cost and Steven Salzberg. A Weighted Nearest Neighbour Algorithm for Learning with Symbolic Features. Machine Learning 10, 57-78, 1993.
Claire Cardie. A case-based approach to knowledge acquisition for domain-specific sentence analysis. Proceedings of AAAI 1993. Menlo Park, 1993, 798-803.
Walter Daelemans, Jakub Zavrel, Peter Berck, Steven Gillis. MBT: A memory-based part of speech tagger generator. Proceedings fourth workshop on very large corpora. Copenhagen, 1996, 14-27.
Walter Daelemans, Antal van den Bosch, Ton Weijters. IGTree: Using Trees for Compression and Classification in Lazy Learning Algorithms. Artificial Intelligence Review 11, 407--423, 1997.
Walter Daelemans, Antal van den Bosch, Jakub Zavrel. Forgetting exceptions is harmful in language learning. Machine Learning. (forthcoming)