For the most obvious implementation of memory-based algorithms, the approach is computationally too expensive to be used in practice (e.g. for part of speech tagging, millions of training items with tens of features per training item are common). As each new task instance has to be compared with each memory item, straightforward implementations are too slow. For numeric features, efficient optimizations (e.g. kd trees) have been developed which make average nearest neighbour search O(log2 n), n the number of training items. Unfortunately, for language processing tasks representations are mostly symbolic (using values such as words, phonemes, graphemes, linguistic categories etc.), and the performance increase using k-d tree like methods for symbolic features is marginal.
In this paper we propose an algorithm to transform the symbolic representations used in our natural language processing tasks into a numeric representation. On the basis of a modified value difference metric (Cost and Salzberg, 1993), differential similarities between values of the same feature are computed, and these are then used to sort the values of a feature into a numeric order, with intervals between values based on their value difference. The resulting numeric representations are then used in kd-tree optimized memory-based learning. We compare this method empirically (measuring generalization accuracy, processing speed and memory requirements) with standard memory-based learning, and various alternative optimizations for symbolic memory-based learning such as indexing of the case base with trees, and IGTrees (Daelemans et al. 1997).
Shlomo Argamon, Ido dagan, Yuval Krymolowski. A Memory-Based Approach
to Learning Shallow Natural Language Patterns. Proceedings of COLING
1998, Montreal, 1998, 67-73.
Scott Cost and Steven Salzberg. A Weighted Nearest Neighbour Algorithm
for Learning with Symbolic Features. Machine Learning 10, 57-78, 1993.
Claire Cardie. A case-based approach to knowledge acquisition for
domain-specific sentence analysis. Proceedings of AAAI 1993. Menlo
Park, 1993, 798-803.
Walter Daelemans, Jakub Zavrel, Peter Berck, Steven Gillis. MBT: A
memory-based part of speech tagger generator. Proceedings fourth
workshop on very large corpora. Copenhagen, 1996, 14-27.
Walter Daelemans, Antal van den Bosch, Ton Weijters. IGTree: Using
Trees for Compression and Classification in Lazy Learning Algorithms.
Artificial Intelligence Review 11, 407--423, 1997.
Walter Daelemans, Antal van den Bosch, Jakub Zavrel. Forgetting
exceptions is harmful in language learning. Machine
Learning. (forthcoming)