In this paper we explore the arbitrary-predictable dimension from a data-analysis and machine learning perspective. Our hypotheses are (i) that the degree to which a category is predictable can be shown objectively by quantitative data-analysis and by machine learning experiments, and (ii) that memory-based learning succeeds in successfully learning arbitrary, predictable, and partly predictable categories without recourse to rule-based or dual-route architectures for the predictable categories. The memory-based learning (MBL) paradigm is based on the assumption that records of past experience, together with a suitably defined measure of similarity suffice to produce intelligent behaviour. The distinguishing characteristic of MBL is that no intervening abstractions in the form of rules or decision trees are created, but that the instances themselves are used directly to make generalizations.
To evaluate hypotheses (i) and (ii), we selected three different Dutch nominal lexical categories covering the complete dimension of arbitrariness: gender (arbitrary), word stress (partly predictable), and diminutive suffix (predictable). From the CELEX lexical database we selected 3000 nouns with information about their phonological form and about these three lexical categories. In a first set of experiments we investigated the instance space formed by these three comparable datasets using different quantitative measures of predictability, showing that the arbitrary-predictable dimension can indeed be shown objectively. In a second set of experiments, we show that memory-based learning is able to obtain high generalization accuracy regardless of predictability of the data. Our results show that memory-based learning offers an architecture for lexical acquisition where regular and irregular instances of predictable and arbitrary tasks are acquired and represented in a uniform way.