 |
Extraction of Folksonomies from Noisy Texts
Wim De Smet, Marie-Francine Moens
K.U.Leuven - ICRI-LIIR
Folksonomies are a rising phenomenon on the Internet. These collections of
linked documents in different media formats (e.g. texts, photos, ...) gain
in popularity because they are created by the people of a community,
instead of by a system authority. The linking of documents is often done by
the users of the folksonomy, for instance, by the addition of metadata,
such as tags.
We built a system for the automatic creation of a text-based
folksonomy, meant to be used in a geographically defined community
(e.g. the inhabitants of a city). This poses two main problems. First, the
automatic linking of texts by their topic, and second, the appearance of
both standard language and a community-related dialect, a transcription of
the dialect spoken in the community's environment. Dialect words especially
lower the recall of a linking algorithm, and should therefore be corrected
to standard words. We solve the linking problem by using the topic-biased
ranking algorithm, Finegrained PageRank, proposed by Xu and Ma in
2006. This method creates a hierarchical clustering of topics found in the
documents, and extracts a model of a "focused random surfer" who browses
documents that are topically related. The problem of correcting dialect
words (in our case a Flemish dialect) is dealt with by performing a nearest
neighbor search over a dynamic set of known words, using a set of
transition rules from dialect to standard words, which are learned from a
parallel corpus of dialect and standard words.
|