theohiwbkrlaweconsocartspsyscitweagrmedfarmfaber
CCL Centre for Computational Linguistics K.U.Leuven
Leuven    - Search Staff Students Organizational chart Search matrix Keywords
Home
Call for papers
Abstract Submission
Important Dates
Location
Program
Registration
Proceedings
Local Organization
Sponsors
Pictures
Centre for Computational Linguistics
---
-  

CLIN 17 - Program

Extraction of Folksonomies from Noisy Texts

Wim De Smet, Marie-Francine Moens

K.U.Leuven - ICRI-LIIR

Folksonomies are a rising phenomenon on the Internet. These collections of linked documents in different media formats (e.g. texts, photos, ...) gain in popularity because they are created by the people of a community, instead of by a system authority. The linking of documents is often done by the users of the folksonomy, for instance, by the addition of metadata, such as tags.

We built a system for the automatic creation of a text-based folksonomy, meant to be used in a geographically defined community (e.g. the inhabitants of a city). This poses two main problems. First, the automatic linking of texts by their topic, and second, the appearance of both standard language and a community-related dialect, a transcription of the dialect spoken in the community's environment. Dialect words especially lower the recall of a linking algorithm, and should therefore be corrected to standard words. We solve the linking problem by using the topic-biased ranking algorithm, Finegrained PageRank, proposed by Xu and Ma in 2006. This method creates a hierarchical clustering of topics found in the documents, and extracts a model of a "focused random surfer" who browses documents that are topically related. The problem of correcting dialect words (in our case a Flemish dialect) is dealt with by performing a nearest neighbor search over a dynamic set of known words, using a set of transition rules from dialect to standard words, which are learned from a parallel corpus of dialect and standard words.

  
NEWSFLASH
CLIN-17 PICTURES now available

   
K.U.Leuven - CWIS  Copyright © Katholieke Universiteit Leuven | reacties op de inhoud: Vincent Vandeghinste
Realisatie: Vincent Vandeghinste | Laatste wijziging: 20 november 2006 | Disclaimer
URL: http://www.ccl.kuleuven.be/CLIN17/ie4.php