Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization
Luís Marujo,
INESC-ID Lisboa and IST –
Abstract:
Fast and effective automated indexing is a critical problem for personalized online news aggregation systems, such as News360, Google News, and Yahoo! News. Key phrases that consist of one or more words and represent the main concepts of the document are often used for the purpose of indexing. The accuracy of current state of the art automated key-phrase extraction systems (AKE) is in the 30-50% range. This makes improvements in AKE an urgent problem. In this work, we followed a fairly traditional approach of training a classifier to select an ordered list of the most likely candidates for key phrases in a given document. We augmented the process with new features, e.g.: the use of signal words, freebase categories, etc. We have also experimented 2 forms of document pre-processing that we call light filtering and co-reference normalization. Light filtering removes sentences from the document, which are judged peripheral to its main content. Co-reference normalization unifies several written forms of the same named entity into a unique form. Finally, we used Amazon’s Mechanical Turk (Mturk) service to label documents for training and testing.
Date: 2012-Jan-06 Time: 15:00:00 Room: 336
For more information:
Upcoming Events
INESC-ID ESR Talks – February 2023

If you are a masters/PhD student or a postdoctoral fellow, come and present your work in an informal and friendly environment – and savour some tasty snacks!
Individual talks will be 10-15 minutes plus time for feedback. Enroll on your selected date by emailing pedro.ferreira[at]inesc-id.pt.
Happening on the second Wednesday of every month (4pm-5pm):
- 15 February (Alves Redol, Room 9)
- 15 March (Alves Redol, Room 9)
- 12 April (Alves Redol, Room 9)
- 10 May (Alves Redol, Room 9)
- 14 June (Alves Redol, Room 9)
- 12 July (Alves Redol, Room 9)
We hope to see you there!