Lexicon extraction from bilingual comparable corpora
Parallel corpora is an expensive resource to come by in Machine Translation Systems. Since it was proved that even in unrelated texts of diﬀerent languages patterns of words co-occurring with each other are preserved, non-parallel texts became part of these systems over parallel corpora. Comparable corpora is a speciﬁc type of non-parallel texts with high level of comparability, that is, they point to the same subject, have similar time window and size. This type of corpora is preferred over parallel corpora not only due to its high abundance, but also because it is easily accessible via web. The ob jective of this work is to build a bilingual lexicon from a source language to a target language using comparable corpora. For that purpose, the system is composed by two modules: one is responsible for the detection of cognate words using diﬀerent approaches like verbatim detection, rule based detection, non-rule based detection and sound based detection. The potential equivalents collected are extracted using similarity measures. The other module uses a characteristic found in comparable texts: context preservation between words across the corpora, that is, the context of a given word in the source language tend to be similar to the context of its translation in the target language. Then, for each word, co-occurrences of context words are counted and stored in context vectors which are further compared with all target vectors using similarity measures. These modules combined may form an eﬃcient platform of automatic translation between equivalents of two languages in the creation of a bilingual lexicon.
Date: 2010-Feb-08 Time: 16:00:00 Room: 336
For more information:
INESC-ID ESR Talks – February 2023
If you are a masters/PhD student or a postdoctoral fellow, come and present your work in an informal and friendly environment – and savour some tasty snacks!
Individual talks will be 10-15 minutes plus time for feedback. Enroll on your selected date by emailing pedro.ferreira[at]inesc-id.pt.
Happening on the second Wednesday of every month (4pm-5pm):
- 15 February (Alves Redol, Room 9)
- 15 March (Alves Redol, Room 9)
- 12 April (Alves Redol, Room 9)
- 10 May (Alves Redol, Room 9)
- 14 June (Alves Redol, Room 9)
- 12 July (Alves Redol, Room 9)
We hope to see you there!