Lexicon extraction from bilingual comparable corpora
Parallel corpora is an expensive resource to come by in Machine Translation Systems. Since it was proved that even in unrelated texts of diﬀerent languages patterns of words co-occurring with each other are preserved, non-parallel texts became part of these systems over parallel corpora. Comparable corpora is a speciﬁc type of non-parallel texts with high level of comparability, that is, they point to the same subject, have similar time window and size. This type of corpora is preferred over parallel corpora not only due to its high abundance, but also because it is easily accessible via web. The ob jective of this work is to build a bilingual lexicon from a source language to a target language using comparable corpora. For that purpose, the system is composed by two modules: one is responsible for the detection of cognate words using diﬀerent approaches like verbatim detection, rule based detection, non-rule based detection and sound based detection. The potential equivalents collected are extracted using similarity measures. The other module uses a characteristic found in comparable texts: context preservation between words across the corpora, that is, the context of a given word in the source language tend to be similar to the context of its translation in the target language. Then, for each word, co-occurrences of context words are counted and stored in context vectors which are further compared with all target vectors using similarity measures. These modules combined may form an eﬃcient platform of automatic translation between equivalents of two languages in the creation of a bilingual lexicon.
Date: 2010-Feb-08 Time: 16:00:00 Room: 336
For more information:
Workshop “Metabolism and mathematical models: Two for a tango” – 2nd Edition
Title: Workshop Metabolism and mathematical models: Two for a tango – 2nd Edition
Dates: October 25-26, 2022
Location: This workshop will be held in a virtual way
The topic of this workshop is metabolism in general, with a special focus, although not exclusive, on parasitology. Besides an exploration of the biological, biochemical and biomedical aspects, the workshop will also aim at presenting some of the mathematical modelling, algorithmic theory and software development that have become crucial to explore such aspects.
This workshop is being organised in the context of two projects, both with the Inria European Team Erable. One of the projects involves a partnership with the University of São Paulo (USP), in São Paulo, Brazil, more specifically the Institute of Mathematics and Statistics (IME) and the Institute of Biomedical Sciences – Inria Associated Team Capoeira – and the other involves the Inesc-ID/IST in Portugal, ETH in Zürich and EMBL in Heidelberg – H2020 Twinning Project Olissipo.
The workshop is open to all members of these two projects but also, importantly, to the community in general.
The program and more details are available here.