TermWise

TermWise: Resources for Specialized Language Use

Aims

There is a growing need for effective multilingual solutions that support business and professional communication. The TermWise Knowledge Platform is a multidisciplinary co-operation with the explicit objective of developing user-oriented research into multilingual language technologies. More specifically, the platform aims to develop software that will help language professionals, like translators or copy-writers, deal more effectively with specialized texts. These texts, e.g., the medical or legal documents, are full of domain-specific jargon and terminology. TermWise aims at automatically extracting multilingual dictonaries and translation memories from parallel and comparable corpora written in Dutch and French.

The focus of our research is on advancing the state of the art in statistical term alignment, i.e., the automatic alignment of phrases and words that are translational equivalents, given large multilingual document collections. The research covers the study of existing statistical alignment models (e.g., sequence and fertility based models) and the design, development and evaluation of novel alignment algorithms including generative, discriminative and graph models.

Partners

The other TermWise partners are the Quantitative Lexicology and Variational Linguistics group of K.U.Leuven (Prof. Dirk Geeraerts and Prof. Dirk Speelman), the ICT De Nayer Campus of Lessius University College ( Dr. Herman Crauwels ), and the Language and Computer Research Group of the Department of Applied Language Studies at Lessius University College, Antwerp, Belgium ( Prof. Frieda Steurs and Dr. Hendrik Kockaert).

Results

We have developed and evaluated a model for term translation based on probabilistic graphical models and trained on comparable corpora, and have applied it to cross-language information retrieval. In addition we have built several models for detecting highly confident word translations from comparable and parallel corpora.

The technologies that we have developed are integrated in a platform for multilingual terminology extraction developed by all consortium partners.



Period From 2009-11-01 to 2013-12-31.
Financed by IOF- KP/09/001- Flemish government
Supervised by Marie-Francine Moens
Staff Wim De Smet
Ivan Vulic
Contact Ivan Vulic

Publications

  1. VULIC, Ivan, DE SMET Wim & MOENS, Marie-Francine Identifying Word Translations from Comparable Corpora Using Latent Topic Models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 479-484). ACL. 2011
  2. VULIC, Ivan, DE SMET, Wim & MOENS, Marie-Francine Cross-Language Information Retrieval with Latent Topic Models Trained on a Comparable Corpus. In Proceedings of the Seventh Asia Information Retrieval Societies Conference (AIRS 2011) (Lecture Notes in Computer Science). 2011
  3. VULIC, Ivan & MOENS, Marie-Francine Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge. In Proceedings of the Thirteenth Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012). ACL. 2012
  4. VULIC, Ivan, DE SMET, Wim & MOENS, Marie-Francine Cross-Language Information Retrieval Models Based on Latent Topic Models Trained with Document-Aligned Comparable Corpora. Information Retrieval. DOI 10.1007/s10791-012-9200-5 2012
  5. VULIC, Ivan & MOENS, Marie-Francine Sub-corpora Sampling with an Application to Bilingual Lexicon Extraction. In Proceedings of the 24th International Conference on Computational Linguistics. ACL. 2012
  6. VULIC, Ivan & MOENS, Marie-Francine A Unified Framework for Monolingual and Cross-Lingual Relevance Modeling Based on Probabilistic Topic Models. In Proceedings of the 35th European Conference on Information Retrieval (ECIR 2012). (Lecture Notes in Computer Science 7814) (pp. 98-109). Berlin: Springer. 2013
  7. VULIC, Ivan & MOENS, Marie-Francine Cross-Lingual Semantic Similarity of Words as the Similarity of their Semantic Word Responses. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL. 2013
  8. MOENS, Marie-Francine & VULIC, Ivan Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval. In Proceedings of the 35th European Conference on Information Retrieval (ECIR 2013).(Lecture Notes in Computer Science 7814) (pp. 775-778). Berlin: Springer 2013
  9. VULIC, Ivan, DE SMET, Wim., TANG, Jie & MOENS, Marie-Francine Probabilistic Topic Modeling in Multilingual Settings: A Short Overview of Its Methodology and Applications. In Proceedings of xLiTe: Cross-Lingual Technologies - NIPS 2012 workshop. 2012
  10. VULIC, Ivan & MOENS, Marie-Francine A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else). In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2013). ACL. 2013
  11. HEYLEN, K., BOND, S., DE HERTOG, Dirk, VULIC, Ivan and KOCKAERT, Hendrik TermWise: A CAT-tool with Context-Sensitive Terminological Support. In Proceedings of the 9th Language Resources and Evaluation Conference. 2014
  12. MOENS, Marie-Francine & VULIC, Ivan Multilingual probabilistic topic modeling and its applications in web mining and search. In Proceedings of the Seventh ACM International Conference on Web Search and Data Mining (WSDM 2014) (pp. 681-682). New York: ACM. 2014
  13. VULIC, Ivan & MOENS, Marie-Francine. Probabilistic Models of Cross-Lingual Semantic Similarity in Context Based on Latent Cross-Lingual Concepts Induced from Comparable Data. In Proceedings of EMNLP 2014: Conference on Empirical Methods in Natural Language Processing. 2014
  14. VULIC, Ivan, DE SMET, Wim, TANG, Jie & MOENS, Marie-Francine Probabilistic Topic Modeling in Multilingual Settings: An Overview of Its Methodology and Applications. Information Processing & Management, 51 (1), 111-147. 2015


Back to all projects