MUltimodal processing of Spatial and TEmporal expRessions (EU CHIST-ERA)


The MUSTER project is a fundamental pilot research project which introduces a new multi-modal framework for the machine-readable representation of meaning. The focus of MUSTER lies on exploiting visual and perceptual input in the form of images and videos coupled with textual modality for building structured multi-modal semantic representations for the recognition of objects and actions, and their spatial and temporal relations. The MUSTER project will investigate whether such novel multi-modal representations will improve the performance of automated understanding of human language. MUSTER starts from the current state-of-the-work platform for human language representation learning known as text embeddings, but introduces the visual modality to provide contextual world knowledge which text-only models lack while humans possess such knowledge when understanding language. MUSTER will propose a new pilot framework for joint representation learning from text and vision data tailored for spatial and temporal language processing. The constructed framework will be evaluated on a series of HLU tasks (i.e., semantic textual similarity and disambiguation, spatial role labeling, zero-shot learning, temporal action ordering) which closely mimic the processes of human language acquisition and understanding.

MUSTER will rely on recent advances in multiple research disciplines spanning natural language processing, computer vision, machine learning, representation learning, and human language technologies, working together on building structured machine-readable multi-modal representations of spatial and temporal language phenomena.


The project is coordinated by the University of Pierre and Marie Curie, France (Patrick Gallinari). Other partners are ETH Zurich, Switzerland (Luc Van Gool) and the University of the Basque Country, Spain (Aitor Soroa, Eneko Agirre).


We have built novel multimodal representations of objects and their attributes.

Period From 2016-05-01 to 2019-04-30.
Financed by EU CHIST-ERA
Supervised by Marie-Francine Moens
Staff Guillem Collell Talleda
Contact Guillem Collell Talleda

More information can be found on the project website


  1. COLLELL TALLEDA, Guillem, DO, Quynh, ZHANG, T. & MOENS, M.-F. Transferring Visual Knowledge into Semantic Roles. In Proceedings CHIST-ERA Projects Seminar 2016 . 2016
  2. COLLELL TALLEDA, Guillem & MOENS, Marie-Francine Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations. In Proceedings of the 26th International Conference on Computational Linguistics (COLING 2016). ACL. 2016
  3. COLLELL TALEDA, Guillem, ZHANG, Ted & MOENS, Marie-Francine Learning to Predict: A Fast Re-constructive Method to Generate Multimodal Embeddings. In Proceedings of the NIPS Workshop on Representation Learning in Artificial and Biological Neural Networks (MLINI 2016). 2016
  4. COLLELL TALEDA, Guillem, ZHANG, Ted & MOENS, Marie-Francine Imagined Visual Representations as Multimodal Embeddings. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). AAAI. 2017

Back to all projects