WebInsight: Towards Modeling Correlation and Evolution of Web Documents


The main aim of the WebInsight project is to explore topic and other content models for (cross-lingual) analysis, comparison and summarization of textual content. The research builds on and expands probabilistic models such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) and can be applied in many text mining tasks such as spatio-temporal text mining, cross-collection comparative analysis and cross-lingual mining. The research insights and the developed technologies are evaluated on Chinese and English Web documents.


Tsinghua University, China (Prof. Juanzi Li, Prof. Jie Tang and Prof. Maosung Sun).


We have researched technology for cross-lingual information processing, more specifically we have shown that the cross-lingual Latent Dirichlet Allocation model can serve as a transfer learning method in cross-lingual clustering and categorization. The partners have organized a joint workshop, The 3rd Workshop on Social Web Search and Mining (SWSM2011): Analysis of User Generated Content Under Crisis in Beijing, China on July 28, 2011 during the 34th Annual International ACM SIGIR conference.

An edited book on mining of user generated content is in preparation.

Period From 2009-01-01 to 2011-12-31.
Financed by Bilateral Scientific Collaboration between K.U.Leuven and Tsinghua University: BIL/08/08
Supervised by Marie-Francine Moens
Staff Wim De Smet
Contact Marie-Francine Moens


  1. De BELDER, Jan, DE SMET, Wim, MOCHALES PALAU, Raquel & MOENS, Marie-Francine Google Owns YouTube? Entity Relationship Extraction with Minimal Supervision. In Proceedings of SIM 2009 Joint Conference SRL ILP MLG. 2009
  2. DE SMET, Wim & MOENS, Marie-Francine Cross-Language Linking of News Stories on the Web Using Interlingual Topic Models. In Proceedings of the CIKM Workshop on Social Web Search and Mining ( SWSM 2009). New York: ACM. 2009
  3. DE SMET, Wim, TANG, Jie & MOENS, Marie-Francine Knowledge Transfer across Multilingual Corpora via Latent Topics. In Proceedings of PAKDD 2011: The 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (Lecture Notes in Computer Science 6634) (pp. 549-560). Berlin: Springer. 2011
  4. DESCHACHT, Koen, DE BELDER, J. & MOENS, Marie-Francine The Latent Words Language Model. Computer Speech and Language, 26 (5), 384-409. 2012
  5. XIA, Huan, LI, Juanzi, TANG Jie & MOENS, Marie-Francine Plink-LDA: Using Link as Prior Information in Topic Modeling. In Proceedings of The 17th International Conference on Database Systems for Advanced Applications (DASFAA 2012) (Lecture Notes in Computer Science 7238 (pp. 213-227). Berlin: Springer. 2012
  6. ZHANG, Jing, TANG, Jie, MA, Cong, TONG, Hanghang, JING, Yu, Li, Juanzi, LUYTEN, W. & MOENS, Marie-Francine Fast and Flexible Top-k Similarity Search on Large Networks. ACM Transactions on Information Systems. 2017

