Generic Technology for Information Extraction from Texts

The project deals with the development of generic technologies for extracting information from texts. The technologies are generic because the algorithms can be used across different applications and types of texts, are to a considerable extent language- and domain- independent and are portable to other domains or languages with a minimum of effort. We have applied machine learning and linguistics-based techniques to a number of information extraction tasks. The technologies that we have developed are widely applicable in information retrieval, text searching, text analysis and language understanding.

We have selected a limited number of challenges in function of their relevance for solving practical problems and exploring new methods:


Hierarchical topic segmentation:

  • The problem: How can you automatically detect the topics and subtopics of a text and structure them into a table of content.
  • The technical solutions: A text usually contains one or a few main topics, a number of subtopics, a number of sub-subtopics, etc. We have developed a system that automatically builds a table of content of the text. It hierarchically segments a text into its topics and subtopics; describes each segment by its key terms and by its begin and end position in the text; and uses the hierarchical and sequential relations between segments to build a topic tree. The system relies on universal linguistic notions about topic and focus of sentences and thematic progression in texts. This linguistic information can be modeled both deterministically (i.e., by manually building rules) and probabilistically (i.e., by training a classifier). The system has been tested on English and Dutch texts. The best results are obtained with a probabilistic classifier which uses the Expectation Maximization (EM) algorithm to estimate the interpolation weights of different probability distributions found in the training data.
  • Novelty and scientific achievement: Hierarchical topic segmentation very novel. The technique is very effective for automatic text summarization, since it allows for zooming in and out on a text. It has been implemented in our summarization system, SUMMA.
  • Practical use: The algorithm can be integrated in information selection systems and text summarization systems.


Entity scoring:

  • The problem: How can we measure how much a text is about a specific entity?
  • The technical solutions:
    • A first step to solve this problem is coreference resolution (i.e. detecting when noun phrases in a text refer to the same real-world entity). Our approach relies on clustering techniques, which use linguistic features to order noun phrases in clusters according to the entity to which they refer and which chose a representative cluster member for each cluster. We have implemented a number of hard and fuzzy clustering algorithms.
    • The 'aboutness' of a text (i.e., to what extent it is about a certain input entity) depends on that entity's coreferents, but also on other entities and processes that are related to it. For instance, if a text is about a person, textual references to his belongings, family members or events in which he has participated add up to the aboutness score of the person itself. We have developed a method for learning the entity-relatedness of terms with statistical association techniques from a corpus of biographies and one of non-biographies. Textual references can also be related to the input entity because they exhibit a specific syntactic relationship with it or because they occur in its context. All referencing information that is thus derived from a text can be presented as a graph. We are testing path traversal algorithms for computing the aboutness of a text vis-Ó-vis the input entity.
  • Novelty and scientific achievement:
    • We have developed a novel clustering method for coreference resolution, which is competitive with existing techniques that have been tested on the corpus of the Message Understanding Conference. Our technique is generic in that it does not rely on any corpus-dependent distance threshold for cluster membership.
    • The development and evaluation of the process of aboutness scoring is still in progress.
  • Practical use: The technologies have a high potential for information retrieval, question answering and text summarization.


Case role detection:

  • The Problem: How can we learn automatically how semantic concepts that express events in the real world are expressed in a clause? How is the linguistic surface structure of a language related to the functional-semantic deep structure? How can we come to generic forms of event analysis?
  • The technical solutions: We have developed a method for learning how semantic roles are expressed in superficial linguistic features of a text. By using information about word forms, their syntactic classes, phrase boundaries, and the position in a text, we are able to extract from a clause information about events and entities that participates in them. For instance, our algorithm can say which parts of a clause expresses a process of thought, somebody that is thinking, the thing that she thinks of, and the place where she is thinking. We learn these correspondences between linguistic surface features and semantic roles from annotated examples and we have tested a number of classifiers for the task (k-nearest neighbor, na´ve Bayes, maximum entropy, decision tree and support vector machine).
  • Novelty and scientific achievement: Event analysis is very important for computers to understand what a text is really about. Our results are competitive with two major research groups in the U.S.A. (Berkeley and ISI), which are the only other research groups in the world that are working on this problem at the moment.
  • Practical use: Our method could be used to develop a semantic tagger for event analysis. Such a tool would have numerous application domains: it could be used for question answering, advanced search engines, machine translation, coreference resolution, and many other tasks that would benefit from semantic text analysis.


Single- and multi-document summarization:

  • The problem: How can we automatically summarize a document or a set of documents by using generic technologies?
  • The technical solutions: We have developed three modules:
  • Module for detection of salient sentences: The discourse structure of a text is an important indicator of where its main content is located. We have developed technologies to detect the main content of a text based on hierarchical topic segmentation. Our algorithm uses this information to determine which information has to be included in a summary.
  • Sentence compression module: This module parses individual sentences, detects main clauses and dependent clauses, gives an indication of the semantic relationships between content items, and uses this information to reduce the content of individual sentences. Sentence reduction is especially relevant for headline summarization.
  • Redundancy elimination module: For the summarization of multiple documents, it is important to detect redundant content. This module identifies redundant information by using statistical techniques that classify sentences into clusters based on lexical and syntactic information.
  • Novelty and scientific achievement: In the automatic summarization contest at the Document Understanding Conferences of 2002 and 2003, we were ranked consistently among the top teams.
  • Practical use: Summarization is very useful for many information presentation and selection tasks, especially when the user is confronted with large document collections and/or small displays (e.g., for mobile phones or PDAs).

Period From 2000-10-01 to 2004-12-31.
Financed by IWT-STWW (Nr. 000135) , Roularta Media Group, Language & Computing, ICMS Group, Wolters-Kluwer
Supervised by Marie-Francine Moens
Contact Marie-Francine Moens


  1. MOENS, M.-F. & DE BUSSER, R. Generic Topic Segmentation of Document Texts. In Proceedings of the 24th ACM SIGIR Annual International Conference on Research and Development in Information Retrieval (pp. 418-419). New York: ACM 2001
  2. ANGHELUTA, R., DE BUSSER, R. & MOENS, M.-F. The Use of Topic Segmentation for Automatic Summarization. In Proceedings of the ACL-2002 Post-Conference Workshop on Automatic Summarization. 2002
  3. DE BUSSER, R., ANGHELUTA, R. & MOENS, M.-F. Semantic Case Role Detection for Information Extraction. In COLING 2002 - Proceedings of the Main Conference. New Brunswick: ACL, pp. 1198-1202. 2002
  4. MOENS, M.-F. & DE BUSSER, R. Information Extraction: Current Technologies and Promising Research Directions. Internal report TR-IE-1, 68 p. 2001
  5. ANGHELUTA, R., MOENS, M.-F. & DE BUSSER, R. Multi-document Summarization, Technical Report, K.U.Leuven 2002 2002
  6. MOENS, M.F., DE BUSSER, R., HIEMSTRA, D. & KRAAIJ, W. Proceedings of the Third Dutch-Belgian Information Retrieval Workshop. Leuven: ICRI. 2002
  7. ANGHELUTA, R. & MOENS, M.-F. A Study about Synonym Replacement in News Corpora; In Proceedings of the 3'rd Dutch-Belgian Workshop in Information Retrieval 2002
  8. MOENS, M.-F., ANGHELUTA, R. & DE BUSSER, R. Summarization of Texts Found on the World Wide Web. In W. ABRAMOWICZ (Ed.), Knowledge-Based Information Retrieval and Filtering from the Web (pp. 101-120) (The Kluwer International Series in Engineering and Computer Science) . Boston: Kluwer Academic Publishers. 2003
  9. DE BUSSER, R., "Report on the 3rd Dutch-Belgian Information Retrieval Workshop (DIR-2002)." BNVKI Newsletter 20 (1), 19-21 and SIGIR Forum 37 (1), 4-6. 2003
  10. ANGHELUTA, R., MOENS, M.-F. & DE BUSSER, R. The K.U.Leuven Summarization System DUC-2003. In Proceedings of the Document Understanding Conference (DUC-2003). National Institute of Standards and Technology, USA. 2003
  11. DE BUSSER, R. & MOENS, M.-F. Learning Generic Semantic Roles. Technical Report, 15p. (submitted for publication) 2003
  12. ANGHELUTA, R., JEUNIAUX, P., MITRA, R. & MOENS, M.-F. Clustering Algorithms for Noun Phrase Coreference Resolution. In Proceedings of 7´┐Żmes Journ´┐Żes internationales d'Analyse statistique des Donn´┐Żes Textuelles (pp. 60-70). March 10-12, 2004, Louvain La Neuve, Belgium. 2004
  13. MOENS, M.-F., ANGHELUTA, R. & DUMORTIER. J., Generic Technologies for Single- and Multi-document Summarization. Information Processing & Management , 2005 (forthcoming). 2000
  14. MOENS, M.-F., ANGHELUTA, R., DE BUSSER, R. & JEUNIAUX, P. Summarizing Text at Various Levels of Detail. In Proceedings of RIAO 2004 Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval (pp. 597-609). Le Centre de Hautes ´┐Żtudes Internationales d'Informatique Documentaire. 2004
  15. ANGHELUTA, R., MITRA, R., JING, X. & MOENS, M.-F. K.U.Leuven Summarization System at DUC-2004. In DUC Workshop Papers and Agenda (pp. 53-60). Boston. 2004
  16. MOENS, M.-F., ANGHELUTA, R. & DUMORTIER, J. Generic Technologies for Single- and Multi-document Summarization. Information Processing & Management , 41(3), 569-586. 2005
  17. MOENS, M.-F. (2006). Using Patterns of Thematic Progression for Building a Table of Content of a Text. Journal of Natural Language Engineering 12 (3): 1-28. 2006
  18. MOENS, M.-F. Information Synthesis: A Glance at the Future. In Proceedings of the IJCAI 2005 Workshop on Knowledge and Reasoning for Answering Questions (invited lecture). 2005
  19. MOENS, M.-F., JEUNIAUX, P., ANGHELUTA, R. & MITRA, R. (2006). Measuring Aboutness of an Entity in a Text . In Proceedings of HLT-NAACL 06 TextGraphs: Graph-based Algorithms for Natural Language Processing. East Stroudsburg: ACL. 2006
  20. MITRA, R., ANGHELUTA, R., JEUNIAUX, P. & MOENS, M.-F. Progressive Fuzzy Clustering for Noun Phrase Coreference Resolution. In
  21. MEHAY. D., DE BUSSER, R. & MOENS, M.-F. Labeling Generic Semantic Roles. In H. Bunt, J. Geertzen & E. Thyse (Eds.), Proceedings of the Sixth International Workshop on Computational Semantics (IWCS-6) (pp. 175-187). Tilburg, The Netherlands: Tilburg University. 2005
  22. BIRYUKOV, M., ANGHELUTA, R. , MOENS, M. -F. Multidocument Question Answering Text Summarization using Topic Signatures. In Proceedings of the DIR-2005 Dutch-Belgian Information Retrieval Workshop . 2005
  23. BIRYUKOV, M., ANGHELUTA, R. & MOENS, M.-F. Multidocument Question Answering Text Summarization Using Topic Signatures. Journal on Digital Information Management . 2005
  24. MOENS, M.-F. Automatic Indexing and Abstracting of Document Texts (The Kluwer International Series on Information Retrieval 6). Boston: Kluwer Academic Publishers. 2000
  25. MOENS, M.-F. & SZPAKOWICZ, S. (Eds.) Text Summarization Branches Out. New Brunswick: Association for Computational Linguistics. 2004
  26. MOENS, M.-F. Information Extraction: Algorithms and Prospects in a Retrieval Context (The Information Retrieval Series 21). New York: Springer. 2006

Back to all projects