Generic Technology for Information Extraction from Texts
The project deals with the development of generic technologies for extracting
information from texts. The technologies are generic because the algorithms
can be used across different applications and types of texts, are to a considerable
extent language- and domain- independent and are portable to other domains or
languages with a minimum of effort. We have applied machine learning and linguistics-based
techniques to a number of information extraction tasks. The technologies that
we have developed are widely applicable in information retrieval, text searching,
text analysis and language understanding.
We have selected a limited number of challenges in function
of their relevance for solving practical problems and exploring new methods:
Hierarchical topic segmentation:
The problem: How can you automatically detect
the topics and subtopics of a text and structure them into a table of content.
The technical solutions: A text usually contains
one or a few main topics, a number of subtopics, a number of sub-subtopics,
etc. We have developed a system that automatically builds a table of content
of the text. It hierarchically segments a text into its topics and subtopics;
describes each segment by its key terms and by its begin and end position
in the text; and uses the hierarchical and sequential relations between
segments to build a topic tree. The system relies on universal linguistic
notions about topic and focus of sentences and thematic progression in texts.
This linguistic information can be modeled both deterministically (i.e.,
by manually building rules) and probabilistically (i.e., by training a classifier).
The system has been tested on English and Dutch texts. The best results are obtained with a probabilistic classifier which uses the Expectation Maximization (EM) algorithm to estimate the interpolation weights of different probability distributions found in the training data.
Novelty and scientific achievement
topic segmentation very novel. The technique is very effective for automatic
, since it allows for zooming in and out on a text.
It has been implemented in our summarization system, SUMMA.
Case role detection:
The Problem: How can we learn automatically how
semantic concepts that express events in the real world are expressed in
a clause? How is the linguistic surface structure of a language related
to the functional-semantic deep structure? How can we come to generic forms
of event analysis?
The technical solutions: We have developed a method
for learning how semantic roles are expressed in superficial linguistic
features of a text. By using information about word forms, their syntactic
classes, phrase boundaries, and the position in a text, we are able to extract
from a clause information about events and entities that participates in
them. For instance, our algorithm can say which parts of a clause expresses
a process of thought, somebody that is thinking, the thing that she thinks
of, and the place where she is thinking. We learn these correspondences
between linguistic surface features and semantic roles from annotated examples
and we have tested a number of classifiers for the task (k-nearest neighbor,
na´ve Bayes, maximum entropy, decision tree and support vector machine).
Novelty and scientific achievement: Event analysis
is very important for computers to understand what a text is really about.
Our results are competitive with two major research groups in the U.S.A.
(Berkeley and ISI), which are the only other research groups in the world
that are working on this problem at the moment.
: Our method could be used to develop
a semantic tagger for event analysis. Such a tool would have numerous application
domains: it could be used for question answering, advanced search engines,
machine translation, coreference resolution
many other tasks that would benefit from semantic text analysis.
Single- and multi-document summarization:
The problem: How can we automatically summarize
a document or a set of documents by using generic technologies?
The technical solutions: We have developed three
Module for detection of salient sentences
discourse structure of a text is an important indicator of where its main
content is located. We have developed technologies to detect the main content
of a text based on hierarchical topic segmentation
Our algorithm uses this information to determine which information has to
be included in a summary.
Sentence compression module: This module parses
individual sentences, detects main clauses and dependent clauses, gives
an indication of the semantic relationships between content items, and uses
this information to reduce the content of individual sentences. Sentence
reduction is especially relevant for headline summarization.
Redundancy elimination module: For the summarization
of multiple documents, it is important to detect redundant content. This
module identifies redundant information by using statistical techniques
that classify sentences into clusters based on lexical and syntactic information.
Novelty and scientific achievement: In the automatic
summarization contest at the Document Understanding Conferences of 2002
and 2003, we were ranked consistently among the top teams.
Practical use: Summarization is very useful for
many information presentation and selection tasks, especially when the user
is confronted with large document collections and/or small displays (e.g.,
for mobile phones or PDAs).
|Period || From 2000-10-01 to 2004-12-31.|
| Financed by ||IWT-STWW (Nr. 000135) , Roularta Media Group, Language & Computing, ICMS Group, Wolters-Kluwer|
| Supervised by ||Marie-Francine Moens|
| Staff |
|Contact ||Marie-Francine Moens|
- MOENS, M.-F. & DE BUSSER, R. Generic Topic Segmentation of Document Texts. In Proceedings of the 24th ACM SIGIR Annual International Conference on Research and Development in Information Retrieval (pp. 418-419). New York: ACM 2001
- ANGHELUTA, R., DE BUSSER, R. & MOENS, M.-F. The Use of Topic Segmentation for Automatic Summarization. In Proceedings of the ACL-2002 Post-Conference Workshop on Automatic Summarization. 2002
- DE BUSSER, R., ANGHELUTA, R. & MOENS, M.-F. Semantic Case Role Detection for Information Extraction. In COLING 2002 - Proceedings of the Main Conference. New Brunswick: ACL, pp. 1198-1202. 2002
- MOENS, M.-F. & DE BUSSER, R. Information Extraction: Current Technologies and Promising Research Directions. Internal report TR-IE-1, 68 p. 2001
- ANGHELUTA, R., MOENS, M.-F. & DE BUSSER, R. Multi-document Summarization, Technical Report, K.U.Leuven 2002 2002
- MOENS, M.F., DE BUSSER, R., HIEMSTRA, D. & KRAAIJ, W. Proceedings of the Third Dutch-Belgian Information Retrieval Workshop. Leuven: ICRI. 2002
- ANGHELUTA, R. & MOENS, M.-F. A Study about Synonym Replacement in News Corpora; In Proceedings of the 3'rd Dutch-Belgian Workshop in Information Retrieval 2002
- MOENS, M.-F., ANGHELUTA, R. & DE BUSSER, R. Summarization of Texts Found on the World Wide Web. In W. ABRAMOWICZ (Ed.), Knowledge-Based Information Retrieval and Filtering from the Web (pp. 101-120) (The Kluwer International Series in Engineering and Computer Science) . Boston: Kluwer Academic Publishers. 2003
- DE BUSSER, R., "Report on the 3rd Dutch-Belgian Information Retrieval Workshop (DIR-2002)." BNVKI Newsletter 20 (1), 19-21 and SIGIR Forum 37 (1), 4-6. 2003
- ANGHELUTA, R., MOENS, M.-F. & DE BUSSER, R. The K.U.Leuven Summarization System DUC-2003. In Proceedings of the Document Understanding Conference (DUC-2003). National Institute of Standards and Technology, USA. 2003
- DE BUSSER, R. & MOENS, M.-F. Learning Generic Semantic Roles. Technical Report, 15p. (submitted for publication) 2003
- ANGHELUTA, R., JEUNIAUX, P., MITRA, R. & MOENS, M.-F. Clustering Algorithms for Noun Phrase Coreference Resolution. In Proceedings of 7´┐Żmes Journ´┐Żes internationales d'Analyse statistique des Donn´┐Żes Textuelles (pp. 60-70). March 10-12, 2004, Louvain La Neuve, Belgium. 2004
- MOENS, M.-F., ANGHELUTA, R. & DUMORTIER. J., Generic Technologies for Single- and Multi-document Summarization. Information Processing & Management , 2005 (forthcoming). 2000
- MOENS, M.-F., ANGHELUTA, R., DE BUSSER, R. & JEUNIAUX, P. Summarizing Text at Various Levels of Detail. In Proceedings of RIAO 2004 Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval (pp. 597-609). Le Centre de Hautes ´┐Żtudes Internationales d'Informatique Documentaire. 2004
- ANGHELUTA, R., MITRA, R., JING, X. & MOENS, M.-F. K.U.Leuven Summarization System at DUC-2004. In DUC Workshop Papers and Agenda (pp. 53-60). Boston. 2004
- MOENS, M.-F., ANGHELUTA, R. & DUMORTIER, J. Generic Technologies for Single- and Multi-document Summarization. Information Processing & Management , 41(3), 569-586. 2005
- MOENS, M.-F. (2006). Using Patterns of Thematic Progression for Building a Table of Content of a Text. Journal of Natural Language Engineering 12 (3): 1-28. 2006
- MOENS, M.-F. Information Synthesis: A Glance at the Future. In Proceedings of the IJCAI 2005 Workshop on Knowledge and Reasoning for Answering Questions (invited lecture). 2005
- MOENS, M.-F., JEUNIAUX, P., ANGHELUTA, R. & MITRA, R. (2006). Measuring Aboutness of an Entity in a Text . In Proceedings of HLT-NAACL 06 TextGraphs: Graph-based Algorithms for Natural Language Processing. East Stroudsburg: ACL. 2006
- MITRA, R., ANGHELUTA, R., JEUNIAUX, P. & MOENS, M.-F. Progressive Fuzzy Clustering for Noun Phrase Coreference Resolution. In
- MEHAY. D., DE BUSSER, R. & MOENS, M.-F. Labeling Generic Semantic Roles. In H. Bunt, J. Geertzen & E. Thyse (Eds.), Proceedings of the Sixth
International Workshop on Computational Semantics (IWCS-6) (pp. 175-187). Tilburg, The Netherlands: Tilburg University. 2005
- BIRYUKOV, M., ANGHELUTA, R. , MOENS, M. -F. Multidocument Question Answering Text Summarization
using Topic Signatures. In Proceedings of the DIR-2005 Dutch-Belgian Information Retrieval Workshop . 2005
- BIRYUKOV, M., ANGHELUTA, R. & MOENS, M.-F. Multidocument Question Answering Text Summarization Using Topic Signatures. Journal on Digital Information Management . 2005
- MOENS, M.-F. Automatic Indexing and Abstracting of Document Texts (The Kluwer International Series on Information Retrieval 6). Boston: Kluwer Academic Publishers.
- MOENS, M.-F. & SZPAKOWICZ, S. (Eds.) Text Summarization Branches Out. New Brunswick: Association for Computational Linguistics.
- MOENS, M.-F. Information Extraction: Algorithms and Prospects in a Retrieval Context (The Information Retrieval Series 21). New York: Springer.
Back to all projects