International Workshop on Knowledge Discovery from (Big) Text: Challenges and Opportunities when Mining Biomedical Text

Date: Monday May 18, 2015

Venue: Huis van Chièvres, Groot Begijnhof, Leuven (

Program Schedule: click here

Registration: Registration is now closed due to the overwhelming success of the workshop. Please contact the organizers if you have additional questions.

Download presentation slides here

Keynote speakers

Smiley face Prof. Dr. Jie Tang,
Tsinghua University, China
Title: Incorporating Social Context and Domain Knowledge for Entity Recognition
Abstract: Recognizing entity instances in documents according to a knowledge base is a fundamental problem in many data mining applications. The problem is extremely challenging for short documents in complex domains such as social media and biomedical domains. Large concept spaces and instance ambiguity are key issues that need to be addressed. Most of the documents are created in a social context by common authors via social interactions, such as reply and citations. Such social contexts are largely ignored in the instance-recognition literature. How can users' interactions help entity instance recognition? How can the social context be modeled so as to resolve the ambiguity of different instances? In this talk, I will present a new model named SOCINST to address the problem using a probabilistic model. Given a set of short documents (e.g., tweets or paper abstracts) posted by users who may connect with each other, SOCINST can automatically construct a context of subtopics for each instance, with each subtopic representing one possible meaning of the instance. The model is also able to incorporate social relationships between users to help build social context. We further incorporate domain knowledge into the model using a Dirichlet tree distribution. We evaluate the proposed model on three different genres of datasets: ICDM'12 Contest, Weibo, and I2B2. In ICDM'12 Contest, the proposed model clearly outperforms (+21.4%) all the top contestants. In Weibo and I2B2, our results also show that the recognition accuracy of SOCINST is up to 5.3-26.6% better than those of several alternative methods.

Bio: Jie Tang is an associate professor in Department of Computer Science and Technology, Tsinghua University. His interests include social network analysis, data mining, and machine learning. He published more than 100 journal/conference papers and holds 10 patents. He served as PC Co-Chair of WSDM'15, ASONAM'15, ADMA'11, SocInfo'12, KDD-CUP Co-Chair of KDD'15, Poster Co-Chair of KDD'14, Workshop Co-Chair of KDD'13, Local Chair of KDD'12, Publication Co-Chair of KDD'11, and as the PC member of more than 50 international conferences. He is the principal investigator of National High-tech R&D Program (863), NSFC project, Chinese Young Faculty Research Funding, National 985 funding, and international collaborative projects with Minnesota University, IBM, Google, Nokia, Sogou, etc. He leads the project for academic social network analysis and mining, which has attracted millions of independent IP accesses from 220 countries/regions in the world. He was honored with the Newton Advanced Scholarship Award, CCF Young Scientist Award, NSFC Excellent Young Scholar, and IBM Innovation Faculty Award.
Smiley face Prof. Dr. Juanzi Li,
Tsinghua University, China
Title: Cross-lingual Knowledge Building
Abstract: As the Web is evolving to a highly globalized information space, sharing knowledge across different languages is attracting increasing attentions. Multilingual knowledge bases, in which the cross-lingual equivalent concepts or relationships are linked together, play important role in multilingual knowledge sharing. They are important sources for harvesting cross-lingual knowledge from the Web and have significant applications such as multilingual information retrieval, machine translation and deep question answering. This talk will introduce our research on key technologies on multilingual knowledge graph including cross-lingual knowledge linking, cross-lingual knowledge extraction and cross-lingual ontology building.

Bio: Prof. Dr. Juanzi Li is a full professor at Tsinghua University. She obtained her PhD degree from Tsinghua University in 2000. Her main research interest is to study the semantic technologies by combining the key technologies of Natural Language Processing, Semantic Web and Data Mining. She is the vice director of Chinese Information Processing Society of Chinese Computer Federation in China. She is principal investigator of many key projects supported by Natural Science Foundation of China (NSFC), national basic science research program and international cooperation projects. She has published over 90 papers in many international journals and conferences such as TKDE, SIGIR, SIGMOD, SIGKDD, IJCAI.
No Image Prof. Dr. Lorraine Goeuriot,
Université Joseph Fourier, France
Title: Medical Information Retrieval and its Evaluation: an Overview of CLEF eHealth Evaluation Task
Abstract: Searching for health advice is a common and important task performed by individuals on the Web. Nearly 70% of search engine users in the US have conducted a Web search for information about a specific disease or health problem (Pew Research Center, Feb. 2011). While health IR is often considered as a domain-specific task, it is performed by a large variety of users, including various healthcare workers, but also, and increasingly commonly, by laypeople. This variety of potential information seekers, each characterized by different health knowledge, implies a broad range of information needs, and consequently a requirement for retrieval systems able to satisfy the health information needs of different categories of users. In this talk, I will introduce the main challenges of Information Retrieval in the medical domain. Then, I will focus on patient-centered challenges and present CLEF eHealth IR task, an evaluation challenge organized at the CLEF conference. The aim of this task is to help patients finding relevant health information online. I will describe the evaluation challenge, the dataset, and the participation. An analysis of the participants results from various perspective will be given, to further guide research in this domain.

Bio: Lorraine Goeuriot is an assistant professor in Université Grenoble Alpes, France. She obtained her Master in computer science and PhD in computational linguistics on medical data in the University of Nantes, France. She also worked as a post-doctoral researcher in Nanyang Technological University, Singapore, on medical opinion mining, and in Dublin City University, Ireland, on medical Information Retrieval. She is co-chair of the CLEF eHealth 2014 and 2015 evaluation lab, and has been co-leading the information retrieval task in 2013 and 2014. She co-organized a workshop on Medical IR in SIGIR 2014. She was publication co-chair for SIGIR 2013 and COLING 2014. She has been involved in two national French research projects and in EU project Khresmoi, on medical information access. She is reviewing papers for several medical informatics conferences and journals.
Smiley face Prof. Dr. Pierre Zweigenbaum,
Title: Objective Linked Knowledge
Abstract: We are witnessing a growing contribution of formal representations of data and knowledge, conveyed by published scientific papers, to the linked open data initiative. This creates a "linked open bibliome" from existing metadata and from Natural Language Processing of the contents of scientific papers. The twentieth century epistemologist Karl Popper suggested the existence of three worlds, the third of which is the realm of objective knowledge. Objective knowledge encompasses human productions including scientific theories, and also the associated artifacts such as books, works of art, and computer records. The contents of the Web, which did not exist when Popper built his theory, constitutes a new realm in Popper's World 3, as are linked data and knowledge, and thus the linked bibliome. This motivates further observations which I will present in this talk. First, Popper's main interest was on scientific theories: he stressed the role of falsification instead of verification in the evaluation of scientific knowledge. It is therefore important when building linked knowledge to equip it with devices which test contradiction with existing knowledge and data, and more generally which help position pieces of knowledge with respect to other pieces of knowledge. I will point at current research on this matter, such as the SWAN biomedical discourse ontology. Second, detecting contradiction among multiple statements, and more generally the gathering of relational information, relies on an ability to identify identical individuals across statements. This key point in linked data and knowledge reveals the importance of co-reference resolution, which precisely aims to do that within a text and across multiple texts. I will briefly review current trends in co-reference resolution in scientific texts.

Bio: Dr. Pierre Zweigenbaum is a CNRS Senior Researcher at LIMSI (Orsay, France), where he leads the Natural Language Processing group LIMSI/ILES. His graduate and post-graduate research focus is Natural Language Processing (NLP), with medicine as a main application domain. His main research interests are in Information Extraction in multilingual settings, and he is the author or co-author of methods and tools to detect various types of medical entities, expand abbreviations, resolve co-references, detect relations. He has also designed methods to acquire linguistic knowledge automatically from corpora and thesauri, to help extend lexicons and terminologies, including bilingual ones, using parallel and comparable corpora. He founded the annual international workshop on Building and Using Comparable Corpora and has chaired or co-chaired its seven editions. He is Vice-President of the French association for artificial intelligence (AFIA), where he acts to create stronger links between NLP and Artificial Intelligence. He is the chair of the recently created Francophone Special Interest Group of the International Medical Informatics Association (IMIA), which fosters the development of resources and tools to process French clinical texts. He is the author of over 200 peer-reviewed publications.
Smiley face Prof. Dr. Zhisheng Huang,
VU University Amsterdam, The Netherlands
Title: Semantic Processing of Medical Text with NLP tools
Abstract: Relation extraction from medical text by using NLP tools has been considered to be one of the important topics in medical knowledge processing. Enhancing those NLP tools with the semantic processing by using some kinds of domain knowledge, such as medical ontologies, would improve the efficiency of medical knowledge extraction. In this talk, we will present an approach how to use the XMedLan NLP tool to obtain the semantic representation of medical knowledge with well-known medical ontologies such as UMLS and SNOMED CT. We will report two use cases of the semantic processing of medical knowledge. The first use case is how to convert unstructured knowledge in medical guidelines into structured ones, and how they can be used in searching for new and relevant evidences for evidence-based medical guideline updates. The second use case is how to semi-automatically use a rule-based formalization of eligibility criteria for clinical trials when processing clinical text.
Smiley face Prof. Dr. Jesse Davis,
KU Leuven, Belgium
Title: Extracting and Reasoning about Biomedical Texts
Abstract: In this talk, I will present an overview of work being done in the context of a Flemish project involving Marie-Francine Moens, Martine De Cock, and myself on extracting information from biomedical texts and then reasoning about the extracted knowledge. In the first part of the talk, I will discuss a semi-supervised pipeline for extracting different relationships about gene regulation. I will highlight two important challenges: dealing with limited supervised training data and a highly imbalanced class distribution. In the second part of the talk, I will present an unsupervised approach to learning an IS-A taxonomy from scratch by analyzing a given text corpus. In particular, I will focus on the challenge of trying to improve the accuracy of the learned taxonomy.

Bio: Prof. Dr. Jesse Davis is a professor of Computer Science at KU Leuven. He completed his PhD at the Universtiy of Wisconsin – Madison and a post-doc at the University of Washington. His research interests include machine learning and its application to problems in health care and sports. The innovative potential of his health-related work has been acknowledged by Flemish industry (e.g., KBF-Elia Award, Janssen Pharmaceutics open innovation award of KIR). He was program co-chair for ILP 2014.
No Image Prof. Dr. Marie-Francine Moens,
KU Leuven, Belgium
Title: Text Mining of Biomedical Texts: Challenges and Opportunities
Abstract: In many domains (such as science and business) crucial knowledge is often embedded in natural language text and cannot be accessed or used without human reading and interpretation. The vast growth of potentially relevant text makes it impossible to manually access and process all relevant knowledge, hence the interest in machine understanding of natural language. An area where knowledge extraction from text is of large economic and societal value is health information analysis and analysis of biomedical publications. Although recent progresses in statistical machine learning, computational linguistics, and knowledge management make the goal of a deeper understanding of language by the machine attainable in the years to come, clinical reports and biomedical publications pose a number of pertinent scientific challenges. In this talk we will introduce the opportunities and difficulties when mining these reports and publications and give pointers to recent scientific and technological advances that will facilitate the task.

Bio: Marie-Francine Moens is a professor at the department of Computer Science of KU Leuven. She is head of the Language Intelligence and Information Retrieval team. She is author of more than 280 international peer-reviewed publications and of several books. She is involved in the organization or program committee (as program chair, area chair or reviewer) of major conferences on computational linguistics, information retrieval and machine learning. She teaches the courses Text Based Information Retrieval and Natural Language Processing at KU Leuven in the Faculty of Engineering Science. She has given several invited tutorials in summer schools and international conferences and regularly gives keynotes at international conferences on the topic of information extraction from text. She participates or has participated as partner or coordinator of numerous European and international projects, which focus on text mining or the development of language technology. In 2011 and 2012 she was appointed as chair of the European Chapter of the Association for Computational Linguistics (EACL) and was a member of the executive board of the Association for Computational Linguistics (ACL). From 2010 until 2014 she was a member of the Research Council of KU Leuven and is currently a member of the Council of the Industrial Research Fund of KU Leuven. She is the scientific manager of the EU COST action iV&L (The European Network on Integrating Vision and Language). She is a member of the editorial board of the journal Foundations and Trends® in Information Retrieval. She was appointed as Scottish Informatics and Computer Science Alliance (SICSA) Distinguished Visiting Fellow in 2014.
No Image Dr. Parisa Kordjamshidi,
University of Illinois at Urbana-Champaign, USA
KU Leuven, Belgium
Title: Structured Learning for Spatial Information Extraction from Biomedical Text
Abstract: The aim is to automatically extract species names of bacteria and their locations from webpages. This task is important for exploiting the vast amount of biological knowledge which is expressed in diverse natural language texts and putting this knowledge in databases for easy access by biologists. The task is challenging and the previous results are far below an acceptable level of performance, particularly for extraction of localization relationships. We design a new structured output prediction model for joint extraction of biomedical entities and the localization relationship. The proposed model is based on a spatial role labeling (SpRL) model designed for spatial understanding of unrestricted text. We extend SpRL to extract discourse level spatial relations in the biomedical domain and apply it on the BioNLP-ST 2013, BB-shared task. We highlight the main differences between general spatial language understanding and spatial information extraction from the scientific text which is the focus of this work. We exploit the text's structure and discourse level global features. The experimental results indicate that a joint learning model over all entities and relationships in a document outperforms a model which extracts entities and relationships independently. This global learning model significantly improves the state-of-the-art results on this task and has a high potential to be adopted in other natural language processing (NLP) tasks in the biomedical domain.

Bio: Parisa Kordjamshidi is a postdoctoral researcher at the University of Illinois at Urbana-Champaign (UIUC) with the Cognitive Computation Group, and at KU Leuven with the Language Intelligence and Information Retrieval group. She obtained her PhD degree from KULeuven in July 2013. During her PhD research she introduced the first Semantic Evaluation task and benchmark for "Spatial Role Labeling". She has worked on structured output prediction and relational learning models to map natural language onto formal spatial representations, appropriate for spatial reasoning as well as to extract knowledge from biomedical text. She is also involved in an NIH (National Institute of Health, USA) project, extending her research experience on structured and relational learning to Learning Based Programming (LBP) for biological data analysis. The results of her research have been published in several international peer-reviewed conferences and journals including ACM-TSLP, JWS, BMC-Bioinformatics.
Smiley face Prof. Dr. Tim Van den Bulcke,
University Hospital Antwerp
University of Antwerp, Belgium
Title: Mining Real-World Electronic Health Record Data for Clinical Trial Patient Recruitment
Abstract: Clinical trials are conducted extensively in clinical research. Patient eligibility screening typically consists of manually reviewing individual patient health records to identify potential candidates. Unfortunately, many clinical trials experience significant delays (between 1 and 6 months for most trials). In many cases this is related to poor recruitment. Computer-assisted eligibility screening could potentially improve the selection accuracy, increase the number of selected patients, and reduce the cost of the selection process. Through the large-scale adoption of electronic health record (EHR) systems in recent years, much information has become available electronically. However, the use of real-world EHR data comes with its own set of challenges. Crucial information is not always registered or is often only available as clinical free text in various languages. Mining such information also requires compliance with different rules regarding patient consent, ethics and privacy. Finally, new workflows to be adopted at multiple institutions before such systems can be used in an operational context.

Bio: Tim Van den Bulcke coordinates biomedical informatics research at the Antwerp University Hospital (UZA) and is professor at Antwerp University (UAntwerpen). He co-founded biomina (, a young interdisciplinary research center for biomedical informatics established by UAntwerpen and UZA, focusing on the application of computational techniques for integration, analysis and visualization of clinical and 'omics data with a particular focus on patient recruitment and medical coding applications. He gained experience with scientific workflow systems and interactive data visualization software during his work in the pharmaceutical industry and during his academic research. He leveraged state-of-the-art computational techniques with interactive visualization software within the domains of bioinformatics and medical informatics.
Smiley face Dr. Toni Verbeiren,
KU Leuven, Belgium
Title: Analyzing and Visualizing Data, the Distributed Approach
Abstract: When data becomes big and we want to analyse that data, the number of options available to us is limited. In this talk, we will discuss the available options and see where it leads us. We take a closer look at distributing data as well as computations and how this relates to visualization and interactivity.

Bio: Toni is a strong believer in multi-disciplinary science. Where people from different fields meet, new ideas are born. Visual analytics is such a place. His current focus is on the visualization of large amounts of data (big data) and multiple dimensions. Prior to joining Prof Aerts' lab, Toni worked as an independent consultant in a broad set of areas all related to technology: virtualization, service management, project management and enterprise architecture. His background in the technical as well as the organizational aspects of technology come in very handy in his current work as a researcher.Toni obtained his PhD from the department of Theoretical Physics at the KU Leuven where he studied models for artificial neural networks.

Organization: KU Leuven: HCI-LIIR research group of the Department of Computer Science and Pharmaceutical and Pharmacological Sciences. The event is sponsored by the Bilateral Collaborative Project of KU Leuven and Tsinghua University (BIL/012/008).

Local organization committee:
Geert Heyman
Walter Luyten
Marie-Francine Moens
Niraj Shrestha


Geert Heyman (geert DOT heyman AT cs DOT kuleuven DOT be)