[ biopathway.org ]

On-going Project (15th Dec. 2010 ~ 29th Feb. 2012)
Identifying the Presence and Certainty of Clinical Conditions from Clinical Reports


The specific details of clinical conditions and their proper understanding play a critical function in clinical decision making about related diagnoses and treatments. Such clinical conditions are usually spelled out by medical experts in terse but plain English and stored as clinical reports, for immediate decision making, periodic assessment and long-term archival purposes. Nevertheless, it is quite demanding to determine promptly and accurately the nature and extent of clinical conditions in such reports and to decide on the next course of action, due primarily to the immense volume of information that must be taken into account but also to the complexity of natural language expressions as employed in such reports. In this regard, automatically extracting clinical conditions from clinical reports is often considered the first step in various applications of medical language processing (MLP).

In this on-going project, we address two issues in extracting and coding clinical conditions with the help of ICD codes, which are used to maintain medical statistics in many countries including the United States and the Republic of Korea. The first issue is to develop methods to evaluate the developed systems against a manually annotated corpus, for example, the dataset provided by the Computational Medicine Center (CMC), which held a shared task for automatically assigning ICD-9-CM codes to medical records in spring 2007 (henceforth, CMC 2007). Unlike most MLP annotated corpora with a single gold-standard annotation, this dataset consists of three different annotations for each report, independently documented by three annotators, one working for a hospital and the other two working for insurance companies. While CMC 2007 considered these codes assigned by two or more annotators to be a tentative gold-standard, we found evidence that most codes assigned by only one annotator are reasonable, and that these disagreements among these three annotations would reflect the specific job of the annotators. In this project, we will carry out experiments to confirm these hypotheses, and propose new evaluation methods.

Second, we are looking further into the negation and uncertainty of natural language expressions, because clinical conditions may sometimes be presented in negated or uncertain forms, the importance of whose correct identification is already well recognized, thanks in part to CMC 2007. The previous studies of detecting negations and uncertainty from clinical reports focus entirely on isolated mentions of clinical conditions. However, there is much evidence that the mere recognition of an isolated clinical condition may not determine its presence and certainty correctly. For example, some medical reports contain apparently contrasting mentions about the patients¡¯ current standing (e.g., ¡°Pneumonia¡± and ¡°No evidence of pneumonia¡±). In this on-going project, we are in the process of developing methods to combine such contrasting mentions.

We believe that the two issues above should be addressed in order to develop and evaluate the systems to automatically assign codes to medical records, which increases the quality of reviewing reports and helps to prevent clinical mistakes by reducing manual intervention.

This project is supported by funding from Microsoft Research Asia.


  1. Seung-Cheol Baek and Jong C. Park (2011). Analyzing Disagreements among ICD-9-CM Coders, The 4th International Symposium on Languages in Biology and Medicine (LBM 2011).

Page maintained by Seung-Cheol Baek
Last modified: Jan 18, 2012