Paper Summary
Written By: Ajay Jaiswal
Automated systems for labelling or extracting structured information from MIMIC-CXR and CheXpert involves:
1. Automated Radiology report labellers:
- Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. CoRR, abs/1901.07031, 2019
- Negbio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA, 2018
- Chexpert++: Approximating the chexpert labeler for speed, differentiability, and probabilistic output. 2020
- Chexbert: combining automatic labelers and expert annotations for accurate radiology report labeling using bert. 2020
- Visualchexbert: addressing the discrepancy between radiology report labels and image labels. 2021
2. Extraction of more fine-grained information:
- Enhancing the expressiveness and usability of structured image reporting systems. 2000
- Information extraction from multi-institutional radiology reports. 2016
- Understanding spatial language in radiology: representation framework, annotation, and spatial relation extraction from chest x-ray reports using deep learning. 2020
- Toward complete structured information extraction from radiology reports using machine learning. 2019
- Extracting clinical terms from radiology reports with deep learning. 2021
The development of automated approaches for structuring large amounts of clinically relevant information in reports is primarily limited by two factors:
- First, the choice of information extraction schema, such as the 14 medical conditions proposed by Irvin et al., limits the amount of information extracted from reports.
- Second, there is a limited number of datasets with dense report annotations, which are expensive to obtain given the amount of time and expertise required by medical experts to procure such annotations.
Paper Contribution:
- We define a novel information extraction schema for radiology reports, intended to cover most clinically relevant information within the report while allowing for ease and consistency during annotation.
- We release development and test datasets annotated according to our schema by board-certified radiologists.
- Our development dataset contains annotations for 500 radiology reports from the MIMIC-CXR dataset, consisting of 14,579 entities and 10,889 relations.
- Our test dataset contains two sets of independent annotations for 100 radiology reports from the MIMIC-CXR and CheXpert datasets.
- We use our dataset to benchmark various modeling approaches. Our best approach, which we call RadGraph Benchmark, achieves a micro F1 of 0.94 / 0.91 (MIMIC-CXR / CheXpert) on named entity recognition and a micro F1 of 0.82 / 0.73 (MIMIC-CXR / CheXpert) on relation extraction.
- We release an inference dataset, which contains annotations automatically generated by RadGraph Benchmark for 220,763 MIMIC-CXR reports, consisting of over 6 million entities and 4 million relations, and 500 CheXpert reports, consisting of 13,783 entities and 9,908 relations.
Annotated reports in the inference dataset have mappings to associated chest radiographs, which can facilitate the development of multi-modal approaches in radiology.
Information Extraction schema
We propose a novel information extraction schema for extracting entities and relations from radiology reports, adapting the schema initially proposed by Langlotz et al. to incorporate relations between entities and reduce the number of entities. Our schema is designed for high coverage of the clinically relevant information in a report corresponding to the radiology image being examined, generally included in the Findings and Impression sections of the radiology report.
Entities:
- An entity as a continuous span of text that can include one or more adjacent words.
- Entities in our schema center around two concepts: Anatomy and Observation. We specify three uncertainty levels for Observation, so our schema defines four entities: Anatomy, Observation: Definitely Present, Observation: Uncertain, and Observation: Definitely Absent.
- Anatomy refers to an anatomical body part that occurs in a radiology report, such as a “lung”.
- Observations refer to words associated with visual features, identifiable pathophysiologic processes, or diagnostic disease classifications. For example, an Observation could be “effusion” or more general phrases like “increased”.
Relations
- We define a relation as a directed edge between two entities. Our schema uses three relations: Suggestive Of, Located At, and Modify.
- Suggestive Of (Observation, Observation) is a relation between two Observation entities indicating that the presence of the second Observation is inferred from that of the first Observation.
- Located At (Observation, Anatomy) is a relation between an Observation entity and an Anatomy entity indicating that the Observation is related to the Anatomy. While Located At often refers to location, it can also be used to describe other relations between an Observation and an Anatomy.
- Modify (Observation, Observation) or (Anatomy, Anatomy) is a relation between two Observation entities or two Anatomy entities indicating that the first entity modifies the scope of, or quantifies the degree of, the second entity.
- Identified Issue: As a result, all Observation modifiers are annotated as Observation entities, and all Anatomy modifiers are annotated as Anatomy entities for simplicity.
Dataset Statistics and Details:
Development dataset: We sample 500 radiology reports from the MIMIC-CXR dataset [1] for our development dataset. Our development dataset was divided into train and dev sets, where the dev set includes 15% of the development dataset. Patients associated with reports in the train set and dev set do not overlap.
Test dataset: We sample 50 radiology reports from the MIMIC-CXR dataset and 50 radiology reports from the CheXpert dataset for our test dataset in order to test generalization of approaches across institutions. We de-identify CheXpert reports using an automated, transformer-based de-identification algorithm followed by manual review of each report.
Benchmark Approaches abd Results:
Approach: We proposed an entity and relation extraction task for radiology reports that can be developed using our development dataset and tested using our test dataset. For each report, we provide annotations identifying the type and span of each entity as well as relations between entities.
Baseline Model: Our Baseline approach to entity and relation extraction uses a BERT model with a linear classification head on top of the last layer for NER and R-BERT for relation extraction. For our baseline NER approach, since the same entity may span multiple tokens, we use the IOB tagging scheme and convert IOB tags to entity types defined by our schema after inference. For each of our approaches, in addition to using BERT weight initial220 izations, we use weight initializations from four different biomedical pretrained models, which are BioBERT , Bio+ClinicalBERT, PubMedBERT, and BlueBERT.
Benchmark models: We develop additional benchmark approaches for our task, using two different entity and relation extraction model architectures. We use BERT in both our PURE and DYGIE++ approaches.
- Our first approach uses the DYGIE++ framework by Wadden et al., which achieved state-of-the-art at the time on NER and relation extraction by jointly extracting entities and relations.
- Our second approach uses the Princeton University Relation Extraction system (PURE) by Zhong et al., which achieved state-of-the-art at the time on relation extraction using a pipeline approach that decomposes NER and relation extraction into separate subtasks.
Evaluation Metrics We report both micro and macro F1 for entity recognition and relation extraction.
- For entity recognition, a predicted entity is considered correct if the predicted span boundaries and predicted entity type are both correct.
- For relation extraction, a predicted relation is considered correct if the predicted entity pair is correct (both the span boundaries and entity type) and the relation type is correct.
Results:
Analysis:
Schema Coverage Given that existing information extraction systems for radiology reports often suffer from a lack of report coverage, we measure the number of tokens and sentences in report sections covered by our schema. To calculate coverage, we extract the Findings and Impression sections of the reports, which our schema is designed to annotate. We then calculate the average percent of sentences and tokens annotated per report across the development and test datasets.
Annotation Disagreements To measure agreement between radiologists using our schema, we calculate Cohen’s Kappa [37] between the two annotators on each test set for the named entity recognition task and the relation extraction task separately. For named entity recognition, we compute Kappa scores of 0.974 and 0.829 on the MIMIC-CXR and CheXpert test sets respectively. For relation extraction, we compute Kappa scores of 0.841 and 0.397 on the MIMIC-CXR and CheXpert test sets respectively. *One reason for greater disagreement on the CheXpert test set compared to the MIMIC-CXR test set may relate to different percentages of patients in the intensive care unit (ICU) within the MIMIC-CXR dataset and the CheXpert dataset, which can systematically affect the contents of radiology reports.