RadGraph: Extracting Clinical Entities and Relations from Radiology Reports

20 Jun 2021

Paper Summary

Written By: Ajay Jaiswal

Paper URL

Automated systems for labelling or extracting structured information from MIMIC-CXR and CheXpert involves:

1. Automated Radiology report labellers:

2. Extraction of more fine-grained information:

The development of automated approaches for structuring large amounts of clinically relevant information in reports is primarily limited by two factors:

Paper Contribution:

Annotated reports in the inference dataset have mappings to associated chest radiographs, which can facilitate the development of multi-modal approaches in radiology.

image

Information Extraction schema

We propose a novel information extraction schema for extracting entities and relations from radiology reports, adapting the schema initially proposed by Langlotz et al. to incorporate relations between entities and reduce the number of entities. Our schema is designed for high coverage of the clinically relevant information in a report corresponding to the radiology image being examined, generally included in the Findings and Impression sections of the radiology report.

image

Entities:

Relations

Dataset Statistics and Details:

image

Development dataset: We sample 500 radiology reports from the MIMIC-CXR dataset [1] for our development dataset. Our development dataset was divided into train and dev sets, where the dev set includes 15% of the development dataset. Patients associated with reports in the train set and dev set do not overlap.

Test dataset: We sample 50 radiology reports from the MIMIC-CXR dataset and 50 radiology reports from the CheXpert dataset for our test dataset in order to test generalization of approaches across institutions. We de-identify CheXpert reports using an automated, transformer-based de-identification algorithm followed by manual review of each report.

Benchmark Approaches abd Results:

Approach: We proposed an entity and relation extraction task for radiology reports that can be developed using our development dataset and tested using our test dataset. For each report, we provide annotations identifying the type and span of each entity as well as relations between entities.

Baseline Model: Our Baseline approach to entity and relation extraction uses a BERT model with a linear classification head on top of the last layer for NER and R-BERT for relation extraction. For our baseline NER approach, since the same entity may span multiple tokens, we use the IOB tagging scheme and convert IOB tags to entity types defined by our schema after inference. For each of our approaches, in addition to using BERT weight initial220 izations, we use weight initializations from four different biomedical pretrained models, which are BioBERT , Bio+ClinicalBERT, PubMedBERT, and BlueBERT.

Benchmark models: We develop additional benchmark approaches for our task, using two different entity and relation extraction model architectures. We use BERT in both our PURE and DYGIE++ approaches.

Evaluation Metrics We report both micro and macro F1 for entity recognition and relation extraction.

Results:

image

image

Analysis:

Schema Coverage Given that existing information extraction systems for radiology reports often suffer from a lack of report coverage, we measure the number of tokens and sentences in report sections covered by our schema. To calculate coverage, we extract the Findings and Impression sections of the reports, which our schema is designed to annotate. We then calculate the average percent of sentences and tokens annotated per report across the development and test datasets.

image

Annotation Disagreements To measure agreement between radiologists using our schema, we calculate Cohen’s Kappa [37] between the two annotators on each test set for the named entity recognition task and the relation extraction task separately. For named entity recognition, we compute Kappa scores of 0.974 and 0.829 on the MIMIC-CXR and CheXpert test sets respectively. For relation extraction, we compute Kappa scores of 0.841 and 0.397 on the MIMIC-CXR and CheXpert test sets respectively. *One reason for greater disagreement on the CheXpert test set compared to the MIMIC-CXR test set may relate to different percentages of patients in the intensive care unit (ICU) within the MIMIC-CXR dataset and the CheXpert dataset, which can systematically affect the contents of radiology reports.