Automated detection of moderate and large pneumothorax on frontal chest X-rays using deep convolutional neural networks: A retrospective study

Abstract

Background

Pneumothorax can precipitate a life-threatening emergency due to lung collapse and respiratory or circulatory distress. Pneumothorax is typically detected on chest X-ray; however, treatment is reliant on timely review of radiographs. Since current imaging volumes may result in long worklists of radiographs awaiting review, an automated method of prioritizing X-rays with pneumothorax may reduce time to treatment. Our objective was to create a large human-annotated dataset of chest X-rays containing pneumothorax and to train deep convolutional networks to screen for potentially emergent moderate or large pneumothorax at the time of image acquisition.

Methods and findings

In all, 13,292 frontal chest X-rays (3,107 with pneumothorax) were visually annotated by radiologists. This dataset was used to train and evaluate multiple network architectures. Images showing large- or moderate-sized pneumothorax were considered positive, and those with trace or no pneumothorax were considered negative. Images showing small pneumothorax were excluded from training. Using an internal validation set (n = 1,993), we selected the 2 top-performing models; these models were then evaluated on a held-out internal test set based on area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and positive predictive value (PPV). The final internal test was performed initially on a subset with small pneumothorax excluded (as in training; n = 1,701), then on the full test set (n = 1,990), with small pneumothorax included as positive. External evaluation was performed using the National Institutes of Health (NIH) ChestX-ray14 set, a public dataset labeled for chest pathology based on text reports. All images labeled with pneumothorax were considered positive, because the NIH set does not classify pneumothorax by size. In internal testing, our “high sensitivity model” produced a sensitivity of 0.84 (95% CI 0.78–0.90), specificity of 0.90 (95% CI 0.89–0.92), and AUC of 0.94 for the test subset with small pneumothorax excluded. Our “high specificity model” showed sensitivity of 0.80 (95% CI 0.72–0.86), specificity of 0.97 (95% CI 0.96–0.98), and AUC of 0.96 for this set. PPVs were 0.45 (95% CI 0.39–0.51) and 0.71 (95% CI 0.63–0.77), respectively. Internal testing on the full set showed expected decreased performance (sensitivity 0.55, specificity 0.90, and AUC 0.82 for high sensitivity model and sensitivity 0.45, specificity 0.97, and AUC 0.86 for high specificity model). External testing using the NIH dataset showed some further performance decline (sensitivity 0.28–0.49, specificity 0.85–0.97, and AUC 0.75 for both). Due to labeling differences between internal and external datasets, these findings represent a preliminary step towards external validation.