COVID-19 Prediction using Explainable Machine Learning

Background and Motivation

The COVID-19 or the SARS-CoV-2 originated from the district of Wuhan, China has transpired to be a pandemic worldwide [1]. Research on the COVID-19 is a hot topic among the Artificial Intelligence community recently. Due to shortage and limited efficiency of current testing mechanism of COVID-19 tests, i.e. through RT-PCR kits [2]; which usually takes upto 4-6 hours to reproduce the results is not very optimal way to move forward as the rate of COVID-19 patients registered grows exponentially. With this problem in scientific community, it motivated the aim of Machine Learning methods be brought to be a part in helping flattening the curve [3]. So, this lead to a goal of building classifiers which can diagonise patients as COVID-19 negative or positive based on their respective X-Ray images [4,5]. As this approach is less time and resource consuming; it is expected to achieve more streamlined performance compared to RT-PCR kits. Also in addition to a good prediction, we needed reasons that could justify what could be the features that are responsible in the diagonistic process [6].

With this idea and motivation in hand, our work tries to experiment in building classifiers with CXR (Chest X-Rays) as Ground Truth that predicts whether an X-Ray image is COVID-19 negative or positive. Along with, we try to come up with features that contributes to the detection of an image and also with an explaination delineating why was such a behaviour observed.

Project Objective

With the motivation to help fight against and analyse COVID-19, we came up with a research question of whether

Can we use Machine Learning methods to diagonise COVID-19 and explain the prediction?

To answer the this question, we aim to answer few sub-questions:

How well could classifiers perform on Chest X-Rays?
Although [2] and [6] extensively works with Neural Networks (Black-Box Model) to classify, Can simple and intrinsically explainable classifiers achieve a base Accuracy, F₁-Score and AUC of 85% using CXR?
How does different features of CXR contribute to the model prediction and Can we come up with few number of feature w.r.t their importance?
Which flavour of algorithm perform best among all and is there a possibility of Classification in Ensemble setting?
Can we come up with explaination of our model’s decsision and prediction?

Dataset

Our Dataset consists of 313 Positive COVID CXR and 1000 Negative CXR collected from four different sources to make our version of the dataset to work upon. This includes COVIDx dataset of [6]¹, Kaggle CXR Pneumonia dataset by Paul Mooney,² CXR images of adult subjects from the RSNA Pneumonia Detection Challenge,³ original and augmented versions of COVID-19 examples⁴ from [7].

According to [2,6,8–10] CT-Scan data would be gold-standard for us and also potray pretty good results evaluated in terms of Accuracy and F₁-Score. However, due to CT Scan being available in very less quantity publicly, we would like to use Chest X-rays as our dataset. Though, it won’t be that competible in terms of quality w.r.t CT-Scans but [11] suggests CXR to be sufficient and comparable to CT-Scans in order to diagnose COVID-19 patients.

In particular we have used the COVID-19 Dataset-Repo as our Ground Truth.

GitHub URL

The R scripts, process notebook and other resources have been stored at the repository.

Design Overview (Algorithms and Methods)

We followed a typical Data Science pipeline starting with Pre-Processing of the Dataset, Feature Extraction and Selection and then feeding Descriptors (Trainable Vectors) to different classifiers to train and test and then finally evaluation would be done based on predictor’s results. The details are delineated in the following sections:

Pre-Processing
- Cropping
  - We are dealing with higly skewed dataset of CXR imgaes.
  - These imgaes are initally cropped and normalised to 256*256 fashion.
- Masking
  - The normalised images are then masked for the lung segment using CNN.
- Segmentation
  - These masked images are then formed in a manner where the masked portion is deducted from the orginal image and the background as been colored black for the ease to processing.
  - These segmented images are just the lung-segment from the original CXR with a black background.
Feature Extraction
- Local Binary Pattern
  - There exist several texture-based vision algorithms. We combined features before training and train our model on a combined feature set; Or else we can train models on individual features, and then combine prediction results might be combined and thus one feature might only not be selected but multiple features can be selected [12].
  - As Literature survey suggests, we found Local Binary Patterns [13] as a good choice for texture-based descriptor.
  - Local Binary Patterns inputs pre-processed image and outputs corresponding lbp vector.
  - Local Binary Pattern works only with grayscale images. The dataset contains RGB and RGBa images which are intrinsincally normalised to grayscale before converting it into a vector.
Re-sampling
- The LPB vectors are then oversampled as the dataset contains highly skewed distribution of target class.
- Re-sampling techniques tried:
  - Random Under-Sampling
  - Over-Sampling:
    - Random Over-Sampling
    - Density Based-SMOTE [14]
    - Borderline-SMOTE [15]
    - AdaSyn [16]
Classification
- The problem is an imbalance learning for a binary classification for images being COVID-19 positive or negative [17].
- Here, we would like to emphasize that the model won’t predict presence or absence or pneumonia, which is a result not only of COVID-19 but other kind of reasons also affect this.
- We implemented following algorithms [18]:
  - k-nearest neighbours
  - Logistic Regression
  - Support Vector Machine
  - Tree-based Classifiers
    - Decision Trees
    - Random Forest
  - Naive Bayes

Evaluation
- The higher the metric value the better the performance.
  - Accuracy
  - Precision
  - Recall
  - F₁-Score
  - AUC & ROC

Screencast

Team

Subhajit Mondal
Jalaj Vora
Subhankar Patra
Shivam Singh
Roshmitha Thummala

R packages

The following packages must be installed in R-Studio:

## The script installs the necessary packages if not already installed, and then loads them

packages <- c(
  "caret",
  "imbalance",
  "e1071",
  "randomForest",
  "imager",
  "wvtool",
  "rpart",
  "ROSE",
  "dplyr",
  "tidyr",
  "ggplot2",
  "rmarkdown",
  "tidyverse",
  "kableExtra",
  "knitr"
  "jsonlite"
  "crul"
  "rpart.plot"
  "pROC"
  "plotROC"
  "magick"
  "funModeling"
  "DataExplorer"
  "tidyverse"
  "repr"
  "factoextra"
  "pander"
  "klaR"
  "janitor"
  "mlbench"
  "MLmetrics"
)

verify.packages <- function(pkg) {
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg))
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, library, character.only = TRUE)
}

verify.packages(packages)

References

[1] World Health Organisation, Novel Coronavirus – China 2020, (2020). https://www.who.int/csr/don/12-january-2020-novel-coronavirus-china/en/.

[2] J. Zhao, Y. Zhang, X. He, P. Xie, COVID-ct-dataset: A ct scan dataset about covid-19, ArXiv. abs/2003.13865 (2020).

[3] M. Wiberg, A. Taylor, D. Rosner, Responding to the covid-19 pandemic: An invitation, Interactions. 27 (2020) 5. https://doi.org/10.1145/3392538.

[4] L. Wang, A. Wong, COVID-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images, arXiv Preprint arXiv:2003.09871. (2020).

[5] A. Narin, C. Kaya, Z. Pamuk, Automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks, arXiv Preprint arXiv:2003.10849. (2020).

[6] M. Karim, T. Döhmen, D. Rebholz-Schuhmann, S. Decker, M. Cochez, O. Beyan, others, Deepcovidexplainer: Explainable covid-19 predictions based on chest x-ray images, arXiv Preprint arXiv:2004.04582. (2020).

[7] J.P. Cohen, P. Morrison, L. Dao, COVID-19 image data collection, arXiv 2003.11597. (2020). https://github.com/ieee8023/covid-chestxray-dataset.

[8] S. Wang, B. Kang, J. Ma, X. Zeng, M. Xiao, J. Guo, M. Cai, J. Yang, Y. Li, X. Meng, others, A deep learning algorithm using ct images to screen for corona virus disease (covid-19), MedRxiv. (2020).

[9] L. Li, L. Qin, Z. Xu, Y. Yin, X. Wang, B. Kong, J. Bai, Y. Lu, Z. Fang, Q. Song, others, Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct, Radiology. (2020) 200905.

[10] D. Singh, V. Kumar, M. Kaur, Classification of covid-19 patients from chest ct images using multi-objective differential evolution–based convolutional neural networks, European Journal of Clinical Microbiology & Infectious Diseases. (2020) 1–11.

[11] D.S. Kermany, M. Goldbaum, W. Cai, C.C.S. Valentim, H. Liang, S.L. Baxter, A. McKeown, G. Yang, X. Wu, F. Yan, J. Dong, M.K. Prasadha, J. Pei, M.Y.L. Ting, J. Zhu, C. Li, S. Hewett, J. Dong, I. Ziyar, A. Shi, R. Zhang, L. Zheng, R. Hou, W. Shi, X. Fu, Y. Duan, V.A.N. Huu, C. Wen, E.D. Zhang, C.L. Zhang, O. Li, X. Wang, M.A. Singer, X. Sun, J. Xu, A. Tafreshi, M.A. Lewis, H. Xia, K. Zhang, Identifying medical diagnoses and treatable diseases by image-based deep learning, Cell. 172 (2018) 1122–1131.e9. https://doi.org/https://doi.org/10.1016/j.cell.2018.02.010.

[12] R.M. Pereira, D. Bertolini, L.O. Teixeira, C.N. Silla Jr, Y.M. Costa, COVID-19 identification in chest x-ray images on flat and hierarchical classification scenarios, Computer Methods and Programs in Biomedicine. (2020) 105532.

[13] L. Nanni, A. Lumini, S. Brahnam, Local binary patterns variants as texture descriptors for medical image analysis, Artificial Intelligence in Medicine. 49 (2010) 117–125.

[14] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, DBSMOTE: Density-based synthetic minority over-sampling technique, Applied Intelligence. 36 (2012) 664–684.

[15] H. Han, W.-Y. Wang, B.-H. Mao, Borderline-smote: A new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing, Springer, 2005: pp. 878–887.

[16] H. He, Y. Bai, E.A. Garcia, S. Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 Ieee International Joint Conference on Neural Networks (Ieee World Congress on Computational Intelligence), IEEE, 2008: pp. 1322–1328.

[17] A. FernáNdez, V. LóPez, M. Galar, M.J. Del Jesus, F. Herrera, Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches, Knowledge-Based Systems. 42 (2013) 97–110.

[18] A. Albahri, R.A. Hamid, others, Role of biological data mining and machine learning techniques in detecting and diagnosing the novel coronavirus (covid-19): A systematic review, Journal of Medical Systems. 44 (2020).