The COVID-19 or the SARS-CoV-2 originated from the district of Wuhan, China has transpired to be a pandemic worldwide [1]. Research on the COVID-19 is a hot topic among the Artificial Intelligence community recently. Due to shortage and limited efficiency of current testing mechanism of COVID-19 tests, i.e. through RT-PCR kits [2]; which usually takes upto 4-6 hours to reproduce the results is not very optimal way to move forward as the rate of COVID-19 patients registered grows exponentially. With this problem in scientific community, it motivated the aim of Machine Learning methods be brought to be a part in helping flattening the curve [3]. So, this lead to a goal of building classifiers which can diagonise patients as COVID-19 negative or positive based on their respective X-Ray images [4,5]. As this approach is less time and resource consuming; it is expected to achieve more streamlined performance compared to RT-PCR kits. Also in addition to a good prediction, we needed reasons that could justify what could be the features that are responsible in the diagonistic process [6].
With this idea and motivation in hand, our work tries to experiment in building classifiers with CXR (Chest X-Rays) as Ground Truth that predicts whether an X-Ray image is COVID-19 negative or positive. Along with, we try to come up with features that contributes to the detection of an image and also with an explaination delineating why was such a behaviour observed.
With the motivation to help fight against and analyse COVID-19, we came up with a research question of whether
Can we use Machine Learning methods to diagonise COVID-19 and explain the prediction?
To answer the this question, we aim to answer few sub-questions:
Our Dataset consists of 313 Positive COVID CXR and 1000 Negative CXR collected from four different sources to make our version of the dataset to work upon. This includes COVIDx dataset of [6]1, Kaggle CXR Pneumonia dataset by Paul Mooney,2 CXR images of adult subjects from the RSNA Pneumonia Detection Challenge,3 original and augmented versions of COVID-19 examples4 from [7].
According to [2,6,8–10] CT-Scan data would be gold-standard for us and also potray pretty good results evaluated in terms of Accuracy and F1-Score. However, due to CT Scan being available in very less quantity publicly, we would like to use Chest X-rays as our dataset. Though, it won’t be that competible in terms of quality w.r.t CT-Scans but [11] suggests CXR to be sufficient and comparable to CT-Scans in order to diagnose COVID-19 patients.
In particular we have used the COVID-19 Dataset-Repo as our Ground Truth.
We followed a typical Data Science pipeline starting with Pre-Processing of the Dataset, Feature Extraction and Selection and then feeding Descriptors (Trainable Vectors) to different classifiers to train and test and then finally evaluation would be done based on predictor’s results. The details are delineated in the following sections:
The following packages must be installed in R-Studio:
## The script installs the necessary packages if not already installed, and then loads them
packages <- c(
"caret",
"imbalance",
"e1071",
"randomForest",
"imager",
"wvtool",
"rpart",
"ROSE",
"dplyr",
"tidyr",
"ggplot2",
"rmarkdown",
"tidyverse",
"kableExtra",
"knitr"
"jsonlite"
"crul"
"rpart.plot"
"pROC"
"plotROC"
"magick"
"funModeling"
"DataExplorer"
"tidyverse"
"repr"
"factoextra"
"pander"
"klaR"
"janitor"
"mlbench"
"MLmetrics"
)
verify.packages <- function(pkg) {
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, library, character.only = TRUE)
}
verify.packages(packages)
[1] World Health Organisation, Novel Coronavirus – China 2020, (2020). https://www.who.int/csr/don/12-january-2020-novel-coronavirus-china/en/.
[2] J. Zhao, Y. Zhang, X. He, P. Xie, COVID-ct-dataset: A ct scan dataset about covid-19, ArXiv. abs/2003.13865 (2020).
[3] M. Wiberg, A. Taylor, D. Rosner, Responding to the covid-19 pandemic: An invitation, Interactions. 27 (2020) 5. https://doi.org/10.1145/3392538.
[4] L. Wang, A. Wong, COVID-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images, arXiv Preprint arXiv:2003.09871. (2020).
[5] A. Narin, C. Kaya, Z. Pamuk, Automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks, arXiv Preprint arXiv:2003.10849. (2020).
[6] M. Karim, T. Döhmen, D. Rebholz-Schuhmann, S. Decker, M. Cochez, O. Beyan, others, Deepcovidexplainer: Explainable covid-19 predictions based on chest x-ray images, arXiv Preprint arXiv:2004.04582. (2020).
[7] J.P. Cohen, P. Morrison, L. Dao, COVID-19 image data collection, arXiv 2003.11597. (2020). https://github.com/ieee8023/covid-chestxray-dataset.
[8] S. Wang, B. Kang, J. Ma, X. Zeng, M. Xiao, J. Guo, M. Cai, J. Yang, Y. Li, X. Meng, others, A deep learning algorithm using ct images to screen for corona virus disease (covid-19), MedRxiv. (2020).
[9] L. Li, L. Qin, Z. Xu, Y. Yin, X. Wang, B. Kong, J. Bai, Y. Lu, Z. Fang, Q. Song, others, Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct, Radiology. (2020) 200905.
[10] D. Singh, V. Kumar, M. Kaur, Classification of covid-19 patients from chest ct images using multi-objective differential evolution–based convolutional neural networks, European Journal of Clinical Microbiology & Infectious Diseases. (2020) 1–11.
[11] D.S. Kermany, M. Goldbaum, W. Cai, C.C.S. Valentim, H. Liang, S.L. Baxter, A. McKeown, G. Yang, X. Wu, F. Yan, J. Dong, M.K. Prasadha, J. Pei, M.Y.L. Ting, J. Zhu, C. Li, S. Hewett, J. Dong, I. Ziyar, A. Shi, R. Zhang, L. Zheng, R. Hou, W. Shi, X. Fu, Y. Duan, V.A.N. Huu, C. Wen, E.D. Zhang, C.L. Zhang, O. Li, X. Wang, M.A. Singer, X. Sun, J. Xu, A. Tafreshi, M.A. Lewis, H. Xia, K. Zhang, Identifying medical diagnoses and treatable diseases by image-based deep learning, Cell. 172 (2018) 1122–1131.e9. https://doi.org/https://doi.org/10.1016/j.cell.2018.02.010.
[12] R.M. Pereira, D. Bertolini, L.O. Teixeira, C.N. Silla Jr, Y.M. Costa, COVID-19 identification in chest x-ray images on flat and hierarchical classification scenarios, Computer Methods and Programs in Biomedicine. (2020) 105532.
[13] L. Nanni, A. Lumini, S. Brahnam, Local binary patterns variants as texture descriptors for medical image analysis, Artificial Intelligence in Medicine. 49 (2010) 117–125.
[14] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, DBSMOTE: Density-based synthetic minority over-sampling technique, Applied Intelligence. 36 (2012) 664–684.
[15] H. Han, W.-Y. Wang, B.-H. Mao, Borderline-smote: A new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing, Springer, 2005: pp. 878–887.
[16] H. He, Y. Bai, E.A. Garcia, S. Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 Ieee International Joint Conference on Neural Networks (Ieee World Congress on Computational Intelligence), IEEE, 2008: pp. 1322–1328.
[17] A. FernáNdez, V. LóPez, M. Galar, M.J. Del Jesus, F. Herrera, Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches, Knowledge-Based Systems. 42 (2013) 97–110.
[18] A. Albahri, R.A. Hamid, others, Role of biological data mining and machine learning techniques in detecting and diagnosing the novel coronavirus (covid-19): A systematic review, Journal of Medical Systems. 44 (2020).