Interpretable machine learning approach for predicting COVID-19 risk status of an individual
This study aimed to ascertained using Statistical feature selection methods and interpretable Machine learning models, the best features that predict risk status (“Low”, “Medium”, “High”) to COVID-19 infection. This study utilizes a publicly available dataset obtained via; online web-based risk assessment calculator to ascertain the risk status of COVID-19 infection. 57 out of 59 features were first filtered for multicollinearity using the Pearson correlation coefficient and further shrunk to 55 features by the LASSO GLM approach. SMOTE resampling technique was used to incur the problem of imbalanced class distribution. The interpretable ML algorithms were implored during the classification phase. The best classifier predictions were saved as a new instance and perturbed using a single Decision tree classifier. To further build trust and explainability of the best model, the XGBoost classifier was used as a global surrogate model to train predictions of the best model. The XGBoost individual’s explanation was done using the SHAP explainable AI-framework. Random Forest classifier with a validation accuracy score of 96.35 % from 55 features reduced by feature selection emerged as the best classifier model. The decision tree classifier approximated the best classifier correctly with a prediction accuracy score of 92.23 % and Matthew’s correlation coefficient of 0.8960. The XGBoost classifier approximated the best classifier model with a prediction score of 99.7 %. This study identified COVID-19 positive, COVID-19 contacts, COVID-19 symptoms, Health workers, and Public transport count as the five most consistent features that predict an individual’s risk exposure to COVID-19.