A machine learning approach to predict E. coli antibacterial resistance using whole-genome sequencing data
Abstract
Background: Antimicrobial resistance (AMR) is a significant global health threat, particularly impacting low- and middle-income countries(LMICS) such as Uganda, where reliable and rapid methods for detecting AMR in E. coli and other pathogens are scarce. This lack can lead to inappropriate treatment and the spread of drug-resistant infections. This thesis undertakes a comprehensive study, where various machine learning models to predict AMR in E. coli for ciprofloxacin(CIP), ampicillin(AMP), and cefotaxime(CTX) were trained on whole genome sequencing (WGS) data from England where such data is more readily available. A separate Ugandan dataset was used for validation purposes, further demonstrating the generalizability and effectiveness of the models in LMICS.
Methods: 1496 (CIP), 1428 (CTX), and 1396 (AMP) sequences from England were divided into training and testing. 42 from Uganda were used for validation. Eight different machine learning models were trained and tested: Logistic Regression(LR), Random Forest(RF), Gradient Boosting(GB), XGBoost(XGB), LightGBM(LGBM), CatBoost(CB), Feed-Forward Neural Network(FFNN), and Support Vector Machine(SVM). The models were evaluated based on precision, recall, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC). Upsampling techniques were implemented to address class imbalance in the data.
Results: Model predictive performance varied significantly across different antibiotics, underlining the critical role of model selection and dataset characteristics. Notably, the FFNN model demonstrated superior performance during testing for CIP (accuracy 84%; F1 0.55; AUC 91%), LR for CTX (accuracy 91%; F1 0.37; AUC 83%) and GB for AMP (accuracy 57%; F1 0.62, AUC 53%), while the LGBM and RF models outperformed others in same scenarios (p < 0.001). Upsampling did not significantly improve the models' performance, underscoring the complexity and high-dimensionality of SNP data. Despite high accuracy scores with the Ugandan validation dataset(FFNN with CIP accuracy 95%, LR with AMP accuracy 98% and GB with CTX accuracy 65%), the models struggled with the recall metric due to severe class imbalance. Key mutations associated with antimicrobial resistance were identified for these antibiotics.
Conclusion: As the threat of AMR continues to rise, the successful application of these models - particularly on the Ugandan dataset, signals a promising avenue for improving AMR detection and treatment strategies in LMICS were genomic data is scarce. This work thus not only expands our current understanding of the genetic underpinnings of AMR but also provides a robust methodological framework that can guide future research and applications in the fight against antimicrobial resistance.