Understanding genetic diversity and rapid drug resistance prediction in mycobacterium tuberculosis from whole-genome sequence and other epidemiological data
Abstract
Tuberculosis (TB) remains one of the major global health problems with an estimated 1.6 million deaths worldwide. The availability of whole-genome sequence (WGS) data offers a good avenue for understanding genetic diversity and drug resistance (DR) mutations. We aimed to investigate the genetic diversity and relatedness of Mycobacterium tuberculosis isolates among individuals with different CD4 cell counts and leverage machine learning (ML) algorithms in predicting DR using WGS and epidemiological data from Uganda.
Methods: This was a cross-sectional study utilizing 226 WGS samples of MTB isolates in Uganda between 2013 and 2023. Associated patient demographic data and phenotypic drug information was obtained. We utilized TB profiler for lineage and drug resistance prediction, and snippy tool for variant calling and annotation. Phylogenetic analysis was performed on the core genome alignment file in MEGA. For ML model development, we split the data into training (80%) and testing (20%) datasets. The SMOTE technique was applied to handle for class imbalance issue. We evaluated various ML algorithms including random forest (RF), Logistic regression (LR), boosting classifiers such as ada Boost, cat Boost, Gradient Boosting, XGBoost etc. for prediction of drug resistance for the antibiotics Rifampicin, Ethambutol, Isoniazid and Streptomycin. Various key metrics such as recall, precision, Receiver operating characteristic curve (ROC), and Matthews Correlation Coefficient (MCC) were used to assess the performance characteristics of the models.
Results: Across the 203 MTB isolates, we observed 5 distinct phylogenetic lineages (L1-4, L3&L4) with L4 being the most prevalent with 149/203 (73.40%) followed by L3 (46(22.66%) among others. The most common sub lineage was L4.6.1.1/Uganda II compared to the other sub lineage. There was statistical association between MTB lineages and CD4 cell count group as either low or high. Overall, all ML algorithms proved that they can predict drug resistance however the boosting classifiers had the highest AUC values. Age, Sex and HIV status proved to be significant features in addition to the SNP positions for ML model development.
Conclusion: Our findings of the circulating lineages, sub lineages, drug resistance profiles play a crucial role in understanding the genetic diversity of MTB. Additionally, our approach of ML, can robustly predict drug resistance and also inform on the underlying gene mutations while utilizing both the WGS (SNP) and epidemiological data.