dc.description.abstract | Introduction: Due to changes in data patterns, self-learning approaches have been adopted in research which is commonly known as Machine Learning (ML). ML has been used previously to predict health outcomes such as early virological among others. These deterministic methodologies use a wide array of features to identify hidden patterns in the data to predict health outcomes. These methodologies incorporate chance and variation arising from fluctuations in the environment including factors not explicitly included in the model. On the other hand, the classical linear methods have been associated with several limitations such as the assumption of linearity, failure to fully incorporate heterogeneity of effects, and they are limited by the growing dimensionality of data since they cannot include so many predictors. This study is aimed to compare the linear and the nonlinear advanced classification algorithms to predict switching to ART second-line regimen.
Objectives: The objective of this study was to compare the linear logistic regression analysis method which is parametric to the non-parametric advanced ML algorithms which include random forests (RF) and K nearest neighbor (KNN) machine learning algorithms to correctly classify patients switching to second-line Antiretroviral Therapy (ART) regimens.
Methods: This study used secondary HIV patient data considering HIV patients from 15 HIV clinics under RHSP. We used the R, STATA, and python software for data management and analysis. The logistic regression, random forest models, and K nearest neighbor models were fitted. The models were compared by assessing the discriminative ability of the models. The models were also evaluated on the average performance metrics which included Area under Curve (AUC), sensitivity, F1 score measure, and overall accuracy.
Results: The majority of the patients were females with 62.4% and most of the patients (52.1%) were aged between 20-34years at enrollment. Out of the 7818 patients, 5% had switched to a second-line ART regimen. Results from the comparison of the fitted models indicated that all
x
the models performed better with balanced data as compared to the imbalanced data models. The Area under Curve (AUC) for the balanced data logistic classifier 68.8% (95% CI 68.0 – 69.2) was significantly higher than the RF 56.9% (95% CI 53.4 – 58.6) and the KNN balanced data models 65.1% (95% CI 64.3 – 65.6). There was no significant statistical difference in the F1 measure for all three. However, the balanced data logistic classifier has the highest AUC and recall score as compared to the rest of the models.
Conclusion: This study indicated that linear classifiers which are parametric such as logistic regression classifiers are good predictors of the switch to a second-line ART regimen with the application of appropriate resampling strategies such as Synthetic Minority Oversampling Technique (SMOTE) which balance data across classes when the data is imbalanced. | en_US |