A stroke prediction model in imbalanced data: the case of an improved cost function for logistic regression

dc.contributor.author Luwandagga, Andrew
dc.date.accessioned 2025-11-10T15:58:39Z
dc.date.available 2025-11-10T15:58:39Z
dc.date.issued 2025
dc.description A dissertation submitted to the Directorate of Research and Graduate Training in partial fulfilment of the requirements for the award of the Degree of Master of Statistics of Makerere University.
dc.description.abstract The accurate prediction of rare events, such as strokes, is often hindered by class imbalance, where the distribution of observations in the response categorical variable is uneven. In such cases, standard classifiers are biased toward the majority class and perform poorly in identifying minority events. Logistic regression, while widely used, is affected by this limitation, highlighting the need for methods that address class imbalance. This study aimed to compare the performance of the Binary Logistic Regression, balanced logistic regression, and the Grid Search Regularized Weighted Logistic Regression (GSRWLR) models, which integrate class weighting, L2 regularization, and grid search for hyperparameter tuning, using 5,110 records from Kaggle's Stroke Prediction Dataset (2020). Findings revealed that the Binary Logistic regression model had poor sensitivity (1.33%), identifying only one stroke case; the balanced logistic regression improved to 77.33%. GSRWLR further enhanced recall to 93.33%, though Area Under Curve-Receiver Operating Characteristic (AUC-ROC) values were similar across models (≈0.83), indicating comparable overall classification ability. GSRWLR demonstrated superior performance in identifying rare events. Age, heart disease, average glucose level, and hypertension were identified as key predictors of stroke. Conclusions from the study suggest that optimizing logistic regression with regularization, class weighting, and grid search significantly improves stroke prediction in imbalanced datasets. Health systems could utilize Grid Search Regularized Weighted Logistic Regression to enhance the detection of stroke, particularly in under-resourced environments such as Ugandan hospitals. Future research should validate the Grid Search Regularized Weighted Logistic Regression model on various stroke datasets, including patient data from Uganda, to improve the generalizability and usefulness of the findings.
dc.identifier.citation Luwandagga, A. (2025). A stroke prediction model in imbalanced data: the case of an improved cost function for logistic regression (Unpublished master’s dissertation). Makerere University, Kampala, Uganda.
dc.identifier.uri https://makir.mak.ac.ug/handle/10570/14866
dc.language.iso en
dc.publisher Makerere University
dc.title A stroke prediction model in imbalanced data: the case of an improved cost function for logistic regression
dc.type Thesis
Files