Reliability of predictions using hybrid Algorithms and its application to dimensionality reduction
Abstract
The reliability of predictions emanating from independent data mining techniques is a complex problem. This could be attributed to cross-cutting weaknesses of individual techniques such as collinearity due to high dimensionality of attributes in a dataset, biasedness due to underfitting and overfitting of data, noise accumulation due to outliers as well as failure to take into consideration class imbalance in imbalanced data and thus affecting the reliability of predictions emanating from these models. This study aimed at addressing this drawback by developing a hybrid data mining algorithm for predicting reliable classes.
The decision tree and naïve Bayes classifiers were used to build a hybrid prediction algorithm. The decision tree was employed for important attribute extraction based on the C4.5 algorithm and its gain ratio values were used as input weights to construct a weighted naïve Bayesian classifier.
The goodness of fit for all the data mining models was done using k-fold cross-validation based on a confusion matrix on previously untrained imbalanced data. Accuracy, F-measure and the Area under the Receiver Operating Characteristics curve (AUC) were the key performance metrics used to evaluate the generalizability of the hybrid model in comparison to the independent models.
The results revealed that the proposed hybrid model outperformed the independent decision tree and naïve Bayes classifiers on all demonstration datasets respectively.
Hence merging several independent homogeneous predictive data mining techniques may enhance the accuracy of the estimates leading to reliable predictions.