A Text Mining Approach for Automated Structuring Of Patient – Doctor Interactions for Improved Decision Making
ABSTRACT Over 80% of information captured in healthcare environments is currently stored as text. Manual efforts to extract meaningful information from textual healthcare data are very labor intensive, tedious and time consuming. Jargons, short forms and grammatical complexities make this data extremely difficult to analyze. Analyzing such data requires advanced and effective techniques (known as text mining) to reveal hidden relationships, trends and to discover knowledge. In Uganda, little effort has been put into utilizing the existing large amounts of unstructured healthcare data. The Medical Concierge Group (TMCG) in Uganda provides medical services to patients through a Medical Call Center. Vast amounts of textual data are collected and stored daily through these patient-doctor interactions. Unfortunately, due to the unstructured nature, little is known about this rich source of information. This data however if properly utilized can help, improve decision making and quality of service delivery at the Call Center. In this research, we design text mining models using both unsupervised and supervised machine learning techniques to extract meaning from text medical call records. We evaluate the performance of the models and compare their results. We further utilize n-grams and visualization tools to improve model output and to present results in more meaningful formats. This study is unique as to the best of our knowledge, it it’s the first in Uganda to explore the use of text mining models to reveal meaningful information embedded in unstructured healthcare data through discovery of hidden semantic structures in the text. We use textual medical records captured at a Medical Call Center in Uganda. Based on our results, Non-Negative Matrix Factorization interestingly outperforms the popular and most commonly used Latent Dirichlet Allocation in generating simple and coherent topics while Latent Semantic Analysis produces the worst quality of topics. Support Vector Machine outperforms Logistic regression and Naïve Bayes Models in text classification tasks. Our results also reveal that n-grams and visualization tools can be effective in improving model results while adding context to generated topics. According to our results, the term HIV, is the most commonly mentioned word at the Call Center while HIV/AIDS is the most common topic of discussion. However concerns such as Malaria and Child health are key topics and must also be given priority. In conclusion, text mining models can be utilized to reveal hidden knowledge in unstructured medical records and to expose relationships that can provide important information for improved decision-making and to guide national health interventions.