An efficient approach for medical text categorization based on clustering and similarity measures

Abstract

AbstractThe huge amount of medical information available in the medical document, makes the use of automated text categorization methods essential in clinical diagnosis and treatment. Automatic categorization of a text can provide information about classes which a text belongs to. This paper can serve as a medical diagnosis tool for categorization patient records by propose text categorization algorithm based on the similarity cluster centers for the categorization of patients with eye diseases records. We propose VEMST algorithm as update to EMST algorithm by using variance to find cluster centers. A text categorization algorithm is developed using two similarity measures (cosine , common words) to classify the categorical data. The results showed that when the number and size of medical documents used great for training the classification accuracy increases, as we noticed when we use comparing medical terms method in the preprocessing phase, the accuracy is better than the use of frequency of all terms in medical document, as well as the execution time at least. Finally, we found the performance of our system when we use the cosine similarity measure is better than his performance with the use of the similarity of common words scale.