An Improved Algorithm for Data Preprocessing in Mining Crime Data Set

Abstract

This paper presents an improved algorithm for data preprocessing to solve the problem of missing values and smoothing the outliers in the real world data sets. Previous works in this field are based mainly on replacing the missing values with the average, class average, most common values and some other techniques in the same direction, and outliers were generally cancelled from the data set. Crime and criminal data sets have their own special characteristics and benchmark in that missing values and outliers have different meanings than in other fields, so they need to be processed in different manners. The algorithm is based mainly on using clustering techniques to group the objects according to their similarities and dissimilarities, then smoothing the outliers accordingly and the missing values are processed according to their clusters. WEKA is used as a tool to find different clusters of the criminals.