“Improving Document Processing and Indexing by Preprocessing and Tokenization"

Abstract

Information Retrieval is the science of searching information within documents. Documents are in huge quantity and still growing. It is very difficult to find the information according to requirements of user. So different algorithms are being proposed based on long research in information retrieval and data mining.Where in this paper we analyze the documents in the collection Sense and Sensibility available on the web page under the subheading “data files” we download these files and build programs that are able to index a collection of documents and calculate text statistics across the corpus. Text processing ( or document processing) includes tokenization, preprocessing (converting upper case letters to lower, Unicode conversion, and removing diacritics from letters, punctuations, or numbers), stop words removal, and stemming. These steps save indexing time and space, especially for a huge set of data. Also, the experiment results at the end of this paper approve the reliability and efficiency of the algorithms