Software Engineering-Based Design for a Bayesian Spam Filter

Abstract

The rapid spread and the easy availability of a free e-mail service have made it the medium of choice for the sending of unsolicited advertising and bulk e-mail in general. These messages, known as junk e-mail or spam mail, are an increasing problem to both Internet users and Internet service providers (ISPs).
The research resolves one aspect of the spam problem by developing an appropriate filter for the e-mail client. The proposed filter is a combination of three forms of filters: Whitelist, Blacklist, and a Bayesian filter. Whitelist-based filter only accepts e-mails from known addresses. Blacklist filter blocks e-mails from addresses known to send out spam. Bayesian content-based filter makes estimations of spam probability based on the text and filters messages based on a pre-selected threshold.
The Bayesian filter is selected to be the main filter. The Bayesian filter is manually trained on a set of gathered e-mails; some of them are spam and the others are legitimate based on the contents of an e-mail. Thereafter the classification phase has been implemented for new entered e-mails. All the required databases are constructed in form of tables stored in the Structured Query Language (SQL) server. The filter at the client side can transparently access the database in order to carry on the intended filtering. The proposed system (e-mail client interface and the filters) can manage messages written in both Arabic and English languages which is crucial for the users in our region.
Software engineering principals are implemented throughout the design process to make the system less vulnerable to faults and easily maintained. The design steps have followed the Waterfall-model using the ASCENT software. A user-friendly interface has been developed to access the features of the spam filter at the client side. Visual Basic version 6 has been used to develop the system. As well, the SQL server has been implemented to build and process the database.
A number of performance measurements have been carried out with asset of gathered e-mails. The results are used to evaluate the performance of the filter and to prove its efficiency.