Email Mining Toolkit (EMT)

The Email Mining Toolkit (EMT) is a data mining system that computes behavior profiles or models of user email accounts. This toolkit is useful for report generation and summarization of email archives, as well as for detecting email security violations when incorporated with a real-time violation detection system, such as the MET system.

EMT, which includes approximately 13,200 lines of code, is implemented in Java providing a GUI implementing an interface to an underlying relational database application. It provides the means of loading, parsing and analyzing email messages from a wide range of storage formats. It not only demonstrates the statistics of email account behavior, it also computes the volume and velocity of emails exchanged between parties, analyzes specific content and patterns, and explores social relationships between groups of users, and the relative rankings of importance of different individuals in an organization.

Moreover, EMT extends these kinds of analyses to model “user behavior” at a very fine granularity. It models the behavior of individual user email accounts or groups of accounts, and can be used to detect changes in behavior that may be of interest in forensic analyses. These features of EMT provide the means to detect fraudulent misuse and attacks such as viruses and Spam (unwanted) email.

EMT includes 15 different features and models. The statistical models that include stationary and non-stationary user profile are used to generate user behavior models. These models include

• Message Table where individual emails may be automatically classified by built in machine learning subsystems,

• Usage Histogram revealing a user’s typical daily email behavior,

• Similar Users which identifies groups of emails users who behave in similar ways ,

• Recipient Frequency providing a detailed analysis of the typical communicants with a user and

• Attachment Statistics detailing attached files serving as a personal file system of a user, as well as the statistical analyses including the birth rate, lifespan, incident rate, prevalence, threat, spread, and death rate useful in identifying interesting attachments and viral attachments.

The analyses built in to EMT concerning groups of accounts and their communication is provided to detect violations of group behavior. These models include

• Enclave Clique groups of users who frequently pairwise exchange messages,
• User Clique the set of accounts a particular user typically emails as a group,
• Email Flow revealing how a single message produces a web of new communication throughout an organization and
• Average Communication Time that views a user’s typical response rates to individuals, indicating the relative importance of communicants.
• These models apply algorithms such as Chi Square, Hellinger Distance, Mahalanobis Distance, N-Gram analysis, Naïve Bayes classifier, TF-IDF categorization and graphical cliques analysis. By combining these features, EMT may be applied to a variety of applications and detection tasks.

EMT’s graphical user interface provides an easy to use interface to execute these functions and that visualizes results in tabular form with displays of plots and histograms that are easy to understand.

Related publications:


Wei-Jen Li, Shlomo Hershkop, Salvotore J. Stolfo, Email Archive Analysis Through Graphical Visualization.


Salvotore J. Stolfo, Wei-Jen Li, Shlomo Hershkop, Ke Wang, Chia-Wei Hu, Olivier Nimeskern, Detecting Viral Propagations Using Email Behavior Profiles, ACM Transactions on Internet Technology (TOIT), May 2004. [ PDF ]


Some EMT screen shots are shown bellow:

General email client window / Machine Learning analysis


Graphical clique analysis


Email flow analysis


Similar Users


Usage Histogram


Usage Frequency Analysis


Virus simulation and detection


Virus detection