Introduction | Download | Documentation | Experiments

Columbia SPARSE (Statistical Parser) project
Data-Driven Malware Detection by
Statistical Content Analysis
Salvatore J. Stolfo, Wei-Jen Li, Angelos Stavrou, Elli Androulaki
This report details the outcome of a research project focused on the problem of identifying Microsoft Word documents that harbor malicious code using static inspection of the statistical byte sequences of binary content. Documents embedded with malcode are a convenient means for attackers to penetrate systems and reach third-party applications that may harbor exploitable vulnerabilities that would otherwise not be reachable by network-level service attacks. Such malicious documents may be served up by any arbitrary website in a passive "drive by" fashion or even introduced to a system by other media such as CD-ROMS and USB drives bypassing all the network firewalls and sensors. These malicious documents pose a very serious threat against organizations that acquire and analyze large collections of (publicly) available documents.
There is nothing new about the presence of viruses in email streams, embedded in attached documents. Nevertheless, the possibility of malcode in seemingly-innocuous documents, downloaded or otherwise obtained from reputable sources, introduces a significant threat. If the malcode can evade detection by the state-of-the-art signature-based antivirus scanners, this threat can become both far-reaching and devastating infecting the organization's infrastructure and then using it as a stepping stone to reach other systems, unreachable via the regular network. Hence,, any machine inside an organization can become the spreading point of the malcode.
Our focus is exactly these classes of attacks which stem from documents, harboring new "zero-day" malcode embedded in "normal" appearing documents, for which no signature of the malcode may be available. The Challenge Problem is whether we may be able to inspect the binary content of any document file to determine whether it is infected with malicious code without a priori knowledge of the specific code in question. We limit our study to Microsoft Word document files; Word documents serve as a "container" for complex object embeddings that need to be parsed and executed to render the document for display. A myriad of object types may be embedded within Word and many other binary-format documents, including Adobe PDF files.
In our work we have developed a number of approaches to modeling the binary content of Word files using statistical n-gram analysis. We developed and shall deliver a software program, known as the Statistical Parser (SPARSE), with a convenient graphical user interface to manage the parsing of Word documents, the training of statistical content models and the testing of Word files to measure their likelihood of containing malicious code.
Last updated Sept 6, 2006 by Wei-Jen Li.