Statistical Parser (SPARSE) is a part of the research in Intrusion Detection System group at
Columbia University. The goal of this work is to develop proof-of concept tools to identify
malware stealthily embedded in files or other objects to avoid detection by conventional AV
scanners. Currently, we are building a integrated hybrid system that employs various detection
methods including static analysis, dynamic emulation, data randonmization, malcode location, and
other strategies.
Background of this Project
The term "SPARSE" came from our initial work on this problem that primarily focused on Statistical
PARSing of the binary content of documents. Document may also be quite sparsely populated with malcode
compared to the rest of a document’s benign content. The idea of SPARSE originally came from the
MEF and
PAYL project. In stead of checking network payload like PAYL, SPARSE reads the
binary content of files. The former SPARSE project was named
Fileprint,
which proposed and demonstrated a means of computing statistical content-based models of typical files
of a single type. Fileprint project applies various approaches such as truncation of files and single/multiple
centroids for modeling and many testing algorithms. And it successfully demonstrated that these techniques
can be efficiently used to cluster many file types and, further more, can be used to classify normal and
malicious executables without pre-built signatures.
The general malware detection mechanism is signature-based, which means to detect unseen/unknown attacks
is impossible. Different from signature-based AV software, the statistical profile can be used to detect
unknown malware; therefore, the zero-day detect will be achievable. Our ultimate goal is to develop a
convenient, easy to use system to analyze and test any file and to reject or alarm on any that does not
fit the learned normal profile.
In addition to statical analyzing the file binary content, we also monitor the dynamic run-time system
behavior of documents. Specificly, we open documents using programs such as Word in diverse virtual machines.
Currently, we are working on integrating the system that contains both the static and the dynamic detector.
Recent Publications
Wei-Jen Li and Salvatore J. Stolfo
"SPARSE: A Hybrid System to Detect Malcode-Bearing Documents"
CU Tech. Report, Jan 2008.
PDF
Wei-Jen Li, Salvatore J. Stolfo, Angelos Stavrou, Elli Androulaki, Angelos Keromytis
"A Study of Malcode-Bearing Documents."
DIMVA, 2007.
PDF
Figures and Screenshots
Fig.1, SparseGUI screenshot: A GUI used to parse documents, analyze the binary content, and perform experiments (including the dynamic sandbox tests).
Fig.2, The concept of the dynamic sandbox test.
Fig.3, The Concept of the Hybrid System.
Fig.4, Entropy Analysis.
Fig.5, The histogram of the 1-gram binary content of the parsed documents.
Earlier work on Fileprint:
Fig.6, The histogram of the 1-gram binary content of some types of files.
Fig.7, Some old detection results.
Papers
- Wei-Jen Li and Salvatore J. Stolfo "SPARSE: A Hybrid System to Detect Malcode-Bearing Documents" CU Tech. Report, Jan 2008. [PDF ]
- Wei-Jen Li, Salvatore J. Stolfo, Angelos Stavrou, Elli Androulaki, Angelos Keromytis "A Study of Malcode-Bearing Documents." DIMVA, 2007. [PDF ]
- Wei-Jen Li, Ke Wang, Salvotore J. Stolfo, "Fileprints: Identifying File Types by n-gram Analysis." 2005 IEEE Information Assurance Workshop [PDF]