Wenke Lee
Computer Science Department
North Carolina State University
Raleigh, NC 27695-7550
wenke@csc.ncsu.edu
- Salvatore J. Stolfo
Computer Science Department
Columbia University
500 West 120th Street, New York, NY 10027
sal@cs.columbia.edu
- Philip K. Chan
Computer Science Department
Florida Institute of Technology
150 W. University Blvd., Melbourne, FL 32901
pkc@cs.fit.edu
- Eleazar Eskin, Wei Fan, Matthew Miller, Shlomo Hershkop, and
Junxin Zhang
Computer Science Department
Columbia University
500 West 120th Street, New York, NY 10027
{eeskin,wfan,matt,sh553,jzhang}@cs.columbia.edu
Most IDSs are based on hand-crafted signatures that are developed by manual encoding of expert knowledge. These systems match activity on the system being monitored to known signatures of attacks. The major problem with this approach is that the system cannot generalize in order to detect new attacks. Recently, there has been an increased interest in data mining-based approaches to building detection models for IDSs. These methods can generalize their models of both known attacks and normal behavior in order to detect unknown attacks. They can also be generated in a quicker and more automated method than manually encoded models that require difficult analysis of audit data by domain experts. Several effective data mining techniques for detecting intrusions have been developed [22,10,33,5], many of which perform close to or better than systems engineered by domain experts.
However, successful data mining techniques are themselves not enough to create deployable IDSs. Despite the promise of better detection performance and generalization ability of data mining-based IDSs, there are some inherent difficulties in the implementation and deployment of these systems. We can group these difficulties into three general categories: accuracy (i.e., detection performance), efficiency, and usability. Typically, data mining-based IDSs (especially anomaly detection systems) have higher false positive rates than traditional hand-crafted signature based methods, making them unusable in real environments. Also, these systems tend to be inefficient (i.e., computationally expensive) during both training and evaluation. Finally, these systems require large amounts of training data and are significantly more complex than traditional systems. In order to be able to deploy real time data mining-based IDSs, these issues must be addressed.
In this paper, we discuss several problems inherent in developing and deploying a real-time data mining-based IDS and present an overview of our research, which addresses these problems. These problems are independent of the actual learning algorithms or models used by an IDS and must be overcome in order to implement data mining methods in a deployable system.
An effective data mining-based IDS must address each of these three groups of issues. Although there are tradeoffs between these groups, each can generally be handled separately. We present the key design elements and group them into which aspect of the performance measure they apply to.
The rest of this paper is organized as follows. In Section 2, we discuss the issues related to detection performance and provide several algorithm independent approaches to improve the accuracy of a system. In Section 3, we discuss efficiency issues and present techniques for efficient model computation. In Section 4, we discuss general usability issues and techniques to address these issues and in Section 5, we discuss a system architecture for implementing the techniques presented throughout the paper. Section 6 surveys related research, and Section 7 offers discussion of future work and conclusive remarks.
At the most basic level, accuracy measures how well an IDS detects attacks. There are several key components of an accuracy measurement. One important component is detection rate, which is the percentage of attacks that a system detects. Another component is the false positive rate, which is the percentage of normal data that the system falsely determines is intrusive. There is an inherent tradeoff between detection rate and false positive rate. One way to represent this tradeoff is by plotting the detection rate versus false positive rate on a curve under different parameter values creating a ROC curve1. A method to compare accuracy between two IDSs is to examine their ROC curves.
In practice, only the small portion of a ROC curve corresponding to acceptably low false positives is of interest, as in a deployable system, only a low false positive rate can be tolerated. Data mining-based systems have the advantage of potentially being able to detect new attacks that hand-crafted methods tend to miss. Since hand-crafted methods typically have a fixed detection threshold, they perform at a constant detection rate across different false positive rates. In a ROC curve, we can assume that their curve is a straight line at each detection level. Data mining-based IDSs are only useful if their detection rate is higher than a hand-crafted method's detection rate with an acceptably low false positive rate. Given this framework, our goal is to develop a data mining-based IDS that is capable of outperforming hand-crafted signature-based systems.
We have developed and applied a number of algorithm-independent techniques to improve the performance of data mining-based IDSs. In this section, we focus on a few particular techniques that have been proven to be empirically successful. We first present a generic framework for extracting features from audit data which help discriminate attacks from normal data. These features can then be used by any detection model building algorithm. We describe a method for generating artificial anomalies in order to decrease the false positive rate of anomaly detection algorithms. Our research has shown that by generating artificial anomalies, we can improve the accuracy of these ID models. Finally, we present a method for combining anomaly and misuse (or signature) detection models. Typically signature and anomaly detection models are trained and used in complete isolation from each other. Our research has shown that by combining the two types of models, we can improve the overall detection rate of the system without compromising the benefits of either detection method.
We have developed a set of data mining algorithms for selecting and constructing features from audit data [18]. First, raw (binary) audit data is processed and summarized into discrete records containing a number of basic features, e.g., timestamp, duration, source and destination IP addresses and ports, and error condition flags. Specialized data mining programs [23] are then applied to connection records to compute frequent patterns describing correlations among features and frequently co-occurring events across many connection records. The consistent patterns of normal activities and the ``unique'' patterns associated with an intrusion are then identified and analyzed to construct additional features for connection records. It can be shown that the constructed features can indeed clearly separate intrusion records from normal ones. Using this approach, the constructed features are more grounded on empirical data, and thus more objective than expert knowledge. Results from the 1998 DARPA Intrusion Detection Evaluation [24] showed that the ID model constructed using our algorithms was one of the best performing of all the participating systems.
As an example, let us consider the SYN flood attack. When launching
this attack, an attacker uses many spoofed source addresses to open
many connections which never become completely established (i.e., only
the first SYN packet is sent, and the connection remains in the ``S0''
state) to some port on a victim host (e.g., http). When
comparing the patterns from the 1998 DARPA dataset that contain SYN
flood attacks with the patterns from a ``baseline'' normal dataset (of
the same network), by first encoding the patterns into numbers and
then computing ``difference'' scores. The following pattern, a
frequent episode [25], has the highest
``intrusion-only'' (i.e., unique for the intrusion) score:
``(flag=S0,
service=http,
,
(flag=S0,
service=http,
(flag=S0,
service=http,
[0.93, 0.03,
2].'' This means that 93% of the time, after two http
connections with S0 flag are made to host victim, within 2
seconds from the first of these two, the third similar connection is
made; and this pattern occurs in 3% of the data. Accordingly, our
feature construction algorithm parses the pattern and uses the anatomy (or structural) information about an intrusion, e.g., ``the
same service (i.e., port) is targeted'', and the invariant information, e.g. flag=S0, to construct the following
features: ``a count of connections to the same
in the
past 2 seconds,'' and among these connections, ``the percentage of
those that have the same service, and the percentage of those that
have the S0 flag.'' For the two ``percentage'' features, the
normal connection records have values close to 0, but the connection
records belonging to syn_flood have values above
.
Once these discriminative features are constructed, it is
easy to generate the detection rules via either manual
(i.e. hand-coding) or automated (i.e., machine learning)
techniques. For example, we use RIPPER [3], an inductive
rule learner, to compute a detection rule for syn_flood: if
for the past 2 seconds, the count of connections to the same
dst_host is greater than 4; and the the percentage of
those that have the same service is greater than 75%; and the percentage of those that have the ``S0'' flag is greater than
75%, then there is a syn_flood attack.
Since we do not know where the exact decision boundary is between known and anomalous instances, we assume that the boundary may be very close to the existing data. To generate artificial anomalies close to the known data, a useful heuristic is to randomly change the value of one feature of an example while leaving the other features unaltered.
Some regions of known data in the instance space may be sparsely populated. We can think of the sparse regions as small islands and dense regions as large islands in an ocean. To avoid overfitting, learning algorithms are usually biased towards discovering more general hypotheses. Since we only have known data, we want to prevent hypotheses from being overly general when predicting these known classes. That is, we want to avoid the situation where sparse regions may be grouped into dense regions to produce singularly large regions covered by overly general hypotheses. It is possible to produce artificial anomalies around the edges of these sparse regions and coerce the learning algorithm to discover the specific boundaries that distinguish these regions from the rest of the instance space. In other words, we want to generate data that will amplify these sparse regions.
Sparse regions are characterized by infrequent values of individual features. To amplify sparse regions, we proportionally generate more artificial anomalies around sparse regions depending on their sparsity using the Distribution Based Artificial Anomaly algorithm (DBA2) presented in detail in [6]. It uses the frequency distribution of each feature's values to proportionally generate a sufficient amount of anomalies.
We have applied this approach for both pure anomaly detection and combined anomaly and misuse (or signature) detection.
For pure anomaly detection, the training set consists of only normal connections augmented by anomalies generated using DBA2 from these normal connections. One anomaly detection model trained using this method successfully detects 94.26% of all anomalous connections in the test data and has a false alarm rate of 2.02%. Furthermore, these anomaly detection models are capable of detecting most intrusion classes, even though there are no intrusions at all in the training set.
We learn a single ruleset for combined misuse and anomaly detection. The ruleset has rules to classify a connection to be normal, one of the known intrusion classes, or anomaly. In order to evaluate this combined approach, we group intrusions together into a number of small clusters. We create multiple datasets by incrementally adding each cluster into the normal dataset and re-generating artificial anomalies. This is to simulate the real world process of developing and discovering new intrusions and incorporating them into the training set. We learn models that contain misuse rules for the intrusions that are ``known'' in the training data, anomaly detection rules for unknown intrusions in left-out clusters, and rules that characterize normal behavior.
We have seen that the detection rates of known intrusions classified
by misuse rules in models learned with and without DBA2 categories of
intrusions are indistinguishable or completely identical. This
observation shows that the proposed DBA2 method does not diminish the
effectiveness of misuse rules detecting particular categories of known
intrusions. In analysis, we consider an anomaly detection model to be
significant for a particular detection class if the detection
rate is more than
.
Our results have shown that more than
75% of the time, the detection rate is significant. We have also
determined that the proposed approach can prove effective in detecting
unclassified known intrusions as anomalies. We consider an anomaly
detection method to significantly compensate for poor misuse
detection if the anomaly detection increases the total rate of
detection (either as the true class or anomaly) to nearly 100%. In
our experiments, 86% of misuse misclassifications are significantly
compensated for by being detected as anomalies. Our experiments with
pure anomaly and combined anomaly and misuse detection can be found
in [6].
In contrast to off-line IDSs, a key objective of real-time IDS is to detect intrusions as early as possible. Therefore, the efficiency of the detection model is a very important consideration. Because our data mining-based models are computed using off-line data, they implicitly assume that when an event is being inspected (i.e., classified using an ID model), all activities of the connection have completed so that all features have meaningful values available for model checking. As a consequence, if we use these models in real time without any modification, then an event is not inspected until complete information about that event has arrived and been summarized, and all temporal and statistical features (i.e., the various temporal statistics of the events in the past n seconds, see Section 2.1) are computed. This scheme can fail miserably under real-time constraints. When the volume of an event stream is high, the amount of time taken to process the event records within the past n seconds and calculate statistical features is also very high. Many subsequent connections may have terminated (and thus completed with attack actions) when the ``current'' connection is finally inspected by the model. That is, the detection of intrusions is severely delayed. Unfortunately, DoS attacks, which typically generate a large amount of traffic in a very short period time, are often used by intruders to first overload an IDS, and use the detection delay as a window of opportunity to quickly perform their malicious intent. For example, they can even seize control of the host on which the IDS lives, thus eliminating the effectiveness of intrusion detection altogether.
It is necessary to examine the time delay associated with computing each feature in order to speed up model evaluation. The time delay of a feature includes not only the time spent for its computation, but also the time spent waiting for its readiness (i.e., when it can be computed). For example, the total duration of a connection can only be computed after the last packet of the connection has arrived, whereas the destination host of a connection can be obtained by checking the header of the first packet.
From the perspective of cost analysis, the efficiency of an intrusion detection model is its computational cost, which is the sum of the time delay of the features used in the model. Based on the feature construction approaches discussed in Section 2.1, we can categorize features used for network intrusion detection into 4 cost levels:
In order to conveniently estimate the cost of a rule, we assign a cost of 1 to the level 1 features, 5 to the level 2 features, 10 to level 3, and 100 to level 4. These cost assignments are very close to the actual measurements we have obtained via extensive real-time experiments. However, we have found that the cost of computing level 4 features is linearly dependent on the amount of connections being monitored by the IDS within the time window used for computation, as they require iteration of the complete set of recent connections.
In this section, we discuss approaches to reduce computational cost and improve the efficiency of the real-time intrusion detection models. In Section 3.1, we describe our techniques for cost-sensitive modeling. In Section 3.2, we discuss a real-time system for implementing distributed feature computation for evaluating multiple cost-sensitive models.
In the domain of network intrusion detection, we use four different levels of costs to compute features, as discussed in Section 3. Features of costs 1, 5, and 10 are computed individually and features of costs 100 can be computed in a single lookup of all the connections in the past n seconds. With the above costs and goals in mind, we use the following multiple ruleset approach:
In real-time execution, the feature computation and rule evaluation proceed as follows:
In our experiments, we used the 1998 DARPA evaluation. The detailed experimental set-up and results can be found in [7]. In summary, our multiple model approach can reduce the computational cost by as much as 97% without compromising predictive accuracy, where the cost for inspecting a connection is the total computational cost of all unique features used before a prediction is made. If multiple features of cost 100 are used, the cost is counted only once since they can all be calculated in a single iteration through the table of recent connections.
Judge uses models that have been learned using the techniques described in Section 3.1. That is, there exists a sequence of models, each of which uses increasingly more costly features than the previous model. Models are evaluated and higher level features are computed at different points in a connection by Judge as more primitive features become available.
The sensor informs Judge of new feature values, or updates to feature values that are maintained throughout the connection's life, whenever there is a change in the connection's state (e.g., a connection has gone from SYN_WAIT to CONNECTED). Sensors also update certain feature values whenever there is an ``exception'' event. Exceptions are certain occurrences which should immediately update the value of a specific feature. For example, if two fragmented packets come in and the offsets for defragmentation are correct, the bad_frag_offset feature must be updated immediately.
Upon each state change and exception event, Judge computes the set of features that are available for the given connection. If the set of features is a proper subset of the set of light-weight features used by one of the ID models, then higher level features are computed and that model is evaluated. The logic for determining when a prediction is made is the same as is described in Section 3.1.
Once a prediction is made by Judge, a complete connection record, with the label, is inserted into a data warehouse, as described in our system architecture outline in Section 5.
We have currently implemented this system using NFR's Network Flight Recorder as the sensor, although the protocol for communication between the sensor and Judge would allow any sensor which extracts features from a data stream to be used.
First, management of both training and historical data sets is a difficult task, especially if the system handles many different kinds of data. Second, once new data has been analyzed, models need to be updated. It is impractical to update models by retraining over all available data, as retraining can take weeks, or even months, and updated models are required immediately to ensure the protection of our systems. Some mechanism is needed to adapt a model to incorporate new information. Third, many data mining-based IDSs are difficult to deploy because they need a large set of clean (i.e., not noisy) labeled training data. Typically the attacks within the data must either be manually labeled for training signature detection models, or removed for training anomaly detection models. Manually cleaning training data is expensive, especially in the context of large networks. In order to reduce the cost of deploying a system, we must be able to minimize the amount of clean data that is required by the data mining process .
We present an approach to each of these problems. We present adaptive learning, which is a generic mechanism for adding new information to a model without retraining. We then present unsupervised anomaly detection which is a new class of intrusion detection algorithms that do not rely on labeled data. In the next section, we present system architecture which automates model and data management.
In [6], different configurations for adaptive learning are discussed. In one such configuration, the additional classifier, H2, is trained from new_intrusion and normal data. The decision rules in Figure 1 are evaluated to compute the final outcome. We refer to H1 as an existing IDS model, while H2 refers to a new model trained specifically for a new attack recently discovered. Each classifier can be decision tree, rule, probability network, neural network and so on.
Since H1 cannot identify new_intrusion, most instances are classified as either anomaly or normal. However, H2 is computed from new_intrusion and normal data. Since new_intrusion is the minority class, H2 is usually good at distinguishing it from other classes.
We have experimented with different configurations to test the effectiveness of this approach. As shown in [6], the cost of training of the proposed method is almost 150 times less expensive than learning a monolithic classifier trained from all available data, and the accuracy of both are essentially equivalent.
Ideally, we would like to build detection models from collected data without needing to manually label it. In this case, the deployment cost would greatly be decreased because the data would not need to be labeled. In order to build these detection models, we need a new class of model building algorithms. These model building algorithms can take as input unlabeled data and create a detection model. We call these algorithms unsupervised anomaly detection algorithms.
In this section, we present the problem of unsupervised anomaly detection and relate it to the problem of outlier detection in statistics [1]. We present an overview of two unsupervised anomaly detection algorithms that have been applied to intrusion detection.
These algorithms can also be referred to as anomaly detection over noisy data. The reason the algorithm must be able to handle noise in the data is that we do not want to manually verify that the audit data collected is absolutely clean (i.e., contains no intrusions).
Unsupervised anomaly detection algorithms are motivated by two major assumptions about the data which are reasonable for intrusion detection. The first assumption is that anomalies are very rare. This corresponds to the fact that normal use of the system greatly outnumbers the occurrence of intrusions. This means that the attacks compose a relatively small proportion of the total data. The second assumption is that the anomalies are quantitatively different from the normal elements. In intrusion detection this corresponds to the fact that attacks are drastically different from normal usage.
Since anomalies are very rare and quantitatively different from the normal data, they stand out as outliers in the data set. Thus, we can cast the problem of detecting the attacks into an outlier detection problem. Outlier detection is the focus of much literature in the field of statistics [1].
In intrusion detection, intuitively, if the ratio of attacks to normal data is small enough, then because the attacks are different, the attacks stand out against the background of normal data. We can thus detect the attack within the dataset.
We have performed experiments with two types of unsupervised anomaly detection algorithms, each for a different type of data. We applied a probabilistic based unsupervised anomaly detection algorithm to building detection models over system calls and a clustering based unsupervised anomaly detection algorithm to network traffic.
The probabilistic algorithm approached detecting outliers by estimating the likelihood of each element in the data. We partition the data into two sets, normal elements and anomalous elements. Using a probability modeling algorithm over the data, we compute the most likely partition of the data. Details and experimental results of the algorithm applied to system call data are given in [5].
The clustering approach detects outliers by clustering the data. The intuition is that the normal data will cluster together because there is a lot of it. Because anomalous data and normal data are very different from each other, they do not cluster together. Since there is very little anomalous data relative to the normal data, after clustering, the anomalous data will be in the small clusters. The algorithm first clusters the data and then labels the smallest clusters as anomalies. Details and experimental results applied to network data are given in [26].
In terms of feature construction for detection models, DC-1 (Detector Constructor) [8], first invokes a sequence of operations for constructing features (indicators) before constructing a cellular phone fraud detector (a classifier). We are faced with a more difficult problem here because there is no standard record format for connection or session records (we had to invent our own). We also need to construct temporal and statistical features not just for individual records, but also over different connections and services. That is, we are modeling different logical entities that take on different roles and whose behavior is recorded in great detail. Extracting these from a vast and overwhelming stream of data adds considerable complexity to the problem.
The work most similar to unsupervised model generation is a technique developed at SRI in the Emerald system [14]. Emerald uses historical records to build normal detection models and compares distributions of new instances to historical distributions. Discrepancies between the distributions signify an intrusion. One problem with this approach is that intrusions present in the historical distributions may cause the system to not detect similar intrusions in unseen data.
Related to automatic model generation is adaptive intrusion detection. Teng et al. [32] perform adaptive real time anomaly detection by using inductively generated sequential patterns. Also relevant is Sobirey's work on adaptive intrusion detection using an expert system to collect data from audit sources [27].
Many different approaches to building anomaly detection models have been proposed. A survey and comparison of anomaly detection techniques is given in [33]. Stephanie Forrest presents an approach for modeling normal sequences using look ahead pairs [9] and contiguous sequences [12]. Helman and Bhangoo [11] present a statistical method to determine sequences which occur more frequently in intrusion data as opposed to normal data. Lee et al. [21,20] uses a prediction model trained by a decision tree applied over the normal data. Ghosh and Schwartzbard [10] use neural networks to model normal data. Lane and Brodley [15,16,17] examine unlabeled data for anomaly detection by looking at user profiles and comparing the activity during an intrusion to the activity under normal use.
Cost-sensitive modeling is an active research area in the data mining and machine learning communities because of the demand from application domains such as medical diagnosis and fraud and intrusion detection. Several techniques have been proposed for building models optimized for given cost metrics. In our research we study the principles behind these general techniques and develop new approaches according to the cost models specific to IDSs.
In intrusion data representation, related work is the IETF Intrusion Detection Exchange Format project [13] and the CIDF effort [29].