Multilayer Statistical Intrusion Detection in Wireless Networks

The rapid proliferation of mobile applications and services has introduced new vulnerabilities that do not exist in ﬁxed wired networks. Traditional security mechanisms, such as access control and encryption, turn out to be ine ﬃ cient in modern wireless networks. Given the shortcomings of the protection mechanisms, an important research focuses in intrusion detection systems (IDSs). This paper proposes a multilayer statistical intrusion detection framework for wireless networks. The architecture is adequate to wireless networks because the underlying detection models rely on radio parameters and tra ﬃ c models. Accurate correlation between radio and tra ﬃ c anomalies allows enhancing the e ﬃ ciency of the IDS. A radio signal ﬁngerprinting technique based on the maximal overlap discrete wavelet transform (MODWT) is developed. Moreover, a geometric clustering algorithm is presented. Depending on the characteristics of the ﬁngerprinting technique, the clustering algorithm permits to control the false positive and false negative rates. Finally, simulation experiments have been carried out to validate the proposed IDS.


Introduction
Mobile applications and services relying on wireless communication infrastructures have dramatically expanded during last years. Ad hoc networks, wireless local area networks (WLANs), and WIMAX are just examples of a panoply of technologies that are continuing to proliferate. In addition, more sophisticated communication techniques are expected to appear in the near future. The intrinsic features of wireless mobile networks make them more vulnerable than wired fixed networks. For instance, the nature of wireless radio links renders the network vulnerable not only to passive eavesdropping but also to active interfering. Moreover, in many contexts, the network consists of autonomous mobile nodes that are capable of acting independently. Hence, without an appropriate physical protection, nodes can be compromised and used to carry out malicious activities.
The shortcomings of the security mechanisms used in wireless networks exacerbate the need for new detection techniques which should defend against sophisticated mobile attacks. In the literature, many attempts have been done to fulfill this need. Most of the existing approaches rely on intrinsic signal characteristics to detect intrusion events.
In this paper, a novel multilayer intrusion detection process for wireless networks is introduced. We consider a set of detectors using heterogeneous features corresponding to different network layers and collected by specific preprocessors. Four major layers are used in our context: the physical layer, the link layer, the transport layer, and the application layer. A set of parameters from each layer is collected, preprocessed, and submitted to the corresponding detector in order to state about the occurrence of malicious events. A postprocessing module has also been designed in order to refine the available information about the attacker by accurately determining its position. The main contributions of our work can be briefly described through the following points.
(1) The physical layer preprocessor, aiming at gathering intrinsic features of the wireless network interfaces, relies on the maximal overlap discrete wavelet transform (MODWT) and geometric unsupervised classification. It is shown to ensure better performances than that in [1] essentially because of its shift-preserving property. To our knowledge, 2 EURASIP Journal on Advances in Signal Processing the MODWT has not been previously used in the intrusion detection context. (2) The transport and application layer detection mechanisms measure the deviation of the real-time traffic from a preestablished model which is adaptively updated. This allows detecting traffic pattern distortion attacks. In fact, we introduce two novel traffic models corresponding to the TCP protocol (transport layer) and video transmission (application layer). We represent the traffic by a long memory process. If the attacker attempts to embed forged packets within a normal stream, our approach allows detecting his activity.
(3) Our intrusion detection process is multilayer, meaning that it can analyze a single-packet stream at different layers, beginning by the physical layer. Furthermore, all of the preprocessing, detection, and postprocessing techniques are statistical. The fact that the proposed architecture is purely statistical corroborates the idea stated in [2] and stating that "statistical anomaly detection will be among the most efficient intrusion detection techniques for wireless networks." The rest of the paper is structured as follows. Section 2 reviews the most important intrusion detection techniques for wireless networks. Section 3 briefly presents wavelet theory fundamentals and highlights the difference between the traditional DWT and the MODWT. The architecture of the proposed IDS is described in Section 4. Section 5 designs the physical layer preprocessing components and shows how network interfaces can be robustly authenticated in a wireless environment. An antispoofing filter based on geometric unsupervised classification of the data provided by the physical and link layer preprocessors is detailed in Section 6. The transport and application layer preprocessors are addressed in Section 7. A technique based on the estimation of the Hurst exponent is used for this purpose. Section 8 describes the simulation environment and discusses the results provided by the proposed techniques. Finally, Section 9 concludes the paper.

Intrusion Detection in Wireless Networks
This section examines the state of intrusion detection in wireless networks, with a particular emphasis on statistical approaches. The wireless intrusion detection system is a network component aiming at protecting the network by detecting wireless attacks, which target wireless networks having specific features and characteristics. Wireless intrusions can belong to two categories of attacks. The first category targets the fixed part of the wireless network, such as MAC spoofing, IP spoofing, and DoS; and the second category of these attacks targets the radio part of the wireless network, such as the access point (AP) rogue, noise flooding, and wireless network sniffing. The latter attacks are more complex because they are hard to detect and to trace back [3,4].
To detect such complex attacks, the WIDS deploys approaches and techniques provided by intrusion detection systems (IDSs) protecting wired networks [5]. Among these approaches, one can find the signature-based and anomalybased approaches. The first approach consists in matching user's patterns with attack's signatures. The second approach aims at detecting any deviation of the "normal" behavior of the network entities. The deployment of the aforementioned approaches in wireless environment requires some modifications. Features and characteristics of wireless environment make the use of traditional approaches of detection very difficult. The major feature is mobility, where information have to be gathered from different mobile sources, which may require a real-time traffic analysis. Moreover, there are no clear differences between "normal" and "abnormal" behavior in mobile environment. Because of the mobility feature, a node can send false information, which can be established as an "abnormal" behavior.
Therefore, traditional approaches of detection have to be revised. The signature-based approach in wireless networks may require the use of a knowledge base containing the wireless attack signatures while an anomaly-based approach requires the definition of profiles specific to wireless entities (mobile users and AP). The wireless intrusion detection can be done by monitoring the active components of the wireless network, such as the APs [6]. Generally, the WIDS is designed to monitor and report on network activities between communicating devices. To do this, the WIDS has to capture and decode wireless network traffic [7,8]. While some WIDSs can only capture and store wireless traffic. For example, WITS [9] retain multiple log files that contain system statistics and sufficient networkrelated data in order to trace back the intruder. Other WIDSs are able to analyze signal fingerprints, which can be useful in detecting and tracking rogue AP attack [10]. Moreover, due to their distributed nature, wireless networks, especially ad hoc networks, are vulnerable to attacks. In this case, wireless intrusion detection provides audit and monitoring capabilities by deploying clustering algorithms to collaboratively detect wireless intrusions [5,11].

Wavelet Theory Fundamentals
Let X = [X 0 , . . . , X N−1 ] be a vector of observations from a stochastic process, the discrete wavelet transform (DWT) is an orthonormal transform that maps X into a vector W = [W 0 , . . . , W N−1 ] at a resolution J, where {W 0 , . . . , W N−1 } denotes a set of reals, called the DWT coefficients, and N= 2 J . More accurately, the DWT can be expressed as follows: where T denotes the transposition operator, W is an N × N matrix defining the DWT and satisfying WW T = I N , and I N is the identity matrix of dimension N.
Obviously, orthonormality implies that X = W T W and X 2 = W 2 . Moreover, the elements of W can be decomposed into J + 1 subvectors such that (i) the first J subvectors are denoted by (W j ) j=1,...,J , and the jth subvector contains all of the DWT coefficients for scale τ j = 2 j . This means that W j is a column vector with N/τ j elements; EURASIP Journal on Advances in Signal Processing 3 (ii) the final subvector is denoted as V j and contains only the scaling coefficient W N−1 .
Consequently, we obtain the multiresolution representation of W given by: According to this reasoning, (1) can be rewritten as follows: where W j and V J are matrices defined by partitioning the rows of W according to the partition of W into W 1 , . . . , W J , and V J . Thus, W j is a (N/τ j ) × N matrix and V J is a row vector of N elements. Several variants of the DWT have been developed for various contexts. In this paper, we use the maximal overlap discrete wavelet transform that has been first proposed in [12]. In contrast to the traditional DWT, the application of the MODWT to a vector X at a given level J yields the column vectors W 1 , W 2 , . . . , W J , each of dimension N. The vector W j , for a specific j in {1, . . . , J}, contains the MODWT wavelet coefficients associated with changes in X on a scale τ j = 2 j−1 . The vector V J contains the DWT coefficients the MODWT scaling coefficients associated with variations at scale τ J = 2 J . More concretely, for a given level j, the components of the N dimensional vectors W j and V j are expressed as follows: for t = 0, . . . , N − 1, where h is the wavelet filter, g is the scaling filter, L denotes the width of h and g, h j,l = h j,l /2 j/2 , g j,l = g j,l /2 j/2 , and L j = (2 j − 1)(L − 1) + 1.
The most important properties of the MODWT are given in the following.
(i) While the partial DWT of level J restricts the vector size (representing the observations) to 2 J , the MODWT of level J is well defined for any sample size N. When N is a multiple of 2 J , the DWT can be computed by a number of multiplications that is of O(N) complexity using the pyramidal algorithm, whereas the corresponding MODWT requires a number of multiplications which is of O(N log 2 N) complexity.
(ii) As for the DWT, the MODWT can be used to build a multiresolution analysis. On the opposite to the traditional DWT, the details and smooths of this multiresolution analysis are such that circularly shifting the input vector by any amount will shift each detail and smooth by a corresponding amount.
(iii) In contrast with the DWT, the MODWT details and smooths are associated with zero-phase filters, thus making it easy to line up features in a multiresolution with original observation vector meaningfully.
(iv) The MODWT can be used to carry out an analysis of variance based on the wavelet and scaling coefficients.
(v) Whereas a circular shift on the observation vector results in modifying the DWT-based power spectra, the corresponding MODWT-based spectra remain unchanged. In fact, we can obtain the MODWT of a circularly shifted time series by just applying a similar shift to each of the components ( W j ) j∈{1,...,J} and V J of the MODWT of the original observation vector The last property is crucial in the context of variance changes. In fact, the signal is often shifted due to the lack of time synchronization between the nodes of the wireless network. The MODWT, therefore, seems to be more convenient than the traditional DWT in this case because it preserves the time shift.

A Multilayer Detection Process for Wireless Networks
In this section, we discuss the architecture of the proposed multilayer statistical intrusion detection approach. We consider three major modules: (a) the preprocessor; (b) the detector; and (c) the postprocessor. Each module can be decomposed at a finer granularity into a set of submodules. Figure 1 shows the basic architecture.
In the following, we discuss the functions implemented by the three modules mentioned above.
(1) The physical and link layer preprocessors: the main objective at this level is to extract several features from the radio signals in order to determine whether the originating transceiver effectively has the MAC address included in the link-layer header of the corresponding data frames. This allows detecting and identifying the attackers using device impersonation or MAC address spoofing techniques in order to hide their identities or gain unauthorized privileges. To implement this module, we develop a Radio Frequency Fingerprinting (RFF) technique (see Section 5). RFF has been successfully applied in many fields including wireless device localization, forensics, and radio frequency identification (RFID). Roughly speaking, an RFF technique should perform two fundamental tasks: transient detection and feature extraction. One novelty of our preprocessor is that it relies on the MODWT to detect the beginning of the transient. We carried out simulations to highlight the enhancement introduced by this wavelet-based technique. The most important advantage of using MODWT is its shiftinvariance property. In fact, given that clock synchronization can hardly be achieved in wireless networks, especially those using ad hoc infrastructures, the signal emanating from an emitting node will necessarily be time shifted when reaching its destination. This can severely affect the transient detection functionality, which is an important phase of the fingerprinting process. The results of these simulations are discussed in Section 8.
(2) Geometric unsupervised classification: typically, an unsupervised classification approach takes as input a set of unlabeled data and attempts to find specific events buried within the data. In the antispoofing problem, we are given a set of data, where it is unknown which originate from authenticated transceivers and which originate from impersonated devices. The goal is to identify the anomalous elements. The main advantage of such approaches is that they do not require the injection of a purely normal training set. The algorithm can indeed perform over unlabeled data. This is convenient with the anomaly detection context because the antispoofing filter operating in a mobile wireless environment should cope with a varying set of MAC addresses (as nodes may join or leave the network). The key characteristic of our framework (proposed in Section 6) is a mapping the data provided by the physical and link layer preprocessors to a feature space, which is basically a vector space. Inside this vector space, the elements that are in lowdensity regions of the probability distribution are labeled as anomalous.
(3) Traffic model-based detection: techniques for detecting previously unseen network intrusion attempts often depend on finding anomalous behavior in network traffic streams. It follows that there is a need to produce traffic models that accurately reflect the characteristics of the applications of interest. It has been noticed in [13,14] that a large number of superimposed heavy-tailed ON/OFF processes can yield self-similar traffic with degree of selfsimilarity assessed by the Hurst parameter [15]. In Section 7, we propose two models for the TCP protocol and for video transmission. These models allow detecting abnormal behavior (e.g., traffic pattern distortion).
In the following sections, we develop the detection mechanisms associated to the three aforementioned modules. Section 5 shows how physical layer preprocessing is carried out. The clustering algorithm allowing to discard spoofed packets is introduced in Section 6. Section 7 proposes a technique allowing to detect traffic injection attacks based on self-similarity of TCP and video traffic behavior.

Physical Layer Preprocessor Design
One problem associated with the application of the DWT for transient detection is that it suffers from a lack of translation invariance. This means that a time series will not necessarily shift its DWT coefficients in a similar manner.
Let X = [X 0 , . . . , X N−1 ] be a time series representing the amplitude of the signal generated by a wireless transceiver. X can be regarded as a sequence of R random variables X 0 , . . . , X R−1 with zero means and different variances It is noteworthy that C k measures the accumulation of variance in the signal as a function of time.
According to the definitions given above, the variance change point can be defined as where the operator argmax returns the integer k 0 for which the k-dependent expression is maximal.

Geometric Unsupervised Classification
where s is the sliding factor for the windowing process. Every time the window is slided by s, we compute the average amplitude and frequency. For a frame φ i , and a window j, a i j and f i j denote the average amplitude and frequency of the corresponding transient, respectively. The feature map allowing to represent the features of the captured frame will be defined as follows: where M is the set of MAC addresses and m i is the physical address included in the link-layer header of frame φ i . Moreover, we introduce an application δ on (R 2Nt ×M)× (R 2Nt × M) such that, for every where (ii) ⊕ denotes the "exclusive OR" operator on binary strings; (iii) · denotes the complement operator on binary strings; (iv) (·) 10 denotes the conversion of a binary string to the decimal basis; (v) · denotes the l 2 -norm on R 2Nt .
It can be easily proved that δ defines a distance on (R 2Nt × M) × (R 2Nt × M). In the following, this distance will be used to build the frame clusters. To this end, we extend δ to the set of frames by defining a distance δ φ on Φ × Φ as follows: In the following subsection, we use the distance δ φ to develop a clustering algorithm on the set of frames.

Distance-Based
Clustering. The goal of this algorithm is to compute the local density of the feature space. In other terms, it should compute how many points are "near" each point in the feature space. In our context, these points, also referred to as elements, correspond to the captured network frames. The principal parameter of the algorithm is a radius r also referred to as cluster width. For any pair of points x 1 and x 2 in the feature space, we consider the two points "near" each other if their distance is less than or equal to r, which represents the typical cluster radius (i.e., δ(x 1 , x 2 ) ≤ r).
For each point x, we define N(x) to be the number of points that is within r of point x. More formally, N(x) is expressed using the set cardinality function |·| as follows: The straightforward computation of N(x) for all points has a complexity of O(|Φ| 2 ), where |Φ| is the cardinality of |Φ|. The reason is that we have to compute the pairwise distances between all points. The approach that we develop in Algorithm 1 allows to define N c clusters based on the distance δ φ . The complexity of this algorithm is O(N c ·|Φ|). This is mainly because the construction of one cluster requires one pass through the set Φ.
The clustering process is as follows. The first point in Φ (i.e., φ 1 ) is the center of the first cluster. For every subsequent point, if it is within r of a cluster center, it is added to that cluster. Otherwise, it is a center of a new cluster. Two important remarks about this clustering algorithm should be highlighted.
(1) Several points may be added to multiple clusters at the same time. We will show that this fact does not affect the anomaly detection process because it relies essentially on the cardinality of every cluster and the local density of the elements within the feature space.
(2) The first point in every cluster is the center of the cluster meaning that an unclustered element is assessed with respect to this point to determine whether it should be appended to the cluster or not.

6
EURASIP Journal on Advances in Signal Processing begin N c = 1;

Spoofed Frame Detection.
Having clustered the set of captured frames, the IDS should identify the anomalous samples. According to our approach, the anomalies corresponding to MAC address spoofing correspond to lowdensity regions of the probability distribution in the feature space. This is because the clustering algorithm presented in the previous subsection intuitively clusters the set of frames according to their source MAC addresses. The details of the subsequent procedure are given in Algorithm 2. In addition to the distance δ φ defined in (11), the algorithm uses the Mahalanobis distance that has been introduced in [16]. We use this distance to measure the intercluster correlation. More theoretically, we define the distance δ M on Φ × Φ as follows: where R is the covariance matrix of φ 1 and φ 2 . If the covariance matrix is diagonal, the Mahalanobis distance can be expressed as a function of the distance δ φ introduced in (11) as follows: where σ φ1 and σ φ2 stand for the standard deviations of φ 1 and φ 2 , respectively. Hence, we develop an anomaly detection algorithm that characterizes an attack instance as a frame φ verifying one among the following properties.
(1) φ belongs to a cluster C k which is "far," in terms of Mahalanobis distance, from the most populated cluster.
(2) φ is far from the centroid of the cluster to which it belongs.
In the following, we discuss informally the anomaly detection algorithm.
(1) Find the largest cluster, that is, the one with the highest number of elements. This cluster is by default labeled as normal. Its centroid is labeled as c π(1) 1 .
(2) Sort the remaining clusters in descending order of the Mahalanobis distance from each cluster to C π(1) . (3) Within every cluster, sort the elements in descending order according to their distance δ φ from c π(1) 1 . (4) Select the first ε 1 N c clusters and label them as potentially normal. (5) Within every cluster C k , select the first ε 2 |C k | elements and label them as normal. (6) All the elements that have not been labeled as normal are labeled as attacks.
Clearly, the efficiency of this anomaly detection approach mainly depends on the choice of the parameters ε 1 and ε 2 . The false positive rate increases when the values of ε 1 and ε 2 are excessively small because most of the captured frames would be labeled as abnormal. Conversely, if ε 1 and ε 2 are large (i.e., very close to 1), the false negative rate increases as most of the frames would be labeled as normal. Moreover, the fingerprinting approach has an obvious influence on the false negative rate. If the RFF approach does not allow distinguishing two transients generated by two distinct transceivers, the efficiency of the geometric classification algorithm is severely affected. A good choice of the parameters ε 1 and ε 2 can be found experimentally.

The set of anomalous events A is expressed by
Algorithm 2: A = anomaly detection (Φ). [13,17], which can be accurately measured using the wavelet transform. This section investigates the use of the wavelet transform and change-point detection algorithms in order to detect the instants when fractality changes abruptly. We demonstrate that transport-layer and application-layer traffic data exhibit long-range dependence features. We particularly study the examples of the transmission control protocol (TCP) at the transport layer and real-time video transmission at the application layer. We show how the Hurst parameter, which expresses the intensity of the long-range dependence phenomenon, can be estimated through the use of the wavelet transforms. Recent studies have pointed out that TCP flows as well as real-time traffic tend to have self-similar behavior because of the intrinsic mechanisms they implement such as traffic generation, aggregation, and control. The interested reader would refer to [14,17] for more details about these results. A detection approach can be developed by measuring the instant, where the traffic deviates from its normal model. This detection approach can be particularly efficient to detect traffic distortion attacks, which consist in changing the traffic normal behavior by dropping packets or injecting packets [18].

Modeling the Transport and Application Layers Traffic as a Long-Range Dependent Processes.
A stationary stochastic process X is said to be long range if its autocorrelation function decays at a rate slower than a negative exponential. In the frequency domain, long-range dependence appears as a 1/ f spectrum around the origin, meaning that where X is the Fourier transform of X, c f is a constant having dimension of variance, and H denotes the Hurst parameter. It is noteworthy that c f and H can be interpreted as quantitative and qualitative measures of long-range dependence, respectively. In the following, we discuss the long-range dependence properties of the TCP and video broadcasting traffic. The transport layer mainly deals with end-to-end congestion control and assures that arbitrarily large streams of data are reliably delivered and arrive at their destination in the order sent. With high-quality traffic measurements at hand, accurate accounting of this multilevel hierarchy of measured network traffic is possible because all the relevant information can be obtained by looking inside the collected packets. As a result of the hierarchy of protocol architectures, between the transport and application layers, actual network traffic can be viewed as the result of interwined mechanisms and modes that exist at the different network layers.
We consider a network with a number of users/sources or end hosts communicating with each other in which an individual source is modeled according to an on-off alternating renewal process as follows. The source alternates between an active state or on state where it sends packets into the network and an inactive or off state where it is idle and does not send any packet. Let {P(t)} be a stationary process, where The length of the on intervals is identically distributed, and so are the lengths of the off intervals. Furthermore, the lengths of on and off intervals are independent. An off interval always follows an on interval, and it is the pair of on and off intervals that defines the interrenewal period. Let F on and F off denote the cumulative distribution function of the on and off intervals, respectively. Let F = 1−F denote a complementary cumulative distribution function. Let also σ on and σ off represent the respective variances. For where α on , α off , l on , and l off are constants. When 1 < α on < 2, the distribution of on times is said to be "heavily tailed" with exponent α on . Since it has infinite variance, the on time can be very long with relatively high probability. At this level,we interested in analyzing the behavior of the cumulative load, L(t) = t 0 P(u)du, at large times t. This load has variance

EURASIP Journal on Advances in Signal Processing
where γ(u) = E(P(u)P(0)) − (E(P(0))) 2 denotes the covariance function of P. It has been shown in [13] that this implies that where σ is a constant and H = (3 − min(α on , α off ))/2. Similarly, video traffic can have self-similar behavior. Motion Picture Expert Group (MPEG) is a set of standards for compression of video, or sequences of images. There are several versions of the standards. MPEG-1 is older, while MPEG-4 is more advanced and achieves better compression performances than MPEG-1. The basic principles of operation of both standards are rather similar. Compression is achieved by reducing the spatial and temporal redundancy in the sequence of images (frames). Spatial redundancy (redundancy within an image) is reduced by applying algorithms for compression of still images (JPEG, e.g.).
It was proved in publications [19,20] that variable bit rate (vbr) video traffic can belong to the class of long-range dependent processes as follows.
(i) The correlation of r k demonstrates the hyperbolic decay for large delays k : r k → c 0 k −β , as k → ∞.
(iii) The variance σ 2 n of the sample mean value decreases slower than the inverse sample size n : σ 2 n = σ 2 (X n ) → c 2 n −β , as n → ∞ (X n = n i=1 X i /n for several constants c 0 , c 1 , c 2 ).
The constant value β ∈ [0; 2] reflects the function type, 0 ≤ β < 1 indicates the long-range dependence, and 1 < β ≤ 2 demonstrates the short-range data dependence. (The persistence degree is often expressed with the help of the Hurst exponent H = 1 − β/2.) The long-range dependence is defined within the limits of the weak stationarity structure [19,21], that is, the stationarity in the wide sense.
The stationarity and the ergodicity allow statistical estimates such as the mean value and the variance or other model parameters to be found from each separate data sample, or in this case from the separate time series. If the assumptions of stationarity and ergodicity do not hold, certain measures such as the mean value and the variance may be without meaning. In reality, the mean value of the VBR video time series converges very slowly, which can be caused by nonstationarity and not necessarily by long-range dependence. More details about this aspect are given in the appendix.

TCP and Video Broadcasting Wavelet Analysis.
Many methods have been used to find a Hurst self-similarity exponent estimate, such as R/S analysis, variance-time plots, the periodogram analysis, and the Whittle analysis. However, the long-range dependence property leads to a serious estimate displacement and difficulties in making a convergence estimate. Consequently, we investigate the use of the wavelet transform in order to cope with the aforementioned shortcuts.
The advantages of the wavelet analysis result from the fact that the wavelet functions themselves demonstrate the scaling property and, therefore, form the optimal "coordinates system," from which the scaling phenomena can be traced. This analysis provides steady detection of the scaling behavior, its type and an accurate measurement of the parameters in order to describe this scaling behavior.
According to Section 3, the time series X(t) is presented in the form where X J (t) = n0/2 J −1 k=0 s J,k ϕ J,k (t) is the initial approximation function corresponding to the scale J (J ≤ J max ); s J,k = X(t), ϕ J,k is the scaling coefficient equal to the scalar product of the initial series X(t) and the scaling function of the "roughest" scale J, displaced by k scale units to the right from the origin of coordinates; D j (t) = n0/2 J −1 k=0 d j,k ψ j,k (t) is the refining function of the jth scale; and d J,k = X(t), ψ J,k is the wavelet coefficient for scale j equal to the scalar product of the initial series X(t) and the wavelet with scale j, displaced by k scale units to the right from the origin of coordinates.
The normalized wavelet and scaling functions of the Haar system give good results for the discrete time series analysis. If ϕ(t) = 1, for 1 ≤ t < 0, 0, otherwise, where ψ is the orthonormal wavelet in L 2 (R) space. It is called the Haar wavelet and {ψ j,k : j, k ∈ Z} is the orthonormal system in L 2 (R). We find that the wavelet coefficients for the time series expansion over the wavelet functions basis and the Hurst exponent H fulfill the following equation: where K j = n 0 /2 j is the wavelet coefficient number for the scale j; C W = c f C(α, ψ) is the parameter that does not depend on scale j and α = 2H − 1.
The number of wavelet coefficients decreases as the scale increases. Formula (21) is used for the Hurst exponent estimate of the LRD video sequences. This means that if X is the LRD process with the Hurst exponent H, the plot of function j, referred to as the logarithmic diagram (LD), should have the linear slope 2H − 1, and demonstrates that the scaling exponent (2H − 1) can be obtained from the plot slope estimate of the function log 2 ((1/K j ) of j. Therefore, the Hurst exponent estimate can be found by means of the choice of the approximated curve equation using the weighted least squares (WLSs) method.
The logarithm of this variable will be the estimate of log 2 μ j , but will be displaced as the logarithm nonlinearity shows that M log 2 (d 2 j ) / = log 2 (Md 2 j ) = jα+log 2 C W . As shown in [22][23][24], we reduce the regression analysis problem to consider the equation M y j = ja +log 2 C W . The estimation of slope α can be obtained by carrying out the weighted linear regression, in which x j = j and σ 2 j = Var(y j ). Determining the quantities S = j2 j= j1 1/σ 2 j , S 1 = j2 j= j1 j/σ 2 j , and S 2 = j2 j= j1 j 2 /σ 2 j , the weighted estimate α can be obtained for α as which is unbiased over the interval [ j 1 ; j 2 ]. In addition, Assuming a weak correlation between wavelet coefficients in the case when d j,k are Gaussian values, the variance σ 2 j can be estimated by the expression where is the generalized Rieman zeta function.

Simulation of the Anomaly Detection Module.
In order to assess the geometric clustering methodology proposed in this paper, we simulated a network composed of 20 nodes. The global flow consists of about 10 6 packets and the attack rate is 0.1 (10% of the packets are spoofed). It is assumed that the attack packets follow a Gaussian distribution within the total traffic. The uncertainty related to MODWT-based fingerprinting mechanism has been set to 10 −3 .
Based on these assumptions, we evaluated our anomalybased detection approach with respect to three well-known methods: modified cluster TV [25], K nearest neighbors (KNNs) [26], and support vector machine (SVM) [27]. This evaluation is based on the receiver operating characteristic (ROC) curves. The reader may wonder about the choice of these methods since they are fundamentally supervised while our geometric technique is unsupervised. In fact, we try to demonstrate that even though geometric clustering does not require a training set to optimize its intrinsic parameters, its performance is comparable to supervised clustering algorithms, which have been extensively used in the intrusion detection context. From our experiments, we found that not all the attacks could be detected. This may be due to two essential factors.
(1) Using our feature map μ w,s , some of the spoofed frames can be in the same region of the feature space as the normal frames. In fact, the signal fingerprinting technique can provide falsely correlated fingerprints for distinct physical addresses (2) The parameters ε 1 and ε 2 do not fit the actual probability distribution of the data traffic across the network. For ε 1 = ε 2 = 0.8, we found that the geometric clustering approach provides less false positives than the other methods while keeping the same rate of false negatives ( Figure 5). Figure 6 plots the ROC curve for different values of ε 1 and ε 2 . These results confirm our remark in Section 6.3 stating that, on the opposite to the false negative rate, the false positive rate decreases with respect to the values of ε 1 and ε 2 .
One possible way to adapt 1 and 2 to the performance of the classifier is to fix a priori a value for the area under the ROC curve (AUC), and then estimate the values of 1 and 2 for which the ROC curve is characterized by the required AUC. The AUC, which can be easily computed using the formula where G is the Gini coefficient [28], is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.
To reduce the computational cost of estimating 1 and 2 , we can draw the ROC curves for two pairs ( 1 1 , 1 2 ) and ( 1 2 , 1 2 ). Then, we compute the corresponding AUCs, say A 1 and A 2 . Supposing that A r is the required AUC, interpolating functions (i.e., polynomials, splines) can be used to estimate the values of r 1 and r 2 . Obviously, more than two pairs can be used for a more accurate estimation of r 1 and r 2 . However, this would result in a computational overhead.

Traffic Pattern Distortion Detection.
To test the efficiency of the traffic pattern distortion detector, we generated a TCP  traffic respecting the statistical model presented in Section 7 and we injected eight denial-of-service attack instances. We used the wavelet-based Hurst parameter estimator described in Section 7 in conjunction with three changepoint detection algorithms which are moving windowiterated cumulative sums of squares (MWICSSs), moving window Schwarz information criterion (MWSIC), and moving window Wang's jump (MWWJ) [29]. The simulation scenario can be described through the following points.
Step 1. We apply the DWT and MODWT. The maximum level of the transforms depends on the length of window. Whitcher et al. [29] recommend to use at least 128 data points to implement the variance change test. Moreover, we want to apply to the coefficients the Ljung-Box test for autocorrelation with maximum lag 10 (see Step 2). For the sake of clarity and computation cost efficiency, we choose to compute wavelet transforms up to level 4.
Step 2. The application of the MWICSS and MWSIC algorithms to test for variance changes requires uncorrelated data. We, therefore, choose the DWT with highest P-value among those packets of the tree for which the null hypothesis of the Ljung-Box test for autocorrelation is not rejected.
Step 3. We test for variance changes (with either the ICSS or the SIC algorithm) using the coefficients of the DWT packet selected from Step 2. If the null hypothesis that no variance change occurs is rejected then we identify the location of the change point using now the nondecimated wavelet packet coefficients of the packet selected in Step 2.
Step 4. Using the binary segmentation procedure, we repeat Steps 1-3 with subsequent subseries until no further variance change point is found. In the case of the ICSS procedure, we also perform the additional confirmatory step on all identified potential change points by using subseries of data between adjacent points, as suggested by Inclán and Tiao [30].
Step 5. We record information of the type (t j ; f j ), where t j is a time location and f j is its frequency of detection, that is, how many times a change at that point has been detected by the method up to the window under consideration. We declare a certain time point to be a variance change if its frequency of detection is greater than or equal to a predetermined threshold T. A smaller T implies not only a faster detection but also a larger number of false alarms.
Plots of Figure 7 give a graphical representation of the performances of the three detection methods. There, each of the two subplots contains a different portion of the signal, displaying 1st, 2nd, 3rd attacks and 4th, 6th, and 8th attacks, respectively, as representatives of the two different kinds of change points, in mean and in variance. Results for MWICSS and MWSIC are for a threshold level 2 and window size 128, those for MWWJ are for window size 128. In these plots, the solid circles indicate the real change points, the square rectangles, the points detected by the MWICSS, the diamonds those detected by the MWSIC, and the triangles those detected by the MWWJ. Notice how the MWICSS and MWSIC algorithms do a better job at detecting attacks of the first type, that show variance changes. However, there appears to be an asymmetric aspect in the detection of these two methods, in that both the MWICSS and the MWSIC detect the start of the attacks but show a relative large delay in detecting the ending points. In other words, these algorithms seem to be sensitive to the location of the change points and to the variance ratio.

Conclusion
In this paper, we presented a multilayer intrusion detection approach for wireless networks. Our approach combines a physical layer antispoofing filter with advanced statistical traffic anomaly detectors. The antispoofing technique consists of a radio signal fingerprinting mechanism and a geometrical clustering algorithm while traffic anomaly detection is based on the estimation of the Hurst parameter of the real traffic. Thorough simulations show that our IDS provides better performance than the most known existing approaches. Furthermore, a postprocessing module is currently under development. Cooperative tracking using large groups of mobile detector nodes is investigated to this purpose. A Kalman filter-like estimator is being implemented and tested in order to examine the effect of the detector node density in the monitored area on the accuracy of the tracking results. More precisely, we assess the improvement in tracking efficiency per additional detector node as the coverage of the monitored region increases.
For the first accuracy order, where ω / = 0; ±π; . . .. Thus, the estimate log(I N ) is closer to the normal value than the nontransformed estimate. To prove (or to negate) the assumption of weak nonstationarity, the X process is divided into I segments, each of which is centered by time t i and has the length N. For each ith segment, the power spectral density I N,i (ω) is calculated in accordance with (A.2). The discretization of the smoothed periodogram (A.2) is carried out by frequencies ω i = π j/N ( j = j 0 + kΔ j, k = 0, 1, . . . , J), and taking a logarithm gives the two-dimensional random variable Y i j = log[I N,i (ω j )]. If the frequencies ω i , like the times t i , have a wide enough dispersion, the random variable Y i j is distributed approximately normally and is noncorrelated [33]. The assumption of Y i j approximate normality and lack of correlation in both measurements imply Y i j approximate independence. Therefore, to define the structure of the basic random process the method of variance analysis can be used [32,33] Y i j = μ + a t i + b ω i + c t i , ω i + η i j , (A.5) where η i j is the independent and identically distributed normal random variable with zero mean value and variance σ 2 , defined by the relation (A.4). The presence of c(t i , ω i ) and a(t i ) can be checked using the variables where the dot shows the mean value over the index for which it substitutes: for example, Y . j = I i=1 Y i j /I. In the stationary process, the terms c(t i , ω i ) and a(t i ) can be expected to disappear. In this case, the variables S I+R /σ 2 and S T /σ 2 are χ 2 -distributed with (I − 1)(J − 1) and (I − 1) degrees of freedom, respectively. The stationarity hypothesis is rejected if one of the statistical tests exceeds 1% of the quantile of the appropriate χ 2 distribution. This test cannot be used in the case of long-range dependence because the noise is not normally distributed and correlated.