The attack and defense problem is modelled as a multi-stage proactive jamming/anti-jamming stochastic game. A stochastic game [22] is played in a sequence of steps, where at the end of each step, every player receives a payoff for the current step and chooses an action for the next step that is expected to maximize his payoff. A player’s payoff in each step is determined not only by his action but also by the actions of all the other players in the game. Collection of all of the actions that a player can take comprise his (finite) action set. The distribution of a player’s choices of actions constitute his strategy. The strategy may be fixed or may be updated according to the deployed learning algorithm.

The proposed game is an extension of Markov decision process (MDP), whose state transition probabilities may be depicted as finite Markov chains.

The modelled game consists of two players: transmitter *T* and jammer *J*. At the end of each step, every player observes his payoff for the given step and decides either to continue transmitting with the same power and at the same frequency or to change one of them, or both. The payoff consists of a summation of reward for the successful transmission (jamming), penalty for the unsuccessful transmission (jamming), and negative values related to cost of transmission (jamming) and cost of frequency hopping. Transmission (jamming) cost is related to the power spent by the user for transmitting (jamming) in a given step. Hopping cost may be explained by the fact that, after changing the channel of the transceiver pair (jammer), a certain time elapses before the communication may be resumed (interference created) due to the settling time of the radios or by other hardware constraints.

A generalized payoff at the end of the step *s* for transmitter *T* is expressed as (1). Here, *R*^{T} denotes the reward for successful transmission, *X*^{T} is the sustained fixed penalty for the unsuccessful transmission, *H* is the hopping cost, *g*(*C*^{T}) is a function that expresses the transmitter’s cost of transmission when power *C*^{T} is used, *f*^{T} is the channel currently used by the transmitter-receiver pair, *α*=1 if transmission is successful and *α*=0 if not, and *β*=1 if the transmitter decides to hop and *β*=0 otherwise. In this notation, subindices are used to denote steps, and superindices to denote the players.

{P}_{s}^{T}\left({C}_{s}^{T},{f}_{s}^{T},{C}_{s}^{J},{f}_{s}^{J}\right)={R}^{T}\alpha -{X}^{T}\left(1-\alpha \right)-H\xb7\beta -g\left({C}_{s}^{T}\right)

(1)

Similarly, jammer *J*’s generalized payoff for the step *s* is given as (2). Here, *R*^{J} is the jammer’s reward for successful jamming, *X*^{J} is the sustained fixed penalty for the unsuccessful jamming, *g*(*C*^{J}) is the jammer’s cost of transmission when power *C*^{J} is used. Finally, *γ*=1 if the jammer decides to hop and 0 if it does not.

{P}_{s}^{J}\left({C}_{s}^{T},{f}_{s}^{T},{C}_{s}^{J},{f}_{s}^{J}\right)={R}^{J}(1-\alpha )-{X}^{J}\alpha -\mathrm{H\gamma}-g\left({C}_{s}^{J}\right)

(2)

### 3.1 Equilibrium analysis of the game

Nash equilibrium is inarguably the central concept in game theory, representing the most common notion of rationality between the players involved in the game. It is defined as the set of distributions of players’ strategies designed in a way that no player has an incentive to unilaterally deviate from its strategy distribution.

Let *n*_{
f
} be a discrete number of channels available to both players for channel hopping, and let {n}_{{C}^{T}} and {n}_{{C}^{J}} be the discrete number of transmission powers for the transmitter and the jammer, respectively. For the game with {n}_{f}\xb7{n}_{{C}^{T}}\left({n}_{{C}^{T}}={n}_{{C}^{J}}\right) pure strategies available to each player, we define *S*^{T} as the set of pure strategies of the transmitter and *S*^{J} as the set of pure strategies of the jammer. Then, x\in {\mathbb{R}}^{{S}^{T}} and y\in {\mathbb{R}}^{{S}^{J}} represent the mixed strategies of the transmitter and jammer, respectively. By denoting the payoff matrices of the transmitter and jammer as *A* and *B*, respectively, a best response to the mixed strategy *y* of the jammer is mixed strategy *x*^{∗} of the transmitter that maximizes its expected payoff {x}^{\ast \u22ba}\mathit{\text{Ay}}. Similarly, the jammer’s best response *y*^{∗} to the transmitter’s mixed strategy *x* is the one that maximizes {x}^{\u22ba}B{y}^{\ast}. A pair (*x*^{∗},*y*^{∗}) that are best responses to each other is a Nash equilibrium of the bimatrix game, i.e., for any other combination of mixed strategies (*x*,*y*) the following equations hold true:

\begin{array}{l}\mathit{\text{xA}}{y}^{\ast \u22ba}\le {x}^{\ast}A{y}^{\ast \u22ba},\end{array}

(3)

\begin{array}{l}{x}^{\ast}B{y}^{\u22ba}\le {x}^{\ast}B{y}^{\ast \u22ba}.\end{array}

(4)

In 1951, Nash proved that all finite non-cooperative games have at least one mixed Nash equilibrium [23]. Particularization of this proof for bimatrix games may be given as follows [24]:

Let *x* and *y* be arbitrary pairs of mixed strategies for the bimatrix game (*A*,*B*), and *A*_{i·} and *B*_{·j} represent the *i* th column and the *j* th row of the matrices *A* and *B*, respectively. Then,

\begin{array}{l}{c}_{i}=\text{max}\left\{{A}_{\mathit{\text{i}}\xb7}{y}^{\u22ba}-\mathit{\text{xA}}{y}^{\u22ba},0\right\},\end{array}

(5)

\begin{array}{l}{d}_{j}=\text{max}\left\{x{B}_{\xb7j}-\mathit{\text{xB}}{y}^{\u22ba},0\right\},\end{array}

(6)

\begin{array}{l}{x}_{i}^{\prime}=\frac{{x}_{i}+{c}_{i}}{1+{\sum}_{k}{c}_{k}},\end{array}

(7)

\begin{array}{l}{y}_{j}^{\prime}=\frac{{y}_{j}+{d}_{j}}{1+{\sum}_{k}{d}_{k}}.\end{array}

(8)

Since *T*(*x*,*y*) = (*x*^{′},*y*^{′}) is continuous and *x*^{′} and *y*^{′} are mixed strategies, it can be shown that (*x*^{′},*y*^{′}) = (*x*,*y*) if and only if (*x*,*y*) is an equilibrium pair. Furthermore, if (*x*,*y*) is an equilibrium pair, then for all *i*:

\begin{array}{l}{A}_{\mathit{\text{i}}\xb7}{y}^{\u22ba}\le \mathit{\text{xA}}{y}^{\u22ba},\end{array}

(9)

hence *c*_{
i
}=0 (and similarly *d*_{
j
} = 0 for all *j*), meaning that *x*^{′}=*x* and *y*^{′}=*y*. Assume now that (*x*,*y*) is not an equilibrium pair, i.e., there either exists \overline{x} such that \overline{x}A{y}^{\u22ba}>\mathit{\text{xA}}{y}^{\u22ba}, or there exists \overline{y} such that \mathit{\text{xB}}{\overline{y}}^{\u22ba}>\mathit{\text{xB}}{y}^{\u22ba}. Assuming the first case, as \overline{x}A{y}^{\u22ba} is a weighted average of {A}_{\mathit{\text{i}}\xb7}{y}^{\u22ba}, there must exist *i* for which {A}_{\mathit{\text{i}}\xb7}{y}^{\u22ba}>\mathit{\text{xA}}{y}^{\u22ba}, and hence some *c*_{
i
}>0, with \sum _{k}{c}_{k}>0. As \mathit{\text{xA}}{y}^{\u22ba} as a weighted average of {A}_{\mathit{\text{i}}\xb7}{y}^{\u22ba}, there must exist {A}_{\mathit{\text{i}}\xb7}{y}^{\u22ba}\le \mathit{\text{xA}}{y}^{\u22ba} for some *i* such that *x*_{
i
}>0. For this *i*, *c*_{
i
}=0, hence:

\begin{array}{l}{x}_{i}^{\prime}=\frac{{x}_{i}+{c}_{i}}{1+{\sum}_{k}{c}_{k}}<{x}_{i},\end{array}

(10)

and so *x*^{′}≠*x*. In the same way, it can be shown that *y*^{′}≠*y*, leading to the conclusion that (*x*^{′},*y*^{′})=(*x*,*y*) if and only if (*x*,*y*) is an equilibrium. As the transformation *T*(*x*,*y*)=(*x*^{′},*y*^{′}) is continuous, it must have a fixed point, and so by applying Brouwer’s fixed point theorem [25], it follows that this fixed point indeed represents an equilibrium point. This concludes the proof of the existence of mixed-strategy equilibrium points in a bimatrix game.

However, efficient computation of equilibria points, as well as proving uniqueness of an equilibrium, remains an open question for many classes of games. Lemke-Howson (LH) [26] is the most well-known algorithm for the computation of Nash equilibria for bimatrix games and is our algorithm of choice for finding the Nash equilibrium strategies. A bimatrix game requires the game to be fully defined by two payoff matrices (one for each player). Since in our case the immediate payoff of every player in each step depends not only on his own action and the action of the opponent but also on the previous state of the player (influence of the hopping cost), our game as a whole cannot be represented by two deterministic payoff matrices. For this reason, we divide the game into {n}_{f}\xb7{n}_{{C}^{T}} subgames, where each subgame corresponds to a unique combination of possible states of the transmitter and the jammer. Since each subgame can be treated as the separate game in a bimatrix form, we proceed to apply the LH method to find mixed strategy Nash equilibriums (one per subgame). Hence, in each step, every player plays an equilibrium strategy corresponding to that step. A union of equilibria strategies of all the {n}_{f}\xb7{n}_{{C}^{T}} combinations of the states within the game may be considered as the Nash equilibrium of the game.

Gambit [27], an open-source collection of tools for solving computational problems in game theory, was used for finding equilibrium points using the LH method. For details on the implementation of the LH algorithm, an interested reader is referred to [28].

Each of the subgames (*A*_{
ij
},*B*_{
ij
}) where *i*=1…*n*_{
f
} and j=1\dots {n}_{{C}^{T}} is a nondegenerate bimatrix game. Then, following Shapley’s proof from [26], we may conclude that there exists an odd number of equilibria for each subgame. In [29], the upper bound on the number of equilibria in *d*×*d* bimatrix games was shown to be equal to \frac{2.4{1}^{d}}{{d}^{1/2}}; however, the uniqueness of Nash equilibrium may still be proven only for several special classes of bimatrix games. Here, we provide conditions that the bimatrix game has to satisfy in order to have a unique completely mixed Nash equilibrium. Completely mixed Nash equilibrium is an equilibrium in which the supports of each of the mixed equilibrium strategies are equal to the number of available pure strategies (i.e., each strategy from a mixed strategy set is played with a non-zero probability). As shown by [30], whose proof we re-state, a bimatrix game (*A*,*B*) whose matrices *A* and *B* are a square, has a unique completely mixed Nash equilibrium if det(*A*,**e**)≠0 and det(*B*,**e**)≠0, i.e.:

\begin{array}{l}\text{det}(A,\mathbf{e})\xb7\text{det}(B,\mathbf{e})\ne 0,\end{array}

(11)

where **e** is a column vector with all entries 1.

The saddle point matrix (*A*,**e**) is given by:

\begin{array}{l}(A,\mathbf{e})=\left[\begin{array}{cc}A& \mathbf{e}\\ {\mathbf{e}}^{\u22ba}& 0\end{array}\right].\end{array}

(12)

Then, the equilibrium strategies of the players are given as:

\begin{array}{l}{x}^{\ast}{}_{i}=-\frac{\text{det}{B}^{i}}{\text{det}(B,\mathbf{e})},\end{array}

(13)

\begin{array}{l}{y}^{\ast}{}_{i}=-\frac{\text{det}{A}_{i}}{\text{det}(A,\mathbf{e})},\end{array}

(14)

where *B*^{i} (*A*_{
i
}) is the matrix of *B* (*A*) with all entries of the *i* th column (row) replaced by 1.

Let us now suppose that (*x*^{∗},*y*^{∗}) is an equilibrium point of the bimatrix game (*A*,*B*), where *x*^{∗} is completely mixed. Then, every pure strategy would give that player the same payoff *P* against the opponent’s strategy *y*^{∗}, i.e.:

\begin{array}{l}A{y}^{\ast}=P\mathbf{e}.\end{array}

(15)

Since *y*^{∗} is a vector of probabilities,

\begin{array}{l}{\mathbf{e}}^{\u22ba}{y}^{\ast}=1.\end{array}

(16)

Or, in matrix form:

\begin{array}{l}\left[\begin{array}{cc}A& \mathbf{e}\\ {\mathbf{e}}^{\u22ba}& 0\end{array}\right]\phantom{\rule{2.5pt}{0ex}}\left[\begin{array}{c}{y}^{\ast}\\ -P\end{array}\right]=\left[\begin{array}{c}0\\ 1\end{array}\right].\end{array}

(17)

Following the assumption det(*A*,**e**)≠0 and by applying Cramer’s rule, it follows from (17) that (14) is true for (*i*=1,2,…,*n*) (in our case, n={n}_{{C}^{T}}\xb7{n}_{f}). Similarly, the same holds for *x*^{∗}_{
i
}. As shown in [30]:

\begin{array}{l}\text{det}({A}_{i},\mathbf{e})=\text{det}({A}_{i}-\mathbf{e}{\mathbf{e}}^{\u22ba})-\text{det}\left({A}_{i}\right)=-\text{det}{A}_{i},\end{array}

(18)

hence (13) and (14) are shown to be true. This concludes the proof of the uniqueness of the completely mixed equilibrium.

It may be computationally shown that all of the {n}_{f}\xb7{n}_{{C}^{T}} subgames constructed within the considered game satisfy (11). Furthermore, by observing the Markov state chains corresponding to the equilibrium points found by the LH method, it may indeed be observed that \text{supp}\left({x}^{\ast}\right)=\text{supp}\left({y}^{\ast}\right)={n}_{f}\xb7{n}_{{C}^{T}}, i.e., the equilibriums are completely mixed. Trying to find multiple equilibria for each subgame using other computational methods available within [27] has also resulted in a single (completely mixed) equilibrium for each subgame: empirical evaluation of these results, based on the algorithms to find all possible equilibrium points of the bimatrix game, further points to the existence of a unique Nash equilibrium for each subgame.

One of the common criticisms of using computational algorithms such as LH for finding Nash equilibria is that they fail to realistically capture the way that the players involved in the game may reach the equilibrium point. For this reason, it is useful to discuss the payoff performance and the convergence properties to Nash equilibrium of the algorithms realistically used for learning in games. This discussion is done for two multi-agent learning algorithms considered within this work: fictitious play (Section 3.2.1) and payoff-based adaptive play (Section 3.2.2).

### 3.2 Learning algorithms

Learning algorithms for MDPs have been extensively studied in the past [31, 32]. Based on their spectrum occupancy inference capabilities, an illustrating example of the corresponding learning algorithms for the considered game and the dimensionality of the action space is given in Figure 1.

For CRs not equipped with spectrum sensing capabilities (geolocation/database-driven CRs and CRs utilizing beacon rays), payoff-based reinforcement algorithms impose themselves as the optimal viable learning algorithms. In these cases, each player is able to evaluate the payoff received in every step and modify its strategy accordingly.

CRs able to perform energy detection spectrum sensing, in addition, also have the possibility of observing their opponents’ actions in each step (influenced possibly by the accuracy of the deployed spectrum sensing mechanism). By incorporating these observations into their future decision-making process, the players may build and update a belief regarding the opponents’ strategy distribution. This learning mechanism is called fictitious play.

Finally, CRs able to perform feature detection spectrum sensing may recognize important parameters of the opponent’s signal and use these observations to their advantage. Since various waveforms exhibit different jamming and anti-jamming properties, depending mainly on their modulation and employed coding (see, for example, [33]), increased action space could consist of switching between multiple modulation types or coding techniques.

In this paper, we focus our analysis on the first two cases. Algorithm 1 illustrates the general formulation of the game. It can be seen how, in every step, each player takes a decision *d*_{
s
} for his next action based on their expected utility \overline{{P}_{s}}=E\left[\phantom{\rule{0.3em}{0ex}}{P}_{s}\right|{P}_{1:s-1}] under PBAP or \overline{{P}_{s}}=E\left[\phantom{\rule{0.3em}{0ex}}{P}_{s}\right|{P}_{1:s-1},s{s}_{1:s-1}] under fictitious play. Received payoffs *P*_{
s
} are calculated for each player using (1) and (2). Thereafter, spectrum sensing is performed and the expected payoff is updated with the new information available. To simplify explanation of the learning strategies and Algorithm 1, it is assumed that both players perform the spectrum sensing step; however, the result of this step is used only under fictitious play framework. For the players with perfect spectrum sensing capabilities, s{s}_{s}^{T}={d}_{s}^{T} and s{s}_{s}^{J}={d}_{s}^{J}.

Note from the pseudocode that the game consists of two main parts: the learning algorithm, in charge of updating the expected payoffs, and the decisioning policy, which uses the available observations to decide upon the future actions.

Let us assume that in step *s* the transmitter was transmitting with power {C}_{s}^{T} on the frequency {f}_{s}^{T}. Using one of the decisioning policies described in Section 3.3, its action in the next step constitutes of transmitting with power {C}_{s+1}^{T} on frequency {f}_{s+1}^{T}. We denote this action as a list of four elements {d}_{s}^{T}=\left[{C}_{s}^{T},{f}_{s}^{T},{C}_{s+1}^{T},{f}_{s+1}^{T}\right] for the transmitter and the equivalent values {d}_{s}^{J}=\left[{C}_{s}^{J},{f}_{s}^{J},{C}_{s+1}^{J},{f}_{s+1}^{J}\right] for the jammer.

#### 3.2.1 Fictitious play

Fictitious play [34] is an iterative learning algorithm where, at every step, each player updates his belief about the stochastic distributions of the strategies of the other players in the game. The application of a learning mechanism based on fictitious play to the modelled game is constructed under the assumption that the player is necessarily endowed with the spectrum sensing capabilities, allowing him to infer the actions of the other player. A payoff of a particular action given the player’s current state and the opponent’s action is deterministic and may be calculated using (1) and (2) for transmitter and jammer, respectively. If the player has the information regarding the opponents’ action in each step, then it is possible to calculate the expected utility more precisely, by accessing the history of the opponents’ actions. This is particularly true for the jammer because of the higher number of non-jammed states compared to the states of successful jamming. Hence, learning the transmitter’s pattern as soon and with as much precision as possible makes a significant difference to the overall payoff. This updating process is denoted in Algorithm 2.

It is known that the convergence of the fictitious play to Nash equilibrium is guaranteed only for several special cases, such as zero-sum games, non-degenerate 2 ×*n* games with generic payoffs, games solvable by iterated strict dominance and weighted potential games. For other types of games, including the game considered within this work, convergence to Nash equilibrium is not guaranteed, and even when it converges, the time needed to run the algorithm to convergence may be very long due to the problem being polynomial parity arguments on directed graphs (PPAD)-complete [35]. This has led to the introduction of the concept of approximate Nash equilibrium (*ε*-equilibrium). Here, *ε* is a small positive quantity representing the maximum increase in payoff that a player could gain by choosing to follow a different strategy.

Author in [36] has shown that fictitious play achieves the worst-case guarantee of *ε*=(*r*+1)/(2*r*) (where *r* is the number of FP iterations) and in reality provides even better approximation results. Furthermore, as recently shown in [37], fictitious play may in some cases outperform any actual Nash equilibrium - for this reason, it is useful to study the performance of the FP algorithm in terms of the average and final payoff compared to the Nash equilibrium.

#### 3.2.2 Payoff-based adaptive play

Payoff-based adaptive play [38] is a form of reinforcement learning algorithm, where it is assumed that the player does not have access to the information about the state of the other player and relies on the history of his own previous payoffs. The expected utility of *d*_{
s
} given previous payoffs is given by Equation 19.

\begin{array}{l}\overline{{P}_{s+1}\left({d}_{s}\right)}=E\left[\phantom{\rule{0.3em}{0ex}}{P}_{s}\right({d}_{s}\left)\right|{P}_{1:s-1}\left({d}_{s}\right)]=\frac{\overline{{P}_{s}\left({d}_{s}\right)}\xb7s+{P}_{s}\left({d}_{s}\right)}{s+1}\end{array}

(19)

PBAP has been shown to converge to Nash equilibrium for zero-sum games [39]. For general finite two-player games, it was shown to converge to close-to-optimal solutions in polynomial time [40].

In addition to comparing the performance of the PBAP to the computed Nash equilibrium strategy from Section 3.1, of particular interest to this work is the comparison to the performance of the FP. This comparison should reflect the benefit that each player gains by being equipped with the spectrum sensing algorithm (FP) over not being equipped with it (PBAP).

### 3.3 Decisioning policies

A decisioning policy of the learning algorithm corresponds to the set of rules that the player uses to select his future actions.

#### 3.3.1 Greedy decisioning policy

The most intuitive decisioning policy consists of always choosing the action that is expected to yield the highest possible value based on the current estimates - the so-called greedy decisioning policy [41]. However, a greedy method is overly biased and may easily lead the learning algorithm to ‘get stuck’ in local optimal solutions. An example of this is given in Figure 2, where both players are employing the greedy decisioning policy. Here, each player fairly quickly learns the ‘best response’ to an opponent’s action and starts relying on using it. Then, a significant amount of time has to pass before his expected payoff for the given action drops enough that another action starts being considered as ‘best response’, where in the meantime significant payoff losses are sustained. This could partially be mitigated by introducing temporal forgiveness into the learning algorithm.

#### 3.3.2 Stochastically sampled decisioning policy

Another common approach to this issue is choosing a stochastically sampled policy (also known as *ε*-greedy policy, [42]) where, at each step, a randomly sampled action is taken with a probability *p*. We propose a variation of the stochastically sampled policy where sampling is performed by scaling the expected payoff value of each action to the minimum possible payoff for the game. For a minimum payoff *PMIN* and *n* actions with expected payoffs, \overline{P\left(1\right)}\dots \overline{P\left(n\right)} the probability of choosing an action *d* is given by (20):

p\left(d\right)=\frac{\overline{P\left(d\right)}-\mathit{\text{PMIN}}}{{\sum}_{k=1}^{n}\overline{P\left(k\right)}-\mathit{\text{PMIN}}}

(20)