 Research
 Open access
 Published:
Modelbased optimal action selection for DynaQ reverberation suppression cognitive sonar
EURASIP Journal on Advances in Signal Processing volume 2023, Article number: 116 (2023)
Abstract
The Doppler shift of lowspeed targets is frequently disturbed by the reverberation Doppler spread clutter under the shallow sea. The clutter is generated by underwater scatterers, which increases the difficulty of Doppler estimation. To solve this problem, a reverberation target resolution function based on the Doppler spread clutter statistical model is proposed in this paper. Through the width of reverberation Doppler clutter, this function adjusts the waveform parameters by determining whether the target is discriminable. In addition, the reverberation Doppler spread clutter is timespatial varying and affected by grazing angle, waves, wind speed, fish and other effects. Thus, the sonar waveform parameters need to be adjusted constantly. Therefore, this paper combines the cognitive sonar based on reinforcement learning with the reverberation target resolution function to evaluate different waveforms in different environments. Consequently, the sonar can adjust the waveform parameters in realtime and obtain the optimal waveform in different environments. Meanwhile, in this paper, the action selection strategy of DynaQ reinforcement learning is optimized, and the modelbased maximum action selection DynaQ algorithm (DynaQMaxAction) is proposed. Compared with the traditional DynaQ and Qlearning algorithms, the proposed algorithm needs fewer episodes. Finally, numerical simulation verified the effectiveness of the proposed algorithm.
1 Introduction
For active sonar, lowspeed small target detection is a challenging problem in engineering applications. In coastal and harbor areas, a large number of reefs, artificial facilities, fish and other strong scatterers, coupled with the movement of the sea surface, platforms and multipath effects, make the reverberation complex and variable. Consequently, it may cause severe Doppler spread, and the lowspeed target is blended with highlevel energy clutter. Due to the timevarying clutter, suppressing the Doppler spread clutter in realtime is the key to reverberation suppression.
Traditional methods of reverberation suppression include array design, waveform design and postprocessing algorithm design. Typically, the essence of array design is to design narrow beamwidth [1], reduce the spatial size of the resolving unit to increase the signaltoreverberation ratio (SRR). Moreover, the essence of the waveform design method is to increase the product of pulse width bandwidth and improve the matched filter gain [2]. In addition, the postprocessing algorithm design contains prewhitening [3], spatial processing method [4], principal component inverse (PCI) subspace method [5], graphic method to suppress reverberation [6]. Prewhitening converts reverberation from colored noise into white noise, improves output signaltonoise ratio (SNR) with matched filters, improves receiver and increases active sonar detection capability. The PCI subspace method decomposes the echo into reverberation subspace and target echo plus white noise subspace. Since the energy of the reverberation subspace is stronger than the target plus white noise subspace components. Hence, setting a threshold to subtract the stronger components to suppress reverberation. In addition, the graphic approach is to improve the contrast between the reverberation and the target in the active sonar image, thus improving the detection capability. However, the above method unable to adjust the detection waveform according to the timevarying reverberation environments. Furthermore, the above method only adaptively adjusts the receiver side without joint the transmitter side for reverberation suppression, the active sonar computational resource is wasted and the parameter estimation efficiency is reduced.
In 2006 Haykin [7] proposed the concept of cognitive radar based on the echolocation system of bats. The cognitive radar consists of three elements: (1) the radar learns from environments; (2) the radar transmitter is interacted with receiver; (3) the radar preserves echo information. The abovementioned adaptive radar methods make adaptive adjustments to the receiver. In contrast to the adaptive radar, the cognitive radar can adjust the transmit waveform parameters jointly with the “transmitterreceiver” over a long period of time.
The underwater acoustic channel is a severe timedoppler dual spread channel that enlarge the demand for active sonar waveform design freedom. In 2011, L. Xiaohua combined environmental information and target prior knowledge proposed the cognitive sonar, with reference to the cognitive radar. The cognitive sonar adjusts the transmit waveform parameters according to the echo [8]. In 2015, Tim Claussen [9] used Doppler processing and realtime interpolation to adjust the cognitive sonar transmit beamformer. 2016, X. Qing combined bionics and dolphin research [10] increased the freedom of cognitive sonar waveform design and proposed the cognitive sonar waveform (CSW). CSW combines the ambiguity function (AF) and Q function to constrain the waveform parameters, such as pulse width, frequency and the number of pulse trains, in order to suppress reverberation. Conversely, cognitive sonar cannot get optimal waveform rapidly, takes a lot of time does not adapt to a rapid timevarying environment.
Recently, with increased computer arithmetic and reduced difficulty in acquiring training data, machine learning has shown amazing performance in many areas, such as target recognition, acoustic confrontation, interference suppression, etc. Moreover, the optimal solution to the above problems is the nondeterministic polynomial (NP) hard problem. The suboptimal method like reinforcement learning (RL) can provide solutions to the NP problem. RL learns by interacting with the environment through rewards and punishments, then continuously adjusting action towards a higher reward. RL is widely used. In the field of Go [11], to acquire decisionmaking capabilities by learning previous game actions and using them to subsequent actions. As in the field of the game [12], RL can optimize the Super Mario’s actions based on the environment to find the optimal path to the goal quickly. In 2018, Jason E. [13] proposed RL combined with cognitive sonar, which can reduce the time to acquire optimal waveform. In 2022, Jeff Tucker [14] combined RL with cognitive sonar for multitarget detection and tracking, by adjusting the sonar waveform to learn from the environment. The efficiency of target detection and tracking is greatly improved. Cognitive sonar based on RL has important implications and extensive research space.
RL is a Markov decision process (MDP) consisting of action, reward signal, transfer probability, model of environment, etc. The analogy of cognitive sonar based on RL is waveform, reward signal, transfer probability, echo information, etc.
The reward signal is defined by the goal of RL which represent the value of waveform selection. The underwater environment is complex, the fixed waveform cannot adapt to the timevarying environment, hence the waveforms and environment information are trained in RL to obtain the optimal waveform. The choice of the reward function is important, and irrational choice may lead to algorithm failure. In this paper, a target resolution function is proposed, which can directly detect whether the reverberation Doppler spread clutter contains a target. Since a Dopplersensitive waveform can suppress the reverberation [15]. If the target cannot be detected, the cognitive sonar will change the waveform parameters to suppress the reverberation and distinguish the target, as shown in Fig. 1.
The convergence episodes of RL directly influence the realtime update efficiency of the active sonar. Traditional RL algorithms include the QLearning algorithm and the state action reward state action (SRASA) algorithm. These are all temporaldifference (TD) singlestep update algorithms, which have a slow convergence speed. In order to make full use of existing knowledge, Sutton introduced the planning process in RL and proposed the modelbased Dyna RL algorithm [16]. This type of algorithm stores experience knowledge by building a model of the environment and generates simulated experience to train the learning machine offline. Since the action selection of the model samples directly determines the efficiency of the algorithm. The introduction of the planning process gives RL the ’cognitive’ ability, getting rid of simple trialanderror learning and greatly improving the convergence efficiency of the algorithm.
In the Dyna framework, the action selection of the planning directly affects the convergence episodes of the algorithm. Sutton initially adopts the randomsample method [16], many iterations of the value function in planning are ineffective for the algorithm. Therefore, the maximum reward action selection strategy for DynaQ (DynaQMaxAction) is first proposed in this paper. The convergence episodes of the algorithm are shortened by reducing the probability of invalid randomsample action selection. Thus, this algorithm has fewer convergence episodes than the DynaQ algorithm convergence episodes. Cognitive sonar based on RL combines the advantage of RL to solve complex problems with the advantage of cognitive sonar to interact with the environment in realtime. It can solve the problem of difficulty in obtaining optimal waveform for complex environments by active sonar theoretically.
This paper is structured as follows. Section 2 describes the statistical model of reverberation Doppler spread clutter, and proves the statistical model with the real data from the Qiandao Lake. Furthermore, analysis of the factors of waveform parameters affecting the reverberation Doppler spread clutter. The target resolution function in reverberation Doppler spread clutter is proposed. The DynaQMaxAction algorithm is proposed by optimizing the action selection strategy of the DynaQ algorithm. Moreover, the principle and algorithm flow of the DynaQMaxAction algorithm is described in detail. Section 3 discusses numerical simulation results of the DynaQMaxAction cognitive sonar combing with the reverberation target resolution function. Meanwhile, analysis of the influence of the model training episodes and action selection probability on the DynaQMaxAction algorithm. Then, conclude a reasonable action selection probability. Conclusions are given in Sect. 4.
2 Methods and experiments
2.1 Reverberation modeling
The establishment of reverberation statistical model is significant for reverberation analysis and simulation. Etter and other researchers mentioned two difficulties in reverberation statistical modeling: the lack of analytical tools for solving complex boundary problems, and the difficulty of measuring and distinguishing too many influencing factors of reverberation [17,18,19]. According to the relationship between the size of the scatterer and the wavelength of the acoustic wave, the reverberation modeling can be divided into the point scattering model and the unitary scattering model. In this paper, a more realistic point scattering model is used for modeling.
The point scattering model assumes that scatterers are randomly distributed in the ocean and the reverberation is the superposition of all scatterer backscattered echoes. The point scattering model has a clear physical meaning and can directly assume the statistical properties of the scattered echo amplitude, phase, and Doppler shift. Considering the small amplitude of the multiple scattering, the multiple scattering effect is ignored. Under the assumption of the narrowband waveform, the reverberation can be described as
where N(t) is the number of scatterers contributing to the reverberation at time t. \(A_i\),\(\tau _i\) and \(\phi _i\), denote the amplitude, time delay, and Doppler shift of the ith scatterer echo, respectively. s(t) denotes the waveform. In addition, the statistical distribution of the echo amplitude \(A_i\) determines the statistical distribution model of the reverberation envelope. Moreover, the Doppler shift \(\phi _i\) of the echo determines the Doppler shift of the reverberation, representing the severity of the Doppler spread. Each scatterer \(A_i\) obeys Gaussian distribution, \(\tau _i\) and \(\phi _i\) are independent of each other. Therefore, satisfying the above conditions is the widesense stationary uncorrelated scattering (WSSUS) channel reverberation.
According to reference [20], a reverberation Doppler spread clutter model is established. It is known that the mathematical model probability density function (PDF) of reverberation Doppler spread distribution conforms to the twoside exponential distribution. The twoside exponential distribution is the distribution model with the highest matching according to the real sea test data. The twoside exponential distribution is symmetrically distributed about \(v=0\), the parameters of the distribution can be used to characterize the severity of the Doppler spread.
where \(\mu\) represents the slope ss of the Doppler spread. A large number of sea test data show that the ss is ranging between 6 and 20 dB/kn. The above equation is also known as the Doppler spreading function. \(\mu\) can be solved by the definition of ss
then
C. Zhang proposed a pseudorandom number generation method for reverberation numerical simulation in combination with the twoside exponential distribution model [21], and verified the model with the Qiandao Lake experiment data. Set y to be a uniformly distributed random number in the interval (0, 1), this leads to
According to Eq. (5), a random number v obeying the statistical model with the specified Doppler spread can be generated. Using the point scattering model, a scattering model with 10,000 points is established. v is converted into a Doppler shift \(\phi _i\) using the formula \(\phi _i=f_0 \frac{2 v}{\textrm{c}}\) calculated by substituting the reverberation formula Eq. (1), where the amplitude obeys a Gaussian distribution with mean 0 and variance 1. The time delay obeys the uniform distribution, ss=20 dB/kn, 1 kn = 0.514 m/s. The Doppler shift obeys the twoside exponential distribution \(\mu =2.30\). The reverberation length is 3 s. Continuous wave (CW) pulse width is 0.58 s, and center frequency \(f_0\)= 4000 Hz. The reverberation is calculated 100 times using the Monte Carlo, and the time domain results of the reverberation correlation function are superimposed. Thus, the reverberation Doppler distribution curve is obtained, and the modeling results are as follows:
Figure 2a is modeled with Eq. (2), it is seen that the Doppler spread distribution is more concentrated with the increase of ss, and the spreading is not apparent. Figure 2b theoretical curve is modeled with the twoside exponential distribution by Eq. (5), it is obvious to find that the theoretical curve is basically consistent with the experimental Qiandao Lake data, indicating that the twoside exponential distribution model can be used for Doppler spread clutter modeling.
2.2 Waveform and reverberation Doppler distribution
Different waveform parameters influence the environment feedback reverberation. Correlation is widely used in reverberation analysis, the Doppler distribution of the reverberation is obtained by superimposing the correlation on the time delay. The following paper mainly focuses on the relationship of waveform and reverberation Doppler distribution.
2.2.1 AF of waveform
The most important method to measure the detection capability of the waveform is the AF. The expression of the AF is obtained by doing timeDoppler domain matched filtering of the transmit waveform [22] as
The above equation is the AF of the transmit waveform s(t). \(\tau\) is the time delay and \(\phi\) is the Doppler shift. Through the AF of the waveform, the measurement accuracy, error ellipse, and intrinsic resolution constant of the waveform can be found. Within the 3 dB range, there is a timeDoppler resolvability range \(\chi (0,0)\) for the waveform. The Doppler resolution of the waveform can be calculated
In this paper, without any special explanation, the square wave envelope CW is \(s(t)=T^{1/2}\textrm{rect}(t/T)\textrm{e}^{\textrm{j}2\pi f_{0} t}\). According to the equation \(\int _{T / 2}^{T / 2} \textrm{e}^{\textrm{j} 2 \pi f t} \text {d} t=T {\text {sinc}}(\pi f T)\). The AF of the CW is
The Doppler resolution of the CW can be calculated as \(\phi _{\textrm{Hz}}= \pm 0.44 / T\)(Hz). According to the formula \(\phi _{\textrm{wave}}=\frac{\phi _{\textrm{Hz}} \textrm{c}}{2 f_0}\), the Doppler resolution \(\phi _{\textrm{wave}}=\frac{0.44 \textrm{c}}{2 T f_0}\)(m/s) can be calculated. c is the speed of the acoustic. Based on \(\phi _{\text{ wave } }\), it can be seen that the CW AF is mainly affected by the joint influence of pulse width T and frequency \(f_0\). In this paper, the CW center frequency \(f_0=\)4000 Hz is fixed, only the effect of pulse width on Doppler resolution is considered.
2.2.2 Reverberation Doppler distribution
In general, the echo \(X_0(t)\) can be composed of target Tar(t), reverberation Rev(t), and noise Noi(t)
Noi(t) is the environment noise, the target echo is \(Tar(t)=s\left( t\tau _t\right) \textrm{e}^{\textrm{j} 2 \pi \phi _t t}\), \(\tau _t\) and \(\phi _t\), are the target time delay and target Doppler shift, respectively. Each scatterer echo is a copy of the waveform with time delay and Doppler shift of different intensities, so the above echo \(X_0(t)\) can be reduced to
\(A_t\), \(Rev_j\) represent the target, reverberation amplitude. \(\tau _t\), \(\tau _j\) represent the target, reverberation time delay. \(\phi _t\), \(\phi _j\) represent the target, reverberation Doppler shift. N represents the number of reverberation scatterers. Correlating the echo with the time delay and Doppler shift copy, obtain
Since reverberation is the environment interference of active sonar. In the field of acoustic engineering, the normalized reverberation channel is a typical WSS channel [23]. Under the assumption that the scatterers are independent of each other, reverberation channel satisfy both US channel and WSS channel. Thus, the reverberation characteristics can be expressed by the reverberation channel scattering function.
The timevarying impulse response of the reverberation channel is \(\textrm{g}\left( \tau ^{\prime }, t\right)\), \(\tau ^{\prime }\) is the time delay at different times t, and the received reverberation is Rev(t), then the reverberation can be expressed as the convolution of the transmitted waveform and the impulse response
The Fourier transform \(\textrm{g}\left( \tau ^{\prime }, t\right)\) of the timevarying impulse response \(P_s\left( \tau ^{\prime }, \phi ^{\prime }\right)\) is called the spreading function.
Ignoring the constant term, substituting Eq. (13) into Eq. (12) yields
The above equation shows that the reverberation is related to the channel spreading function and the waveform in the timeDoppler domain. Thus, the reverberation can be composed by the channel spreading function \(P_s\left( \tau ^{\prime }, \phi ^{\prime }\right)\) multiplied by the time delay, Doppler shift weighted waveform.
By correlating Rev(t) with the waveform s(t), obtain
Do autocorrelation on Eq. (15).
where \(E\left[ P_s\left( \tau _1, \phi _1\right) P_s^*\left( \tau ^{\prime }, \phi ^{\prime }\right) \right] =R_{ps}(\tau ^{\prime }, \phi ^{\prime }) \delta (\tau ^{\prime }\tau _1) \delta (\phi ^{\prime }\phi _1)\). \(\delta\) represents the Dirac delta function. \(R_{ps}(\tau , \phi )\) is the scattering function of the channel which is used to describe the timeDoppler distribution. When the autocorrelation is at \(\tau , \phi\), Eq. (16) can reduce to
The above equation shows that the autocorrelation of the Rev(t) matched filter is the twodimensional convolution of waveform AF and reverberation channel scattering function. Under the assumption of WSSUS, the autocorrelation function of the reverberation echo \(\sum _{j=1}^N {Rev}_j \chi _s\left( \tau _j, \phi \phi _j\right)\) can be reduced to the twodimensional convolution of the channel scattering function and the waveform AF. Doing autocorrelation on Eq. (11) obtain
\(A_{\textrm{c}}\) represents the amplitude of the reverberation. \({\text {Rr}}_{s X}(\tau , \phi )\) is superimposed on the time delay to obtain the Doppler distribution curve \(\varphi (\phi )\) of the reverberation
The scattering effect of the channel is a fuzzy effect. According to the approximate derivation of the reverberation point scattering model [23], the reverberation channel scattering function can be reduced to
where K is a constant. \(\rho (\tau , \phi )\) is the joint distribution of the scatterer with the time and Doppler shift. The equation indicates that the reverberation scattering function is determined by the joint PDF of the \(\tau\) and \(\phi\) of the scatterer. Similarly, if the spatial distribution of the scatterer and the Doppler distribution are independent of each other, then we have
If the scatterers are uniformly distributed by distance, this equation can be further reduced to
where T is the pulse width of the waveform. \(K^{\prime }\) is a constant. The result of the reverberation correlation function can be reduced to
The reverberation Doppler distribution curve can be reduced to
In summary, it can be seen that the reverberation Doppler distribution is influenced by the waveform pulse width T, the PDF \(\rho (\phi )\) of the scatterer and the waveform Doppler resolution. Moreover, the target Doppler resolution interval and reverberation correlation function Doppler 3 dB width determine the Doppler distribution; when the target Doppler resolution interval has more overlapping areas with 3 dB reverberation correlation function, it means the target is not easy to distinguish; when the waveform Doppler resolution interval has less overlapping areas with 3 dB reverberation correlation function, it means the reverberation Doppler clutter suppression is better and the target is easy to distinguish.
2.2.3 Effect of scatterer Doppler PDF \(\rho (\phi )\) on reverberation Doppler distribution
According to Eq. (24), the reverberation Doppler distribution is jointly influenced by the waveform AF and the Doppler PDF of the scatterer. In this section, assuming that the waveform parameters are consistent, and the influence of the waveform AF on the reverberation Doppler distribution is eliminated, the relationship between different Doppler distribution \(\rho (\phi )\) of the scatterer and the reverberation Doppler distribution is discussed. The scatterer Doppler distribution \(\rho (\phi )\) is modeled according to the twoside exponential distribution, and two extreme cases of scatterer Doppler distribution are considered, which are discussed as ss=6 dB/kn, \(\mu\)=0.69 and ss=20 dB/kn, \(\mu\)=2.30. The waveform CW pulse width is T=0.23 s, the target is 1.29 m/s, and the signaltoreverberation ratio (SRR) is SRR=4 dB.
Moreover, it is assumed that the target time information is known, and the target time region is extracted with twice the pulse width before subsequent signal processing.
Figure 3a is modeled with Eq. (5) and shows the relationship between different scatterer Doppler distributions and the reverberation Doppler distribution. It is easy to find that the 6 dB/kn curve is easier to distinguish the target than the 20 dB/kn curve, which indicates that the wider the scatterer Doppler distribution function is, the easier it is to distinguish the target. Figure 3b is modeled with Eq. (2) and shows the relationship between the target Doppler resolution and the scatterer Doppler distribution. The shaded area is the overlapping area between the waveform Doppler resolution and the scatterer Doppler distribution, which can be considered a good matching area for highenergy clutter. Outside the shaded area, the reverberation Doppler clutter is suppressed due to the waveform Doppler filtering effect. For the same waveform, if the scatterer Doppler distribution is wider, the better the clutter suppression effect is [15].
2.2.4 The effect of AF on the reverberation Doppler distribution
In this section, the Doppler scatterer distribution \(\rho (\phi )\) is assumed to be constant, and the twoside exponential distribution model is set ss=10 dB/kn, \(\mu =1.15\). The effect of different waveform parameters on the reverberation Doppler distribution is discussed.
Figure 4a set the target Doppler 1.29 m/s, SRR = 4 dB, Monte Carlo experiment 100 times to compare the reverberation Doppler distribution curve in the case of T = 0.1 s and T = 0.3 s. When the CW pulse width T =0.1 s, the target is not easy to distinguish. When the CW pulse width T = 0.3 s, the reverberation Doppler distribution becomes narrower and the waveform Doppler resolution increases, it is obvious that the 0.707 (3 dB) position target is easy to distinguish. Therefore, without considering noise interference, the 0.707 (3 dB) Doppler distribution width of the reverberation can be used to determine whether the target is distinguishable. Thus, the reverberation target resolution function is proposed:
\(\phi _{\text{ data } }\)is the reverberation Doppler 0.707 (3 dB) width, and \(\phi _{\textrm{wave}}(T)\) is waveform Doppler resolution half of the 0.707 (3 dB) width. When the 0.707 (3 dB) reverberation width is equal to two times the waveform Doppler resolution, it represents that the target is distinguishable.
With the change of waveform pulse width, the reverberation Doppler distribution is also changing. For the singlepeak target, 3 dB position can be directly judged as Fig. 4a, if the 3 dB position consists of a doublepeak as Fig. 4b, one of the peaks near 0 m/s, another peak with 3 dB width can be judged, if the secondary peak width meets Eq. 25, on behalf of the target can be distinguished.
From the above analysis, it can be seen that increasing the waveform pulse width can reduce the reverberation Doppler spread clutter width to distinguish the target. For highspeed moving targets, the Doppler clutter will basically not exist. As the ocean environment changes, the reverberation scatterer Doppler distribution changes in real time, and different waveforms need to be used according to different environments to help target discrimination. The more waveforms retained in the active sonar, the higher the flexibility of active sonar waveform selection.
Although the target resolution function is proposed, the optimal waveform cannot be quickly achieved. Then combining the RL with cognitive sonar, the active sonar can adjust the waveform parameters according to the reverberation Doppler width to distinguish the target from the Doppler spread clutter quickly.
2.3 The DynaQMaxAction reverberation suppression cognitive sonar
The underwater environment is complex, Doppler spread is serious and timevarying. Thus, active sonar requires freedom for waveform design, so combining active sonar with the DynaQ algorithm can use different parameters waveforms for different reverberation Doppler spread clutter. However, the random action selection strategy of the DynaQ algorithm influences the convergence episodes. Therefore, in this section, the action selection strategy of the DynaQ algorithm is improved and the DynaQMaxAction algorithm is proposed.
2.3.1 The DynaQMaxAction five tuples
Five tuples are the basic component of RL. For better integration with cognitive sonar, the DynaQMaxAction can be equated to five tuples of MDP, which consists of a model of environment \(\textbf{S}_{\text{ revb } }\), action \(\textbf{A}_{\text{ wave } }\), transfer probability P, reward signal \(\textrm{R}_{\text{ reward } }\), discount factor \(\gamma\).

1.
Model of environment \(\textbf{S}_{\text{ revb } }\): Model of environment \(\textbf{S}_{\text{ revb } }\) is the reverberation Doppler clutter width set.
$$\begin{aligned} \textbf{S}_{\text{ revb } }=\left[ S_{r 1}, S_{r 2}, \ldots , S_{r m}, \ldots \right] ^{\textrm{T}} \end{aligned}$$(26)\(S_{r m}\) represents the Doppler clutter width of the mstate reverberation.

2.
Action \(\textbf{A}_{\text{ wave } }\): \(\textbf{A}_{\text{ wave } }\) is a collection of waveforms, which consists of waveforms with different parameters.
$$\begin{aligned} \textbf{A}_{\text{ wave } }=\left[ A_{w 1}, A_{w 2}, \ldots , A_{w m}, \ldots \right] ^{\textrm{T}} \end{aligned}$$(27)When active sonar selects waveform \(A_{w m}\), then automatically jumps to the corresponding state \(S_{r m}\). In practice, in order to improve the efficiency of target detection, without considering the blind area, all waveforms can be sent out at once, and the optimal waveform is achieved by RL calculating.

3.
Transfer probability P: The P transfer probability is the probability of choosing waveform \(A_{w m}\) causing the state transfer from \(S_{r m}\) to \(S_{r n}\). In the absence of prior knowledge and the transfer probability is unknown, the initial transfer probability P is equal probability.

4.
Reward signal \(\textrm{R}_{\text{ reward } }\):
The reward signal is the reward value of different waveforms \(A_{w m}\). The reward signal setting is significant, unreasonable setting cause the waveform transmit strategy to fall into the local optimal solution. According to the Doppler spread clutter width and the previous target resolution function Eq. 25 to define the reward signal
When \(\textrm{R}_{\textrm{reward}}(T)\in \left[ 0,0.01 *2 \phi _{\textrm{wave}}(T)\right]\), the reverberation Doppler clutter target is distinguishable and the reward signal is set to 10. When \(\textrm{R}_{\textrm{reward}}(T)\notin \left[ 0,0.01 *2 \phi _{\textrm{wave}}(T)\right]\), the reward factor C=1, the waveform reward signal is obtained through Eq. 28. q is the magnification factor, which is used to enlarge the difference in reward signals between different waveforms. \(\textrm{R}_{\text{ reward } }\) can be used to evaluate the difference between different waveforms, and the closer to the optimal waveform is, the higher \(\textrm{R}_{\text{ reward } }\) tends to guide the RL to converge to the optimal waveform.
5) Discount factor \(\gamma\):
The discount factor \(\gamma \in [0,1]\), which determines the decay of future reward signal, when \(\gamma\) tends to 0, the active sonar tends to obtain immediate rewards; when \(\gamma\) tends to 1, the active sonar tends to obtain longterm gains, indicating that almost all reward signals are influencing the Qvalue.
2.3.2 The DynaQMaxAction algorithm flow
The traditional DynaQ algorithm treats planning as an improvement of the action, but the action selection within the model of the DynaQ algorithm is random [24]. In this paper, the action selection strategy of the DynaQ algorithm is improved, then the DynaQMaxAction algorithm is proposed. The algorithm flow is as follows:
\(\alpha\) is the step size, \(\alpha \in (0,1)\), which determines the effect of estimation error on Q(\(S_{r m}\),\(A_{w m}\)). \(S_{r m}\) and \(A_{w m}\) are the state and the action at this step. \(S_{r n}\) and \(A_{w n}\) are the state and the action at the next step. \(\textrm{R}_{\textrm{reward}}\) is the reward signal. \(\varepsilon\) is the greediness of action selection, \(\varepsilon \in [0,1]\), representing the probability of selecting the maximum reward signal action. Greedy action selection uses current knowledge to maximize immediate reward and does not sample worse actions. The stateaction value function is abbreviated as Qvalue. The Q table consists of (\(S_{r m}\), \(A_{w m}\)) corresponding position to the Qvalue.
Steps 1–7 of DynaQMaxAction algorithm are identical to the traditional DynaQ algorithm, differing only in steps 8–12. Steps 8–12 can be summarized as n updates of the Qvalue using the model already learned. Inspired by the action selection strategy of the Qlearning algorithm, introducing the \(\varepsilon\) action selection strategy to the DynaQMaxAction algorithm. The DynaQMaxAction algorithm superimposes the reward signals of each action in all environments, and compares the sum of reward signals of different actions. \(\varepsilon\) probability selects the next action with the maximum sum of the reward signals, 1\(\varepsilon\) probability selects the next action randomly
The DynaQMaxAction algorithm can select an optimal waveform by integrating the waveform into all environments through Eq. 29, which shortens the convergence episodes of the Qvalue. The DynaQMaxAction algorithm architecture is shown in Fig. 5 below.
Figure 5 ‘Real Experience’ to ‘Value Functions (Qvalue)’ represents direct learning based on real experience to improve Qvalues and actions. ‘Real Experience’ to ‘Model’ to ‘Simulated Experience’ to ‘Value Functions (Qvalue)’ is a modelbased learning process. The model learns from real experience and generates simulated experience. Finally, the simulated experience is used to update the Qvalue.
The core idea of the DynaQMaxAction algorithm is the Qvalue learning from real experience and planning from simulated experience. Learning and planning are deeply integrated in the sense that they share almost all the same machinery, differing only in the source of their experience. The \(\varepsilon\) model action selection strategy can shorten the convergence episodes of the active sonar.
3 Results and discussion
In this section, the relationship between the reverberation Doppler clutter width and CW waveform pulse width is simulated by the DynaQMaxAction algorithm. According to the DynaQMaxAction algorithm iteration, the optimal waveform is achieved. Set SRR = 4 dB. The following table shows the reverberation Doppler clutter widths obtained by numerical simulation in different environments, and calculates the reward signals for the corresponding waveform pulse widths according to Eq. 28.
The reward signals show that only the 300ms CW signal is the optimal waveform in this environment. The rest of the waveforms are given different \(\textrm{R}_{\text{ reward } }\), guiding the active sonar to converge to optimal waveform 4.
In this paper, there are two kinds of convergence: step convergence and Qvalue convergence. When the sum of the reward per episode, and the converged steps per episode are converged, means the DynaQMaxAction algorithm step is convergence. While, when the loss function is converged, means the DynaQMaxAction algorithm Qvalue is convergence.
The loss function is the error between predicted Qvalue and real Qvalue.
\(\mathrm {Q_{predict}}\) is the predicted value of Qvalue, Q is the real value of Qvalue, and Steps is the number of steps to reach the optimal in this episode.
Define the average reward signal per episode \(R_{\textrm{average}}\)
The next subsection will discuss the different training times and different \(\varepsilon\) greediness of the DynaQMaxAction algorithm.
3.1 Different training times of the DynaQMaxAction algorithm with \(\varepsilon\)=1
Set \(\varepsilon\)=1 for the DynaQMaxAction algorithm, means each action is selected from the model with the maximum reward signal. Then compare the number of the DynaQMaxAction algorithm convergence episodes for different model training times.
In Fig. 6a–d Env=0 refers to the Qlearning algorithm, Env=5 represents the DynaQMaxAction algorithm with 5 times of model training, and the xaxis represents the number of training episodes. According to the results of the number of convergence steps, the sum of rewards and the average reward, the step convergence efficiency increases as the number of model training times increases.
In Fig. 6a, b, the step of the Qlearning algorithm needs about 30 episodes to converge, while after 5 times of model training the step of the DynaQMaxAction algorithm only needs about 20 episodes to converge. Even after 90 times of model training, the step of the DynaQMaxAction algorithm just needs several episodes to converge.
According to the average loss curve in Fig. 6d, the Qlearning algorithm Qvalue needs about 400 episodes to converge. While after 5 times of model training, the DynaQMaxAction algorithm Qvalue needs about 150 episodes to converge, and only about 30 episodes to converge after 90 times of training.
Therefore, as the number of training times increases, the DynaQMaxAction algorithm with \(\varepsilon\)=1 shortens the number of training episodes and increases the efficiency.
Figure 6e, f the histograms are plotted based on the Q table and the table of model reward signals after 90 times of model training. Figure 6e indicates the Qvalue of different actions in different environments. Figure 6f indicates the model reward of different actions in different model environments. S1, S2, S3, S4, and S5 represent five different reverberation Doppler clutter widths, and t01, t015, t02, t03, and t035 represent five waveforms, which corresponding to pulse widths of 100 ms, 150 ms, 200 ms, 300 ms, and 350 ms, respectively in Table 1.
According to the histogram, it can be seen that the optimal waveform t03 in different environments of the Q table of the DynaQMaxAction algorithm has the largest Qvalue and can provide the absolute maximum action selection to the environment. The histogram of the model reward shows that only the optimal action and its corresponding optimal environment reward is the largest. Therefore, it makes each model action selection directly select the t03 optimal waveform and speed up the step convergence speed and shorten the Qvalue convergence episodes.
3.2 Different training times of the DynaQMaxAction algorithm with \(\varepsilon\)=0.6
Set \(\varepsilon\)= 0.6 and compare the results of different model training times.
According to Fig. 7a–d, it can be seen that the DynaQMaxAction algorithm converges with fewer episodes as the number of training times increases, and the model training times 90 is optimal.
In Fig. 7a, b, the step of the Qlearning algorithm needs about 30 episodes to converge, the step of the Env=5 DynaQMaxAction algorithm need about 20 episodes to converge. It is almost the same as Fig. 6a, b.
In addition, according to Figs. 6d and 7d, it can be seen that after 90 times model training, the \(\varepsilon\)=0.6 curve is flatter compared to the \(\varepsilon\)=1 curve after convergence.
The histogram results in Fig. 7e show that the Q table of \(\varepsilon\)=0.6 the DynaQMaxAction algorithm shows a stepwise growth in different environments, and the Qvalue tend to the optimal waveform t03. Comparing with Fig. 6e, it can be seen that \(\varepsilon\)=0.6 Q table is smoother than \(\varepsilon\)=1 Q table. In addition, the maximum model reward in Fig. 7f is the optimal environment corresponding to the optimal waveform, which is consistent with \(\varepsilon\)=1 in Fig. 6f.
3.3 The DynaQMaxAction algorithm with different greediness \(\varepsilon\)
The model was set to training 10 times to compare the results for different \(\varepsilon\) action selection probabilities [0, 0.3, 0.6, 1], where \(\varepsilon\)=0 represents the DynaQ algorithm.
In Fig. 8, comparing with \(\varepsilon\)=0.3, \(\varepsilon\)=0.6, \(\varepsilon\)=1 and the DynaQ algorithm curves show that the Qvalue of the DynaQMaxAction algorithm converge with fewer episodes than the DynaQ algorithm Qvalue convergence episodes, and \(\varepsilon\)=1 converges with the fewest episodes. However, \(\varepsilon\)=1 is too large leading to overfitting, which makes the average loss curve fluctuate. Therefore, finding a suitable action selection probability \(\varepsilon\), can reduce the fluctuation and shorten the convergence episodes. According to the results in Fig. 8, \(\varepsilon\)=0.6 is more appropriate.
4 Conclusion
The discrimination of lowspeed weak targets in reverberation Doppler spread clutter is a difficult problem in active sonar signal processing. This paper combines the point scattering model and twoside exponential distribution to model the reverberation Doppler spread clutter, and verifies the effectiveness of the model through the real data of the Qiandao Lake. In this paper, a target resolution function is proposed for the reverberation Doppler clutter target resolution problem, which can quickly identify the target in the Doppler spread clutter, and the target resolution function is combined with the DynaQMaxAction algorithm, which enables the active sonar to adjust the waveform parameters according to different reverberation. Meanwhile, the relationship between the DynaQMaxAction algorithm and the greediness of action selection and the number of model training times are discussed. Based on the numerical simulation results, it is found that the step convergence efficiency of the DynaQMaxAction algorithm combined with the action selection greediness converges more rapidly than the step of the Qlearning algorithm. According to the results, the DynaQMaxAction algorithm converges with fewer episodes than the DynaQ algorithm converges episodes. Providing a theoretical basis for future engineering applications of RL based reverberation suppression cognitive sonar.
Availability of data and materials
The data are not publicly available due to the private reasons.
Abbreviations
 DynaQ:

Integrating planning, acting, and learning, Qlearning
 DynaQMaxAction:

DynaQ maximum reward action selection strategy
 AF:

Ambiguity function
 NP:

Nondeterministic polynomial
 RL:

Reinforcement learning
 MDP:

Markov decision process
 SRASA:

State action reward state action
 TD:

Temporal difference
 WSSUS:

Widesense stationary uncorrelated scattering
 PDF:

Probability density function
 CW:

Continuous wave
 SRR:

Signaltoreverberation ratio
References
Z. Hao ke, M. Qu li, L. Hai lin, Study on robust spacetime adaptive reverberation suppressing, in IEEE 10th International Conference on Signal Processing Proceedings, pp. 2407–2410 (2010)
X. Cui, C. Chi, S. Li, Y. Li, H. Huang, Coprime pulse trains of frequencymodulated for suppressing reverberation, in OCEANS 2021: San Diego Porto, pp. 1–4 (2021)
B.W. Choi, E.H. Bae, J.S. Kim, K.K. Lee, Improved prewhitening method for linear frequency modulation reverberation using dechirping transformation. J. Acoust. Soc. Am. 123(3), 21–25 (2008)
J.N. Maksym, M. SandysWunsch, Adaptive beamforming against reverberation for a threesensor array. J. Acoust. Soc. Am. 102(6), 3433–3438 (1997)
Y. Li, H. Huang, C. Zhang, S. Li, New schurtypebased pci algorithms for reverberation suppression in active sonar, in Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, vol. 4, pp. 641–6444 (2005)
Z.Q. Wang, L. An, J.R. Lu, Signal detection based on mathematical morphology in oceanic reverberation, in 2007 14th International Conference on Mechatronics and Machine Vision in Practice, pp. 8–12 (2007)
S. Haykin, Cognitive radar: a way of the future. IEEE Signal Process. Mag. 23(1), 30–40 (2006)
L. Xiaohua, L. Yaan, L. Guancheng, Y. Jing, Research of the principle of cognitive sonar and beamforming simulation analysis, in 2011 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), pp. 1–5 (2011)
T. Claussen, V.D. Nguyen, Realtime cognitive sonar system with targetoptimized adaptive signal processing through multilayer data fusion, in 2015 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), pp. 357–361 (2015)
X. Qing, D. Nie, G. Qiao, J. Tang, Dolphin bioinspired transmitting waveform design for cognitive sonar and its performance analysis, in 2016 IEEE/OES China Ocean Acoustics (COA), pp. 1–7 (2016)
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, D. Hassabis, A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science 362(6419), 1140–1144 (2018)
M. Taylor, Teaching reinforcement learning with mario: An argument and case study, in Proceedings of the National Conference on Artificial Intelligence 2 (2011)
J.E. Summers, J.M. Trader C.F. Gaumond, J.L. Chen, Deep reinforcement learning for cognitive sonar. J. Acoust. Soc. Am. 143(3Supplement), 1716–1716 (2018)
J. Tucker, V. Chavali, K.E. Wage, J.K. Nelson, Multiple objective optimization for fully adaptive active sonar, in OCEANS 2022, Hampton Roads, pp. 1–9 (2022)
T.C. Yang, J. Schindall, C.F. Huang, J.Y. Liu, Clutter reduction using Doppler sonar in a harbor environment. J. Acoust. Soc. Am. 132(5), 3053–3067 (2012)
R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA, 2016)
X. He, Y. Xu, M. Liu, C. Hao, C. Hou, Adaptive estimation of kdistribution shape parameter based on fuzzy statistical normalization processing. IEEE Trans. Aerosp. Electron. Syst. 58(5), 4566–4577 (2022)
P.C. Etter, C.H. Haas, D.V. Ramani, Evolving trends and challenges in applied underwater acoustic modeling, in OCEANS 2015  MTS/IEEE Washington, pp. 1–10 (2015)
F. Cao, X. Zhang, J. Han, S. Lv, Experimental analysis of statistical property of low frequency reverberation envelope in shallow water, in 2021 OES China Ocean Acoustics (COA), pp. 534–538 (2021)
J.J. Murray, A theoretical model of linearly filtered reverberation for pulsed active sonar in shallow water. J. Acoust. Soc. Am. 136(5), 2523–2531 (2014)
C. Zhang, X. Ma, X. Li, F. Zhan, S. Zhang, Modified asymmetric statistical model for the reverberation doppler spread spectrum. Shengxue Xuebao/Acta Acustica 43, 943–950 (2018)
J. Zhang, X. Qiu, C. Shi, Y. Wu, Cognitive radar ambiguity function optimization for unimodular sequence. EURASIP J. Adv. Signal Process. 2016, 1–13 (2016)
N.U.R. Junejo, M. Sattar, S. Adnan, H. Sun, A.B.M. Adam, A. Hassan, H. Esmaiel, A survey on physical layer techniques and challenges in underwater communication systems. J. Marine Sci. Eng. 11(4) (2023)
X. Li, C. Yang, J. Song, S. Feng, W. Li, H. He, A motion control method for agent based on dynaq algorithm, in 2023 4th International Conference on Computer Engineering and Application (ICCEA), pp. 274–278 (2023)
Acknowledgements
The authors would like to express their gratitude to the Deep Sea Observation Project, 2019JCJQZD02400.
Funding
The research was supported by the Deep Sea Observation Project, 2019JCJQZD02400.
Author information
Authors and Affiliations
Contributions
Yubin Fu put forward the original idea of the paper and complete the manuscript. Xiaochuan Ma contributed to the validation, review, editing and supervision. Xingyuan Pei and Pengzhuo Li finished the revisions.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
All procedures performed in this paper were in accordance with the ethical standards of research. community
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fu, Y., Ma, X., Feng, C. et al. Modelbased optimal action selection for DynaQ reverberation suppression cognitive sonar. EURASIP J. Adv. Signal Process. 2023, 116 (2023). https://doi.org/10.1186/s13634023010547
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13634023010547