Speech improvement in noisy reverberant environments using virtual microphones along with proposed array geometry

This paper proposes a novel approach for improving the speech of a single speaker in noisy reverberant environments. The proposed approach is based on using a beam-former with a large number of virtual microphones with the suggested arrangement on an open sphere. Our method takes into account virtual microphone signal synthesizing using the non-parametric sound field reproduction in the spherical harmonics domain and the popular weighted prediction error method. We obtain entirely accurate beam steering towards a known source location with more directivity. The suggested approach is proven to perform effectively not just in boosting the directivity factor but also in terms of improving speech quality as measured by subjective metrics like the PESQ. In comparison to current research in the area of speech enhancement by beamformer, our experiments reveal more noise and reverberation suppression as well as improved quality in the enhanced speech samples due to the usage of virtual beam rotation in the fixed beamformer. Text for this section.


Introduction
Distance speech signals recorded by a microphone inside a room contain reverberation caused by reflections from surfaces like walls, windows, floors, doors and ceilings. Similar to additive noise, echoes and interference, reverberation has a destructive effect on speech intelligibility [1]. Moreover, high reverberation leads to a dramatic decrease in the recorded speech quality, causing a severe degradation in audio applications such as automatic speech recognition and source localization [2]. As the reverberation time (RT60) increases, the detrimental effects on the speech signal magnify. In the literature, reverberation is divided into early reflections and late reverberation. The early reflections part increases the speech intelligibility, whereas the late reverberation part distorts the speech signal [3].
Beamforming is considered a rational approach to overcome noise and reverberation, which has attracted much attention due to its advantages in audio signal capturing in many applications such as sound reproduction and speech separation [1]. The beamformer performance depends on the number of microphones [1]. Although increasing the number of microphones increases the signal-to-noise ratio (SNR) in the output of the beamformer [4,5], in terms of hardware and computational complexity, it is not feasible to increase the number of microphones by a significant amount. Therefore, a serious challenge is the limitation of the number of microphones in beamformers.
Although an adaptive beamformer can adapt the beam pattern toward the source, its performance in a high reverberation room is lower than a fixed beamformer. Therefore, fixed beamformers are preferable to adaptive beamformers in high reverberation conditions [6].
In this research, using virtual microphones (VMs) is considered an attractive approach to dominate these problems. VMs with proper techniques can synthesize sound signals at any spatial position, independent of the locations of the physical microphones. Virtual microphones (sensors) have been used in various applications, mainly in array processing [5]. For example, in [7] , the phase shift is estimated by using a VM in the microphone array; Also, a wideband beamformer is designed by deploying an optimized array consisting of virtual sensors [8].
Although there are several techniques to synthesize a VM signal, the procedure of virtual miking is still a significant challenge [5]. For instance, in [9], the image theory is applied to VM signal estimation, in [10], the interpolation of physical microphone signals generates a new VM signal, and in [11], geometrical information is used to produce the VM signal.
The sound field recording consists of reconstructing the sound signal in arbitrary places in the space [12], which can be employed for synthesizing the VM signal. This approach is generally classified into parametric [11][12][13][14][15] and non-parametric [16][17][18][19][20][21][22]. A general model characterizes the acquired sound field in parametric methods, whereas non-parametric methods decompose the recorded sound field into spatial basis functions. Since the recorded signal contains the direct sound and reverberation, the parametric method requires two distinct models to estimate the direct sound and reverberation [12]. In contrast, in the non-parametric approach, instead of model estimation in the parametric approach, a sound field can be represented by determining the coefficients of spherical harmonics. We have also presented a new technique to calculate the coefficients of spherical harmonics.
Due to reverberation being affected by the room impulse response (RIR) between the microphone and the sound source, getting an accurate model for reverberation is practically infeasible. Therefore, despite the analysis complexity, the non-parametric method along with a dereverberation approach is used in this research. It is worth noting that in the previous research, the signal in the spherical harmonics domain has not been employed as a virtual microphone signal, while in this research, we have reconstructed the signal of virtual microphones in a noisy reverberant room by using the recorded signals of real microphones.
The accuracy of the non-parametric method increases when the amount of reverberation decreases. The dereverberation removes the reverberant component in the recorded signal and improves the signal-to-reverberant ratio (SRR). The criterion for evaluating a dereverberation technique is the amount of late reverberation suppression [3,23]. In dereverberation classes, the direct inverse filtering method has fewer performance limitations and less sensitivity to the RIR estimation than the spectral enhancement and channel equalization methods [23][24][25][26]. Therefore, the weighted prediction error (WPE) algorithm in this class of dereverberation is utilized in this research. This paper proposes a solution for synthesizing VM signals and using them in a fixed beamformer. The proposed approach allows increasing the number of VMs in the beamformer without increasing the hardware. Figure 1 shows an overview of this research.
Our contributions are as follows: (1) We propose a new technique to synthesize the virtual microphone signal using spherical harmonics analysis. (2) We offer a new array geometry that utilizes a large number of VMs without increasing the computational complexity of the beamformer. (3) We propose a method to rotate the beam pattern of a fixed beamformer towards the known sound source location.
The paper is organized as follows. In Sect. 2, problem formulation of the sound field in spherical coordinates for virtual miking and sampling, beamforming and dereverberation is described. In Sect. 3, array geometry is considered, and at the beginning, the array performance evaluation is defined; next, the proposed uniform phase shift array geometry is detailed. In Sect. 4, experimental results, which include the implementation setup and simulation results, are presented.

Problem formulation
In this paper, as shown in Fig. 2, the position of a point in spherical coordinates is specified as r = (r, θ, φ) , in which r is the radial distance from the origin (radius), θ and φ are the inclination (polar angle) and azimuth, respectively. Also, an acoustic source Fig. 1 The research block-diagram in r s is considered in the far-field region. The room where the microphone array and sound source are located has a moderate diffuse noise and high reverberation. S(t, ω, r) is the signal of a single speech source recorded by a physical microphone, which can be written as [12] where t is time, ω = 2π f is radial frequency, f > 0 is temporal frequency, S d (t, ω, r) is the sum of direct-path speech and early reflections, S r (t, ω, r) is the late reverberation signal that spatially has isotropic and homogeneous characteristics, and N (t, ω, r) is the noise. Suppose X(t, ω, r) is the virtual microphone signal, which is defined as where X d (t, ω, r) is the reconstructed direct sound, X r (t, ω, r) is the reverberation sound field component, and X n (t, ω, r) is the estimated noise.

Making the VM signal
This section explains the method of creating a virtual microphone signal in the spherical harmonics domain employing the spherical Fourier transform. By calculating the coefficients of spherical harmonics, the received speech signal at a specific point on the sphere surface can be estimated. Y m n (θ, φ) is the spherical harmonics of order n ( n ∈ N ) and degree m ( m ∈ Z and −n ≤ m ≤ n ) which is defined as [27] where (.)! is the factorial function, and P m n (cos θ) is the normalized associated Legendre polynomial.

Fig. 2 Geometric model of a spherical array
While p(k, r) is a square-integrable function on the surface of an open sphere only for kr in a range smaller than N, it can be illustrated employing a weighted sum of the spherical harmonics as [27] where N is the truncation order, p(k, r) is the time-dependence amplitude of the sound pressure in free three-dimensional space, p nm (k, r) are the weights which are known as coefficients of the spherical Fourier transform, k = 2π fc is the wave number, and c is the speed of sound wave in air. The coefficients of the spherical Fourier are defined as [27] where (.) * denotes complex conjugation. It is worth noting that for satisfying the far-field condition, the distance between the sound source and the microphone array centre has to be more than 8r 2 f /c [28].
Because of using uniform distribution of physical microphones on the spherical array, (e.g. positioning the microphones in the vertex of the Platonic solids), for n ≤ N , p nm (k, r) can be obtained as [27] where r q = (r, θ q , φ q ) is the location of the qth physical microphone and Q is the number of physical microphones. To avoid spatial aliasing, Q is set to be greater than or equal to (N + 1) 2 [27].
By combining (4) and (6) the amplitude of the sound pressure on the sphere surface in the direction of (θ, φ) is A physical microphone located in the r q converts p(k, r q ) to S(t, ω, r q ) and a virtual microphone positioned in the r converts p(k, r) to X(t, ω, r) . Finally, based on (7), the VM signal can be synthesized as The number of virtual microphones determines the number of times Equation 8 is calculated. So, with the increase in the number of virtual microphones, the computational complexity will also increase linearly.

Dereverberation
Based on [23], by filtering the multi-channel recorded signal, the estimated late reverberation signal in the qth physical microphone can be estimated as (ω) are coefficients of the linear prediction (dereverberation) filter, superscript (.) H is the Hermitian transpose, D is the time-delay that separates the early reflections from the late reverberation part, L c is the dereverberation filter length.
Based on (1) and using (9), the direct sound signal in the qth physical microphone can be estimated as In order to estimate the direct sound signal, the filter coefficients c (q,q ′ ) l (ω) are predicted using the WPE method. The conventional WPE method assumes a circularly symmetric complex Gaussian distribution for the desired speech coefficients in the first physical microphone S d (t, ω, r 1 ) , with zero-mean and unknown time-varying variance [23,29]. Using a recursive algorithm as described in Algorithm (1), c (q,q ′ ) l (ω) can be estimated [23] where J is the number of iterations, and ε is a small value .
Once C [j] (ω) is specified, the coefficients of dereverberation filter c (q,q ′ ) l (ω) , can be defined as (10), the estimated direct sound signal Ŝ d (t, ω, r q ) for all physical microphones ( q = 1, 2, · · · , Q ) can be obtained. By replacing  (8), the estimated direct sound of the VM signal can be acquired as

Beamforming
As shown in Fig. (1), a beamformer with the input of synthesized VM signals is used. A complex-valued weight W v (ω) is applied to the vth VM signal, and then the weighted signals are added together. The beamformer output is obtained as [1] where X d (t, ω, r v ) is an estimate of the direct sound of the vth virtual microphone in the , and V is the number of virtual microphones. By combining (12) and (13) the beamformer output is given as It is assumed that all physical and virtual microphones are omnidirectional, and without losing the generality the source is located in the ( θ = 90 • , φ = 0 • ) direction in the farfield. So, the phase vectors of the VMs are given as where τ v and e −jωτ v are the time delay of receiving the source signal and the phase shift of the the vth VM signal, respectively.
Assuming a spherically diffuse white noise with zero-mean value, the pseudo-coherence V × V matrix, Ŵ(ω) , can be specified. The (v, v ′ ) th element of Ŵ(ω) is given as [1] The weights of a reqularized superdirective beamformer are given as [1] where ǫ ≥ 0 is the regulirization parameter and I V is the V × V identity matrix.

Proposed array geometry
The geometry of the microphone array has an important effect on the sound capturing performance. Beam-pattern, directivity factor (DF), white noise gain (WNG), frequency range, robustness, and sidelobe suppression are major parameters related to geometry [30]. In this study, the two main employed evaluation parameters of spatial sound capturing are the DF and the WNG. Using (15), (16) and (17), the DF is expressed as [1] and the WNG is given as [1] Our proposed geometry is the combination of parallel rings at equal distances from each other (see Fig. 3). The ring plane has been considered perpendicular to the line between the sphere centre and the source location. As a result, the distance of the points on a ring from the source location will be equal. Therefore, the direct signals received at the points on a ring are in phase with each other, and as a result, they can be added together easily. So, the best WNG will be obtained in the proposed geometry.In order not to increase the computational load of the beamformer, the number of rings is set to be equal to the number of real microphones. The radius of the lth ring can be calculated as where L is the number of rings and l = 1, 2, ..., L . It is assumed that on the lth ring Q l virtual microphones are distributed uniformly. Based on [31], to avoid spatial aliasing, the ranges of Q l and L are expressed as where f max is the maximum frequency of the speech and r is the radius of the circle (ring). Finally, we have L distinct rings in our proposed array geometry, leading to a total of V = L l=1 Q l virtual microphones.

Implementation setup
This section describes the experimental setup of the proposed speech improvement system as detailed in previous sections. A general block diagram of the implementation system is shown in Fig. 4.
First, we choose the uniform spherical microphone array geometry to capture the 3-D audio. We employ Q = 32 physical microphones placed on the vertices of a truncated icosahedron (similar to the microphone arrangement of Eigenmike [32]) on the surface of an open sphere with radius r = 10 cm.
Due to simulating microphone signals, the clean speech is filtered through the RIR model of the desired room and then is recorded by 32 microphones. The RIR generator provided by Habets [33] is used to simulate the RIR of a room with 6 × 5 × 4 (m 3 ) dimensions [22] with various SNR and RT60 values. The SNR is in the range of 0-30 dB, and the RT60 is in the range of 0.2-1 second.
In order to reduce audio reverberation, according to Sect. 2.2, the WPE dereverberation algorithm is employed. D = 3 , L c = 15 , ε = 10 −3 , and J = 5 are four optimum variables in Algorithm 1 [23]. So, the Ŝ d (t, ω, r q ) is obtained by using the WPE algorithm in the optimum performance.
In the next step, using (3) and N = 4 , 25 spherical harmonics functions, Y m n (θ, φ) , are specified as . Then the complex value of each Y m n (θ q , φ q ) for the qth microphone is specified. By employing (6) a set of p nm (.) is calculated which is consist of 25 signals in the spherical harmonics domain.
Depending on the source direction and using the proposed array geometry as mentioned in Sect. 3, the location of the V virtual microphones on the surface of the open sphere, (r, θ v , φ v ) , is determined. By selecting L = 32 , the number of VMs is V = 392 . So, X d (t, ω, r v ) is synthesized by using (12) and (r, θ v , φ v ) (for v = 1, 2, ..., V ). Finally, a beamformer with the proposed array geometry and the proposed regularized superdirective algorithm in (17) with ǫ = 0.1 is applied to the VM signals.
The improved speech signal in the beamformer output is compared to the original clean speech to evaluate the results. In this research, four well-known metrics is employed: (1) the Perceptual evaluation of speech quality (PESQ) [34], (2) the Frequency-weighted segmental signal-to-noise ratio (FWSegSNR) [35], (3) the Cepstral distance (CD) [36], and 4) the Speech-to-reverberation modulation energy ratio (SRMR) [37]. It should be emphasized that at a smaller RT60, the SRMR metric becomes less precise [37].

Simulation results
In this section, the performance of the proposed system is evaluated. To this end, the system depicted in Fig. 4 and the setup setting as detailed in Sect. 4 are used. Twenty clean speech utterances from the TIMIT database [38] with different SNR equal to 5, 10, and 20 decibels and different RT60 in the range of 0.2-1 second are used (totally 540 utterances). Moreover, all sub-blocks are simulated in the MATLAB software package.

Array measurement
To evaluate the proposed geometry mentioned in Sect. 3, the proposed microphone array (PMA), the uniform circular microphone array (UCMA), and the uniform spherical microphone array (USMA) geometries under the same conditions and all with the same beamforming method in terms of the DF and the WNG are compared (see Fig. 5). The USMA consist of 32 microphones on the vertices of a truncated Icosahedron on the surface of a sphere with a radius of 10 cm. Also, the UCMA includes 32 microphones on a ring with the same radius as USMA. As detailed in Sect. 3, the PMA geometry consist of L = 32 rings and based on (21) there are V = 392 virtual microphones on these rings. In this comparison, the sound source is located on the UCMA plane in the far-field. Figure 5a represents the DF values for UCMA, USMA, and PMA geometries. As depicted, the PMA geometry is superior at all frequency bands, especially at higher frequencies (e.g., more than 5 dB around 4 kHz). Figure 5b shows the WNG values for three mentioned geometries. As shown, the WNG of the PMA is more than the other two geometries, even at low frequencies. At frequencies below 700 Hz, the WNG of the (a) (b) Fig. 6 a The DF and b the WNG of the PMA, USMA and UCMA geometries with Q = L = 32 for two sources located at X-axis and θ s = φ s = 45 • PMA is, on average, 3 dB more than the UCMA and the USMA geometries. As a result, the performance of the PMA geometry is superior.
In order to evaluate the performance of the three geometries under study in relation to changes in the source location, the sound source is rotated 45 degrees. As shown in Fig. 6a, the DF curve of the PMA does not change as the source location changes. At the same condition, the DF of the UCMA decreases by an average of 3 dB. Also, the DF of the USMA does not change at frequencies less than 1.2 kHz but changes slightly at frequencies above 1.2 kHz. As depicted in Fig. 6b, by changing the source location, the WNG curve of the PMA is fixed and always is better than the other two geometries.
Next, we explore the performance of the PMA geometry in comparison to the USMA and the UCMA geometries for the other two setups, including Q = 20 and Q = 12 microphones when the sound source is located on the UCMA plane in the far-field. In this examination, for L = 20 rings, V = 250 virtual microphones and for L = 12 rings, V = 152 virtual microphones are used in the PMA geometry. As shown in Fig. 7a, by reducing the number of microphones, the DF of the PMA changes slightly, while below 2.5 kHz, the DF of the UCMA reduces slightly, and above 2.5 kHz reduces more in proportion to the increase the frequency. Also, the DF of the USMA reduces differently at different frequencies. Figure 7b shows that by reducing the number of rings from L = 20 to L = 12 , the WNG of the PMA is reduced on average by 1 dB. The WNG of the UCMA reduces below 2.7 kHz, and the WNG of the USMA, except from 0.8 kHz to 1.4 kHz, reduces. As can be seen, the DF and the WNG of the PMA are superior.

Speech quality measurement
By considering Fig. 4 and explanations given in Sect. 4, the performance of the PMA is evaluated in the frequency range of 100-4000 Hz in terms of four metrics PESQ, CD, FWSegSNR, and SRMR. The USMA and the UCMA geometries with Q = 32 microphones are employed for physical microphones arrangement. Also, L = 32 rings are considered in the PMA geometry for V = 392 virtual microphones distribution on the surface of the sphere (see Sect. 3).
In addition, the performance of the PMA geometry in speech improvement is compared with the UCMA and the USMA geometries. We have compared the proposed system with the WPE dereverberation (WPE), the regularised superdirective beamformer (BF), and their combination (WPE+BF) along with the UCMA and the USMA geometries. The level of diffuse noise and the reverberation time are controlled, confined to 5 -20 dB and 200-1000 ms, respectively. Our primary goal is audio capturing in the high reverberant environments, so in the test scenarios, we divide the diffuse noise levels into three parts: very high noise level (SNR=5 dB), high noise level (SNR=10 dB), and medium noise level (SNR=20 dB).  As depicted in Fig. 8, the PESQ metric versus RT60 is used to evaluate the proposed system compared to the other methods and geometries in three SNR levels. As shown, for the UCMA and the USMA geometries, the WPE method has little ability to PESQ improvement, whereas the effect of the beamformer is quite apparent. However, almost the combination of dereverberation and beamforming is better than each of them, and its results are close to the beamforming results.  As it turns out, the use of the USMA geometry further improves the speech quality compared to the UCMA geometry, but its effectiveness is limited. The fantastic performance of the proposed system is evident in all three amounts of noise in Fig. 8, due to the proposed array geometry and a large number of virtual microphones relative to the number of physical microphones.  In Table 1, the average of PESQ for RT60 between 200 and 1000 ms is calculated for each method. As can be seen, by increasing the diffuse noise power, the proposed system in speech improvement performs better than other methods, and this superiority in increasing the PESQ metric is quite evident.
Since the PESQ criterion somehow reflects the opinion of the human listener, in addition to this chart, by listening to the output of the proposed system, the speech improvement is quite hearable. Figure 9 illustrates the cepstral distance (CD) metric of under assessment methods in RT60 values between 200 and 1000 ms. WPE performance depends on the reverberation time, and the best performance is around 600 ms in various noise levels. At the same time, the performance of the beamformer is almost the same at different amounts of noise.
Experiments revealed that the combination of dereverberation and beamforming with spherical geometry effectively reduced the cepstral distance. Nevertheless, in all situations, the proposed system performs more potent than the others to improve the CD metric. As shown in Figure 9 and Table 2, the proposed system, due to the use of multiple virtual microphones, suppresses the noise and reverberation included in the recorded speech more effectively than the other methods at all SNR and RT60 values. Figure 10 indicates the comparison between the proposed system and the other methods in terms of the FWSegSNR metric. By carefully examining the performance of the WPE in the three charts of Fig. 10, it is clear that at different amounts of noise levels, the WPE performs almost independently of the SNR of the recorded signal. Also, the WPE improves the more FWSegSNR at a moderate reverberation level (RT60 about 500 ms).  Table 5 Comparing the outcomes of the proposed geometry to random geometry and the 5% microphone placement error in terms of the average of the PESQ between 200 and 1000 milliseconds of RT60 interval As can be seen in Table 3, as the value of the SNR of the recorded speech signal decreases, the value of the FWSegSNR due to the beamformer performance increases. The proposed system improves the FWSegSNR at least one decibel more than other methods by utilizing 392 VMs. Figure 11 contains three charts that show the SRMR changes versus various RT60 values from 200 to 1000 ms in three levels of SNR. By comparing different methods, it is observed that the WPE significantly improves the SRMR. In contrast, the beamformer slightly improves the SRMR in several SNR values.
The mean of SRMR in the RT60 range between 200 and 1000 ms in the three SNR levels are represented in Table 4. The WPE, in contrast to the beamformer, in all methods and SNR levels, performs more successfully in increasing the SRMR metric. The proposed system performs better than the other methods because of utilizing the WPE to synthesize VM signals and uses many VMs in the PMA geometry.
Finally, the destructive effects of the microphone placement error of the spherical microphone array and the use of random geometry instead of the proposed geometry, in terms of PESQ, are represented in Table 5. As can be seen, 5% of microphone placement error has less than 2% effect on the PESQ, while using random geometry reduces the PESQ by about 21% on average (for three SNRs).

Conclusion
A novel method to synthesize the virtual microphone signal in the SH domain has been presented. Also, a new microphone array geometry for arranging a large number of virtual microphones has been proposed. Because the location of virtual microphones depends on the source position; therefore, the proposed microphone array is always in a constant direction relative to the source location. As a result, with this technique, the direction of the array beam-pattern can be adjusted to the sound source without the need for adaptive beamformers. Test results on 540 corrupted utterances have shown that the suggested system significantly has improved the noisy reverberant speech because of its ability to increase the number of virtual microphones and use the proposed geometry.