Advances in Multimicrophone Speech Processing

1 School of Engineering, Bar-Ilan University, Ramat-Gan, 52900, Israel 2 INRS-EMT, University of Quebec, 800 de la Gauchetiere Ouest, Montreal, QC, Canada H5A 1K6 3 Institute of Audiology and Hearing Science, University of Applied Sciences, Oldenburg/Ostfriesland/Wilhelmshaven Ofener Street 16, 26121 Oldenburg, Germany 4Department of Electrical Engineering, Technion — Israel Institute of Technology, Technion City, Haifa 32000, Israel 5Department of Electrical Engineering (ESAT-SCD), Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium 6 Institute of Communication Acoustics, Ruhr-Universitaet Bochum, 44780 Bochum, Germany 7Western Australian Telecommunications Research Institute, The University of Western Australia, 35 Stirling Hwy, Crawley, 6009, Australia

Speech quality may significantly deteriorate in the presence of interference, especially when the speech signal is also subject to reverberation. Consequently, modern communication systems, such as cellular phones, employ some speech enhancement procedure at the preprocessing stage, prior to further processing (e.g., speech coding).
Generally, the performance of single-microphone techniques is limited, since these techniques can utilize only spectral information. Especially for the dereverberation problem, no adequate single-microphone enhancement techniques are presently available. Hence, in many applications, such as hands-free mobile telephony, voice-controlled systems, teleconferencing, and hearing instruments, a growing tendency exists to move from single-microphone systems to multimicrophone systems. Although multimicrophone systems come at an increased cost, they exhibit the advantage of incorporating both spatial and spectral information.
The use of multimicrophone systems raises many practical considerations such as tracking the desired speech source, and robustness to unknown microphone positions. Furthermore, due to the increased computational load, real-time algorithms are more difficult to obtain and hence the efficiency of the algorithms becomes a major issue.
The main focus of this special issue is on emerging methods for speech processing using multimicrophone arrays. In the following, the specific contributions are summarized and grouped according to their topic. It is interesting to note that none of the papers deal with the important and difficult problem of dereverberation.

Speaker separation
In the paper "Speaker separation and tracking system," Anliker et al. propose a two-stage integrated speaker separation and tracking system. This is an important problem with several potential applications. The authors also propose quantitative criteria to measure the performance of such a system, and present experimental evaluation of their method. In the paper "Speech source separation in convolutive environments using space-time-frequency analysis" Dubnov et al. present a new method for blind separation of convolutive mixtures based on the assumption that the signals in the time-frequency (TF) domain are partially disjoint. The method involves detection of singlesource TF cells using eigenvalue decomposition of the TFcells correlation matrices, clustering of the detected cells with expectation-maximization (EM) algorithm based on Gaussian mixture model (GMM), and estimation of smoothed transfer functions between microphones and sources via extended Kalman filtering (EKF). Serviere and Pham propose in their paper "Permutation correction in the frequencydomain in blind separation of speech mixtures" a method for blind separation of convolutive mixtures of speech signals, based on the joint diagonalization of the time-varying spectral matrices of the observation records. This paper proposes 2 EURASIP Journal on Applied Signal Processing a two-step method. First, the frequency continuity of the unmixing filters is used in the initialization of the diagonalization algorithm. Then, the continuity of the time variation of the source energy is exploited on a sliding frequency bandwidth to detect the remaining frequency permutation jumps. In their paper "Geometrical interpretation of the PCA subspace approach for overdetermined blind source separation" Winter et al. discuss approaches for blind source separation where the number of sources can exceed the number of users. Two methods are compared. The first is based on principal component analysis (PCA). The second is based on geometric considerations.

Echo cancellation
In their paper "Efficient fast stereo acoustic echo cancellation based on pairwise optimal weight realization technique," Yukawa et al. propose a class of efficient fast acoustic echo cancellation algorithms with linear computational complexity. These algorithms are based on pairwise optimal weight realization power technique. Numerical examples demonstrate that the proposed schemes significantly improve the convergence behavior compared with conventional methods in terms of system mismatch as well as echo return loss enhancement (ERLE).

Acoustic source localization
Time-delay estimation is a first stage that feeds into subsequent processing blocks for identifying, localizing, and tracking radiating sources. The paper "Time-delay estimation in room acoustic environments: an overview" by Chen et al. presents a systematic overview of the state of the art of timedelay-estimation algorithms ranging from the simple crosscorrelation method to the advanced blind channel identification based techniques. In their work "Kalman filters for timedelay of arrival-based source localization," Klee et al. propose an algorithm for acoustic source localization based on timedelay-of-arrival (TDOA) estimation. In their approach, they use a Kalman filter to directly update the speaker position estimate based on the observed TDOAs. In their contribution, "Microphone array speaker localizers using spatial-temporal information," Gannot and Dvorkind propose to exploit the speaker's smooth trajectory for improving the position estimate. Based on TDOA readings, three localization schemes, which use the temporal information, are presented. The first is a recursive form of the Gauss method. The other two are extensions of the Kalman filter to the nonlinear problem at hand, namely, the extended Kalman filter and the unscented Kalman filter. In their paper, "Particle filter design using importance sampling for acoustic source localization and tracking in reverberant environments," Lehmann and Williamson develop a new particle filter for acoustic source localization using importance sampling, and compare its tracking ability with that of a bootstrap algorithm proposed previously in the literature. A real-time implementation of the algorithm also shows that the proposed particle filter can reliably track a person talking in real reverberant rooms.

Speech enhancement and speech detection
The paper "Dual channel speech enhancement by superdirective beamforming" by Lotter and Vary presents a dual channel input-output speech enhancement system. The proposed algorithm is an adaptation of the well-known superdirective beamformer including postfiltering to the binaural application. In contrast to conventional beamformer processing, the proposed system outputs enhanced stereo signals while preserving the important interaural amplitude and phase differences of the original signal. In their paper "Sector-based detection for hands-free speech enhancement in cars" Lathoud et al. investigate an adaptation control of beamforming interference cancellation techniques for in-car speech acquisition. Two efficient adaptation control methods are proposed that avoid target cancellation. Experiments on real in-car data validate both methods, including a case with 100 km/h background road noise. In their paper "Using intermicrophone correlation to detect speech in spatiallyseparated noise," Koul and Greenberg provide a theoretical analysis of a system for determining intervals of high and low signal-to-noise ratio when the desired signal and interfering noise arise from distinct spatial regions. The system uses the correlation coefficient between two microphone signals configured in a broadside array as the decision variable in a hypothesis test, and can, for example, be used as an adaptation control method for an adaptive beamformer.