An Overcomplete Signal Basis Approach to Nonlinear Time-Tone Analysis with Application to Audio and Speech Processing

Although a beating tone and the two pure tones which give rise to it are linearly dependent, the ear considers them to be independent as tone sensations. A linear time-frequency representation of acoustic data is unable to model these phenomena. A time-tone sensation approach is proposed for inclusion within audio analysis systems. The proposed approach extends linear time-frequency analysis of acoustic data, by accommodating the nonlinear phenomenon of beats. The method replaces the one-dimensional tono-topic axis of linear time-frequency analysis with a two-dimensional tonotopic plane, in which one direction corresponds to tone, and the other to its frequency of modulation. Some applications to audio prostheses are discussed. The proposed method relies on an intuitive criterion of optimal representation which can be applied to any overcomplete signal basis, allowing for many signal processing applications.


INTRODUCTION
Speech recognition is a hierarchical process consisting of four main phases, audio analysis, speech feature extraction, pattern classification, and language processing [1,2]. The audio analysis phase relies on a mathematical model of the human cochlea as a frequency analyzer. To a first approximation, the cochlea is an array of overlapping linear bandpass filters. This model dates back to Von Helmholtz in 1863 [3] and still plays an important role in audio signal processing systems. But there is clear evidence that the cochlea is in fact inherently nonlinear and this nonlinearity is not just a result of overloading it at high signal levels. As a result, the overlapping linear filter model fails to account for essentially nonlinear phenomena of human audition, such as masking, beats, and the sensation of Tartini or combination tones [4,5].
Masking is an effect whereby the threshold of hearing of a test tone is increased in the presence of a masking tone. This threshold is dependent on the frequency separation of the test and masking tones, and tones outside of this critical band having little influence on the threshold. A tone close to the masking tone causes the two tones to interfere in the form of beats [5]. An example of the sensation of combination or Tartini tones occurs when a listener on being presented with two pure tones hears a third tone which actually is not present. A linear frequency-analysis model of audition cannot account for these phenomena.
The basilar membrane within the cochlea detects the component frequencies, or tones, of incoming sound. Due to the flexible nature of the membrane, incoming vibrations set up a travelling wave along the membrane giving it a different maximum displacement for different frequencies. The bulging of the membrane creates a shearing motion for the hair cells of the organ of Corti, generating electrical potentials that stimulate the auditory nerve. One end of the basilar membrane vibrates most at low-frequency tones, and the other end vibrates most at high-frequency tones. These mechanical properties give the basilar membrane a tonotopic organization or organization by tone, tones are arranged from low frequency at one end to high frequency at the other. Tonotopy, from tono and the Greek word topos meaning place, is a fundamental organizing principle of the auditory system. Tonotopy is apparent as a linear arrangement of neurons along a tonotopic axis according to "best frequency," the acoustic frequency to which a neuron is most sensitive. Knowing which group of neurons is active in the basilar membrane, and where those neurons reside in the membrane, one can estimate which tone was being heard.
It has been reported that the basilar membrane within the mammalian cochlea has a response function that is highly 2 EURASIP Journal on Applied Signal Processing nonlinear and compressive [6][7][8][9][10]. Recent results of psychophysical masking experiments suggest that the basilar membrane is the primary source of the nonlinearities [11][12][13]. Furthermore, cochlear nonlinearities have a significant influence on a wide range of basic auditory processes, such as frequency selectivity [14,15], temporal integration [16], and loudness growth [17][18][19]. Rather, it is held that they arise as artefacts of the frequency analysis process itself. Therefore, future models of human audition, necessary for incorporation into audio analysis systems, will need to incorporate a simulation of the characteristics of the basilar membrane if they are to provide an accurate account of our audio perceptions.
As described, the inherent nonlinearity has an apparent enhancement of sound which may be a biological method for noise suppression. Therefore, it is important that signal processing adopts similar approaches to provide similar enhancement. Nonlinear models of early stages of the human audition draw heavily on understanding of psychoacoustics and the neurophysiology of the audition. These include the nonlinear mechanics of the cochlea [20], use of wave digital filtering for the analysis of Tartini tones [21], adaptive Qcircuits [22], and formant tracking based on temporal analysis of a nonlinear cochlear model [23].
However, they do not complement the nonlinear phenomena mentioned above. The generation of a model of human audition, building on the tonotopic nature of the audition system and which can accommodate nonlinear phenomena such as the sensation of beats, is worthy of investigation. It is hypothesized that such models may be useful for efficient audio analysis in the fields of psychoacoustics, speech analysis, and audio prostheses.

AIM
The aim of this short paper is to propose a simple and general approach to the construction of a nonlinear filter model of human audition based on tonotopy, one which would be suitable for incorporation within audio analysis systems. The proposed analysis method has a psychoacoustic foundation and is based on redundant time-tone sensation analysis which extends linear time-frequency analysis by accommodating the phenomena of beat sensation.

METHOD OF ANALYSIS
The method proposed is based on signal analysis using overcomplete bases. It employs the principal that elements of the basis, which are linearly dependent as vectors, may still correspond to independent tone sensations. In the case of beat sensation, if two pure tones are represented by real arrays of sampled data e 1 and e 2 , the corresponding modulated tone is a third vector e 3 . The basis set comprising e 1 , e 2 , and e 3 is redundant since e 3 = e 1 + e 2 . Nevertheless, e 3 is not excluded from the signal basis, but included as it corresponds to an independent tone sensation.
Consider the analysis of a finite array of real sampled data s ∈ R n , using the set of basis vectors E N = {e 1 , . . . , e N }, where e α ∈ R n . Assumptions are made that each e α has unit length, that no two are the same and that the set of basis vectors E N spans R n . Consider signal decompositions of the form with the condition that N = n not required. Indeed, the case N > n is particularly interesting since it can be postulated that although the e α in E N may be linearly dependent as physical vibrations, they are nevertheless independent as tone sensations. When N > n, (1) does not have a unique solution for C and one must be chosen from the set of all possibilities. The optimum choice is denoted here by C(s) and its components by C α (s). The only requirement on this choice is that it satisfies an intuitive criterion of optimality that signals which correspond to independent tone sensations are transformed into linearly independent signal descriptors. If the array of sampled data s is of the form s = κe α for some α and some κ satisfying 0 ≤ κ ∈ R, then the optimum solution to (1) should satisfy where β ∈ {1, . . . , N} and δ is the Kronecker delta function, where δ αβ = 1 if and only if α = β.
If N > n, a linear map from s → C(s) cannot satisfy this criterion, however infinitely many nonlinear maps do so. The minimization of an objective function Φ(C, s) is chosen for obtaining this transformation: As a function of the components C α , the objective function Φ(C, s) is a positive definite quadratic of rank N or N − 1. To determine C(s), Φ(C, s) must be minimized, subject to the constraints of (1). Since the basis set E N is overcomplete, the solution exists for all s ∈ R n and is unique. Calculating the solution is straightforward and involves linear matrix operations. Furthermore, it satisfies the intuitive criterion of optimality presented in (2).
When s = κe α , for some strictly positive κ ∈ R and some α, the solution can be written as follows: where E is the n × N matrix whose columns consist of e α ∈ E N , Π −1 (s) is the N × N diagonal matrix with π α (s) −1 along the diagonal, and E(s) is the matrix whose components are E iα = (π α (s)) −1/2 E iα . It can be shown that when N = n, C(s) = E −1 s and is linear.
The choice of the basis functions should physically be well motivated. Basis signals are generated from windowed versions of pure tones as well as modulated tones. Only modulated tones appropriate to human audition (25 Hz-5000 Hz) need to be included. This basis will be redundant and the signal analysis consists of the mapping C : s → C(s) as defined in (4).
Implementation of the scheme would involve analysis of short epochs of audio, the length of which will be application dependent. For pure tones, the basis functions consist of multiple tones on a logarithmic scale of frequency from 20 Hz to 20 000 Hz. Dependent on the number of tones included, the approach will ascertain if the nonlinear phenomenon of beats or Tartini-tones distortion is occurring, thus allowing the "imagined" third frequency to be removed from further processing. Due to the richness of voiced speech sounds, the harmonic structure of these sounds would require these also to be incorporated within the bases used for analysis.

DISCUSSION
A tonotopy representation of acoustic data is proposed. Tonotopy is present in both the basilar membrane and auditory cortex and refers to the spatial arrangement of where sound is perceived. Tonotopy arises from mechanical properties of the basilar membrane. However, in the cerebral cortex, tonotopy is seen as a progressive change in neuronal best frequency with position along the cortical surface-a tonotopic map. Multiple tonotopic maps have been observed in the auditory cortex of many mammalian species. As many as seven tonotopic mappings have been observed in the cat cortex [24,25]. Four or more tonotopic mappings have been observed in the monkey brain [26].
However, the tonotopic organization of human auditory cortex-located on the superior surface of the temporal lobe-remains poorly understood. A degree of tonotopic organization and the presence of some major organizational elements seen in animals have been demonstrated by human studies [27]. These studies, based on magnetoencephalography [28,29], positron emission tomography (PET) [30,31], and functional magnetic resonance imaging (fMRI) [32][33][34] have localized cortical responses to multiple discrete narrowband stimuli having different centre frequencies.
Following neurophysiological evidence for tonotopic organization, the descriptor coefficients in the proposed analysis can be arranged in a two-dimensional tonotopic plane. In this way, the one-dimensional tonotopic axis of linear time-frequency analysis is replaced with a twodimensional tonotopic plane, in which one direction corresponds to tone, and the other to its frequency of modulation.
The representation of acoustic data in these terms of pure tones and modulated tones is highly redundant. Redundant representation of signal data from a signal processing viewpoint leads to robustness with little influence by noise level [35]. The proposed method can be employed as an extension to the linear time-frequency analysis for audio analysis, to a nonlinear time-tone sensation analysis consistent with the phenomenon of beat sensation in human audition. This method maps a segment of audio into a three-dimensional volume instead of a two-dimensional plane. Each time slice of audio is mapped onto a two-dimensional tonotopic plane. The representation of signal data using overcomplete bases is led by the observation that linearly dependent signals may still be independent as tone sensation.
A full experimental validation for the approach may be carried out in an animal study. The mechanical responses would be recorded for two pure-tone stimuli with a frequency difference small enough to cause the resonance regions of the basilar membrane to overlap. In this way, one tone of intermediate pitch would be heard with modulated or beating loudness. The mechanical responses would be recorded from the basilar membrane in the basal turn cochlea using displacement sensing equipment such as a displacement-sensitive laser interferometer. The harmonic content of the responses would then be evaluated using Fourier analysis. The pure tones and their modulations would form the bases for analysis and the criterion tuned from observation of deformation of the basilar membrane, generating the two-dimensional tonotopic plane.
A practical application of this proposed analysis may be in the design of cochlear amplifiers for use in cochlear implants. Many patients suffering from sensorineural hearing loss have been implanted with a cochlear prosthesis, consisting of a microphone, a processor to convert acoustical information for digital processing, and an electrode array that provides electrical stimuli to the auditory nerve. The electrode is inserted along the tonotopic axis of the cochlea. Using various temporal patterns of stimulation along this spatial map, many patients have achieved successful speech comprehension. However nonlinearities associated with the operation of the cochlear amplifier (e.g., level-dependent gain) have several consequences. Most notably, the nonlinearities will distort the incoming sounds to some extent. One wellknown consequence of this is the existence of Tartini distortion products in the cochlea [5]. Another possible consequence is harmonic distortion. This type of distortion has been observed in the receptor potentials of the inner and outer hair cells of the cochlea [36]. Harmonic distortion can also be perceived as "overtones" when pure tones are presented in psychophysical experiments [37]. The proposed analysis may be able to correct or compensate for these electronic distortions and not introduce additional distortion to the signal processing cascade.

EURASIP Journal on Applied Signal Processing
Another potential application of the analysis in the area of audio prosthesis would be for incorporation within a new category of implants using the central nucleus of the inferior colliculus (ICC) [38]. The ICC has been recently shown to have a well-defined tonotopic structure [39]. Most of the auditory information that is transmitted to the auditory cortex via the thalamus is processed in the ICC. Unlike cochlear prostheses, the ICC implant will be placed in direct contact with neurons, which narrows the fields of stimulation. The feasibility of characterizing the effects of ICC stimulation on auditory perception and the possible development of an ICC-based auditory implant have been reported [38]. The use of a tonotopic plane proposed here may aid in the design of efficient stimuli for enhancement of sound for the auditory nerve.
The method presented provides a general solution and is applicable to any field where linearly dependent elements of the signal basis can be construed as independent artefacts of the signal. The approach presented here complements other analysis methods using overcomplete bases, in particular the method of matching pursuit [40], which also satisfies the constraint on optimality given by (2).

CONCLUSION
The representation of acoustic data in terms of pure tones and modulated tones is highly redundant and is of benefit in acoustic analysis. A general method is proposed which resolves the problem of nonuniqueness of such representations. This is achieved by adopting a solution which fulfils an intuitive criterion of optimality. The method being general is applicable to any field where linearly dependent elements of the signal basis can be construed as independent artefacts of the signal.