EURASIP Journal on Applied Signal Processing 2005:12, 1794–1806 c ○ 2005 P. Hanna and M. Desainte-Catherine A Statistical and Spectral Model for Representing Noisy Sounds with Short-Time Sinusoids

We propose an original model for noise analysis, transformation, and synthesis: the CNSS model. Noisy sounds are represented with short-time sinusoids whose frequencies and phases are random variables. This spectral and statistical model represents information about the spectral density of frequencies. This perceptually relevant property is modeled by three mathematical parameters that define the distribution of the frequencies. This model also represents the spectral envelope. The mathematical parameters are defined and the analysis algorithms to extract these parameters from sounds are introduced. Then algorithms for generating sounds from the parameters of the model are presented. Applications of this model include tools for composers, psychoacoustic experiments, and pedagogy.


INTRODUCTION
Computers offer new possibilities for sound processing. Applications are numerous in the musical field. Digital sound models are developed to represent signals with mathematical parameters in order to allow composers to transform the original sound in a musical way.
Noises are used more and more frequently in contemporary music, especially in electroacoustic music. A new vocabulary describing noisy sound properties has been proposed during the twentieth century [1]. We consider as noisy sounds the natural sounds such as rubbing or scratching, but also some parts of instrumental sounds like the breath of a saxophone, and speech phonemes such as consonants or whispered voices.
Existing models only represent sounds composed of low noise levels. They consider natural sounds as mixes of sinusoids (deterministic part) and noise (stochastic part). They first extract sinusoids and model the residual using a low noise model such as LPC [2] or piecewise-linear spectral envelopes [3,4]. In such approaches, the noisy part is assumed This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
to be parts of the signal that cannot be represented with sinusoids whose amplitude and frequency slowly vary with time and is implicitly defined as whatever is left after the sinusoidal analysis/synthesis. These approximations lead to audible artifacts and explain why such models are limited to the analysis and the synthesis of purely noisy signals. Our research concerns improvements of the modeling of this noisy part. The goals is the extraction of the structure of the pseudoperiodic components and to propose a reasonable approximation of the residual relying on psychoacoustics. We focus on robust stand-alone noise modeling.
In this article, we present an original noise model to analyze, transform, and synthesize such noisy signals in real time. This spectral model represents noisy signals with shorttime sinusoids whose frequency values are randomly chosen according to statistical parameters. Modifying these mathematical parameters extracted from sounds leads to original transformations which cannot be performed using the previously described models. Some of these transformations are related to the modification of the distribution of the frequency values of the sinusoids in the spectra. Psychoacoustic experiments show that this distribution is perceptually relevant and mainly depends on the number of sinusoidal components. For this reason, we focus on the spectral density and the mathematical parameters related to it.
After presenting existing models and their limitations in Section 2, we present theory behind the representation of noise with short-time sinusoids in Section 3. The new model and its mathematical parameters are defined in Section 4. Then, in Section 5, we propose an original method to extract these parameters from analyzed sounds. In Section 6, the synthesis algorithms are detailed before presenting the limitations of this model and two applications in Section 7.

BACKGROUND
Many model types have been considered for music synthesis. In this section, we summarize previous approaches to analyzing, transforming, and synthesizing noise-like signals and indicate their limitations.

Temporal models
The existing models for analyzing, transforming, and synthesizing noisy sounds are temporal or spectral models. Temporal models generate noises by randomly drawing samples using a standard distribution (uniform, normal, etc.). Then, they may be filtered (subtractive synthesis). The main temporal models use linear predictive coding (LPC) to color a white noise source. These approaches are common in speech research but are less closely linked to perception and are less flexible [5,6].

Spectral models
We are particularly interested in spectral models because they are useful for (mostly) harmonic sounds [7]. These sinusoidal models are very accurate for sounds with low noise levels and are intuitively controlled by users [8]. Therefore, it seems interesting to adapt them to the representation of more complex sounds.
Several advances have been proposed in the area of sinusoid-plus-noise models [9]. Macon has extended the ABS/OLA model [10] to enable time-scale and pitch modifications to unvoiced and noise-like signals by randomizing phases [11]. Another extension proposes to modulate the frequencies and/or amplitudes with a lowpass-filtered noise [12].
In 1989, research led to hybrid models, which decompose the original sounds into two independent parts: the sinusoidal part and the stochastic part [13]. Extensions have been proposed to consider transients separately [14,15]. The stochastic part corresponds to the noisy part of the original sound. It is entirely defined by the time-varying spectral envelopes [6]. Other methods use piecewise-linear spectral envelopes [3], LPC [2], or DCT modeling of the spectrum [16]. Another residual model related to the properties of the auditory system is proposed in [4]. The noisy part of any sound is represented by the time-varying energy in each equivalent rectangular band (ERB). However, because of such approximations, artifacts may result if this model is applied to sounds with high-level stochastic components.
Hybrid models considerably improve the quality of synthesized sounds, but it is desirable to present more param-eters for the user to control musical sounds. The only musical parameters presented to the composers are amplitude (related to the volume) and spectral envelope (related to the color). We propose to develop a robust noise model that allows the largest possible number of high-fidelity transformations on the analysis data before synthesis.
Experiments have demonstrated the importance of the spectral density of sinusoidal components [17,18]. A spectral model for noisy sounds is adequate to control these parameters. Furthermore, the color of the noise, related to the spectral envelope, is intuitively represented on the frequency scale. For these reasons, the modeling of noisy sounds in the spectral domain is justified.
However, the mathematical justification of the representation of any random signal by a sum of sinusoids with timeconstant amplitude, frequency, and phase, is not straightforward. Similar models have been developed with theory in physics [19].

JUSTIFICATION OF A SPECTRAL AND STATISTICAL MODEL FOR NOISES
In this section, we present the justifications from the fields of statistics, physics, perception, and music, for the proposed spectral model of noisy sounds.

Thermal noise model
Thermal noises can be described in terms of a Fourier series [19]: where N is the number of frequencies, n is an index, ω n are equally spaced component frequencies, C n are random variables distributed according to a Rayleigh distribution, and Φ n are random variables uniformly distributed between 0 and 2π. The samples X defining the signal are distributed according to a normal law. This definition is the starting point of our work. This definition represents a noise by a finite sum of sinusoids. It justifies a spectral model for the noisy sounds and is the central point of the model presented in this article. Nevertheless, the thermal noise model does not specify the number of sinusoids and the difference between frequencies.
It is obvious that choosing N = 2 sinusoids in a frequency band whose width is 20 kHz is not sufficient to synthesize a white noise that is perceptually equivalent to a white noise synthesized by randomly distributing samples according to a Gaussian law.
So long as the number of sinusoids is not small, the synthesized samples are normally distributed because of the law of large numbers [19]. Nevertheless, this normal distribution is not sufficient to define colored noise. The perception is sensitive to the number N of sinusoids (for a given bandwidth). It is detailed in the next section.  Figure 1: Illustration of the spectral density: spectral differences between (a) Gaussian white noise and (b) white noise whose spectral density is low (synthesized with the CNSS model). The black parts indicate gaps of energy. These gaps are perceived and allow human ears to differentiate these two sounds.
It is important to note that the representation of the stochastic part for the hybrid models (e.g., the SMS model) is implicitly based on the thermal noise model. The only difference comes from the deterministic definition of the amplitudes of the sinusoids. Indeed, the resynthesis part [3] generates short-time spectra from the spectral envelopes by randomly distributing phase values according to a uniform law. Then an inverse Fourier transform is computed. This mathematical operation consists of summing a fixed number of sinusoids whose frequencies are equally spaced, and whose amplitudes are fixed. We denote by F s the sample rate and by W s the size of the synthesis window. The difference between successive frequencies is F s /W s , and the number of sinusoids is W s /2 (number of sinusoids in the audible frequency range). This synthesis model can be described by the equation: The necessary number of sinusoids needs to be discussed. Even if experiments confirm that the number implicitly used when computing an inverse Fourier transform appears to be sufficient [6], the question is to know if it is necessary. Another question would be to know if it is necessary to define in a random way the amplitudes of sinusoids. Here again, experiments seem to indicate that fixed amplitudes do not introduce audible artifacts [20].

Psychoacoustic experiments
For psychoacoustic experiments, noise can be synthesized by two different ways. The first one filters white noise computed by random distribution according to a normal or a uniform law [5]. The second one requires the desired noise spectrum and synthesizes sound by summing sinusoids whose amplitudes depend on this desired noise spectrum [19]. The spectral method is based on the thermal noise model. It is generally preferred because it directly controls the spectrum [21].
This approach raises the question of the necessary number of sinusoids to generate a noise which cannot be discriminated from noises synthesized by random distribution of samples. Gerzso did the first experiments in [17]. These experiments have been improved by Hartmann et al. [18] to study the human ability to discriminate bands of noise composed of different numbers of sinusoids.
Results of these experiments are numerous. In the case of a narrowband of noise, the mechanism of discrimination is related to the intensity fluctuations [21]. In the case of broadband of noise, it is related to the spectral resolution [18]. This result leads to the fact that humans would perceive the energy variations in the short-time spectrum corresponding to wide intervals between two neighbor frequencies. Figure 1 shows two synthetic sounds: The second one is characterized by a low spectral density that is indicated by black points corresponding to spectral gaps.
These experiments show that human auditory system is sensitive to the number of sinusoids used to synthesize bands of noise. In the following, this number is thus assumed to be a perceptual characteristic of sounds. It is related to spectral gaps or intensity fluctuations. The control of the number of sinusoids is thus related to perception of sounds.
Moreover, these experiments confirm that a spectral approach to noise synthesis is possible. Indeed, it is now possible to compute an adequate number of oscillators to synthesize white noise whose spectral density and bandwidth are at their maximum. This case corresponds to the highest computational cost.

Definition of the spectral density
Psychoacoustic experiments show that the spectral density is used by the auditory system to discriminate bands of noise. Nevertheless, giving an exact and complete definition of the spectral density is difficult. Gerzso related the spectral density to the ratio of the number of sinusoids by the width of the frequency band [17]. For a band of noise of width ∆F with N sinusoids, the spectral density ρ is defined as We believe that this definition is not strictly correct, because it does not take into account the distribution of the sinusoidal component frequencies [20] and the duration of the band of noise. Perception of pitch depends on the duration of sounds [22]. The experiments we have done confirm that the use of successive short-time windows may cancel the sensation related to a low spectral density. Indeed, the difference between a thermal noise and a harmonic sound comes from the value of the difference between successive frequencies. This difference corresponds to the fundamental frequency. Periodic sound waves can have a pitch only if it has a sufficient duration. This duration depends on the periodicity. Psychoacoustic experiments indicate that the number of cycles necessary lies in the range of tens of cycles [23].
This observation is also confirmed by the usual method based on the inverse Fourier transform. This method considers the number of sinusoids as a function of the number of samples, and thus as a function of the duration of the synthesized sound.
In the following, we consider two independent parameters: the number of sinusoids and the duration of the sound.

Statistical model
The spectral model we propose is based on the thermal noise model. This model defines the frequencies of the sinusoids as equally spaced. The study of the perception of the spectral density shows that humans can perceive spectral gaps or intensity fluctuations. These phenomena can be due to one or more missing sinusoids. We thus propose to define frequencies as random variables which are controlled by mathematical parameters. The random property of the frequencies is justified because the ear is not sensitive to the precise information about the intensity fluctuations or the spectral gaps, but to their statistical properties. It is useful to represent the probabilities, but it seems useless to retain exact informations about these properties.
Moreover, the study of the intensity fluctuations shows that they are dependent on the distribution of the phases of sinusoids [20]. This dependency is illustrated by the two limits: phases with same values and uniformly distributed values. In the first case, intensity fluctuations grow as the number of sinusoids increases [19], because the corresponding waveforms are composed of one or more intensity peaks that are audible. Conversely, uniformly distributed phases correspond to the thermal noise model and lead to weak intensity fluctuations. Therefore, it appears useful to control this phase distribution in order to modify the audible properties related to the intensity variations.
The thermal noise model considers the amplitudes of the sinusoids as random variables distributed according to a Rayleigh law. Practically, fixed amplitudes lead to bands of noise that cannot be discriminated from bands of noise synthesized with sinusoids whose amplitudes are randomly determined [19]. Moreover, we haven't managed to relate the distribution of the amplitudes to a perceptual property. For these reasons, we restrain our spectral model to fixed amplitudes determined from spectral envelopes.

Mathematical justification
The distribution of the frequencies and the phases of sinusoids that compose bands of noise are perceptually relevant. The synthesized signal can thus be described from (1) by where F n and Φ n are random variables, and a n are fixed values.
We can mathematically show that this spectral and statistical approach, based on the thermal noise model, defines a white noise in the case of constant amplitudes (for all n, a n = a 0 ). White noise satisfies the following equation: where E denotes the expectation [24]. By writing for all (p, q) the product of the expectation of X p and X(p + q), we have Since this expectation is defined by integrating over the phases, which are assumed to be uniformly distributed in the interval [0; 2π), the equation reduces to (7) and, for l = n, We conclude that for all integers (p, q), These equations correspond to the definition of white noise. Concerning the white noise, two assumptions are thus imposed by this definition. The first one concerns phases which have to be uniformly distributed between 0 and 2π. The second one concerns frequencies which also have to be uniformly distributed all over the audible frequency range.

THE CNSS MODEL
In the previous section, we have shown that a band of noise can be represented by a sum of sinusoids. This representation is the starting point of the statistical and spectral model we present in this paper: the CNSS model.

Definition
The CNSS model (colored noise by sum of sinusoids) defines sounds as random processes X k . They are represented by a fixed sum of sinusoids whose amplitudes a n are fixed and whose phases Φ n and frequencies F n are random variables. Phases Φ n are distributed according to a uniform law in the interval [0; 2π) and frequencies F n are distributed in a band whose width is denoted ∆F. Therefore, (4) defines the CNSS model.

Short-time frames
Practically, the sound is analyzed and synthesized by overlapping and adding two (or more) temporal frames. Each frame is defined by different sets of parameters. This approach does not appear in the definition of the thermal noise model. However, it can be justified. Indeed, as previously seen, the number of sinusoids depends on the size of the synthesis window. This number can be reduced by considering successive short-time signals: the duration is too short for ears to perceive the low spectral density. Furthermore, real-time synthesis requires successive short-time windows in order to enable modifications of the parameters from frame to frame.

Parameters
The CNSS model represents sounds by analyzing successive temporal frames. Each frame is modeled by many mathematical parameters. The duration of these frames is denoted by W s . It is a positive integer and is expressed in samples. Concerning the distribution of frequencies, M bins (M ≤ N), equally spaced, divide the frequency bandwidth. In each successive frame, N frequency values are drawn into these bins from a uniform distribution. These parameters are illustrated by Figure 2 and are detailed below.

Bandwidth
The signal represented is supposed to be a band of noise. One of the parameters of the CNSS model is the width of this band. It defines the interval of the probability density function of the frequencies. It is denoted by ∆F and is defined by a maximum frequency F max and a minimum frequency F min : Since we assume ∆F > 0, we also assume F min < F max . This parameter is obviously constrained by the resolution of the auditory system (20 − 20 000 Hz). However, due to the Nyquist criteria (sample rate F s = 44 100 Hz), the interval is [0; F s /2 = 22 050] Hz. It also corresponds to the interval implicitly considered when an inverse Fourier transform is computed.

Number of bins
In order to describe the probability density function of frequencies, we propose to define bins, whose sizes are constant, covering the entire bandwidth ∆F. The number of bins is a parameter and is denoted by M.
Each bin, denoted by B i (i ∈ [0; M − 1]), defines an interval of the bandwidth ∆F. The width of every bin, denoted by ∆B, is constant: The interval I Bi defined by each bin B i is The number of bins is positive and is not bounded: Inside each bin, at most one frequency is randomly chosen according to a distribution law defined by the other parameters N and L.

Number of sinusoids
The number of sinusoids, denoted by N, appears in (4) of the CNSS model. This number is linked to the number of bins M. It is not possible to define more than one frequency in the same bin. At the opposite extreme, at least one sinusoid composes the signal: As previously seen in Section 3.2, the influence of the number of sinusoids is not perfectly understood yet. It is linked to the duration of the synthesis frames [20]. Nevertheless, we present approximations about the linear variations of this number as a function of the bandwidth and the duration.
If N equally spaced frequencies are defined in a band whose width is F s /2 Hz, the difference between successive frequencies is F s /2N Hz. In order to be perceived, the minimum duration is 2N/F s seconds, which corresponds to 2N samples. Therefore, W s /2 sinusoids have to be used to define a noise with maximum spectral density. It is important to note that this value is the number of sinusoids used when computing an inverse Fourier transform of size W s . The usual technique for the synthesis of the stochastic part of hybrid models [3] requires a number of sinusoids corresponding to the maximum spectral density. Therefore, filtered white noises that can be synthesized applying this technique are always characterized by a maximum spectral density.
In the case of white noise (bandwidth F s /2 Hz), the maximum number is half the synthesis frame duration W s : As a conclusion, a band of noise defined by a width ∆F, a duration W s , and a maximum spectral density, is represented by The number N of sinusoids is thus defined in the interval [0; ∆F · W s /F s ].

Width of the PDF of frequencies
Inside each selected bin, one frequency is randomly determined according to a uniform law. One parameter, denoted by L, defines the relative width of this law. Its value is in the interval [0; 1]. When it is null, the probability density function is a delta function, and the frequency is the upper boundary of the bin. In the bin B i , the frequency would be F min +(i+1)((F max − F min )/M). At the opposite extreme, if the parameter L is 1, the probability density function is a rectangular function: all the frequencies have the same probability to be chosen.
In the case when the number of sinusoids equals the number of bins N = M, we write The probability density function, denoted by ρ and associated to the bin B i , is, for This parameter defines the regularity of the differences between the frequencies composing modeled sounds. For example, when L = 0 and N = M, all frequencies are equally spaced:

Width of phase PDF
Thermal noise model defines phases of each sinusoid composing sounds as random variables distributed in the interval [0; 2π) according to a uniform law. The CNSS model allows the modification of this law by limiting the interval [0; 2π). The relative width of the probability density function is a real number in the interval [0; 1] and is denoted by P. When it is null, the probability density function is a delta function and all the phases are the same. This results in a intensity peak occurring at periodic times and depending on the duration of the frames. When this parameter P is 1, the phases are uniformly distributed according to the thermal noise model. By considering the phases Φ t0 i of the sinusoids at the time t 0 , the relation between the parameter P and these phases is We thus write the phases of sinusoids at the time t = 0, as a function of their frequency F i :

Color
The color is a parameter already used in other spectral representations of sounds (SMS [6], STN [14], etc.) and refers to the spectral envelope. The SAS model [7] also introduces this parameter. Its name is due to the analogy between audible and visible spectra [8]. In the CNSS model, the color is defined by smoothed spectral envelopes. It is denoted by C and represents the variations of the amplitude as a function of the frequency. It is theoretically a continuous function, but it is modeled as a finite number of points. This representation allows manipulations that are more intuitive than the manipulations of filters [3].
Here, the main point is the independence between the spectral envelope and the spectral density. Existing models only consider spectral envelope, and the information related to the spectral density is contained in the spectral envelope or is not taken into account. The CNSS model allows independent manipulations of the spectral envelope and the spectral density.

Generalization of the filtered white noise models
The CNSS model is essentially a generalization of the filtered white noise models. The mathematical parameters of the model enables control of the frequency distribution. Nevertheless, it is possible to define frequencies of sinusoids as fixed values, according to the existing models. By choosing a band of frequency whose width is half of the sample rate (∆F = F s /2 = 22 050 Hz), with the number of frequencies N equal to the number of bins M and with the relative width L null, the frequencies are no longer random variables. They also are equally spaced: When the number of frequencies is half of the length of the frame, it is equivalent to an inverse Fourier transform.

ANALYSIS
The CNSS model represents noisy sounds with two perceptual parameters: the spectral density and the spectral envelope. The analysis stage consists of approximating these two parameters and estimating the related mathematical parameters that are described in the previous section.

Approximation of the spectral density
As previously seen, psychoacoustic experiments show that energy gaps in the spectrum of noisy sounds are perceptually relevant. These energy gaps are related to intensity fluctuations [21]. We have proposed an original method [25] to analyze these properties. It is based on the statistical study of these fluctuations.

Limitations of the usual techniques
The study of the energy distribution is based on the use of the short-time Fourier transform (STFT) [26]. Two main limitations lead us to choose another way. The first limitation concerns the resolution of this discrete transform and the usual problem of the tradeoff of time versus frequency. The second one is related to the analysis algorithm. One basic idea would be to detect gaps in the amplitude spectrum. But the determination of thresholds is needed, and this determination must rely on psychoacoustic research. Furthermore, approximations of the short-time Fourier transform lead to amplitude gaps that are due to the analysis windows applied to the sound [27]. For these reasons, we applied another method based on the statistical analysis of the intensity fluctuations.

Statistical analysis of the intensity fluctuations
The intensity fluctuations have been studied and modeled in order to explain the ability for humans to discriminate noises with different spectral densities [18]. Another theoretical study of these intensity fluctuations leads to comparable results [28,29]. We relate the variance of the envelope power of any signal to the number of sinusoidal components composing this signal. We define V NEP as the ratio of this variance to the average envelope power: We consider a narrow frequency band. This condition allows us to assume that the amplitudes of the sinusoidal components that compose the signal within this band are equal. We show that V NEP is directly linked to the variations of sinusoidal components of the analyzed signal. In this case, the theoretical relation between the measure V NEP and the number of sinusoidal components is The method we propose consists of producing several values obtained by successively computing the measure V NEP on filtered signal. The consecutive calculations of V NEP lead to an approximation of the intensity fluctuations and thus to the presence of energy gaps. Indeed, if the analyzed band is composed of noise that is modeled by several sinusoids (N 1), the measure V NEP is high. At the opposite end, if the band is composed of a very few sinusoidal components N ≈ 1, the measure V NEP becomes low.
The analysis method consists of the following operations.
(1) Bandpass filtering: this first stage is basic and consists of bandpass filtering the original sound in order to generate signals for the estimation of the intensity fluctuations. (2) Calculation of the measure V NEP : the measure of V NEP is done using the envelope power of the signal, as given in (25). (3) Thresholding: once the values of V NEP have been computed, the next stage consists of counting the number N of values of V NEP inside a frequency band which are below the selected threshold t h . This threshold t h is one parameter of this analysis method. After this stage, a number N is associated to each frequency value F, center of each studied frequency band. (4) Maximization: an iteration on the width of the frequency band used to compute V NEP leads to the maximum value of N . This maximum is assumed to be the difference between two sinusoids inside the analyzed band or, to say it differently, the size of the spectral gap in the analyzed band [20].
This method leads to an approximation of the differences between sinusoids as a function of the frequency. Several experimental examples are shown in [20].

Assumptions
The method relies on two main assumptions. The first one concerns the variations of the spectral envelope. The analysis stage of this spectral envelope consists of smoothing it using lowpass filter and compressing it. For this reason, we assume that in a narrow frequency band that the spectral envelope does not vary enough to introduce variations for the values of V NEP inside this band. The spectral envelope can be assumed as constant. Nevertheless, this assumption obviously depends on the width of the frequency band used. Choosing the width too large would result in variations of the spectral envelope and thus in errors concerning the estimation of V NEP . The second assumption concerns the approximation of the spectral density of the frequency bands studied. We assume that the spectral density is constant in this frequency band. Of course, the validity of this assumption depends on the width of the frequency bands studied: the larger this band, the weaker the probability for the spectral density to be constant.

Extraction of the parameters of the model
We represent the distribution of the frequencies with the parameters N, M, and L, where N is the number of sinusoids, M is the number of bins, and L is the width of the PDF of frequencies. We estimate these parameters by analyzing the approximation of the spectral density using the previously discussed method. Therefore, this stage consists of linking the approximation of the spectral density to these mathematical parameters. The different steps of this part of the analysis algorithm are illustrated in Figure 3. The first part consists of estimating the number of bins M because it is directly related to the maximum of the probability density function of the difference between frequencies. The second part tests if the number of frequencies N is different from the number of bins M. If it is different, this number of sinusoids is estimated. Then the width of the probability density function inside each bin is approximated.

Estimation of the probability density function of the frequency difference
We denote by q the probability density function of the difference between two successive frequencies (or the width of a spectral gap). This function is obtained from the results of the approximation of the variations of the spectral density as a function of frequency. These variations have the same char-acteristics as the probability density function q. The properties of this function give the estimation of the parameters of the CNSS model. Figure 4 shows an experimental probability density function of the difference between two successive frequencies. It has been computed with a high number (10000) of outcomes of frequency drawings. Different values of M have been chosen. This figure shows that the most probable value corresponds to the value ∆F/M. The number of bins is thus directly linked to the most probable difference between two successive frequencies:

Equality between N and M
Once the number of bins has been determined, the analysis method uses the symmetry of the cumulative function extracted. Indeed, theory shows that the probability density function is symmetric around the value ∆F/M in the case of N = M, as opposed to the case M > N. These two cases are represented in Figure 4. The algorithm that is proposed to determine whether the number of frequencies N and the number of bins M are the same, tests the maximum value of q that is not zero. We denote this value by q m . If N = M, this value is slightly less than twice the number of bins M. Otherwise, the number of frequencies N is less than the number of bins M:

Estimation of the number of frequencies N
In the case when N is different from M, an algorithm is proposed to estimate the number of frequencies needed. This algorithm calculates the number of q which is greater than a fixed frequency difference. Indeed if M is greater than N, the number of wide intervals between neighbor frequencies becomes higher. For a fixed number of bins M, the higher the number of frequencies, the higher this number of wide intervals. However, this method is slightly more complex than the ones used for the determination of M and the equality between M and N, because it requires a calibration stage [20]. Indeed, this calibration stage is necessary because the number of wide intervals detected between neighbor frequencies depends on the analysis parameters.

Estimation of the harmonicity coefficient L
The parameter L is correlated to the harmonicity of the sound and thus to its periodicity. A low value L (near 0) imposes a fixed difference between successive frequencies whereas a high value (near 1) implies a distribution of the differences between 0 and ∆F/M. For this reason, we compute the autocorrelation function to extract the value of L from the analyzed sound. We measure the ratio of the second maximum to the first point (zero-lag peak) of the autocorrelation function (which is the total energy of the signal and the maximum). This ratio is used as a discrimination function of the voiced and unvoiced sounds for speech [30].

Extraction of the spectral envelope
The spectral envelope (also named the color [7]) is estimated using the same process used in the other spectral models [3]. A short-time Fourier transform is performed and the classical methods usually applied to residual (spline interpolation, line-segment approximation, etc.) can be computed to find a function that matches the amplitude spectrum.
The CNSS model needs an adapted analysis method to be able to estimate the spectral density of natural noisy sounds. The limitations of the short-time Fourier transform necessitate the use of new approaches. The proposed method has been successfully tested on synthetic and natural sounds [20,25]. It is difficult to compare its accuracy because, to our knowledge, there are no comparable alternatives. The technique proposed is still in experimentation and will certainly be improved in the future. But, for now, it is the only method that allows the estimation of the spectral density of the sinusoidal components and enables the extraction of the CNSS parameters.

SYNTHESIS
In this part, we present the algorithms used to synthesize sounds from the statistical and the mathematical parameters. The first part consists of generating the oscillators for each successive frame. The frequency, amplitude, and phase of each sinusoid are computed using the model parameters.
Then the temporal samples are generated in each successive frame to produce the synthesized sound. Figure 5 shows the general diagram for the synthesis.

Frequencies
The frequency values of the sinusoidal components of each frame have to be computed from the statistical parameters. The first step consists in defining M bins (denoted by B i ) from the bandwidth values: Then N frequencies are determined from these M bins. N bins have to be drawn from the M possibilities. A statistically correct algorithm to choose these bins is the classic algorithm to randomly define a permutation. One bin i is randomly drawn, then bins M − 1 and i are interchanged. Another bin is randomly chosen from the M − 1 last bins. This algorithm repeats until all N bins are chosen. This algorithm has a large cost if M is large compared to N because one large array has to be initialized and manipulated in each frame. But experiments show that defining M more than 100 times N amounts to a uniform draw of frequencies: We could consider the special case where there are the same number of frequency bins as sinusoids (N = M). The calculation would be more efficient in that case, because we could directly associate sinusoid i with bin B i (i ∈ {0, . . . , N − 1}). But we know that most of the time spent in the synthesis is spent in partial synthesis, so improving the algorithm for that special case may not pay off enough.
Once the bins have been chosen, frequency values have to be determined from the parameter L. Another uniform draw is made in a band which is defined by the upper bound of the bin B j ( j ∈ {0, . . . , M − 1}) and whose length is L multiplied by the bin length (F max − F min )/M: Therefore, the following operations have to be done in sequence for each frame of the temporal signal.

Determination of phases
The model of thermal noise described in [19] imposes on each component its phase to be uniformly distributed: However, noise synthesized with sinusoids with equal phases results in intensity peaks. These peaks can be periodic depending on the length of the synthesis window. Such noises are described as impulsive noises. By changing the width of the probability density function of phases, users can control the amplitude of these peaks. The proposed synthesis model introduces a new parameter by controlling the relative width P (P ∈ R, 0 ≤ P ≤ 1) of the probability density function of the phase:

Determination of amplitudes
The amplitudes are simply defined from the frequency values and the spectral envelope by linearly interpolating the smooth spectral envelope extracted from the synthesis model. However, other types of interpolation (splines, LPC, etc.) are possible.

Additive synthesis of frames
Once the frequency, amplitude, and phase values are calculated, temporal samples are generated with additive synthesis. An efficient algorithm is presented in [31]. This algorithm can generate approximatively 2 partials per sample for each MHz of CPU clock speed. The algorithms we present have been implemented to create a real-time sound synthesizer. The most CPU consumption is in the case of white noise (or filtered white noise). Synthesizing sounds with more sinusoidal components is useless, because the difference cannot be heard by increasing N. For this reason, we define a maximum value for N depending on the synthesis window size. N cannot be greater than half of the synthesis window size (W s in samples). This limit corresponds to the inverse Fourier transform technique [32,33,34,35]:

OLA of frames
Spectral synthesis techniques often use the overlap-add method. The resulting temporal signal does not taper to 0 at the boundaries of each frame because of the random values of phase spectrum. This may also be the case when analyzed sounds are transformed. This is the reason why the synthesis method uses the OLA technique. But in the case of noise synthesis, both experiments and theory show that this method introduces intensity fluctuations which result in audible artifacts [36]. Indeed, the statistical moments are not preserved. We have proposed new methods to avoid these variations. We next describe a method which involves time shifting the sinusoids.  This method is applied to N sinusoids (denoted by s n ) with random phase. It consists in shifting the start of each sinusoid in each frame in order to distribute the intensity variations introduced by the weighting windows. Thus each component starting time (d n with n ∈ {0, . . . , N − 1}) is set to different values before being multiplied by the weighting window. The resulting signal x can be written as where s n are sinusoids. By choosing d n equally spaced over the half window, one can show that this sum of sinusoids leads to noise with constant statistical properties. There are many ways to determine the offset values. For example, they can be randomly drawn according to a uniform distribution. But this method may lead to artifacts because many offsets may have the same value, which introduces variance fluctuations [36]. To avoid these probabilities, we prefer choosing these values by dividing the half window in bins.
As a conclusion, the following operations have to be done in sequence for each partial of each frame of the temporal signal: (1) draw an offset off, (2) synthesize the current partial, (3) multiply it by weighting window, (4) offset output buffer with off, (5) add partial buffer to the output buffer.

APPLICATION AND CONCLUSION
This noise model is still being tested, especially with respect to the analysis method. We now present applications and details about the implementation.

Implementation
In order to test the real-time capabilities of the model, we developed the synthesis part of the model on one of the existing free software tools for real-time audio. The objective was to control all the synthesis parameters as fast as possible, while the sound is rendered. The first target was jMax (see http://www.ircam.fr/jmax) because we have used it with the SAS sound model [7] successfully for a long time.
The libcnss library and its jMax extension are free software developed on the GNU/Linux platform. They are available at http://scrime.labri.fr.

Applications
We can consider two uses for the CNSS model. The first one is the synthesis by using composer-specified control parameters. The synthesis-based applications do not apply to the analysis process. They consists in changing parameters in real time in order to modify synthesized synthetic sounds. This approach will be very useful to understand the perception of noisy sounds. Applications for this noise model have been developed. One of the major application is the pedagogic tool Dolabip [37]. The application uses the two sound models SAS and CNSS to help children to understand sound phenomena.
The second use for the CNSS model is the analysis followed by synthesis, with or without modification. Parameters of the CNSS model are extracted from natural noisy sounds. The main interest is to be able to perform original transformations on analyzed sounds. Users can then modify analyzed before resynthesizing transformed original sounds. Figure 6 shows an example of whispered voice that is analyzed and then resynthesized. The transformations allowed by the CNSS model are perceptually and musically relevant.
(i) Time scaling. In [38] we presented a method to perform time transformations without changing the statistical properties of noises. The first experimentations we have done show the limitations of the analysis methods. We hope to considerably improve these results by further development of better analysis methods.
(ii) Spectral density transformations. A key original aspect of the model we have developed is the ability to control of the spectral density by modifying parameters such as the number of sinusoids and the distribution of these sinusoids.
(iii) Harmonicity. For sounds with low spectral density, users can control the difference between successive frequencies and thus the periodicity of the temporal envelope. This characteristic is perceptually relevant.
(iv) Color. As in existing spectral models, the spectral envelope can be modified.
Another application is a musical tool for electroacoustic composers. Composers use the software developed under jMax based on the CNSS model to synthesize original sounds which can be incorporated in musical pieces.

Future work
The approach we described here is original because we analyze a new parameter, the spectral density, which has been experimentally determined to be perceptually essential for noises. The analysis method is very complex and the approach we present can certainly be improved in the future. But it can already permit the development of psychoacoustic experiments on the perception of the spectral density, which is not completely understood. The results of these experiments and, in particular, the resolution of the human auditory system will give important data to improve the model.
The analysis method proposed here is limited to sounds which do not contain any transient or sinusoid whose amplitude and frequency vary slowly with time. We are developing methods in order to detect fast energy variations (transients) and stable sinusoids. Several methods for transient detection have been proposed (e.g., [39] or [40]). These methods will soon be incorporated in the analysis stage to prevent extracting information related to transients or sinusoids, and which is now assumed as linked to the noise part of the analyzed signal.
Furthermore, we limit the model to the analysis and synthesis of one band of noise. However, a polyphonic signal can be composed of several bands. Each band can be analyzed, transformed, and synthesized independently. One of the improvements of the analysis method is to be able to discriminate noisy bands which are independent: their perceptual properties (spectral density, harmonicity, etc.) may be totally different.

Conclusion
In this paper, we propose the study of the representation of noisy sounds with short-time sinusoids. No complete justification has been proposed for this representation, whereas many models apply it implicitly. This study leads to the CNSS model, a spectral and statistical model for the analysis, the musical transformation, and the synthesis of noisy sounds. It is appropriate for representing the noisy part of natural sounds, and it allows new high-fidelity transformations (e.g., modifications of the spectral density). The quality of the classical transformations are also at least as good as transformations performed with the existing models (e.g., the time-scale operations [38]). For now, the CNSS model assumes that the modeled sound does not contain any stable sinusoid and any transient. This may lead to audible artifacts in the case of transformations of complex sounds, at the contrary of models such as [12], for example. The CNSS model we have developed is still in experimentation: the values of the parameters of the model have to be refined using psychoacoustic tests. But the model already shows considerable promise for musical creation, psychoacoustic experimentation, and pedagogy. Several sound examples can be found at http://www.labri.fr/Perso/hanna/ sounds.html.