EURASIP Journal on Applied Signal Processing 2003:8, 791–805 c ○ 2003 Hindawi Publishing Corporation Parameter Estimation of a Plucked String Synthesis Model Using a Genetic Algorithm with Perceptual Fitness Calculation

We describe a technique for estimating control parameters for a plucked string synthesis model using a genetic algorithm. The model has been intensively used for sound synthesis of various string instruments but the fine tuning of the parameters has been carried out with a semiautomatic method that requires some hand adjustment with human listening. An automated method for extracting the parameters from recorded tones is described in this paper. The calculation of the fitness function utilizes knowledge of the properties of human hearing.


INTRODUCTION
Model-based sound synthesis is a powerful tool for creating natural sounding tones by simulating the sound production mechanisms and physical behavior of real musical instruments. These mechanisms are often too complex to simulate in every detail, so simplified models are used for synthesis. The aim is to generate a perceptually indistinguishable model for real instruments.
One workable method for physical modelling synthesis is based on digital waveguide theory proposed by Smith [1]. In the case of the plucked string instruments, the method can be extended to model also the plucking style and instrument body [2,3]. A synthesis model of this kind can be applied to synthesize various plucked string instruments by changing the control parameters and using different body and plucking models [4,5]. A characteristic feature in string instrument tones is the double decay and beating effect [6], which can be implemented by using two slightly mistuned string models in parallel to simulate the two polarizations of the transversal vibratory motion of a real string [7].
Parameter estimation is an important and difficult challenge in sound synthesis. Usually, the natural parameter settings are in great demand at the initial state of the synthesis. When using these parameters with a model, we are able to produce real-sounding instrument tones. Various methods for adjusting the parameters to produce the desired sounds have been proposed in the literature [4,8,9,10,11,12]. An automated parameter calibration method for a plucked string synthesis model has been proposed in [4,8], and then improved in [9]. It gives the estimates for the fundamental frequency, the decay parameters, and the excitation signal which is used in commuted synthesis.
Our interest in this paper is the parameter estimation of the model proposed by Karjalainen et al. [7]. The parameters of the model have earlier been calibrated automatically, but the fine-tuning has required some hand adjustment. In this work, we use recorded tones as a target sound with which the synthesized tones are compared. All synthesized sounds are then ranked according to their similarity with the recorded tone. An accurate way to measure sound quality from the viewpoint of auditory perception would be to carry out listening tests with trained participants and rank the candidate solutions according to the data obtained from the tests [13]. This method is extremely time consuming and, therefore, we are forced to use analytical methods to calculate the quality of the solutions. Various techniques to simulate human hearing and calculate perceptual quality exist. Perceptual linear predictive (PLP) technique is widely used with speech signals [14], and frequency-warped digital signal processing is used to implement perceptually relevant audio applications [15].
In this work, we use an error function that simulates the human hearing and calculates the perceptual error between the tones. Frequency masking behavior, frequency dependence, and other limitations of human hearing are taken into account. From the optimization point of view, the task is to find the global minimum of the error function. The variables of the function, that is, the parameters of the synthesis model, span the parameter space where each point corresponds to a set of parameters and thus to a synthesized sound. When dealing with discrete parameter values, the number of parameter sets is finite and given by the product of the number of possible values of each parameter. Using nine control parameters with 100 possible values, a total of 10 18 combinations exist in the space and, therefore, an exhaustive search is obviously impossible.
Evolutionary algorithms have shown a good performance in optimizing problems relating to the parameter estimation of synthesis models. Vuori and Välimäki [16] tried a simulated evolution algorithm for the flute model, and Horner et al. [17] proposed an automated system for parameter estimation of FM synthesizer using a genetic algorithm (GA). GAs have been used for automatically designing sound synthesis algorithms in [18,19]. In this study, a GA is used to optimize the perceptual error function. This paper is sectioned as follows. The plucked string synthesis model and the control parameters to be estimated are described in Section 2. Parameter estimation problem and methods for solving it are discussed in Section 3. Section 4 concentrates on the calculation of the perceptual error. In Section 5, we discretize the parameter space in a perceptually reasonable manner. Implementation of the GA and different schemes for selection, mutation, and crossover used in our work are surveyed in Section 6. Experiments and results are analyzed in Section 7 and conclusions are finally drawn in Section 8.

PLUCKED STRING SYNTHESIS MODEL
The model proposed by Karjalainen et al. [7] is used for plucked string synthesis in this study. The block diagram of the model is presented in Figure 1. It is based on digital waveguide synthesis theory [1] that is extended in accordance with commuted waveguide synthesis approach [2,3] to include also the body modes of the instrument in the string synthesis model.
Different plucking styles and body responses are stored as wavetables in the memory and used to excite the two string  Figure 1: The plucked string synthesis model. Figure 2: The basic string model. models S h (z) and S v (z) that simulate the effect of the two polarizations of the transversal vibratory motion. A single string model S(z) in Figure 2 consists of a lowpass filter H(z) that controls the decay rate of the harmonics, a delay line z −LI , and a fractional delay filter F(z). The delay time around the loop for a given fundamental frequency f 0 is where f s is the sampling rate (in Hz). The loop delay L d is implemented by the delay line z −LI and the fractional delay filter F(z). The delay line is used to control the integer part L I of the string length while the coefficients of the filter F(z) are adjusted to produce the fractional part L f [20]. The fractional delay filter F(z) is implemented as a first-order allpass filter. Two string models are typically slightly mistuned to produce a natural sounding beating effect. A one-pole filter with transfer function is used as a loop filter in the model. Parameter 0 < g < 1 in (2) determines the overall decay rate of the sound while parameter −1 < a < 0 controls the frequency-dependent decay. The excitation signal is scaled by the mixing coefficients m p and (1 − m p ) before sending it to two string models. Coefficient g c enables coupling between the two polarizations.
Mixing coefficient m o defines the proportion of the two polarizations in the output sound. All parameters m p , g c , and m o are chosen to have values between 0 and 1. The transfer function of the entire model is written as where the string models S h (z) and S v (z) for the two polarizations can be written as an individual string model Synthesis model of this kind has been intensively used for sound synthesis of various plucked string instruments [5,21,22]. Different methods for estimating the parameters have been used, but in consequence of interaction between the parameters, systematic methods are at least troublesome but probably impossible. The nine parameters that are used to control the synthesis model are listed in Table 1.

ESTIMATION OF THE MODEL PARAMETERS
Determination of the proper parameter values for sound synthesis systems is an important problem and also depends on the purpose of the synthesis. When the goal is to imitate the sounds of real instruments, the aim of the estimation is unambiguous: we wish to find a parameter set which gives the sound output that is sufficiently similar to the natural one in terms of human perception. These parameters are also feasible for virtual instruments at the initial stage after which the limits of real instruments can be exceeded by adjusting the parameters in more creative ways.
Parameters of a synthesis model correspond normally to the physical characteristics of an instrument [7]. The estimation procedure can then be seen as sound analysis where the parameters are extracted from the sound or from the measurements of physical behavior of an instrument [23]. Usually, the model parameters have to be fine-tuned by laborious trial and error experiments, in collaboration with accomplished players [23]. Parameters for the synthesis model in Figure 1 have earlier been estimated this way and recently in a semiautomatic fashion, where some parameter values can be obtained with an estimation algorithm while others must be guessed. Another approach is to consider the parameter estimation problem as a nonlinear optimization process and take advantage of the general searching methods. All possible parameter sets can then be ranked according to their similarity with the desired sound.

Calibrator
A brief overview of the calibration scheme, used earlier with the model, is given here. The fundamental frequencyf 0 is first estimated using the autocorrelation method. The frequency estimate in samples from (1) is used to adjust the delay line length L I and the coefficients of the fractional delay filter F(z). The amplitude, frequency, and phase trajectories for partials are analyzed using the short-time Fourier transform (STFT), as in [4]. The estimates for loop filter parameters g and a are then analyzed from the envelopes of individual partials. The excitation signal for the model is extracted from the recorded tone by a method described in [24]. The amplitude, frequency, and phase trajectories are first used to synthesize the deterministic part of the original signal and the residual is obtained by a time-domain subtraction. This produces a signal which lacks the energy to excite the harmonics when used with the synthesis model. This is avoided by inverse filtering the deterministic signal and the residual separately. The output signal of the model is finally fed to the optimization routine which automatically fine-tunes the model parameters by analyzing the time-domain envelope of the signal.
The difference in the length of the delay lines can be estimated based on the beating of a recorded tone. In [25], the beating frequency is extracted from the first harmonic of a recorded string instrument tone by fitting a sine wave using the least squares method. Another procedure for extracting beating and two-stage decay from the string tones is described by Bank in [26]. In practice, the automatical calibrator algorithm is first used to find decent values for the control parameters of one string model. These values are also used for another string model. The mistuning between the two string models has then been found by ear [5] and the differences in the decay parameters are set by trial and error. Our method automatically extracts the nine control parameter values from recorded tones.

Optimization
Instead of extracting the parameters from audio measurements, our approach here is to find the parameter set that produces a tone that is perceptually indistinguishable from the target one. Each parameter set can be assigned with a quality value which denotes how good is the candidate solution. This performance metric is usually called a fitness function, or inversely, an error function. A parameter set is fed into the fitness function which calculates the error between the corresponding synthesized tone and the desired sound. The smaller the error, the better the parameter set and the higher the fitness value. These functions give a numerical grade to each solution, by means of which we are able to classify all possible parameter sets.

FITNESS CALCULATION
Human hearing analyzes sound both in the frequency and time domain. Since spectra of all musical sounds vary with time, it is appropriate to calculate the spectral similarity in short time segments. A common method is to measure the least squared error of the short-time spectra of the two sounds [17,18]. The STFT of signal y(n) is a sequence of discrete Fourier transforms (DFT) with where N is the length of the DFT, w(n) is a window function, and H is the hop size or time advance (in samples) per frame.
Integers m and k refer to the frame index and frequency bin, respectively. When N is a power of two, for example, 1024, each DFT can be computed efficiently with the FFT algorithm. If o(n) is the output sound of the synthesis model and t(n) is the target sound, then the error (inverse of the fitness) of the candidate solution is calculated as follows: where O(m, k) and T(m, k) are the STFT sequences of o(n) and t(n) and L is the length of the sequences.

Perceptual quality
The analytical error calculated from (7) is a raw simplification from the viewpoint of auditory perception. Therefore, an auditory model is required. One possibility would be to include the frequency masking properties of human hearing by applying a narrow band masking curve [27] for each partial. This method has been used to speed up additive synthesis [28] and perceptual wavetable matching for synthesis of musical instrument tones [29]. One disadvantage of the method is that it requires peak tracking of partials, which is a time-consuming procedure. We use here a technique which determines the threshold of masking from the STFT sequences. The frequency components below that threshold are inaudible, therefore, they are unnecessary when calculating the perceptual similarity. This technique proposed in [30] has been successfully applied in audio coding and perceptual error calculation [18].

Calculating the threshold of masking
The threshold of masking is calculated in several steps: (1) windowing the signal and calculating STFT, (2) calculating the power spectrum for each DFT, (3) mapping the frequency scale into the Bark domain and calculating the energy per critical band, (4) applying the spreading function to the critical band energy spectrum, (5) calculating the spread masking threshold, (6) calculating the tonality-dependent masking threshold, (7) normalizing the raw masking threshold and calculating the absolute threshold of masking.
The frequency power spectrum is translated into the Bark scale by using the approximation [27] where f is the frequency in Hertz and ν is the mapped frequency in Bark units. The energy in each critical band is calculated by summing the frequency components in the critical band. The number of critical bands depends on the sampling rate and is 25 for the sample rate of 44.1 kHz. The discrete representation of fixed critical bands is a close approximation and, in reality, each band builds up around a narrow band excitation. A power spectrum P(k) and energy per critical band Z(ν) for a 12 milliseconds excerpt from a guitar tone are shown in Figure 3a.
The effect of masking of each narrow band excitation spreads across all critical bands. This is described by a spreading function given in [31] 10 log 10 B(ν) = 15.91 + 7.5(ν + 0.474) The spreading function is presented in Figure 3b. The spreading effect is applied by convolving the critical band energy function Z(ν) with the spreading function B(ν) [30]. The spread energy per critical band S P (ν) is shown in Figure 3c. The masking threshold depends on the characteristics of the masker and masked tone. Two different thresholds are detailed and used in [30]. For the tone masking noise, the threshold is estimated as 14.5 + ν dB below the S P . For noise masking, the tone it is estimated as 5.5 dB below the S P . A spectral flatness measure is used to determine the noiselike or tonelike characteristics of the masker. The spectral flatness measure V is defined in [30] as the ratio of the geometric to the arithmetic mean of the power spectrum. The tonality factor α is defined as follows:   where V max = −60 dB. That is to say that if the masker signal is entirely tonelike, then α = 1, and if the signal is pure noise, then α = 0. The tonality factor is used to geometrically weight the two thresholds mentioned above to form the masking energy offset U(ν) for a critical band The offset is then subtracted from the spread spectrum to estimate the raw masking threshold R(ν) = 10 log 10 (SP(ν))−U(ν)/10 .
Convolution of the spreading function and the critical band energy function increases the energy level in each band. The normalization procedure used in [30] takes this into account and divides each component of R(ν) by the number of points in the corresponding band where N p is the number of points in the particular critical band. The final threshold of masking for a frequency spectrum W(k) is calculated by comparing the normalized threshold to the absolute threshold of hearing and mapping from Bark to the frequency scale. The most sensitive area in human hearing is around 4 kHz. If the normalized energy Q(ν) in any critical band is lower than the energy in a 4 kHz sinusoidal tone with one bit of dynamic range, it is changed to the absolute threshold of hearing. This is a simplified method to set the absolute levels since in reality the absolute threshold of hearing varies with the frequency. An example of the final threshold of masking is shown in Figure 3d. It is seen that many of the high partials and the background noise at the high frequencies are below the threshold and thus inaudible.

Calculating the perceptual error
Perceptual error is calculated in [18] by weighting the error from (7) with two matrices where m and k refer to the frame index and frequency bin, as defined previously. Matrices are defined such that the full error is calculated for spectral components which are audible in a recorded tone t(n) (that is above the threshold of masking). The matrix G(m, k) is used to account for these components. For the components which are inaudible in a recorded tone but audible in the sound output of the model o(n), the error between the sound output and the threshold of masking is calculated. The matrix H(m, k) is used to weight these components.
Perceptual error E p is a sum of these two cases. No error is calculated for the components which are below the threshold of masking in both sounds. Finally, the perceptual error function is evaluated as where W s (k) is an inverted equal loudness curve at sound pressure level of 60 dB shown in Figure 4 that is used to weight the error and imitate the frequency-dependent sensitivity of human hearing.

DISCRETIZING THE PARAMETER SPACE
The number of data points in the parameter space can be reduced by discretizing the individual parameters in a perceptually reasonable manner. The range of parameters can be reduced to cover only all the possible musical tones and deviation steps can be kept just below the discrimination threshold.

Decay parameters
The audibility of variations in decay of the single string model in Figure 2 have been studied in [32]. Time constant τ of the overall decay was used to describe the loop gain parameter g while the frequency-dependent decay was controlled directly by parameter a. Values of τ and a were varied and relatively large deviations in parameters were claimed to be inaudible. Järveläinen and Tolonen [32] proposed that a variation of the time constant between 75% and 140% of the reference value can be allowed in most cases. An inaudible variation for the parameter a was between 83% and 116% of the reference value.
The discrimination thresholds were determined with two different tone durations 0.6 second and 2.0 seconds. In our study, the judgement of similarity between two tones is done by comparing the entire signals and, therefore, the results from [32] cannot be directly used for the parametrization of a and g. The tolerances are slightly smaller because the judgement is made based on not only the decay but also the duration of a tone. Based on our informal listening test and including a margin of certainty, we have defined the variation to be 10% for the τ and 7% for the parameter a. The parameters are bounded so that all the playable musical sounds from tightly damped picks to very slowly decaying notes are possible to produce with the model. This results in 62 discrete nonuniformly distributed values for g and 75 values for a, as shown in Figures 5a and 5b. The corresponding amplitude envelopes of tones with different g parameter are shown in Figure 5c. Loop filter magnitude responses for varying parameter a with g = 1 are shown in Figure 5d.

Fundamental frequency and beating parameters
The fundamental frequency estimatef 0 from the calibrator is used as an initial value for both polarizations. When the  fundamental frequencies of two polarizations differ, the frequency estimate settles in the middle of the frequencies, as shown in Figure 6. Frequency discrimination thresholds as a function of frequency have been proposed in [33]. Also the audibility of beating and amplitude modulation has been studied in [27]. These results do not give us directly the discrimination thresholds for the difference in the fundamental frequencies of the two-polarization string model, because the fluctuation strength in an output sound depends on the fundamental frequencies and the decay parameters g and a.
The sensitivity of parameters can be examined when a synthesized tone with known parameter values is used as a target tone with which another synthesized tone is compared. Varying one parameter after another and freezing the others, we obtain the error as a function of the parameters. In Figure 7, the target values of f 0,v and f 0,h are 331 and 330 Hz. The solid line shows the error when f 0,v is linearly swept from 327 to 344 Hz. The global minimum is obviously found when f 0,v = 331 Hz. Interestingly, another nonzero local minimum is found when f 0,v = 329 Hz, that is, when the beating is similar. The dashed line shows the error when both f 0,v and f 0,h are varied but the difference in the fundamental frequencies is kept constant. It can be seen that the difference is more dominant than the absolute frequency value and have to be therefore discretized with higher resolution. Instead of operating the fundamental frequency parameters directly, we optimize the difference d f = | f 0,v − f 0,h | and the mean frequency f 0 = | f 0,v + f 0,h |/2 individually. Combining previous results from [27,33] with our informal listening test, we have discretized d f with 100 discrete values and f 0 with 20. The range of variation is set as follows: which is shown in Figure 8.

Other parameters
The tolerances for the mixing coefficients m p , m o , and g c have not been studied and the parameters have been earlier adjusted by trial and error [5]. Therefore, no initial guesses are made for these parameters. The sensitivities of the mixing coefficients are examined in an example case in Figure 9 Figure 10. This method is applied to the parameter g c , the range of which is limited to 0-0.5.
Discretizing the nine parameters this way results in 2.77× 10 15 combinations in total for a single tone. For an acoustic guitar, about 120 tones with different dynamic levels and playing styles have to be analyzed. It is obvious that an exhaustive search is out of question.

GENETIC ALGORITHM
GAs mimic the evolution of nature and take advantage of the principle of survival of the fittest [34]. These algorithms operate on a population of potential solutions improving  characteristics of the individuals from generation to generation. Each individual, called a chromosome, is made up of an array of genes that contain, in our case, the actual parameters to be estimated.
In the original algorithm design, the chromosomes were represented with binary numbers [35]. Michalewicz [36] showed that representing the chromosomes with floatingpoint numbers results in faster, more consistent, higher precision, and more intuitive solution of the algorithm. We use a GA with the floating-point representation, although the parameter space is discrete, as discussed in Section 5. We have also experimented with the binary-number representation, but the execution time of the iteration becomes slow. Nonuniformly graduated parameter space is transformed into the uniform scales where the GA operates on. The floating-point numbers are rounded to the nearest dis-crete parameter value. The original floating-point operators are discussed in [36], where the characteristics of the operators are also described. Few modifications to the original mutation operators in step 5 have been made to improve the operation of the algorithm with the discrete grid.
The algorithm we use is implemented as follows.
(1) Analyze the recorded tone to be resynthesized using the analysis methods discussed in Section 3. The range of the parameter f 0 is chosen and the excitation signal is produced according to these results. Calculate the threshold of masking (Section 4) and the discrete scales for the parameters (Section 5). (2) Initialization: create a population of S p individuals (chromosomes). Each chromosome is represented as a vector array x, with nine components (genes), which contains the actual parameters. The initial parameter values are randomly assigned. (3) Fitness calculation: calculate the perceptual fitness of each individual in the current population according to (15). (4) Selection of individuals: select individuals from the current population to produce the next generation based upon the individual's fitness. We use the normalized geometric selection scheme [37], where the individuals are first ranked according to their fitness values. The probability of selecting the ith individual to the next generation is then calculated by where q is the user-defined parameter which denotes the probability of selecting the best individual, and r is the rank of the individual, where 1 is the best and S p is the worst. Decreasing the value of q slows the convergence. (5) Crossover: randomly pick a specified number of parents from selected individuals. An offspring is produced by crossing the parents with a simple, arithmetical, and heuristic crossover scheme. Simple crossover creates two new individuals by splitting the parents in a random point and swapping the parts. Arithmetical crossover produces two linear combinations of the parents with a random weighting. Heuristic crossover produces a single offspring x o which is a linear extrapolation of the two parents x p,1 and x p,2 as follows: where 0 ≤ h ≤ 1 is a random number and the parent x p,2 is not worse than x p,1 . Nonfeasible solutions are possible and if no solution is found after w attempts, the operator gives no offspring. Heuristic crossover contributes to the precision of the final solution.
(6) Mutation: randomly pick a specified number of individuals for mutation. Uniform, nonuniform, multinonuniform, and boundary mutation schemes are used. Mutation works with a single individual at a time. Uniform mutation sets a randomly selected parameter (gene) to a uniform random number between the boundaries. Nonuniform mutation operates uniformly at early stage and more locally as the current generation approaches the maximum generation. We have defined the scheme to operate in such a way that the change is always at least one discrete step. The degree of nonuniformity is controlled with the parameter b. Nonuniformity is important for fine-tuning. Multi-nonuniform mutation changes all of the parameters in the current individual. Boundary mutation sets a parameter to one of its boundaries and is useful if the optimal solution is supposed to lie near the boundaries of the parameter space. The boundary mutation is used in special cases, such as staccato tones. Our algorithm is terminated when a specified number of generations is produced. The number of generations defines the maximum duration of the algorithm. In our case, the time spent with the GA operations is negligible compared to the synthesis and fitness calculation. Synthesis of a tone with candidate parameter values takes approximately 0.5 second, while the duration of the error calculation is 1.2 second. This makes 1.7 second in total for a single parameter set.

EXPERIMENTATION AND RESULTS
To study the efficiency of the proposed method, we first tried to estimate the parameters for the sound produced by the synthesis model itself. First, the same excitation signal extracted from a recorded tone by the method described in [24] was used for target and output sounds. A more realistic case is simulated when the excitation for resynthesis is extracted from the target sound. The system was implemented with Matlab software and all runs were performed on an Intel Pentium III computer. We used the following parameters for all experiments: population size S p = 60, number of generations = 400, probability of selecting the best individual q = 0.08, degree of nonuniformity b = 3, retries w = 3, number of crossovers = 18, and number of mutations = 18.
The pitch synchronous Fourier transform scheme, where the window length L w is synchronized with the period length of the signal such that L w = 4 f s / f 0 , is utilized in this work. The overlap of the used hanning windows is 50%, implying that hop size H = L w /2. The sampling rate is f s = 44100 Hz and the length of FFT is N = 2048.
The original and the estimated parameters for three experiments are shown in Table 2. In experiment 1 the original excitation is used for the resynthesis. The exact parameters are estimated for the difference d f and for the decay parameters g h , g v , and a v . The adjacent point in the discrete grid is estimated for the decay parameter a h . As can be seen in Figure 7, the sensitivity of the mean frequency is negligible compared to the difference d f , which might be the cause of deviations in mean frequency. Differences in the mixing parameters m o , m p , and the coupling coefficient g c can be noticed. When running the algorithm multiple times, no explicit optima for mixing and coupling parameters were found. However, synthesized tones produced by corresponding parameter values are indistinguishable. That is to say that the parameters m p , m o , and g c are not orthogonal, which is clearly a problem with the model and also impairs the efficiency of our parameter estimation algorithm.
To overcome the nonorthogonality problem, we have run the algorithm with constant values of m p = m o = 0.5 in experiment 2. If the target parameters are set according to discrete grid, the exact parameters with zero error are estimated. The convergence of the parameters and the error of such case is shown in Figure 11. Apart from the fact that the parameter values are estimated precisely, the convergence of the algorithm is very fast. Zero error is already found in generation 87.
A similar behavior is noticed in experiment 3 where an extracted excitation is used for resynthesis. The difference and the decay parameters g h and g v are again estimated precisely. Parameters m p , m o , and g c drift as in previous experiment. Interestingly, m p = 1, which means that the straight path to vertical polarization is totally closed. The model is, in a manner of speaking, rearranged in such a way that the individual string models are in series as opposed to the original construction where the polarization are arranged in parallel.
Unlike in experiments 1 and 2, the exact parameter values are not so relevant since different excitation signals are used for the target and estimated tones. Rather than looking into the parameter values, it is better to analyze the tones produced with the parameters. In Figure 12, the overall temporal envelopes and the envelopes of the first eight partials for the target and for the estimated tone are presented. As can be seen, the overall temporal envelopes are almost identical and the partial envelopes match well. Only the beating amplitude differs slightly but it is inaudible. This indicates that the parametrization of the model itself is not the best possible since similar tones can be synthesized with various parameter sets.
Our estimation method is designed to be used with real recorded tones. Time and frequency analysis for such case is shown in Figure 13. As can be seen, the overall temporal envelopes and the partial envelopes for a recorded tone are very similar to those that are analyzed from a tone that uses estimated parameter values. Appraisal of the perceptual quality of synthesized tones is left as a future project, but our informal listening indicates that the quality is comparable with or better than our previous methods and it does not require any hand tuning after the estimation procedure. Sound clips demonstrating these experiments are available at http://www.acoustics.hut.fi/publications/papers/jasp-ga.   Table 2. Mixing coefficients are frozen as m p = m o = 0.5 to overcome the nonorthogonality problem. One hundred and fifty generations are shown and the original excitation is used for the resynthesis.      Table 2. The synthesized target tone is produced with known parameter values and the synthesized tone uses estimated parameter values. Extracted excitation is used for the resynthesis.

CONCLUSIONS AND FUTURE WORK
A parameter estimation scheme based on a GA with a perceptual fitness function was designed and tested for a plucked string synthesis algorithm. The synthesis algorithm is used for natural-sounding synthesis of various string instruments. For this purpose, automatic parameter estimation is needed. Previously, the parameter values have been extracted from recordings using more traditional signal processing techniques, such as short-term Fourier transform, linear regression, and linear digital filter design. Some of the parameters could not have been reliably estimated from the recorded sound signal, but they have had to be fine-tuned manually by an expert user.
In this work, we presented a fully automatic parameter extraction method for string synthesis. The fitness function we use employs knowledge of properties of the human auditory system, such as frequency-dependent sensitivity and frequency masking. In addition, a discrete parameter space has been designed for the synthesizer parameters. The range, the nonuniformity of the sampling grid, and the number of allowed values for each parameter were chosen based on former research results, experiments on parameter sensitivity, and informal listening.
The system was tested with both synthetic and real tones. The signals produced with the synthesis model itself are considered a particularly useful class of test signals because there will always be a parameter set that exactly reproduces the analyzed signal (although discretization of the parameter space may limit the accuracy in practice). Synthetic signals offered an excellent tool to evaluate the parameter estimation procedure, which was found to be accurate with two choices of excitation signal to the synthesis model. The quality of resynthesis of real recordings is more difficult to measure as there are no known correct parameter values. As high-quality synthesis of several plucked string instrument sounds has been possible in the past with the same synthesis algorithm, we expected to hear good results using the GA-based method, which was also the case.
Appraisal of synthetic tones that use parameter values from the proposed GA-based method is left as a future project. Listening tests similar to those used for evaluating high-quality audio coding algorithms may be useful for this task.