EURASIP Journal on Applied Signal Processing 2005:12, 1807–1820 c ○ 2005 Hindawi Publishing Corporation Reduced-Complexity Deterministic Annealing forVectorQuantizerDesign

This paper presents a reduced-complexity deterministic annealing (DA) approach for vector quantizer (VQ) design by using soft information processing with simplified assignment measures. Low-complexity distributions are designed to mimic the Gibbs distribution, where the latter is the optimal distribution used in the standard DA method. These low-complexity distributions are simple enough to facilitate fast computation, but at the same time they can closely approximate the Gibbs distribution to result in near-optimal performance. We have also derived the theoretical performance loss at a given system entropy due to using the simple soft measures instead of the optimal Gibbs measure. We use thederived result to obtain optimal annealing schedules for the simple soft measures that approximate the annealing schedule for the optimal Gibbs distribution. The proposed reduced-complexity DA algorithms have significantly improved the quality of the final codebooks compared to the generalized Lloyd algorithm and standard stochastic relaxation techniques, both with and without the pairwise nearest neighbor (PNN) codebook initialization. The proposed algorithms are able to evade the local minima and the results show that they are not sensitive to the choice of the initial codebook. Compared to the standard DA approach, the reduced-complexity DA algorithms can operate over 100 times faster with negligible performance difference. For example, for the design of a 16-dimensional vector quantizer having a rate of 0.4375 bit/sample for Gaussian source, the standard DA algorithm achieved 3.60 dB performance in 16 483 CPU seconds, whereas the reduced-complexity DA algorithm achieved the same performance in 136 CPU seconds. Other than VQ design, the DA techniques are applicable to problems such as classification, clustering, and resource allocation.


INTRODUCTION
Vector quantization is a source coding technique that approximates blocks (or vectors) of input data by one of a finite number of prestored vectors in a codebook.The challenge is to find the set of vectors (or quantization levels) such that a given criterion for the total distortion between the actual source and the quantized source is as small as possible under a constraint on the overall rate [1].Since distortion depends on the codebook design, vector quantizer design is a key optimization problem to determine the performance of a VQ-based system [2,3,4,5].
The traditionally used VQ design approach is the generalized Lloyd algorithm (GLA), also referred to as the LBG algorithm [6].The GLA is an iterative descent algorithm, which converges to a final codebook relatively quickly, but such that the resulting codebook is only locally optimal.This is because the algorithm gets trapped in a local minimum of the distortion (energy) surface to which the initial codebook is closest.Consequently, the performance of GLA can be poor compared to that of a globally optimal quantizer.
A powerful approach to reduce the sensitivity of the algorithm to the initial codebook is the introduction of randomness.Several randomized optimization techniques have been investigated in the past.In [7] such "random search" techniques are discussed, where the idea is to randomly perturb the system at each iteration and determine the resulting change in performance.In some of its variations, a perturbation is only accepted if the performance increases, otherwise it is rejected; and in other variations, perturbations that decrease performance are also accepted under certain conditions.In general, if a random search technique allows temporary decreases in an objective function with nonzero probability, then the algorithm is in the class of stochastic relaxation (SR) [8,9], or stochastic local search techniques.
An important SR technique is simulated annealing (SA) [9], where in each iteration a new codebook is generated in the neighborhood of the old one, and the new codebook is accepted or rejected according to the Metropolis algorithm [10].If sufficient computational resources are devoted, the SA algorithm is guaranteed to yield globally optimal solutions [11].A reduced-complexity quantizer design based on SR is proposed in [9], achieving similar or slightly better results than SA in much less time under similar iteration schedules.Basically, the reduced-complexity SR algorithm is the generalized Lloyd algorithm together with a stochastic perturbation step that could either be on the encoder (SR-C) or the decoder (SR-D).Another method that uses a similar randomized search technique is suggested in [12] and has an average performance comparable to SR-D but with higher complexity.In the above approaches, random search moves were allowed on the energy surface in order to give the system the ability to avoid local minima.
Unlike these SR techniques, a deterministic annealing (DA) approach for optimal vector quantizer design puts the problem in a probabilistic framework, and deterministically optimizes the probabilistic objective function in each iteration [13].In DA there are no random moves on the energy (cost) surface.At high temperatures, the energy surface is smoothed, so that the algorithm starts at the global minimum on the smoothed energy surface.Through a careful annealing schedule, the algorithm traces the global minimum as the energy surface assumes its nonconvex "rugged" form with decreasing temperature.The Gibbs distribution is used to associate sample vectors in the training set with codevectors since it maximizes the entropy under the constraint of a given average distortion.Note that the sample vectorcodevector associations are not one-to-one, but rather oneto-many.In other words, each sample vector in the training set is assigned to all codevectors with a given probability: a sample vector gets assigned to each sample vector with a probability that increases as the distance to the sample vector decreases.
The DA method can construct high-performance vector quantizers by avoiding local minima.The main obstacle to its widespread utilization is the very significant computation complexity involved.Note that while codebook design is a one-time cost in many applications, there are others where codebook adaptation is required.Our focus then is on reducing codebook design complexity for these situations.The computation complexity of the DA approach is due to three factors: first, computation of the Gibbs distribution to obtain each association probability involves the computation of an exponential term; second, this function has to be computed for a very large number of pairings (as many pairings as codevectors, for each of the training samples); third, this process has to be repeated for each iteration of the annealing process.Our proposed technique will address all these factors that contribute to the complexity by (i) defining simple functions to compute the association probability, (ii) reducing the number of association probabilities to compute per training vector, and (iii) speeding up the iterations in the annealing process.
Our proposed reduced-complexity deterministic annealing approach is based on using soft information processing with simplified assignment measures.In digital communications engineering, soft information is "a reliability measure over the sample space of the investigated random variable."More specifically when considering multiple decoding choices for a received noisy signal, soft information provides a measurement of how much confidence there is in making each possible decoded signal choice [29,30,31].By analogy, in our VQ design, we will call soft information the relative "strength" of the association of each codevector to each sample vector.We will refer to this formulation as a soft vector quantizer (SVQ) design, where the encoding is "soft" in that each input vector in the training set gets assigned to multiple codewords.We develop reduced-complexity DA techniques through the design of simple soft measures that can mimic the effect of the Gibbs distribution used in the standard DA.Hence, while the designed soft measures are simple enough to facilitate fast computation, they also keep the performance penalty to a minimum by mimicking the Gibbs distribution's functionality.We also derive a theoretical analysis of the performance loss when using a simplified measure instead of the optimal one, and further use the result to derive optimal annealing schedules for the proposed simple soft measures.In contrast to the standard DA which starts with essentially a single codevector and increases the size of the codebook through iterations, in SVQ the design starts with the required number of codevectors and optimizes their locations through iterations.It is also observed, and empirically shown, that the importance of a codevector for a given sample vector (in terms of the amount of probability mass associated with it) decreases exponentially fast with the distance from the sample vector, even at relatively high temperatures.Hence, major computational gains can be obtained with negligible performance degradation by considering only the nearest few codevectors from each sample vector.We present experimental evidence indicating that through these techniques significant performance gains are achieved by the SVQ algorithms over the traditionally used GLA and over SR-D, where the latter is widely thought to provide near-optimal performance.Compared to the standard DA, the results show drastic reductions in computational complexity with very small sacrifice in performance.It is also shown that using the SR technique [9] in combination with the SVQ algorithms results in further improvement in performance, although the added benefits of the SR techniques are less when the performance of the SVQ approaches the optimal.
This paper is organized as follows.Section 2 briefly introduces the standard DA method for VQ design and points out its computational complexities.In Section 3, the reduced-complexity Gibbs distribution and low-complexity soft information measures for VQ design are proposed and analyzed.Optimal annealing schedules for the lowcomplexity soft information measures are also derived.Section 4 presents experimental results comparing the performances of the proposed algorithms with that of GLA, SR, and the standard DA on various Gauss-Markov, speech, and image sources.The effect of codebook initialization is also investigated.Finally, Section 5 concludes the paper.

VECTOR QUANTIZER DESIGN BY DETERMINISTIC ANNEALING
In the deterministic annealing algorithm proposed by Rose et al. [13], the main principle is the application of a probabilistic hierarchical clustering process, where each sample vector in the training set is associated to a cluster with a certain degree of membership.Each cluster R j is represented by a codevector c j .Thus, the distortion (energy) function to be minimized is an expected distortion function, where d(x, c j ) is the distortion measure incurred in representing sample vector x by codevector c j , and P(x ∈ R j ) is the probability that x belongs to the cluster represented by c j .The probability distribution used to define the associations is the Gibbs distribution, which maximizes the entropy under the constraint (1) [13]: where | | is the cardinality of the cluster set, {R 0 , R 1 , . . ., R | |−1 }.Notice that the distribution in (2) is a form of soft information.In other words, it gives a reliability value for assigning the sample vector x to cluster R j over the sample space of the cluster set.The parameter β is a term that is inversely proportional to the temperature in the annealing process.Hence, at infinite temperature, which corresponds to β = 0, the probability associations are uniform: for all x, j.This means that each sample vector x is equally assigned to all the clusters.As β gets large, that is, the temperature is lowered, the probability assignments for a sample vector x start to favor clusters closer to x; the closer a cluster representative c j to x, the higher its probability assignment.In the limit as β → ∞, each sample vector gets assigned exactly to one cluster, namely, the cluster corresponding to the nearest codevector.We refer to this as the hard assignment, as opposed to the soft assignment where a sample vector gets assigned to more than one representative.
The codevector locations are defined as the weighted average of the sample vectors, where the weights are the probability associations of the sample vectors to the specific codevector being considered: Thus, at β = 0 all cluster representatives are at the center of mass of the training set, where K is the number of sample vectors in the training set.Essentially, at β = 0, there is, only one cluster (or Voronoi region), which is the whole set, and a single representative codevector at its center of mass.The hierarchical design algorithm in [13] starts the annealing process with the whole training set as one cluster at β = 0, gradually increases β, and reoptimizes by solving (3) at each β.As β is increased, the probability associations start to get "harder", that is, more biased towards the closest codevector, and the system goes through a sequence of splitting of the clusters at phase transitions until the required number of clusters (or codevectors) are reached.The main focus in [13] is the derivation of the critical values of β, denoted by β c , at which these phase transitions occur.The authors show that in order to be able to attain the global minimum, the splitting of the clusters should be at these critical moments, which correspond to optimal cluster splitting temperatures.Note that β does not control the size of the codebook; the system goes through a sequence of phase transitions until the required number of representatives (codevectors) is reached.During the annealing process whenever β reaches β c for an existing cluster, that cluster splits into smaller clusters.In the limit β → ∞, the associations become hard and each sample vector is associated with one representative as in the GLA algorithm.
The work by Rose et al. [13] provides the theoretical framework explaining how the DA approach avoids local minima, and shows that through the proposed annealing process a global minimum can be achieved.However, for practical applications this algorithm has some drawbacks: in particular, the annealing of the temperature has to be very slow especially in the vicinity of β c , and the association probabilities for each sample vector have to be calculated for all codevectors.Such complexity may be excessive for many applications.In the next section, we present and analyze reduced-complexity techniques for VQ design that result in very significant computational gains with negligible performance degradation.In the sequel, we will refer to the method explained in this section as the standard DA.

Introduction
In our proposed soft vector quantizer (SVQ) algorithms, we formulate the vector quantizer design problem in a probabilistic framework as in the standard DA.However, unlike standard DA where each training vector is associated in probability with all of the codevectors, in SVQ algorithms each training vector is allowed to be associated with only a subset of the codevectors.These probability associations provide the relative reliability of each of the codevectors that the training vector can be mapped to and are a function of the relative distances of the codevectors to the training vector.The cost of the computation of the Gibbs soft assignment in (2), which involves exponentials, is high; if we count each of the basic operations (addition, subtraction, and multiplication ) to take one floating point operation, flop, then an exponential computation takes 8 flops, and since soft assignments for all codevectors have to be updated for all sample vectors in every iteration in the standard DA, this results in a system of very high computational complexity.In order to reduce the computational complexity of the system, we would like to define and use a simpler distribution, preferably one that does not involve exponential terms.We define a general "simple" distribution as where µ(x k , c i ) is a computationally easy-to-compute measure of the goodness of match of codevector c i to sample vector x k .The denominator is the sum of the goodness of matches with respect to a subset of N codevectors that are most relevant to x k , where N ≤ |C| and |C| is the codebook size.Note that, when N = |C|, all of the codevectors are regarded as relevant.Therefore, in (5) the softness of the assignment can be controlled by adjusting N (as N is reduced the assignment becomes hard).Recall that in (2) the term β acts as a softness control factor (i.e., as β increases assignments get harder), but for any given β assignment to all the codevectors was required.Thus, using a simple function, µ(x k , c i ), coupled with N |C| can result in major computational gains at the expense of some reduction in performance, as only a fraction of codevector assignments is required.
For a given set of soft assignments, p 0 (c i |x k ), for all i, for all k, the codevector locations can be computed as the weighted average of the sample vectors as in (3), where sample vector probabilities are assumed to be uniform, p(x k ) = 1/K, and where K is the size of the training set.The general iterative framework for updating the soft assign- Initial codebook: ments and codevector locations is shown in Figure 1.Note that this framework is independent of the type of soft assignment used and can be applied for any choice of µ(x k , c i ).
Ideally, in any annealing algorithm, the annealing temperature should start at a very high level (theoretically infinite) and gradually cool down to zero.However, as we have seen in the standard DA, this results in a very slow convergence.In our proposed SVQ algorithm, the temperature is not infinite at the start; we demonstrate that starting with a low temperature and with fixed (required) number of codevectors, it is possible to achieve near-optimal performance, even though starting with a low temperature means starting the algorithm with a nonconvex energy surface.We show that introduction of controlled randomness into the iterations, as in standard SR, helps in reducing the impact of this nonconvexity on the final design.

Reduced-complexity Gibbs distribution for VQ design
We know that as a result of soft association every sample vector x k has a certain degree of belonging to all of the codevectors in the codebook.However, when we take all the soft associations into account, the effect of very small soft associations on (6) and on the converged codebook is negligible.Thus, in order to reduce the complexity, we compute the soft associations only for the N closest codevectors and set the other |C| − N associations to zero.Note that the size of the codebook is not changed.This approach requires computing the distance from one input to all codevectors (which is required for any VQ design) and identifying the N nearest codevectors.Note that after the first few iterations the N nearest codevectors for training vectors need not be computed in every iteration, because the displacement of the codevectors from one iteration to the next tend to be small.Periodic updates of the list of Nclosest codevectors can be used to prevent changes in this list leading to inaccurate assignment probabilities.Denoting N(x k , N) to be the nearest N codevectors from a given x k , the soft information is computed by where Note that with this kind of exponential function the assignment probabilities decay very rapidly, even when β is low.

Fixed number of associations
One can first consider an approach with a fixed N. Experimentally, we have found N = 4 to be a good tradeoff value between performance and complexity.In other words, results obtained by setting N = 4 and with N = |C| (i.e., using all the codevectors in the codebook) resulted in negligible performance difference, however, the computational savings are significant, especially for large codebooks (e.g., |C| > 128).A comparison of N = 4 and N = |C| = 128 using the same annealing schedule is given in Table 1.The loss in performance incurred by considering only the nearest 4 codevectors for each sample vector instead of the whole codebook is negligible for all practical purposes, while a very significant complexity reduction (about a factor of 120 speedup) is achieved.
The proposed algorithm is shown in Figure 2, where the iterations start with N = 4, β = 1/(4σ 2 X ), where σ 2 X is the source variance, and an initial codebook C 0 .The initial value of β, β = 1/(4σ 2 X ), is found empirically to be a good starting value.At each iteration, we gradually increase β (κ > 1.0), update the soft information according to (7), and reoptimize the codevector locations using (6).We can then apply the codevector perturbation method of [9].As β increases, the softness of the codevector associations decreases.In the limit, when all the probability mass is assigned to the nearest codevector for all sample vectors, we reach the nearest neighbor condition.

Variable number of associations
When we need to design quantizers for very large codebook sizes (e.g., |C| = 512, 1024, . . .), it is useful to use N larger than 4 (e.g., 10, 12, 15, . . .).However, we know that while the Nth furthest away codevector from a given sample vector plays an important role (has large probability mass) in the early iterations, its importance decreases in each iteration.As the temperature decreases, the probability mass is gradually transferred from the distant to the closer codevectors.Hence, after a while, the Nth codevector will contain negligible mass and it can be discarded without any significant effect on the final performance.Thus to simplify the computation without affecting the performance, we can modify our algorithm as follows: whenever the average probability mass of the nearest N − 1 codevectors, PM(N − 1) exceeds a certain mass π (e.g., π = 0.99), N is reduced by one, where K is the size of the training set.Gradually decreasing N, as the temperature changes, results in considerable complexity reduction, except when N becomes small.This is because reducing N requires computing PM(N − 1) at each iteration, which adds to the overall complexity.Therefore, in our algorithm, when N reaches a small value, for example, N = 4, the process of gradual reduction of N stops.

Low-complexity soft information measures for VQ design
As previously stated, in order to reduce further the computational complexity of the system, we can use in (5) a less complex distribution than the optimal Gibbs distribution.One of the simplest distributions that readily comes to mind is the "inverse Euclidean distance" distribution, in which, for a given sample vector x k , the "importance" of the codevectors decreases with increasing distance from x k ."Inverse Euclidean distance" in a soft information measure can be defined as The distances in (9) are the Euclidean norms between x k and the codevectors (n is the vector dimension), 2 .The number of codevectors to be taken into consideration for each x k can be determined by a circle centered on x k with a radius R. R is chosen so that it includes the N nearest codevectors to x k .The radius R decreases from one iteration to the next, R (m) = R (m−1) ρ, where 0 < ρ < 1.0.Note that as R decreases at some point it will include less than N codevectors, when that happens we take only those codevectors within the circle into consideration.
Another soft information measure can be defined using a triangle function centered on the sample vector, x k , as shown in Figure 3.The function, with height h = 1 and a spread R x , is chosen such that it will contain the N codevectors within an Euclidean distance R x from x k .Using the fuzzy systems terminology, we can define this triangle function as the membership function of x k and denote it by m x .The soft associations are computed by using the heights of the membership function corresponding to the Euclidean distances of the codevectors from x k , The spread R x decreases gradually in each iteration giving more and more importance to the nearer codevectors as the iterations increase.Note that we start with fixed N but as R x decreases it will include less than N codevectors, when that happens we take only those codevectors within the spread into consideration.At the limit, when only one codevector stays within the nearest neighbor set (spread), that is, N = 1, the soft information measure becomes hard and all the probability mass gets assigned to the nearest codevector.Note that as the spread is decreased, for some sample vectors, N = 1 will be reached earlier than the others since the nearest codevector distance cannot be the same for each sample vector.
As the spread continues to decrease, at some point for some sample vectors, R x < d(x k , c i ) for all i.In these cases, the algorithm assigns all the probability mass to the nearest codevector.The spread at the mth iteration is controlled by a geometric schedule as in the Gibbs case: where ρ is the reduction factor, 0 < ρ < 1.0.The soft information measure in (10) can be defined in terms of the spread, R x and the distances, d i = d(x k , c i ) using triangular similarities, where the height of the triangle is h = 1: Therefore, (10) becomes This is a better soft information measure than the inverse Euclidean distance measure in (9), because it can mimic better the effect of the temperature reduction in the Gibbs distribution.With (9) the only time the soft assignments will change as the radius is decreased is when a codevector is left out of the circle of radius R.However, using (10) the heights get affected by the reduction in the spread R x as seen in ( 13).This is desired in order to approximate the effect of the temperature reduction in the Gibbs distribution because as the spread decreases the codevectors closer to x k should increase their relative share of the soft assignment in conformity with their distances from x k .
The experimental results will also demonstrate that ( 13) is in fact a better measure than (9).Note also that the computational cost of computing one soft assignment using (9) requires 5N + 4 flops, whereas using (13) it requires N + 7 flops, counting addition, subtraction, and multiplication as one flop and division as four flops (N is the number of codevectors taken into computation).Hence, for N ≥ 1, N + 7 < 5N + 4, implying that ( 13) is also less costly than (9).Recalling that an exponential computation is 8 times more costly than a basic operation (8 flops compared to one flop of operation time for a basic operation), then (7) takes N(8 + 1 + 1) + 4 = 10N + 4 flops, which is much larger than N + 7. Therefore, the height-defined triangular soft information measure is a computationally less complex distribution than the Gibbs distribution.
The algorithm for the low-complexity soft information is similar to the one shown in Figure 2. In this case, the temperature control is done by the spread of the triangle function.The initial spread R (0)  x = 4σ 2 X was empirically found to give good performance when N is initialized to 4, where σ 2 X is the variance of the training set components.

Optimal temperature schedule
In the previous section, we have proposed a low-complexity soft assignment measure, namely, the triangular soft information measure, as a simplified way of computing the soft assignments.Although this measure will significantly reduce the computational cost of the soft assignments compared to the Gibbs soft measure, the reduction in computational cost will be at the expense of some loss in performance, since Gibbs is the optimal soft measure.In this section, we will show how this performance loss can be reduced if, in the low-complexity approach, we can find temperature reduction schedules that approximate those from the Gibbs β schedule.To achieve this, we will want to minimize the L 1 distance between the two distributions [32], More specifically, we would like to find the spread reduction schedule R x for a given Gibbs β schedule such that ( 14) is minimized.Note that minimizing ( 14) is related to minimizing the relative entropy between p 0 (c|x) and p G (c|x), D(p 0 (c|x) p G (c|x)), since we know from [32] that with equality when p 0 = p G .Although it is intuitive that in order to minimize the performance difference between a simplified soft measure and the optimal soft measure, the relative entropy between the two should be minimized, we refer to the appendix for a more formal justification.The error analysis in the appendix shows that at a given system entropy (softness) the performance loss in terms of distortion between two distributions (soft measures) is a function of the relative entropy between the distributions, so that we can show that minimizing the relative entropy results in minimizing the distortion penalty paid for using a simplified soft measure.
We can show that the relative entropy is approximately minimized when the variances of the two distributions, p 0 (c|x) and p G (c|x), are equal.The variances of p G (c|x) and p 0 (c|x), respectively, are (the lower limits of the integrals start from zero because we use absolute distances between sample vector and each of the codevectors): Equating ( 16) and ( 17), and solving for R x , we get Hence, using (18) we can obtain a schedule for R x given a schedule for β.
We have used the setup in Figure 4 to show that for a given β for the Gibbs distribution, the spread R x obtained by (18) for the triangle distribution minimizes the relative entropy.In the figure, there is a set of L codevectors at increasing distances from a sample vector x.For each β in a set {β 1 , β 2 , . . ., β m }, the soft Gibbs assignments of the codevectors is computed using (7) with N = L.Then, through an exhaustive search, the spread R x that gives the soft assignments using (13) which minimizes the relative entropy D(p 0 (c|x) p G (c|x)) is obtained.
The resulting minimum relative entropy curve is shown in Figure 5 by the solid line.This is compared with the result obtained using (18) by the dashed curve.We can see that the derived relation in (18) can approximate well the minimum relative entropy curve, and hence the best R x schedule for a given β schedule.The error is due to the fact that we approximate the relative entropy using the variance of the two functions.
The reduced-complexity Gibbs algorithm and the lowcomplexity soft measure algorithm for the triangular membership function using two different spread reduction schedules are used to design codebooks of sizes 128 and 256 (for details on the experiments, see Section 4).The results are shown in Table 2. Of the two schedules for triangular soft information measure, the first one is the geometric spread reduction given in (11), and the second one is obtained using (18) and the Gibbs schedule (referred to as Gibbs guided spread reduction in Table 2).Observe from the results in Table 2 that the performance of the triangular soft information measure using the Gibbs guided spread reduction schedule outperformed the geometric spread reduction when both are operated with the same number   R x and β pairs.The solid curve is obtained by an exhaustive search for R x that gives the minimum relative entropy for a given value of β.The dashed curve is obtained using the derived relationship between R x and β to give the minimum relative entropy. of iterations.Therefore, for a given β-schedule the relation in (18) provides a better R x -schedule than the geometric reduction.Note that β-schedule is itself geometric, β (m) = β (m−1) • κ.But since the Gibbs soft information measure is the optimal measure, following the β-schedule for the simple soft information measure results in an increase in performance as demonstrated in Table 2.Note also that to obtain the β-schedule the Gibbs algorithm need not be run, it can be obtained using β (m) = β (m−1) • κ, κ > 1.0.

EXPERIMENTAL RESULTS
We now present the results obtained when our algorithms were used to design codebooks of various sizes and sources.The results are compared with other algorithms of interest, namely, GLA, SR-D, and standard DA.Our quoted execution times (CPU times) are based on those obtained with an Intel PIII-550 MHz machine.
The first set of training sources we considered was two cases of first-order Gauss-Markov sources, one with correlation coefficient α 0 = 0.0 (uncorrelated source) and the other with α 0 = 0.9 (correlated source).We divided 16 384 samples into 1024 16-dimensional training vectors, and designed codebooks of sizes 32, 64, 128, and 256 for both training sets, where the initial codebooks were obtained randomly from the training sets.Since both GLA and SR-D are sensitive to the choice of the initial codebooks, in order to investigate the effect of initialization, we have also designed codebooks of sizes 32, 64, 128, and 256, where the pairwise nearest neighbor (PNN) algorithm [33] is used to obtain the initial codebooks.For this we have used the uncorrelated Gaussian source with 4096 16-dimensional training vectors.
The second source examined was a segment of human speech sampled at 8 kHz and partitioned into 2048 16dimensional vectors, and we have designed five codebooks of sizes 16, 32, 64, 128, and 256.The final source considered was obtained by extracting 8192 16-dimensional vectors (corresponding to 4 × 4 blocks) from two 512 × 512 monochrome training images from the USC image database, with each pixel amplitude quantized to 8 bits.Four codebooks of sizes 32, 64, 128, and 256 were designed using this training set, and the performance of these codebooks is tested in coding the image "Lena" which was outside of the training set.The effect of the PNN initialization on the speech and the image sources is also demonstrated.
We designed codebooks for the following algorithms where in the plots an "a" appended to an algorithm name means that stochastic perturbation was used (e.g., SVQ-Ga would mean the same as SVQ-G but without perturbation).
(1) SVQ-G: soft vector quantizer design using the reduced-complexity Gibbs distribution as the soft measure, and with stochastic perturbation.(2) SVQ-E: soft vector quantizer design using the inverse Euclidean distance distribution as the soft measure, and with stochastic perturbation.(3) SVQ-T: soft vector quantizer design using the heightdefined distribution with triangular membership function as the soft measure, and with stochastic perturbation.(4) VQ-DA: vector quantizer design using the standard deterministic annealing [13].(5) SR-D: vector quantizer design using the reducedcomplexity decoder perturbation algorithm [9].(6) GLA: vector quantizer design using the generalized Lloyd algorithm [6].
In the cases where PNN initialization is not used, for each algorithm, except VQ-DA, the average performances for 20 different initial codebooks are computed.To allow us to compare the average performances of the different algorithms, the same set of initial codebooks is used.Recall that VQ-DA uses the center of mass of the training set as the initial codebook, so its performance with this initial condition is recorded.In the cases where PNN initialization is used, a unique initial codebook is obtained from the training set.
The performance measure used for the image source is peak signal-to-noise ratio (PSNR) and for the others is signal-tonoise ratio (SNR), defined as PSNR = 10 • log 10 (255 2 /D) and SNR = 10 • log 10 (P s /D), where P s is the signal power and D is the distortion per sample.The SR-D algorithm was run for 200 iterations as given in [9], and the GLA was run until convergence.

Gauss-Markov sources
The performances of the first 5 algorithms (listed above, both with and without perturbation) with initial codebooks obtained randomly from the training set are compared with the GLA performances in Figures 6 and 7.In all cases, the reduced-complexity DA algorithms (SVQ) achieved significant improvements over the traditionally used GLA and over SR-D, which is said to give near-optimal results [9].From the figures we observe that the SVQ-G algorithm (reducedcomplexity Gibbs distribution) performed better than the other SVQ algorithms; however, the performance of SVQ-T is competitive.Note the progression of performances of the low-complexity soft information measures: the performance improves from the inverse Euclidean distance soft measure (SVQ-E and SVQ-Ea) to the triangular soft measure (SVQ- T and SVQ-Ta).This was an expected result since the triangular soft measure was designed to better approximate the optimal Gibbs distribution.Note also the gain achieved by the stochastic relaxation (SR) in the SVQ algorithms compared to nonstochastic cases.The gain ranges from a high 0.2 dB for SVQ-E to a low 0.02 dB for SVQ-G algorithms.It should be noted that the better an algorithm performs without the SR, the lesser the additional gain achieved by the SR in the SVQ algorithms.In other words, as an algorithm comes closer to the global optimum using the principles of soft information processing, it requires less help from the SR to attain an improved performance.In the limit, granting enough computational resources for the full power of the soft information processing to be utilized, the global optimum can be reached without requiring any help from SR.But as the results demonstrate, for reduced-complexity DA approaches, SR has a positive effect in the improvement of the performances with negligible computational complexity.
The results for VQ-DA (standard DA) were obtained starting with all the sample vectors being equally associated with all the codevectors, which dictates an initial codebook where all the codevectors are at the center of mass of the training set.The simulations were conducted with a conservative annealing schedule, where it took over 120000 CPU seconds (about 24 hours) for the codebook of size |C| = 256 to converge.Recall that in VQ-DA the probability associations are computed to all codevectors for each sample vector, thus the algorithm executes very slowly especially for large codebooks.The figures show that the performance of VQ-DA compared to reduced-complexity DA algorithms is inferior in all cases.Moreover, the SVQ algorithms run much faster than VQ-DA, requiring 350 CPU seconds for |C| = 256 and 16-dimensional vectors.While, if enough computational resources are allocated, VQ-DA is expected to be very close to optimal as shown in [13], the performance of the reducedcomplexity DA algorithms proved that for most practical applications the expected performance of VQ-DA does not justify its computational burden.
Both GLA and SR-D algorithms are sensitive to the initial codebooks.Hence, in order to investigate the effect of initialization on these algorithms and on our proposed algorithms, we have used the PNN initialization for the codebooks [33].In Figure 8, we show the performances of the 4 codebooks on (uncorrelated) Gaussian source as improvement over the PNN initialized GLA.For clarity of presentation, we have only included the SVQ-Ga performance from our proposed algorithms; the other SVQ algorithms behave comparatively the same with SVQ-Ga as in Figure 6.Note from the figure that the PNN initialization improves the GLA and SR-D algorithms, however, the SVQ-Ga algorithm is not affected.This is a positive result for the SVQ algorithms for it shows that they can evade the local minimum dictated by the initial codebook, and hence are insensitive to the choice of the initial codebook.The PNN and its fast but sup-optimal version, fast-PNN, require O(K 3 ) and O(K log K) time, respectively, where K is the size of the training set [33].The results presented in Figure 8 are obtained using the full search PNN algorithm (with complexity O(K 3 )) in order to get the best possible results with the GLA and the SR-D algorithms.The fast-PNN initialization would result in reduced performance; it is shown in [33] that the fast-PNN algorithm increased the coding error by 0.4-0.6 dB for image sources compared to full search PNN.The SVQ algorithms outperform both GLA and SR-D algorithms without the complexity of the initialization process, which gets computationally more impractical as the size of the training set increases.The running time for the generation of the PNN codebooks from a training set of 4096 16-dimensional vectors was 2374 CPU seconds, and the design of the size 256 codebooks for GLA, SR-D, and SVQ-Ga algorithms on average was 44, 366, and 1552 CPU seconds on the same machine.Therefore, with the PNN initialization, the total running times for the GLA and the SR-D algorithms are higher than the SVQ-Ga algorithm.Since SVQ-Ga performs the same with and without the initialization, the SVQ-Ga algorithm outperforms GLA and SR-D in less running time.

Speech source
The performance on the speech source using the three algorithms, GLA, SR-D, and SVQ-Ga, with and without the codebook initialization, is shown in Figures 9 and 10.In Figure 9, the performance improvement over GLA and in Figure 10, improvement over PNN initialized GLA are shown.Note that while the performance improvement of SVQ-Ga over GLA is large (0.95 dB at 0.5 bit/sample), compared with the PNN initialized GLA the improvement is rather modest.But note again that the effect of the initialization is very small on the SVQ-Ga performance, whereas improvements of 0.85 dB and 0.2 dB are obtained at 0.5 bit/sample for GLA and SR-D, respectively, after initialization.Therefore, as in the Gaussian source, the SVQ-Ga renders the initialization unnecessary.

Image source
The last source considered was the image source, where the results are shown in Figure 11 for the coding of the image source "Lena."As in the previous two source cases, the SVQ-Ga performance is practically not sensitive to the initial codebook initialization and it outperformed the GLA and the SR-D algorithms by 0.3-0.4dB and 0.2-0.3dB, respectively, both being initialized with PNN.Therefore, as in the Gaussian and the speech sources, the SVQ-Ga outperformed the PNN + GLA and PNN + SR-D without the need of initialization.

CONCLUSION
In this paper, we have designed reduced/low-complexity methods for deterministic annealing (DA) for the vector quantizer design problem, which we called soft vector quantizer (SVQ) design algorithms.The proposed low-complexity soft measures are used as the soft association probabilities in the probabilistic framework of the DA to reduce the computational cost compared to the optimal Gibbs soft measure used in the standard DA.Although the simple soft measures significantly reduce the computational complexity of the system, this improvement comes at a price since these soft measures are not the optimal distributions.Hence, we have also derived the theoretical performance loss for using a simplified measure instead of the optimal measure, and used the result to derive optimal annealing schedules for the proposed simple soft measures.We have demonstrated that using the derived optimal schedule for the low-complexity soft measures increases the quality of the final codebook compared to using a geometric reduction schedule which is usually suggested in the annealing algorithms.We have also shown that the low-complexity DA methods benefit from the stochastic   relaxation techniques with decreasing benefits as the performance approaches the optimal.
We have demonstrated the effectiveness of our low/reduced-complexity DA (SVQ) algorithms by designing codebooks for a variety of sources, namely, Gauss-Markov, speech, and image, at different rates.In each case, the proposed SVQ algorithms significantly improved the quality of the final codebooks compared to the traditionally used GLA and compared to the SR-D algorithm, where the latter is accepted as a benchmark reference by some researchers to be a VQ design technique that performs near-optimally.We have also investigated the effect of codebook initialization on GLA, SR-D, and SVQ algorithms and showed that, while GLA and SR-D receive major benefit from this initialization at the expense of increased computational complexity, the SVQ algorithms are able to attain the same performance without the need of initialization.Hence, the SVQ algorithms are not sensitive to the choice of the initial codebook and outperform codebook initialized GLA and SR-D algorithms.Compared to the standard DA, the computational complexity of the SVQ algorithms is shown to be drastically reduced.Using the same annealing temperature, the SVQ algorithms run more than a factor of 100 faster than the standard DA algorithm with negligible performance difference.We believe that the proposed algorithms, with their significantly higher performance over the widely used GLA and SR-D, and with their low-computational complexity with negligible performance difference compared to the standard DA, have proved themselves to be important VQ design techniques.

APPENDIX
Our proposed triangular soft information measure significantly reduces the computational cost of the soft assignments compared to the optimal Gibbs soft measure, at the cost of some loss in performance.In this appendix, we derive the penalty paid in distortion for using the simplified soft measure instead of the optimal one at a given system entropy (softness), and show that minimizing the relative entropy between the two measures minimizes the distortion penalty.
For a given soft assignment measure (conditional probability), p(c|x), we have the expected distortion and the

Figure 1 :
Figure 1: The iterative procedure showing the updating of the soft assignments and the codevectors.

Figure 2 :
Figure 2: Flowchart for the reduced-complexity Gibbs soft assignment measure algorithm.

Figure 3 :
Figure 3: Triangular membership function used as a soft information measure.Codevectors within the spread of the function comprise the nearest N codevectors for the considered sample vector.

Figure 4 :
Figure 4: An instance of the Gibbs function with parameter β and an instance of the triangle function with parameter R x are shown.There are L codevectors at increasing distance from sample vector x.

Figure 5 :
Figure5: Plot showing the minimum relative entropy between the triangle soft measure and the Gibbs soft measure at various spread R x and β pairs.The solid curve is obtained by an exhaustive search for R x that gives the minimum relative entropy for a given value of β.The dashed curve is obtained using the derived relationship between R x and β to give the minimum relative entropy.

Figure 8 :
Figure 8: The effect of PNN initialization as improvement over GLA.Source is Gaussian, vector dimension = 16 samples/vector.

Figure 9 :
Figure 9: Improvements over GLA for human speech source sampled at 8 kHz; vector dimension = 16 samples/vector.

Table 1 :
Average performance and running time comparison for N = |C| = 128 and N = 4.The source is uncorrelated Gaussian, the vector dimensions are 16, and the soft information measure is reducedcomplexity Gibbs distribution.The results are averaged over 20 experiments (details on experimental setup are in Section 4).

Table 2 :
Comparing the geometric and the Gibbs guided spread (temperature) reduction for the triangular membership function for the design of 128 and 256 sized codebooks for uncorrelated Gaussian source with vector dimension 16.