Skip to main content

Incorporation of perceptually adaptive QIM with singular value decomposition for blind audio watermarking


This paper presents a novel approach for blind audio watermarking. The proposed scheme utilizes the flexibility of discrete wavelet packet transformation (DWPT) to approximate the critical bands and adaptively determines suitable embedding strengths for carrying out quantization index modulation (QIM). The singular value decomposition (SVD) is employed to analyze the matrix formed by the DWPT coefficients and embed watermark bits by manipulating singular values subject to perceptual criteria. To achieve even better performance, two auxiliary enhancement measures are attached to the developed scheme. Performance evaluation and comparison are demonstrated with the presence of common digital signal processing attacks. Experimental results confirm that the combination of the DWPT, SVD, and adaptive QIM achieves imperceptible data hiding with satisfying robustness and payload capacity. Moreover, the inclusion of self-synchronization capability allows the developed watermarking system to withstand time-shifting and cropping attacks.

1 Introduction

In recent years, copyright protection of multimedia data has been of great concern to content owners and service providers. Digital watermarking technology received much attention for resolving such a concern because this technology could hide information into the multimedia object (e.g., images, audio, and video) for applications like intellectual property protection, content authentication, and fingerprinting.

An audio watermarking scheme generally takes into consideration four aspects, namely, imperceptibility, security, robustness, and capacity. The developed schemes shall ensure the security and inaudibility of the embedded information, but still possess the ability of withstanding malicious attacks. The payload capacity must be large enough to accommodate necessary information. Different methods were attempted on various domains, such as time [15], Fourier transform [68], cepstral transform [913], discrete cosine transform (DCT) [1417], and discrete wavelet transform (DWT) [14, 16, 1823].

Compared with transform domain methods, the time-domain approach is rather easier to implement and requires less computation. The watermark is usually a pseudo noise added to the host signal. Alternatively, binary information can be converted to a noise-like signal through the spread spectrum technique. The existence of the watermark can be verified by measuring the correlation function between the pseudo noise and watermarked signal. The time-domain methods are usually less robust to digital signal processing attacks unless a long segment along with adequate embedding strength is adopted. In contrast, quantization index modulation (QIM) has been proven to be a promising technique [24]. The time-domain data embedding is achieved by quantizing the parameters derived from the time series. Though the QIM generally outperforms the spread spectrum in the time domain, it still needs a long segment for reliable detection. As a consequence, the time-domain QIM was mainly used for frame synchronization in many watermarking systems [14, 20, 21, 24]. Being aware of the limitation of the time-domain approach, many researchers thus turned to the transform domains where signal characteristics could be better explored. The embedding intensity as well as position of the watermark can be selected based upon the features extracted in the transform domains [1, 14, 21].

Singular value decomposition (SVD) is a powerful tool for image processing applications [25, 26]. Because the SVD can adapt to various transform domains, it has been extensively applied in audio watermarking [5, 8, 17, 22, 27]. For instance, Abd El-Samie [5] utilized a twofold strategy to embed the watermark. After applying the first SVD to a 2-D matrix formed by the audio signal, he blended the intended watermark with the diagonal matrix holding singular values and then performed the second SVD on the modified matrix. In his design, the matrices containing left- and right-singular vectors must be conserved in order to extract the watermark. Al-Nuaimy et al. [27] further extended the twofold strategy and applied it to the audio signals transmitted over network systems on a segment-by-segment basis.

Bhat et al. [22] presented a SVD-based blind watermarking scheme operated in the DWT domain. The watermark bits were embedded into the audio signals using QIM, of which the quantization steps were adaptively determined according to the statistical properties of the involved DWT coefficients. The authors claimed that their scheme was the first adaptive audio watermarking scheme exploring both DWT and SVD and had a high payload and superior performance against MP3 compression. Lei et al. [17] also attempted to embed a binary watermark into the high-frequency band of the SVD-DCT block. They attained a performance generally better than the previous SVD-based methods. Most recently, Lei et al. [28] integrated lifting wavelet transform (LWT), SVD, and QIM to achieve a very good tradeoff among the robustness, imperceptibility, and payload. Apart from the abovementioned methods, there are other audio watermarking schemes applicable to different domains in the literature [29, 30].

Audio watermarks are supposed to be transparent to human ears, by what means the modification due to watermarking is virtually inaudible. One way to enhance the embedding efficiency is to exploit the auditory characteristics so that the embedding strength is sufficiently high to withstand attacks without introducing audible distortion. The methods presented in [16, 17, 22] demonstrated the benefit of exploiting the signal characteristics, but they relied on heuristic rules to decide the embedding strength. In these methods, even though some attention was paid to adjust relevant parameters to reach optimal performance, the connection between multiple transform domains and human auditory properties has not been thoroughly addressed.

Because the DWPT possesses multi-resolution capacity and is more computationally efficient than the Fourier transform, it may cooperate with the psychoacoustic model to render an estimate of auditory masking thresholds [31, 32]. Hence, our aim in this study is to explore all useful properties of the DWPT, SVD, and QIM for audio watermarking such that the issues of robustness, imperceptibility, and payload capacity can be resolved altogether. In particular, the primary interest is placed on the blind watermarking, which does not require the original audio signal to extract the watermark.

2 Derivation of auditory masking threshold in the DWPT domain

Auditory masking is the effect when a sound is inaudible due to the presence of a louder sound. There are two types of auditory masking. One is spectral masking (sometimes referred to as simultaneously masking), which is the characteristic of the human auditory system when a sound signal is masked by a masker with a different frequency. The other is temporal masking (or non-simultaneous masking), which is the masking effect occurring before and after a sudden stimulus sound.

While studying spectral masking, critical bands are of great importance because they can be employed to elucidate the properties of frequency selectivity [32, 33]. Based upon the theory of perceptual entropy [3135], this study derives the auditory masking threshold in terms of signal power for each critical band. The derivation begins with the utilization of the DWPT to approximate the critical bands. The procedures for deriving spectral masking thresholds are briefly summarized as follows:

  1. 1.

    Segment the host audio signal into frames, each of 4,096 samples in length.

  2. 2.

    Decompose the audio signal using the DWPT according to the specification given in Table 1, in which each packet node approximately corresponds to a critical band. The decomposition is carried out using the Daubechies-8 wavelet. Let c i (n) denote the i th DWPT coefficient in the n th band with a length of N (n).

  3. 3.

    Compute the short-term spectrum X i (n) in each band by applying the fast Fourier transform (FFT) to c i (n), i.e., X i (n) = FFT{c i (n)}.

  4. 4.

    Estimate the tonality factor τ to see whether the band is noise-like or tone-like.

    τ = min 10 log 10 P M g X i n 2 / P M a X i n 2 25 , 1 ,

where PMg(|X i (n)|2) and PMa(|X i (n)|2) stand for the geometric and arithmetic means of |X i (n)|2, respectively.

  1. 5.

    Adjust the masking level according to the tonality factor.

    D z n = 1 N n i = 0 N n 1 c i n 2 10 a n 10 ,

where a(n) signifies the permissible noise floor relative to the signal in the n th band, and it is formulated as

a n = τ 0.275 n 15.025 + 1 τ × 9.0 expressed in dB .
  1. 6.

    Extend the masking effect to the adjacent bands by convolving the adjusted masking level with a spreading function SF(n), namely C z (n) = D z (n)  10SF(n)/10, with SF(n) defined as

    SF n = p + u + v 2 n + y v u 2 h + n + y 2 expressed in dB ,

where p = 15.242, y = 0.15, h = 0.3, u = −25, and v = 30.

  1. 7.

    Compare the masking threshold C z (n) with the absolute threshold of hearing in quiet state, termed T(n) in decibel. The maximum of the two is selected as the masking threshold, i.e.,

    η n = max C z n , 10 T n 10 .
Table 1 The arrangement of DWPT decomposition

The masking threshold obtained through the above procedure is designated as η(n), which represents the noise power level not detectable by human ears in the n th band.

3 Frame synchronization

One of the weaknesses of the existing watermarking methods consists in the vulnerability to time shifting and cropping [14]. The frame synchronization is perhaps the most prevailing counterstrategy to deal with such an issue. Many watermarking systems considered dividing the audio signal into two sorts of segments, namely, one for synchronization and the other for watermarking. This study resorts to the idea of frequency division which uses non-overlapping frequency bands to hide the synchronous codes and information bits separately. Figure 1 illustrates the idea of frequency division, where the synchronous code is placed in the frequencies below 172 Hz and the information bits are allowed to hide in the critical bands above 172 Hz.

Figure 1

The idea of frequency division.

To synchronize the frames, this study utilizes a time-domain QIM that was developed in [36] but is modified to suit the requirements here. The audio signal is deliberately partitioned into frames of length Lf = 8192 (twice the amount for mask threshold derivation), and each frame is further divided into Ns = 32 Subsections. A 32-bit Barker code ‘1111101110100111-0100101001001000’ [37] is employed for the synchronization task because this code has low correlation with a time-shifted version of itself. Each binary bit is first converted into bipolar form, termed Sb(k) {−1, 1}, and then embedded into a subsection spreading over Ls(Lf/Ns = 256) samples by

m ^ = m / D D + D / 4 if S b k = 1 m / D D + 3 D / 4 if S b k = 1 for k = 0 , 1 , , L s 1 ,

where m and m ^ denote, respectively, the original and modified mean values of the Subsection. D is the quantization step supposedly yielding no perceptible distortion.

To achieve the goal of imperceptivity, the quantization step at sample i, designated as D i , is obtained by referring to the root-mean-square of Np past lowpass-filtered samples:

D i = 1 N p n = 1 N p x lp 2 i n 1 2 × 10 10 / 20 ,

where xlp(i) is the output of feeding the audio signal through a fourth order Butterworth lowpass filter with the cutoff frequency set at 172 Hz. Np is chosen as 1,536. The scaling factor 10−10/20 aims at attenuating the signal power by 10 dB. The purpose of using xlp(i) is twofold. First, it provides an estimate of the signal power for frequency components below 172 Hz. Second, it excludes the disturbance from high-frequency bands where the information bits are located.

Following the derivation of a new mean, the proposed time-domain QIM modifies the audio samples in each subsection using

x ^ k = x k + m ^ m M k for k = 0 , 1 , , L s 1 ,

where M(k) is a function designed to have a flat top in the middle but descend to zero on both ends, i.e.,

M k = υ × 0.5 0.5 cos 2 πk / 63 , k = 0 , 1 , , 31 ; 1 , k = 32 , , L s 33 ; 0.5 0.5 cos ( 2 π k 192 / 63 ) , k = L s 32 , , L s 1 .

The variable υ in Equation (9) is a scaling factor used to attain a mean of unity for M(k), i.e., 1 L s k = 0 L s 1 M k = 1 .

Based on the analysis given in [21], the QIM via Equation (8) introduces a noise with a power level of 7D i 2/48, which is 8.36 dB lower than D i 2. The window M(k) contributes about −0.46 dB to the signal-to-noise ratio (SNR). Combining with the 10 dB given in Equation (7), the overall SNR resulting from the watermarking is around 17.9 dB. According to the theory of auditory entropy [31, 34], the masking threshold for the frequency components below 172 Hz is approximately −16 dB below the signal power regardless of signal tonality. Consequently, the purposely reserved 17.9-dB SNR is sufficient to ensure the imperceptibility of the embedded synchronous code.

The detection of the synchronization code requires the preparation of a bit sequence b ˜ i , which is of the same length as the watermarked audio signal and can be derived as

b ˜ i = 2 m ˜ i m ˜ i / D ˜ i D ˜ i > ? 0.5 D ˜ i 1 ,

where m ˜ i denotes the mean computed over a subsection starting from the i th sample. D ˜ i corresponds to the −10-dB RMS of previous N lowpass-filtered samples. After acquiring b ˜ i , the existence of a synchronous code can be identified by examining the cross-correlation between the Barker code Sb(k) and a decimated version of b ˜ i :

r i = k = 0 N s 1 S b N s 1 k b ˜ i k L s .

As Equation (11) places the synchronous code in a backward direction, the largest r(i) over an interval of 8,192 samples indicates a salient demarcation between the frames. This synchronization marker can be more prominent by adding up two other cross-correlation functions that are 8,192 samples away from the current one.

r ^ 3 i = j = 1 1 r i + 8192 j .

The position of the marker, termed I, is identified simply by picking the largest peak of r ^ 3 i in each interval:

I = arg max i r ^ 3 i i start i < i start + 8192 ,

where istart denotes the starting index.

4 Watermarking via SVD

An advantage of the SVD-based watermarking is that large singular values change very little for most types of attacks. The proposed watermarking scheme thus takes such an advantage by applying the QIM to the gap between two principal singular values. For each packet node of the DWPT, the N coefficients c i 's in a frame are organized as a 2 × N/2 matrix M in the following manner:

M = c 1 c 2 c 3 c 4 c N 1 c N 2 × N / 2 .

Without loss of generality, the superscript (n) previously used to signify a specific band has been removed in the expression. Taking SVD of M results in M = USVT, where U is a 2 × 2 real unitary matrix, S is a 2 × N / 2 diagonal matrix with non-negative real diagonal values λ i 's in decreasing order, and VT (the transpose of V) is an N / 2 × N /2 real unitary matrix. Alternatively, the matrix M can be written as

M = u 1 u 2 λ 1 0 0 λ 2 0 0 0 0 v 1 v 2 v N / 2 T = λ 1 u 1 v 1 T + λ 2 u 2 v 2 T ,

where u i and v i are the i th columns of the matrices U and V. The total energy of the N DWPT coefficients is the squared sum of all the elements in M, i.e.,

E c = i = 1 N c i 2 .

The same result can be obtained using

E c = trace M M T = λ 1 2 + λ 2 2 .

It is recalled that the procedure described in Section 2 provides a masking threshold η, which is the maximum tolerable power variation unperceivable by human ears. The derived threshold can guide us devise a robust and transparent watermarking scheme. This study proposes embedding a watermark bit wb into the matrix M by manipulating λ1 and λ2 subject to three criteria. First, the overall energy shall remain unchanged. That is

Criterion 1

λ 1 2 + λ 2 2 = λ 1 2 + λ 2 2 ,

where λ 1 and λ 2 denote the adjusted results of λ1 and λ2, respectively. Second, the gap between λ 1 and λ 2 , termed g = λ 1 λ 2 , must comply with the QIM rule according to wb:

Criterion 2

g = λ 1 λ 2 = λ 1 λ 2 Δ Δ + Δ 4 , if w b = 0 ; λ 1 λ 2 Δ Δ + 3 Δ 4 , if w b = 1 ,

where  ·  represents the floor function. As for the third criterion, the signal power variation shall not exceed the auditory masking threshold η.

Let M′ denote the matrix restored by substituting the modified eigenvalues into S such that

M = c 1 c 2 c 3 c 4 c N / 2 1 c N / 2 = U λ 1 0 0 λ 2 0 0 0 0 V T .

Because of the constraint imposed by Equation (19), the adjustment of these two eigenvalues thus holds the inequality

λ 1 λ 2 λ 1 λ 2 Δ 2 ;

and the resulting error energy Eerror becomes

E error = i = 1 N c i c i 2 = trace M M M M T = λ 1 λ 1 2 + λ 2 λ 2 2 .

It is readily seen from Equation (21) that

λ 1 λ 1 2 + λ 2 λ 2 2 Δ 2 4 .

Ideally, if the error power, i.e., Eerror/N, falls beneath the masking threshold η, the signal alteration due to watermarking will be inaudible. Such a condition can be expressed as

Criterion 3

E error N Δ 2 4 N η .

Let Δ max = 2 denote the maximum step size used to quantize the gap between the two eigenvalues without causing perceivable distortion. The modifications with respect to λ 1 and λ 2 are denoted as λ 1 = λ 1 + δ 1 and λ 2 = λ 2 δ 2 . Then, the derivation of λ 1 and λ 2 based on the three criteria becomes very straightforward. Following the replacement of ∆max for ∆ in Equation (19), an equation with variables δ1 and δ2 is formed:

δ 1 + δ 2 = g λ 1 + λ 2 = ρ .

In combination with Equation (18), δ1 can be solved from a quadratic equation like

2 δ 1 2 + 2 λ 1 + 2 λ 2 2 ρ δ 1 + 2 λ 2 ρ + ρ 2 = 0 .

The relationship among all involved variables is illustrated in Figure 2. After obtaining δ1, δ2 is acquirable using Equation (25). As Equation (26) usually comes up with two solutions for δ1, this study chooses the one with a smaller magnitude. Nevertheless, Equation (26) may also render complex roots when (g′)2 > E c . Hence, a preventive measure is taken to ensure the obtainment of real roots. It is noted from Equation (19) that the minimum possible value of g′ is 3Δmax/4 for wb = 1. In an extreme case where λ 1 = g = 3 Δ max / 4 and λ 2 = 0 , ∆max must satisfy

E c = λ 1 2 + λ 2 2 = g 2 = 3 Δ max / 4 2 .
Figure 2

Relationship among all the variables involved in the SVD-based watermarking.

Consequently, the preventive measure examines the inequality whether Δ max < 4 3 E c and substitutes ∆max with 4 3 E c if the inequality does not hold. This substitution, in turn, guarantees an outcome of non-negative λ 1 and λ 2 .

With the fulfillment of the three criteria, namely Equations (18), (19), and (24), the audio signal can maintain its segmental power while executing the QIM. The key factor of the entire process turns out to be η, which subsequently determines ∆max, λ 1 and λ 2 . Putting the derived λ 1 and λ 2 into Equation (20) renders a modified matrix M′ with new DWPT coefficients. Once the processes in all the involved critical bands are completed, the watermarked signal is attained by taking inverse DWPT with respect to the modified DWPT coefficients.

The watermark extraction from the watermarked signal is rather simple. Analogy to the procedures adopted for watermark embedding, the extraction process starts with taking the DWPT of the watermarked audio and then deriving the masking threshold η ˜ for each packet node. Following the derivation of Δ ˜ max from η ˜ , the watermark bit w ˜ b can be verified by first calculating

γ = λ ˜ 1 λ ˜ 2 Δ ˜ max λ ˜ 1 λ ˜ 2 Δ ˜ max .

w ˜ b is ‘1’ if γ ≥ 0.5, and is ‘0’ otherwise.

5 Further enhancement

The main challenge of the adaptive QIM lies in the presupposition that the quantization steps must be accurately recovered from the watermarked signal. As seen in Section 4, the quantization step is correlated to the masking threshold, of which the formulation involves the tonality and power deduced from the signal. During the watermark embedding, the process of QIM inevitably varies the tonality and therefore causes difficulties in retrieving the quantization steps for watermark extraction. A simple way to overcome this problem is to take advantage of SVD.

It is recalled from Equation (15) that the SVD decomposes the signal into two parts, namely, λ 1 u 1 v 1 T and λ 2 u 2 v 2 T . These two parts become λ 1 u 1 v 1 T and λ 2 u 2 v 2 T , respectively, after applying QIM. As λ 1 is always larger than λ 2 , λ 1 u 1 v 1 T can be regarded as the predominant part of the watermarked signal. If the tonality is merely derived from the predominant part, i.e., λ 1 u 1 v 1 T in the original signal and λ 1 u 1 v 1 T in the watermarked signal, the results remain identical because the two scalars, λ1 and λ 1 , do not affect the tonality. Hence, our first enhancement to the proposed DWPT-SVD scheme is to use u 1 v 1 T to compute for the tonality.

Another important factor in the derivation of the masking threshold is the signal power. Despite that the signal power has been deliberately maintained during watermark embedding, the attacks such as MP3 compression and noise contamination may alter the segmental power. To alleviate the problem of power alteration, our second enhancement adopts a lowpass 2-D filter to smoothen the quantization steps distributing over a plane formed by critical band numbers and frame indices. Figure 3 illustrates the idea of filter smoothing. The filter coefficients are obtained from a rotationally symmetric Gaussian function with the variance being 0.5. The filter size is tentatively chosen as 3 × 3 since it offers satisfactory results. It is particularly noted in the end that the quantization steps computed at the embedding stage shall also be processed by the filter when the second enhancement takes effect. The reason for this arrangement is to ensure an exact restoration of the quantization steps from the watermarked signal.

Figure 3

Illustration of the lowpass filtering over the derived quantization steps.

6 Integration of the entire watermarking system

Figure 4 presents the configuration of the developed watermarking system. The watermark can be an arbitrary binary bit sequence. Just for the purpose of illustration, we adopt a binary image W(i, j) of size 32 × 32, which contains an equal amount of 0's and 1's. The procedures for embedding the watermark are as follows:

  1. 1.

    Maintain security by scrambling the image watermark using the Arnold transform [38].

  2. 2.

    Convert the scrambled image into a bit stream.

  3. 3.

    Partition the audio signal into frames of size 4,096 samples.

  4. 4.

    Insert the synchronization codes into the audio signal using the time-domain adaptive QIM presented in Section 3.

  5. 5.

    For the third to the fifteenth critical bands in each frame

    1. a.

      Compute the DWPT coefficients.

    2. b.

      Apply SVD to the matrix formed by the DWPT coefficients.

    3. c.

      Derive the quantization step.

    4. d.

      Embed one binary bit by quantizing the gap between two principal singular values of SVD.

    5. e.

      Recompose the DWPT coefficients.

  6. 6.

    Perform inverse DWPT to obtain the watermarked audio signal.

Figure 4

The proposed watermarking system.

The watermark extraction is a reverse process. The procedural steps are the following:

  1. 1.

    Align the frame by tracing the synchronous markers.

  2. 2.

    For the third to the fifteenth critical bands in each frame

    1. a.

      Compute the DWPT coefficients.

    2. b.

      Apply SVD on the matrix formed by the DWPT coefficients.

    3. c.

      Derive the quantization step.

    4. d.

      Quantize the gap between two singular values.

    5. e.

      Translate the quantized value into a binary bit.

  3. 3.

    Gather the bits from all frames.

  4. 4.

    Convert the bit sequence into an image matrix.

  5. 5.

    Take the inverse Arnold transform to restore the watermark image, termed W ˜ i , j .

7 Performance evaluation

The test subjects comprised ten pieces of 30-s music recordings clipped from randomly chosen CD albums, including vocal arrangements and ensembles of musical instruments. All audio signals were sampled at 44.1 kHz with 16-bit resolution. The performance evaluation comprises three aspects: payload capacity, quality assessment, and robustness test.

To understand the influences of the two enhancements mentioned in the previous section, the test of the proposed DWPT-SVD-adaptive QIM consists of three phases, namely, the proposed one solely, the one with enhancement 1, and the one with enhancements 1 and 2.

Three recently developed SVD-based methods, denominated as ‘adaptive DWT-SVD’ [22], ‘SVD-DCT’ [17], and ‘LWT-SVD’ [28], are employed for performance comparison as they represent other ways to exploit the SVD for audio watermarking in transform domains. The minimum and maximum quantization steps in the adaptive DWT-SVD are 0.6 and 0.9 respectively, which are the typically suggested values. The parameters α and β for controlling the embedding strength in the SVD-DCT are assigned as 0.125 and 0.1, respectively. For the LWT-SVD method, the decomposition level of the lifting wavelet transform is chosen as 4 and the quantization step size is 0.6. The other parameters used in these three methods follow original specifications [17, 22, 28].

7.1 Payload

The theoretical payload capacities for the methods under investigation are presented in Table 2. The LWT-SVD holds the highest number in comparison to others. The capacity of the proposed scheme is 13 × 44,100/4,096 = 139.97 bps, which is lower than that of the LWT-SVD. However, this quantity is already three times more than that achieved by the adaptive DWT-SVD and SVD-DCT. It is worth pointing out that the payload capacities listed in Table 2 are computed without considering the demand of synchronous codes. In general, these numbers will drop if the watermarking methods need to allocate extra segments for frame synchronization. One advantage of the proposed synchronization technique is that it only affects the spectrum centralized in the first two critical bands, thus leaving the rest critical bands available for information hiding.

Table 2 Statistics of the measured SNRs and ODGs, along with the payload capacities

7.2 Quality assessment

The quality disturbance resulting from watermark embedding is assessed using the SNR and perceptual evaluation of audio quality (PEAQ) [39, 40]. The SNR is defined as

SNR = 10 log 10 n s 2 n n s n s ˜ n 2 ,

where s(n) and s ˜ n are the original and watermarked audio signals, respectively. Since the auditory quality is a fundamentally subjective concept that does not necessarily correspond to the measured SNR, this study also resorts to the PEAQ to measure the perceived quality. The PEAQ algorithm aims at simulating human perceptual properties and integrates multiple model output variables into a single metric. It renders an objective difference grade (ODG) between −4 and 0, signifying a perceptual impression from ‘very annoying’ and ‘imperceptible’.

Table 2 also provides the measured SNRs and ODGs for all kinds of watermarked audio signals. The SVD-DCT generally renders the largest SNR value, while the proposed scheme produces the lowest. Despite that the SNRs do not show any favor for the proposed scheme, the resulting ODGs suggest that our scheme indeed achieves the best perceived quality. In fact, the average ODG is around 0 for our scheme, implying that the watermarked signal is nearly indistinguishable from the original one. The average ODGs for the adaptive DWT-SVD and SVD-DCT are slightly above 1, indicating that the distortion caused by watermarking may still be perceivable. On the other hand, the quality degradation by the LWT-SVD seems to be minor, as the corresponding average ODG is just −0.4. Nevertheless, the ODGs resulting from these three methods are not comparable with ours.

7.3 Robustness test

The robustness test consists of two categories: one is focused on frame synchronization, and the other is concerned with watermark recovery. The attack types considered in this study include the following:

  1. A.

    Resampling: conducting down-sampling to 11,025 Hz and then upsampling back to 44,100 Hz.

  2. B.

    Requantization: quantizing the watermarked signal to 8 bits/sample and then back to 16 bits/sample.

  3. C.

    Amplitude scaling: scaling the amplitude of the watermarked audio signal by 0.85.

  4. D.

    Noise corruption: adding zero-mean white Gaussian noise to the watermarked audio signal with SNR = 30 dB.

  5. E.

    Noise corruption: adding zero-mean white Gaussian noise to the watermarked audio signal with SNR = 20 dB.

  6. F.

    Lowpass filtering: applying a lowpass filter with a cutoff frequency of 8 kHz.

  7. G.

    Echo addition: adding an echo signal with a delay of 50 ms and a decay of 5% to the watermarked audio signal.

  8. H.

    Jittering: randomly deleting or adding one sample for every 100 samples within each frame.

  9. I.

    128-kbps MPEG compression: compressing and decompressing the watermarked audio signal with a MPEG layer III coder at a bit rate of 128 kbps.

  10. J.

    64-kbps MPEG compression: compressing and decompressing the watermarked audio signal with a MPEG layer III coder at a bit rate of 64 kbps.

  11. K.

    Time shifting: shifting the watermarked audio signal by an amount of 50% relative to the frame length.

The efficiency of the proposed synchronization scheme is demonstrated via the statistical means and standard deviations of r ^ 3 i ' s discussed in Section 3, along with the misdetection counts of the synchronization markers. As revealed from the results in Table 3, the detectability of the synchronous marks is always reliable, indicating that common attacks do not impose any threat to the watermarking system equipped with such a synchronization technique.

Table 3 Statistical results of the estimated correlation functions for the time-domain synchronization scheme

The robustness of the proposed watermarking technique in the presence of various attacks is evaluated using the bit error rate (BER), which is defined as

BER W , W ˜ = i = 1 M j = 1 M W i , j W ˜ i , j M × M ,

where  stands for the exclusive-or operator. Table 4 gives the BERs obtained from the watermarked audio signals under the attacks.

Table 4 Averaged bit error rates of the watermarking schemes under various attacks

Generally speaking, all the SVD-based methods manifest certain robustness against most attacks. However, the adaptive DWT-SVD and LWT-SVD appear vulnerable to amplitude scaling. The reason can be ascribed to the fact that some of the controlling parameters in both methods are fixed. A minor change in amplitude may therefore result in a disastrous consequence. In contrast, the SVD-DCT and the proposed scheme do not exhibit such deficiency, as both of them are designed to adapt to amplitude variation. Besides amplitude scaling, the adaptive DWT-SVD also suffers from the attack of resampling. The reason is due to the altered statistical distribution of the DWT coefficients that eventually leads to inaccurate watermark extraction.

As shown in Table 4, the proposed scheme generally retains very high accuracy under all sorts of attacks, but it seldom reaches 100% correctness. This is because the masking threshold derived from the watermarked signal may somewhat differ from the original one. To ameliorate such drawback, two enhancements have been proposed in Section 5. The first enhancement rectifies the inconsistency in the derivation of tonality. As a consequence, the proposed scheme comes up with a perfect accuracy if no attack is present. Excellent robustness is also observed for attacks like resampling, amplitude scaling, and lowpass filtering. The second enhancement tends to mitigate the power alterations caused by the attacks. After being equipped with the second enhancement, the proposed scheme gains noticeable improvements for all kinds of attacks. More importantly, the changes in SNR and ODG are slight, meaning that the improvement is not obtained at the cost of perceived quality.

7.4 Security

There are several possible ways to promote the watermark security. In [17, 28], the synchronous code was chaotically permutated and the watermark data were scrambled. A similar strategy is certainly applicable to our system. Here, the Arnold transform is chosen to shuffle the watermark image since this technique has been widely utilized in digital image encryption. Aside from data scrambling, the controlling parameters (e.g., the frame length, the arrangement of the matrix in Equation (14), and/or the selected critical bands) can be utilized as secret keys. It would be difficult, if not impossible, to detect the watermark without knowing the exact parameters.

8 Error analysis

There are two types of errors during the search of watermarks. The false-positive error (FPE) is the probability of declaring an unwatermarked audio signal as a watermarked one, whereas the probability of the opposite condition (classifying a watermarked audio signal as an unwatermarked one) is known as the false-negative error (FNE).

Following the basic assumption and derivative rules given in [22], the FPE Pfp can be computed as

P fp = P H W , W ˜ T without watermark = k = T N w N w k P e k 1 P e N w k ,

where H W , W ˜ denotes the number of matched bits in a total of Nw bits, and T is the threshold for claiming the existence of the watermark. N w k stands for the binomial coefficient. Pe is the probability that the extracted bits match with the original watermark bits. Since the unwatermarked bits are either 0 or 1 with pure randomness, Pe is therefore assumed to be 0.5. As a result, Equation (31) can be further simplified as

P fp = 1 2 N w k = T N w N w k .

If Nw = 1024 and T = 0.8 × Nw = 820, then Pfp = 2.62 × 10−88, which means that FPE can rarely happen.

Analogy to the discussion in the derivation of FPE, the FNE Pfn can be computed as

P fn = P H W , W ˜ < T with watermark = k = 0 T 1 N w k 1 BER k × BER N w k = k = T N w N w k BER k × 1 BER N w k .

Taking the worst case (where BER = 0.012) in our experiments as an example, the FNE of the proposed scheme is virtually zero.

9 Conclusion

This paper presents an efficient audio watermarking technique, which integrates the DWPT, SVD, and adaptive QIM subject to the auditory masking effect. While the DWPT decomposes the audio signal into critical bands, the exploration of perceptual entropy leads to the derivation of auditory masking thresholds. The thresholds, in turn, determine the quantization steps required by the QIM. In virtue of the robustness of the SVD technique, the proposed watermarking scheme first assembles the DWPT coefficients into a matrix and then manipulates the singular values to satisfy three criteria. As a result, the embedded watermark is guaranteed to restrain underneath the perceptible level. To further improve the overall performance, this study introduces two auxiliary enhancement measures to ensure the recovery of quantization steps.

Apart from the scheme for data embedding, the developed watermarking system is equipped with a competent frame synchronization technique to withstand the time-shifting attacks. The experimental results reveal that the proposed DWPT-SVD-adaptive QIM scheme performs very well against many attacks such as resampling, requantization, amplitude scaling, lowpass filtering, jittering, echo addition, white noise contamination, and MP3 compression. The comparison with the other SVD-related watermarking methods indicates that our scheme is comparable to, if not better than, the selected methods. Most importantly, the resulting average ODGs of the proposed scheme are around 0, implying that the embedded watermarks and synchronous codes are virtually inaudible by human ears. All these merits can be attributed to the incorporation of the perceptually adaptive QIM with SVD in the DWPT domain.


  1. 1.

    Swanson MD, Zhu B, Tewfik AH, Boney L: Robust audio watermarking using perceptual masking. Signal Process. 1998, 66(3):337-355. 10.1016/S0165-1684(98)00014-0

    MATH  Article  Google Scholar 

  2. 2.

    Bassia P, Pitas I, Nikolaidis N: Robust audio watermarking in the time domain. IEEE Trans. Multimedia 2001, 3(2):232-241. 10.1109/6046.923822

    Article  Google Scholar 

  3. 3.

    Lie W-N, Chang L-C: Robust and high-quality time-domain audio watermarking based on low-frequency amplitude modification. IEEE Trans. Multimedia 2006, 8(1):46-59.

    Article  Google Scholar 

  4. 4.

    Lemma AN, Aprea J, Oomen W, van de Kerkhof L: A temporal domain audio watermarking technique. IEEE Trans. Signal Processing 2003, 51(4):1088-1097. 10.1109/TSP.2003.809372

    MathSciNet  Article  Google Scholar 

  5. 5.

    Abd F: El-Samie. An efficient singular value decomposition algorithm for digital audio watermarking. Int. J. Speech. Technol. 2009, 12(1):27-45.

    Google Scholar 

  6. 6.

    Li W, Xue X, Lu P: Localized audio watermarking technique robust against time-scale modification. IEEE Trans. Multimedia 2006, 8(1):60-69.

    Article  Google Scholar 

  7. 7.

    Tachibana R, Shimizu S, Kobayashi S, Nakamura T: An audio watermarking method using a two-dimensional pseudo-random array. Signal Process. 2002, 82(10):1455-1469. 10.1016/S0165-1684(02)00284-0

    MATH  Article  Google Scholar 

  8. 8.

    Megías D, Serra-Ruiz J, Fallahpour M: Efficient self-synchronised blind audio watermarking system based on time domain and FFT amplitude modification. Signal Process. 2010, 90(12):3078-3092. 10.1016/j.sigpro.2010.05.012

    MATH  Article  Google Scholar 

  9. 9.

    Li X, Yu HH: Transparent and robust audio data hiding in cepstrum domain. ICME 2000, 1: 397-400.

    Google Scholar 

  10. 10.

    Li S, Cui L, Choi J, Cui X: An audio copyright protection schemes based on SMM in cepstrum domain. Lect. Notes Comput. Sc. 2006, 4109: 923-927. 10.1007/11815921_102

    Article  Google Scholar 

  11. 11.

    Liu SC, Lin SD: BCH code-based robust audio watermarking in cepstrum domain. J. Inf. Sci. Eng. 2006, 22(3):535-543.

    MathSciNet  Google Scholar 

  12. 12.

    Lee SK, Ho Y-S: Digital audio watermarking in the cepstrum domain. IEEE T. Consum. Electr. 2000, 46(3):744-750. 10.1109/30.883441

    MathSciNet  Article  Google Scholar 

  13. 13.

    Hu H-T, Chen W-H: A dual cepstrum-based watermarking scheme with self-synchronization. Signal Process. 2012, 92(4):1109-1116. 10.1016/j.sigpro.2011.11.001

    Article  Google Scholar 

  14. 14.

    Wang X-Y, Zhao H: A novel synchronization invariant audio watermarking scheme based on DWT and DCT. IEEE Trans. Signal Processing 2006, 54(12):4835-4840.

    Article  Google Scholar 

  15. 15.

    Yeo I-K, Kim HJ: Modified patchwork algorithm: a novel audio watermarking scheme. IEEE Trans. Speech and Audio Processing 2003, 11(4):381-386. 10.1109/TSA.2003.812145

    Article  Google Scholar 

  16. 16.

    Wang X, Qi W, Niu P: A new, adaptive digital audio watermarking based on support vector regression. IEEE T. Audio Speech 2007, 15(8):2270-2277.

    Article  Google Scholar 

  17. 17.

    Lei BY, Soon IY, Li Z: Blind and robust audio watermarking scheme based on SVD–DCT. Signal Process. 2011, 91(8):1973-1984. 10.1016/j.sigpro.2011.03.001

    MATH  Article  Google Scholar 

  18. 18.

    He X, Scordilis MS: Efficiently synchronized spread-spectrum audio watermarking with improved psychoacoustic model. Research Letter in Signal Process 2008. 10.1155/2008/251868

    Google Scholar 

  19. 19.

    Xiang S, Kim HJ, Huang J: Audio watermarking robust against time-scale modification and MP3 compression. Signal Process. 2008, 88(10):2372-2387. 10.1016/j.sigpro.2008.03.019

    MATH  Article  Google Scholar 

  20. 20.

    Wang X-Y, Niu P-P, Yang H-Y: A robust digital audio watermarking based on statistics characteristics. Pattern Recognition 2009, 42(11):3057-3064. 10.1016/j.patcog.2009.01.015

    MATH  Article  Google Scholar 

  21. 21.

    Wu S, Huang J, Huang D, Shi YQ: Efficiently self-synchronized audio watermarking for assured audio data transmission. IEEE Trans. Broadcast. 2005, 51(1):69-76. 10.1109/TBC.2004.838265

    Article  Google Scholar 

  22. 22.

    Bhat V: K, I Sengupta, A Das. An adaptive audio watermarking based on the singular value decomposition in the wavelet domain. Digit. Signal Process. 2010, 20(6):1547-1558.

    Google Scholar 

  23. 23.

    Chen S-T, Wu G-D, Huang H-N: Wavelet-domain audio watermarking scheme using optimisation-based quantisation. IET Signal Process. 2010, 4(6):720-727. 10.1049/iet-spr.2009.0187

    MathSciNet  Article  Google Scholar 

  24. 24.

    Chen B, Wornell GW: Quantization index modulation: a class of provably good methods for digital watermarking and information embedding. IEEE Trans. Inform. Theory 2001, 47(4):1423-1443. 10.1109/18.923725

    MATH  MathSciNet  Article  Google Scholar 

  25. 25.

    Ruizhen L, Tieniu T: An SVD-based watermarking scheme for protecting rightful ownership. IEEE Trans. Multimedia 2002, 4(1):121-128. 10.1109/6046.985560

    Article  Google Scholar 

  26. 26.

    Bao P, Xiaohu M: Image adaptive watermarking using wavelet domain singular value decomposition. IEEE Trans. Circuits Syst. Video Technol. 2005, 15(1):96-102.

    Article  Google Scholar 

  27. 27.

    Al-Nuaimy W, El-Bendary MAM, Shafik A, Shawki F, Abou-El-azm AE, El-Fishawy NA, Elhalafawy SM, Diab SM, Sallam BM, Abd FE: El-Samie, HB Kazemian. An SVD audio watermarking approach using chaotic encrypted images. Digit. Signal Process. 2011, 21(6):764-779.

    Google Scholar 

  28. 28.

    Lei B, Yann I: Soon, F Zhou, Z Li, H Lei. A robust audio watermarking scheme based on lifting wavelet transform and singular value decomposition. Signal Process. 2012, 92(9):1985-2001.

    Google Scholar 

  29. 29.

    Zezula R, Misurec J: Audio digital watermarking algorithm based on SVD in MCLT domain. ICONS 2008, 140-143.

    Google Scholar 

  30. 30.

    Dhawan A, Mitra SK: Hybrid audio watermarking with spread spectrum and singular value decomposition. INDICON 2008, 11-16.

    Google Scholar 

  31. 31.

    Carnero B, Drygajlo A: Perceptual speech coding and enhancement using frame-synchronized fast wavelet packet transform algorithms. IEEE Trans. Signal Processing 1999, 47(6):1622-1635. 10.1109/78.765133

    Article  Google Scholar 

  32. 32.

    He X, Scordilis MS: An enhanced psychoacoustic model based on the discrete wavelet packet transform. J. Franklin Inst. 2006, 343(7):738-755. 10.1016/j.jfranklin.2006.07.005

    MATH  Article  Google Scholar 

  33. 33.

    Painter T, Spanias A: Perceptual coding of digital audio. Proc. IEEE 2000, 88(4):451-515.

    Article  Google Scholar 

  34. 34.

    Johnston JD: Estimation of perceptual entropy using noise masking criteria. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1988, 2525: 2524-2527.

    Google Scholar 

  35. 35.

    Johnston JD: Transform coding of audio signals using perceptual noise criteria. IEEE J. Select. Areas Commun. 1988, 6(2):314-323. 10.1109/49.608

    Article  Google Scholar 

  36. 36.

    Hu H-T, Yu C: A perceptually adaptive QIM scheme for efficient watermark synchronization. IEICE T. Inf. Syst. 2012, E95-D(12):3097-3100. 10.1587/transinf.E95.D.3097

    Article  Google Scholar 

  37. 37.

    Gentry SM: Detection Optimization Using Linear Systems Analysis of a Coded Aperture Laser Sensor System: Sandia Report. Albuquerque: Sandia National Laboratories; 1994.

    Book  Google Scholar 

  38. 38.

    Arnold VI, Avez A: Ergodic Problems of Classical Mechanics. New York: Benjamin; 1968.

    Google Scholar 

  39. 39.

    ITU Radiocommunication Sector (ITU-R): Recommendation BS.1387: Method for Objective Measurements of Perceived Audio Quality. Geneva: International Telecommunication Union; 1998.

    Google Scholar 

  40. 40.

    Kabal P: An Examination and Interpretation of ITU-R BS.1387: Perceptual Evaluation of Audio Quality, TSP Lab Technical Report. Montréal: Department of Electrical and Computer Engineering, McGill University; 2002.

    Google Scholar 

Download references


This work was supported by the National Science Council, Taiwan, ROC, under grants NSC101-2221-E-197-033 and NSC102-2221-E-197-020.

Author information



Corresponding author

Correspondence to Hwai-Tsu Hu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Hu, HT., Chou, HH., Yu, C. et al. Incorporation of perceptually adaptive QIM with singular value decomposition for blind audio watermarking. EURASIP J. Adv. Signal Process. 2014, 12 (2014).

Download citation


  • Singular value decomposition
  • Discrete wavelet packet transform
  • Adaptive quantization index modulation
  • Auditory masking threshold
  • Frame synchronization