Incorporation of perceptually adaptive QIM with singular value decomposition for blind audio watermarking

This paper presents a novel approach for blind audio watermarking. The proposed scheme utilizes the flexibility of discrete wavelet packet transformation (DWPT) to approximate the critical bands and adaptively determines suitable embedding strengths for carrying out quantization index modulation (QIM). The singular value decomposition (SVD) is employed to analyze the matrix formed by the DWPT coefficients and embed watermark bits by manipulating singular values subject to perceptual criteria. To achieve even better performance, two auxiliary enhancement measures are attached to the developed scheme. Performance evaluation and comparison are demonstrated with the presence of common digital signal processing attacks. Experimental results confirm that the combination of the DWPT, SVD, and adaptive QIM achieves imperceptible data hiding with satisfying robustness and payload capacity. Moreover, the inclusion of self-synchronization capability allows the developed watermarking system to withstand time-shifting and cropping attacks.


Introduction
In recent years, copyright protection of multimedia data has been of great concern to content owners and service providers. Digital watermarking technology received much attention for resolving such a concern because this technology could hide information into the multimedia object (e.g., images, audio, and video) for applications like intellectual property protection, content authentication, and fingerprinting.
Compared with transform domain methods, the timedomain approach is rather easier to implement and requires less computation. The watermark is usually a pseudo noise added to the host signal. Alternatively, binary information can be converted to a noise-like signal through the spread spectrum technique. The existence of the watermark can be verified by measuring the correlation function between the pseudo noise and watermarked signal. The time-domain methods are usually less robust to digital signal processing attacks unless a long segment along with adequate embedding strength is adopted. In contrast, quantization index modulation (QIM) has been proven to be a promising technique [24]. The timedomain data embedding is achieved by quantizing the parameters derived from the time series. Though the QIM generally outperforms the spread spectrum in the time domain, it still needs a long segment for reliable detection. As a consequence, the time-domain QIM was mainly used for frame synchronization in many watermarking systems [14,20,21,24]. Being aware of the limitation of the time-domain approach, many researchers thus turned to the transform domains where signal characteristics could be better explored. The embedding intensity as well as position of the watermark can be selected based upon the features extracted in the transform domains [1,14,21].
Singular value decomposition (SVD) is a powerful tool for image processing applications [25,26]. Because the SVD can adapt to various transform domains, it has been extensively applied in audio watermarking [5,8,17,22,27]. For instance, Abd El-Samie [5] utilized a twofold strategy to embed the watermark. After applying the first SVD to a 2-D matrix formed by the audio signal, he blended the intended watermark with the diagonal matrix holding singular values and then performed the second SVD on the modified matrix. In his design, the matrices containing left-and right-singular vectors must be conserved in order to extract the watermark. Al-Nuaimy et al. [27] further extended the twofold strategy and applied it to the audio signals transmitted over network systems on a segmentby-segment basis.
Bhat et al. [22] presented a SVD-based blind watermarking scheme operated in the DWT domain. The watermark bits were embedded into the audio signals using QIM, of which the quantization steps were adaptively determined according to the statistical properties of the involved DWT coefficients. The authors claimed that their scheme was the first adaptive audio watermarking scheme exploring both DWT and SVD and had a high payload and superior performance against MP3 compression. Lei et al. [17] also attempted to embed a binary watermark into the high-frequency band of the SVD-DCT block. They attained a performance generally better than the previous SVDbased methods. Most recently, Lei et al. [28] integrated lifting wavelet transform (LWT), SVD, and QIM to achieve a very good tradeoff among the robustness, imperceptibility, and payload. Apart from the abovementioned methods, there are other audio watermarking schemes applicable to different domains in the literature [29,30].
Audio watermarks are supposed to be transparent to human ears, by what means the modification due to watermarking is virtually inaudible. One way to enhance the embedding efficiency is to exploit the auditory characteristics so that the embedding strength is sufficiently high to withstand attacks without introducing audible distortion. The methods presented in [16,17,22] demonstrated the benefit of exploiting the signal characteristics, but they relied on heuristic rules to decide the embedding strength. In these methods, even though some attention was paid to adjust relevant parameters to reach optimal performance, the connection between multiple transform domains and human auditory properties has not been thoroughly addressed.
Because the DWPT possesses multi-resolution capacity and is more computationally efficient than the Fourier transform, it may cooperate with the psychoacoustic model to render an estimate of auditory masking thresholds [31,32]. Hence, our aim in this study is to explore all useful properties of the DWPT, SVD, and QIM for audio watermarking such that the issues of robustness, imperceptibility, and payload capacity can be resolved altogether. In particular, the primary interest is placed on the blind watermarking, which does not require the original audio signal to extract the watermark.

Derivation of auditory masking threshold in the DWPT domain
Auditory masking is the effect when a sound is inaudible due to the presence of a louder sound. There are two types of auditory masking. One is spectral masking (sometimes referred to as simultaneously masking), which is the characteristic of the human auditory system when a sound signal is masked by a masker with a different frequency. The other is temporal masking (or non-simultaneous masking), which is the masking effect occurring before and after a sudden stimulus sound.
While studying spectral masking, critical bands are of great importance because they can be employed to elucidate the properties of frequency selectivity [32,33]. Based upon the theory of perceptual entropy [31][32][33][34][35], this study derives the auditory masking threshold in terms of signal power for each critical band. The derivation begins with the utilization of the DWPT to approximate the critical bands. The procedures for deriving spectral masking thresholds are briefly summarized as follows: 1. Segment the host audio signal into frames, each of 4,096 samples in length. 2. Decompose the audio signal using the DWPT according to the specification given in Table 1, in which each packet node approximately corresponds to a critical band. The decomposition is carried out using the Daubechies-8 wavelet. Let c i (n) denote the ith DWPT coefficient in the nth band with a length of N (n) . 3. Compute the short-term spectrum X i (n) in each band by applying the fast Fourier transform (FFT) to c i (n) , i.e., X i (n) = FFT{c i (n) }. 4. Estimate the tonality factor τ to see whether the band is noise-like or tone-like.
τ ¼ min where PM g (|X i (n) | 2 ) and PM a (|X i (n) | 2 ) stand for the geometric and arithmetic means of |X i (n) | 2 , respectively. 5. Adjust the masking level according to the tonality factor.
where a(n) signifies the permissible noise floor relative to the signal in the nth band, and it is formulated as 6. Extend the masking effect to the adjacent bands by convolving the adjusted masking level with a spreading function SF(n), namely C z (n) = D z (n) ⊗ 10 SF(n)/10 , with SF(n) defined as 7. Compare the masking threshold C z (n) with the absolute threshold of hearing in quiet state, termed T(n) in decibel. The maximum of the two is selected as the masking threshold, i.e., The masking threshold obtained through the above procedure is designated as η(n), which represents the noise power level not detectable by human ears in the nth band.

Frame synchronization
One of the weaknesses of the existing watermarking methods consists in the vulnerability to time shifting and cropping [14]. The frame synchronization is perhaps the most prevailing counterstrategy to deal with such an issue. Many watermarking systems considered dividing the audio signal into two sorts of segments, namely, one for synchronization and the other for watermarking. This study resorts to the idea of frequency division which uses non-overlapping frequency bands to hide the synchronous codes and information bits separately. Figure 1 illustrates the idea of frequency division, where the synchronous code is placed in the frequencies below 172 Hz and the information bits are allowed to hide in the critical bands above 172 Hz.
To synchronize the frames, this study utilizes a timedomain QIM that was developed in [36] but is modified to suit the requirements here. The audio signal is deliberately partitioned into frames of length L f = 8192 (twice the amount for mask threshold derivation), and each frame is further divided into N s = 32 Subsections. A 32-bit Barker code '1111101110100111-0100101001001000' [37] is employed for the synchronization task because this code has low correlation with a time-shifted version of itself. Each binary bit is first converted into bipolar form, termed S b (k) ∊ {−1, 1}, and then embedded into a subsection spreading over L s (≜L f /N s = 256) samples bŷ where m andm denote, respectively, the original and modified mean values of the Subsection. D is the quantization step supposedly yielding no perceptible distortion.
To achieve the goal of imperceptivity, the quantization step at sample i, designated as D i , is obtained by referring to the root-mean-square of N p past lowpass-filtered samples: where x lp (i) is the output of feeding the audio signal through a fourth order Butterworth lowpass filter with the cutoff frequency set at 172 Hz. N p is chosen as 1,536. The scaling factor 10 −10/20 aims at attenuating the signal power by 10 dB. The purpose of using x lp (i) is twofold. First, it provides an estimate of the signal power for frequency components below 172 Hz. Second, it excludes the disturbance from high-frequency bands where the information bits are located. Following the derivation of a new mean, the proposed time-domain QIM modifies the audio samples in each subsection usingx where M(k) is a function designed to have a flat top in the middle but descend to zero on both ends, i.e., The variable υ in Equation (9) is a scaling factor used to attain a mean of unity for M(k), i.e., 1 Based on the analysis given in [21], the QIM via Equation (8) introduces a noise with a power level of 7D i 2 /48, which is 8.36 dB lower than D i 2 . The window M(k) contributes about −0.46 dB to the signal-to-noise ratio (SNR). Combining with the 10 dB given in Equation (7), the overall SNR resulting from the watermarking is around 17.9 dB. According to the theory of auditory entropy [31,34], the masking threshold for the frequency components below 172 Hz is approximately −16 dB below the signal power regardless of signal tonality. Consequently, the purposely reserved 17.9-dB SNR is sufficient to ensure the imperceptibility of the embedded synchronous code. The detection of the synchronization code requires the preparation of a bit sequenceb i ð Þ, which is of the same length as the watermarked audio signal and can be derived as wherem i denotes the mean computed over a subsection starting from the ith sample.D i corresponds to the −10-dB RMS of previous N lowpass-filtered samples. After acquiring b i ð Þ, the existence of a synchronous code can be identified by examining the cross-correlation between the Barker code S b (k) and a decimated version ofb i ð Þ: As Equation (11) places the synchronous code in a backward direction, the largest r(i) over an interval of 8,192 samples indicates a salient demarcation between the frames. This synchronization marker can be more prominent by adding up two other cross-correlation functions that are 8,192 samples away from the current one.r The position of the marker, termed I, is identified simply by picking the largest peak ofr 3 i ð Þ in each interval: where i start denotes the starting index.

Watermarking via SVD
An advantage of the SVD-based watermarking is that large singular values change very little for most types of attacks. The proposed watermarking scheme thus takes such an advantage by applying the QIM to the gap between two principal singular values. For each packet node of the DWPT, the N coefficients c i 's in a frame are organized as a 2 × N/2 matrix M in the following manner: Without loss of generality, the superscript (n) previously used to signify a specific band has been removed in the expression. Taking SVD of M results in M = USV T , where U is a 2 × 2 real unitary matrix, S is a 2 × N / 2 diagonal matrix with non-negative real diagonal values λ i 's in decreasing order, and V T (the transpose of V) is an N / 2 × N /2 real unitary matrix. Alternatively, the matrix M can be written as where u i and v i are the ith columns of the matrices U and V. The total energy of the N DWPT coefficients is the squared sum of all the elements in M, i.e., The same result can be obtained using It is recalled that the procedure described in Section 2 provides a masking threshold η, which is the maximum tolerable power variation unperceivable by human ears. The derived threshold can guide us devise a robust and transparent watermarking scheme. This study proposes embedding a watermark bit w b into the matrix M by manipulating λ 1 and λ 2 subject to three criteria. First, the overall energy shall remain unchanged. That is Criterion 1 where λ ′ 1 and λ ′ 2 denote the adjusted results of λ 1 and λ 2 , respectively. Second, the gap between λ ′ 1 and λ ′ 2 , termed g ′ ¼ λ ′ 1 −λ ′ 2 , must comply with the QIM rule according to w b : Criterion 2 where ⌊ · ⌋ represents the floor function. As for the third criterion, the signal power variation shall not exceed the auditory masking threshold η.
Let M′ denote the matrix restored by substituting the modified eigenvalues into S such that Because of the constraint imposed by Equation (19), the adjustment of these two eigenvalues thus holds the inequality and the resulting error energy E error becomes It is readily seen from Equation (21) that Ideally, if the error power, i.e., E error /N, falls beneath the masking threshold η, the signal alteration due to watermarking will be inaudible. Such a condition can be expressed as Criterion 3 Let Δ max ¼ 2 ffiffiffiffiffiffi ffi Nη p denote the maximum step size used to quantize the gap between the two eigenvalues without causing perceivable distortion. The modifications with respect to λ ′ 1 and λ ′ 2 are denoted as λ ′ Then, the derivation of λ ′ 1 and λ ′ 2 based on the three criteria becomes very straightforward. Following the replacement of Δ max for Δ in Equation (19), an equation with variables δ 1 and δ 2 is formed: In combination with Equation (18), δ 1 can be solved from a quadratic equation like The relationship among all involved variables is illustrated in Figure 2. After obtaining δ 1 , δ 2 is acquirable using Equation (25). As Equation (26) usually comes up with two solutions for δ 1 , this study chooses the one with a smaller magnitude. Nevertheless, Equation (26) may also render complex roots when (g′) 2 > E c . Hence, a preventive measure is taken to ensure the obtainment of real roots. It is noted from Equation (19) that the minimum possible value of g′ is 3Δ max /4 for w b = 1. In an extreme case where λ ′ Consequently, the preventive measure examines the inequality whether Δ max < 4 3 ffiffiffiffiffi E c p and substitutes Δ max with 4 3 ffiffiffiffiffi E c p if the inequality does not hold. This substitution, in turn, guarantees an outcome of non-negative λ ′ 1 and λ ′ 2 . With the fulfillment of the three criteria, namely Equations (18), (19), and (24), the audio signal can maintain its segmental power while executing the QIM. The key factor of the entire process turns out to be η, which subsequently determines Δ max , λ ′ 1 and λ ′ 2 . Putting the derived λ ′ 1 and λ ′ 2 into Equation (20) renders a modified matrix M′ with new DWPT coefficients. Once the processes in all the involved critical bands are completed, the watermarked signal is attained by taking inverse DWPT with respect to the modified DWPT coefficients.
The watermark extraction from the watermarked signal is rather simple. Analogy to the procedures adopted for watermark embedding, the extraction process starts with taking the DWPT of the watermarked audio and then deriving the masking thresholdη for each packet node. Following the derivation ofΔ max fromη , the watermark bitw b can be verified by first calculating

Further enhancement
The main challenge of the adaptive QIM lies in the presupposition that the quantization steps must be accurately recovered from the watermarked signal. As seen in Section 4, the quantization step is correlated to the masking threshold, of which the formulation involves the tonality and power deduced from the signal. During the watermark embedding, the process of QIM inevitably varies the tonality and therefore causes difficulties in retrieving the quantization steps for watermark extraction. A simple way to overcome this problem is to take advantage of SVD. It is recalled from Equation (15) that the SVD decomposes the signal into two parts, namely, λ 1 u 1 v T 1 and λ 2 u 2 v T 2 . These two parts become λ ′ 1 u 1 v T 1 and λ ′ 2 u 2 v T 2 , respectively, after applying QIM. As λ ′ 1 is always larger than λ ′ 2 , λ ′ 1 u 1 v T 1 can be regarded as the predominant part of the watermarked signal. If the tonality is merely derived from the predominant part, i.e., λ 1 u 1 v T 1 in the original signal and λ ′ 1 u 1 v T 1 in the watermarked signal, the results remain identical because the two scalars, λ 1 and λ ′ 1 , do not affect the tonality. Hence, our first enhancement to the proposed DWPT-SVD scheme is to use u 1 v T 1 to compute for the tonality.
Another important factor in the derivation of the masking threshold is the signal power. Despite that the signal power has been deliberately maintained during watermark embedding, the attacks such as MP3 compression and noise contamination may alter the segmental power. To alleviate the problem of power alteration, our second enhancement adopts a lowpass 2-D filter to smoothen the quantization steps distributing over a plane formed by critical band numbers and frame indices. Figure 3 illustrates the idea of filter smoothing. The filter coefficients are obtained from a rotationally symmetric Gaussian function with the variance being 0.5. The filter size is tentatively chosen as 3 × 3 since it offers satisfactory results. It is particularly noted in the end that the quantization steps computed at the embedding stage shall also be processed by the filter when the second enhancement takes effect. The reason for this arrangement is to ensure an exact restoration of the quantization steps from the watermarked signal.
6 Integration of the entire watermarking system Figure 4 presents the configuration of the developed watermarking system. The watermark can be an arbitrary binary bit sequence. Just for the purpose of illustration, we adopt a binary image W(i, j) of size 32 × 32, which contains an equal amount of 0's and 1's. The procedures for embedding the watermark are as follows: The watermark extraction is a reverse process. The procedural steps are the following:

Performance evaluation
The test subjects comprised ten pieces of 30-s music recordings clipped from randomly chosen CD albums, including vocal arrangements and ensembles of musical instruments. All audio signals were sampled at 44.1 kHz with 16-bit resolution. The performance evaluation comprises three aspects: payload capacity, quality assessment, and robustness test.
To understand the influences of the two enhancements mentioned in the previous section, the test of the proposed DWPT-SVD-adaptive QIM consists of three phases, namely, the proposed one solely, the one with enhancement 1, and the one with enhancements 1 and 2. Three recently developed SVD-based methods, denominated as 'adaptive DWT-SVD' [22], 'SVD-DCT' [17], and 'LWT-SVD' [28], are employed for performance comparison as they represent other ways to exploit the SVD for audio watermarking in transform domains. The minimum and maximum quantization steps in the adaptive DWT-SVD are 0.6 and 0.9 respectively, which are the typically suggested values. The parameters α and β for controlling the embedding strength in the SVD-DCT are assigned as 0.125 and 0.1, respectively. For the LWT-SVD method, the decomposition level of the lifting wavelet transform is chosen as 4 and the quantization step size is 0.6. The other parameters used in these three methods follow original specifications [17,22,28].

Payload
The theoretical payload capacities for the methods under investigation are presented in Table 2. The LWT-SVD holds the highest number in comparison to others. The capacity of the proposed scheme is 13 × 44,100/4,096 = 139.97 bps, which is lower than that of the LWT-SVD. However, this quantity is already three times more than that achieved by the adaptive DWT-SVD and SVD-DCT. It is worth pointing out that the payload capacities listed in Table 2 are computed without considering the demand of synchronous codes. In general, these numbers will drop if the watermarking methods need to allocate extra segments for frame synchronization. One advantage of the proposed synchronization technique is that it only affects the spectrum centralized in the first two critical bands, thus leaving the rest critical bands available for information hiding.

Quality assessment
The quality disturbance resulting from watermark embedding is assessed using the SNR and perceptual evaluation of audio quality (PEAQ) [39,40]. The SNR is defined as where s(n) ands n ð Þ are the original and watermarked audio signals, respectively. Since the auditory quality is a fundamentally subjective concept that does not necessarily correspond to the measured SNR, this study also resorts to the PEAQ to measure the perceived quality. The PEAQ algorithm aims at simulating human perceptual properties and integrates multiple model output variables into a single metric. It renders an objective difference grade (ODG) between −4 and 0, signifying a perceptual impression from 'very annoying' and 'imperceptible'. Table 2 also provides the measured SNRs and ODGs for all kinds of watermarked audio signals. The SVD-DCT generally renders the largest SNR value, while the proposed scheme produces the lowest. Despite that the SNRs do not show any favor for the proposed scheme, the resulting ODGs suggest that our scheme indeed achieves the best perceived quality. In fact, the average ODG is around 0 for our scheme, implying that the watermarked signal is nearly indistinguishable from the original one. The average ODGs for the adaptive DWT-SVD and SVD-DCT are slightly above 1, indicating that the distortion caused by watermarking may still be perceivable. On the other hand, the quality degradation by the LWT-SVD seems to be minor, as the corresponding average ODG is just −0.4. Nevertheless, the ODGs resulting from these three methods are not comparable with ours.

Robustness test
The robustness test consists of two categories: one is focused on frame synchronization, and the other is concerned with watermark recovery. The attack types considered in this study include the following: A. Resampling: conducting down-sampling to 11,025 Hz and then upsampling back to 44,100 Hz. B. Requantization: quantizing the watermarked signal to 8 bits/sample and then back to 16 bits/sample. C. Amplitude scaling: scaling the amplitude of the watermarked audio signal by 0.85. The efficiency of the proposed synchronization scheme is demonstrated via the statistical means and standard deviations ofr 3 i ð Þ 0 s discussed in Section 3, along with the misdetection counts of the synchronization markers. As revealed from the results in Table 3, the detectability of the synchronous marks is always reliable, indicating that common attacks do not impose any threat to the watermarking system equipped with such a synchronization technique.
The robustness of the proposed watermarking technique in the presence of various attacks is evaluated using the bit error rate (BER), which is defined as where ⊕ stands for the exclusive-or operator. Table 4 gives the BERs obtained from the watermarked audio signals under the attacks. Generally speaking, all the SVD-based methods manifest certain robustness against most attacks. However, the adaptive DWT-SVD and LWT-SVD appear vulnerable to amplitude scaling. The reason can be ascribed to the fact that some of the controlling parameters in both methods are fixed. A minor change in amplitude may therefore result in a disastrous consequence. In contrast, the SVD-DCT and the proposed scheme do not exhibit such deficiency, as both of them are designed to adapt to amplitude variation. Besides amplitude scaling, the adaptive DWT-SVD also suffers from the attack of resampling. The reason is due to the altered statistical distribution of the DWT coefficients that eventually leads to inaccurate watermark extraction.
As shown in Table 4, the proposed scheme generally retains very high accuracy under all sorts of attacks, but it seldom reaches 100% correctness. This is because the masking threshold derived from the watermarked signal may somewhat differ from the original one. To ameliorate such drawback, two enhancements have been proposed in Section 5. The first enhancement rectifies the inconsistency in the derivation of tonality. As a consequence, the proposed scheme comes up with a perfect accuracy if no attack is present. Excellent robustness is also observed for attacks like resampling, amplitude scaling, and lowpass filtering. The second enhancement tends to mitigate the power alterations caused by the attacks. After being equipped with the second enhancement, the proposed scheme gains noticeable improvements for all kinds of attacks. More importantly, the changes in SNR and ODG are slight, meaning that the improvement is not obtained at the cost of perceived quality.

Security
There are several possible ways to promote the watermark security. In [17,28], the synchronous code was chaotically permutated and the watermark data were scrambled. A similar strategy is certainly applicable to our system. Here, the Arnold transform is chosen to shuffle the watermark image since this technique has been widely utilized in digital image encryption. Aside from data scrambling, the controlling parameters (e.g., the frame length, the arrangement of the matrix in Equation (14), and/or the selected

Error analysis
There are two types of errors during the search of watermarks. The false-positive error (FPE) is the probability of declaring an unwatermarked audio signal as a watermarked one, whereas the probability of the opposite condition (classifying a watermarked audio signal as an unwatermarked one) is known as the false-negative error (FNE). Following the basic assumption and derivative rules given in [22], the FPE P fp can be computed as where H W;W À Á denotes the number of matched bits in a total of N w bits, and T is the threshold for claiming the existence of the watermark. N w k stands for the binomial coefficient. P e is the probability that the extracted bits match with the original watermark bits. Since the unwatermarked bits are either 0 or 1 with pure randomness, P e is therefore assumed to be 0.5. As a result, Equation (31) can be further simplified as If N w = 1024 and T = ⌈0.8 × N w ⌉ = 820, then P fp = 2.62 × 10 −88 , which means that FPE can rarely happen.
Analogy to the discussion in the derivation of FPE, the FNE P fn can be computed as Taking the worst case (where BER = 0.012) in our experiments as an example, the FNE of the proposed scheme is virtually zero.

Conclusion
This paper presents an efficient audio watermarking technique, which integrates the DWPT, SVD, and adaptive QIM subject to the auditory masking effect. While the DWPT decomposes the audio signal into critical bands, the exploration of perceptual entropy leads to the derivation of auditory masking thresholds. The thresholds, in turn, determine the quantization steps required by the QIM. In virtue of the robustness of the SVD technique, the proposed watermarking scheme first assembles the DWPT coefficients into a matrix and then manipulates the singular values to satisfy three criteria. As a result, the embedded watermark is guaranteed to restrain underneath the perceptible level. To further improve the overall performance, this study introduces two auxiliary enhancement measures to ensure the recovery of quantization steps.
Apart from the scheme for data embedding, the developed watermarking system is equipped with a competent frame synchronization technique to withstand the timeshifting attacks. The experimental results reveal that the proposed DWPT-SVD-adaptive QIM scheme performs Table 4 Averaged bit error rates of the watermarking schemes under various attacks Attack type Adaptive DWT-SVD (%) SVD-DCT (%) LWT-SVD (%) Proposed (%) Proposed + enhancement