Digital watermarking has been used to protect multimedia data. Demands in this area increase proportionally with the number of internet applications. Namely, these applications are associated with a need for copyright protection of digital audio, digital image, and digital video. Note that the cryptographic methods could be used for this purpose. However, once the data are decoded they can be unlimitedly copied. This has been one of the primary reasons for developing the watermarking techniques. The watermarking, in general, consists of embedding a secret information that can be reliably detected within the host signal. Obviously, this information should be imperceptible within the host data. Depending on the application type, the watermarking can be robust, fragile, or semifragile. The robust watermark should be resistant to various nonmalicious or malicious attacks. Nonmalicious attacks are commonly used signal processing techniques such as compression algorithms, filtering, and so forth, while the malicious attacks are the signal processing techniques that are intentionally used to remove the watermark. The fragile watermark is used to prove data authenticity. Thus, if the content of a signal has been changed, the watermark should no longer exist. The semifragile watermark should be robust to a slight modification, such as for example a certain degree of compression.
Depending on the type of host signal (speech/audio signals, image, video, etc.) various watermarking approaches are developed. Also, different domains have been used: the time domain (or the space domain), the spectral domains such as DFT, DWT, and DCT domain, and a joint time/space-frequency domain. The existing watermarking techniques are mainly based on either the time or frequency domain. However, in both cases, the time-frequency characteristics of the watermark do not correspond to the time-frequency characteristics of the host signal. It may result in the watermark being not imperceptible, because it is present in the time-frequency regions where the signal components do not exist.
3.1. An Overview of Some Time-Frequency-Based Watermarking Techniques
The time-frequency domain can be very efficient regarding the watermark imperceptibility and robustness. This section presents some key time-frequency-based watermarking procedures with the aim to inspire more contributions on this topic.
Here, we will classify the time-frequency-based watermarking techniques into two categories. The first one is the approaches based on watermarks with specific time-frequency characteristics, where the detection procedure is performed within the time-frequency domain. The second one uses the time-frequency domain to embed or to shape the watermark.
-
(A)
Image Watermark with Specific Time-Frequency Characteristics
(A.1) Among the first time-frequency-based image watermarking procedures is the approach introduced in [38]. Although the watermark is embedded in the space domain it is chosen to have a specific space/spatial-frequency characteristic. Namely, a two-dimensional chirp signal is used as watermark:
Observe that the Wigner distribution provides an ideal representation for this signal.
The watermark is embedded within the entire image:
.
The watermark detection is performed by using
where
The variable parameters
,
, and
are used. Different values of those parameters (
, and
) produce a set of projections. The additional term
can be used to detect some geometrical transformations, as well. Note that the detector has the form of the Radon-Wigner distribution, which ensures that the energy of the watermark is distributed over the hyper plane defined by
(
is the phase function of the watermark). In order to make a decision weather the watermark exists in the image or not, the maxima of the Radon Wigner distribution
are compared with an assumed reference threshold. Also, multiple chirp watermarks with small and randomly chosen amplitude are used to increase flexibility of the proposed procedure. The parameters of the chirp signal as well as the random sequence that defines the amplitudes of chirp signals serve as the watermark key. Since the watermark is embedded within the entire image in the space domain, a proper masking that provides imperceptibility should be applied. An analysis of the performances giving an estimation of the detectable watermark amplitude level is provided in [38]. The robustness is tested on various attack, some being a median filter, geometrical transformations (translation, rotation and cropping simultaneously applied), a high-pass filter, local notch filter, and Gaussian noise.
(A.2) Mobasseri et al. [44] have proposed a scheme for robust watermarking based on the polynomial phase. The algorithm combines the approach in [45] (where
bits are embedded in the image) with the 2D chirp-based methods. Here, the image of size
is partitioned into
blocks. A 2D chirp of the form
is used, where
are taken. The watermark is embedded in the block located at the pixel (
) according to
The constant that controls the watermark strength is
, and the integer part is denoted by "
", while
are watermark bits taken from
. The knowledge of the pair
is required in order to recover
. It can be obtained by using the chirp transform
where:
Finally, the total detection over all blocks can be obtained by
This provides a possibility to use the watermark that cannot be detected by considering a single block only. Thus, in such case it would be necessary to integrate all of them over the entire image. Note that it is also possible to generate different chirps for different blocks instead of using the same chirp for all blocks. It would make the detection even more difficult for unauthorized users.
The embedded bits are recovered by
The proposed method is adapted to be robust to the JPEG compression algorithm. The watermark is embedded within the
blocks by using the quantization matrix
. Namely, the DCT coefficients and the 2D chirp are quantized by this matrix. However, choosing an appropriate pair
is necessary to ensure that the watermark survives this quantization. The watermark survival degree can be quantified by
where
is the unmarked compressed block. The watermark is completely removed by compression if
is obtained. The quality of the proposed technique is tested on the image Lena, and it is proven that, for this case, it outperforms the standard spread spectrum technique.
(A.3) The watermarking in the fractional Fourier domain belongs to the time-frequency-based algorithm as well. This approach is defined in [39], and it uses a combination of the space and spatial-frequency domain. Namely, the image is transformed in the fractional Fourier domain for the angles
:
where FRFT denotes the one-dimensional fractional Fourier transform. The FRFT can be treated as a rotation in the time-frequency plane for an angle
, while the inverse transform can be considered as a rotation for the angle
. Thus, the FRFT domain is a combination of the time and frequency domain (the Fourier transform is a special case for
). Depending on the angle
, the FRFT assures that the time or the frequency domain is dominant. For
close to
the frequency domain is dominant, while for small
the FRFT is dominantly in the time domain.The watermark is embedded in the FRFT coefficients reordered into a non-increasing sequence
. By analogy with the watermarking in the DCT domain, the first
coefficients are omitted, while the next
coefficients are used. The watermark is embedded as
A real valued watermark key composed of
and
is used. The detection is performed by
The performance analysis providing the detection threshold is done, the threshold being chosen as
where the watermark is a Gaussian white noise with the variance
. The watermark key consists of the watermark sequence and the angles (
1,
2). Thus, the algorithm provides two more degrees of freedom, and it offers more possibility to generate watermarks. The watermarking procedure is tested on various images and attacks.
(A.4) Barkat and Sattar have proposed a fragile watermarking procedure for image authentication [43]. The watermark with a particular time-frequency signature is inserted in the image pixels. Although, in general,
pixels (according to the image size) exist, a significantly lower number of them is used. The pixels location can be chosen arbitrarily. The authors have used diagonal pixels, modulated by a pseudonoise sequence as a secret key. Various frequency-modulated nonstationary signals can be a watermark, as well. However, the features that could be easily identified should be used. Consequently, different time-frequency distributions should be used for watermark detection. Barkat and Sattar have used a quadratic frequency-modulated signal. It is detected by using the Wigner distribution. The proposed scheme is tested on the following attacks: cropping, translation, JPEG compression, and scaling. Very week and imperceptible attacks were applied (e.g., JPEG with 99% quality is used). It is shown that the watermark cannot be identified after these attacks.
(B) Watermark Created in the Time-Frequency Domain
(B.1) An image watermarking approach is proposed by Al-khassaweneh and Aviyente in [49]. The image rows are used to create a set of one-dimensional signals. Then, the Wigner distribution is calculated for each of them. Also, the watermark sequence is transformed to the time-frequency domain by using the Wigner distribution. Finally, the Wigner distribution of the watermark sequence is embedded in the Wigner distribution of each image row as follows:
where
is a set of the time-frequency dependent weighting coefficients. The watermarked image is obtained by using the inverse transform. Having in mind that we deal with a real and positive signal, it is defined as
However, the previous equation holds only if the two-dimensional function (45) is a valid Wigner distribution. Namely, it is well known that any two-dimensional function cannot be the Wigner distribution. It introduces a very restrictive condition on the function
. In the proposed method it is determined by using the time-frequency representation of the corresponding row and taking the middle frequency region.
Al-khassaweneh and Aviyente have suggested a nonblind detection procedure. Namely, the second part of the function in (46) that depends on the watermark is selected. The detection is performed by using the standard correlation detection. A threshold that provides a minimal probability of error is derived. The proposed method is tested on different images and under various attacks. The average probability of error was found to be 0.03.
(B.2) Foo et al. in [48] have defined a method for digital audio watermarking based on the time-frequency domain. Here the audio frames are changed, so that the logical value of 1 is assigned. If the original frame is lengthened or shortened, the logical value 1 is assigned, otherwise the "normal frames" correspond to the logical value 0. The watermark is a sequence obtained as a binary code of the alphabet letters, converted to the ASCII code (the example with the binary code 010001100101001101010111 for the letters FSW is used). The crucial part of this method is the selection of frames that will be lengthened or shortened (the frame size of 1024 samples is used). The frames with signal energy level above the masking threshold are selected (the psychoacoustic model is used to determine the masking threshold in each subband). The frames length is changed by adding or removing samples with amplitudes that do not exceed the masking threshold. Four samples are added or removed within the frame of 1024 samples. It ensures that a perceptual distortion will not appear. In order to preserve the total length of the watermarked audio signal, the same number of the lengthened and shortened frames is used. The pair of frames called Diamond frames is used to represent the binary 1, while the logical values 0 are assigned to the unaltered frames.
The detection procedure is nonblind, that is, the original signal is required. A significant difference between the watermarked and the original signals will appear only if a pair of changed frames exists. Thus, it is used for logical values detection. The proposed watermarking scheme has been tested on various musical signals, as well as on a speech signal, and a set of different attacks has been applied (filtering, resampling, noise, cropping, and MP3 compression). Although the results vary for different signals and attacks, in general they are good. The worst results are obtained for the rock and pop music signals with MP3 compression. However, in all cases the owner can be identified.
(B.3) Esmaili et al. have proposed a spread spectrum based watermarking in the time-frequency domain [46]. This technique is used for watermarking of music signals. The watermark is created as
where
is the watermark before spreading,
is the spreading code or the pseudonoise sequence, while
is the time-varying carrier frequency. The parameter
controls the watermark strength. The masking properties of the human auditory system are used to shape an imperceptible watermark. The pseudonoise sequence is low pass filtered according to the signal characteristics (the Butterwort filter is used). Two different scenarios of masking have been considered. The tone- or noise-like characteristic are determined by using the entropy
The probability of energy for each frequency (within a window used for the spectrogram calculation) is denoted by
, while
is the maximum frequency. A half of the maximum entropy
is taken as a threshold between noise-like and tone-like characteristics. If the entropy is lower than
it is considered as a tone-like, otherwise it is a noise-like characteristic.
The time-varying carrier frequency is obtained as the instantaneous mean frequency of the host signal, calculated by
Finally, after the watermark is modulated and shaped, it is embedded in the time domain as
.
A simple watermark detection procedure is applied. First, demodulation is performed by using the time-varying carrier, and then the watermark is detected by using the standard correlation procedure with the pseudonoise sequence.
The proposed method has been tested on several music files. It has been shown that, under various attacks, the bit error rates are mostly between 0.02 and 0.08.
(B.4) An interesting audio watermarking approach based on linear chirps has been proposed in [47]. The watermark is created as a chirp signal, which is perceptually shaped according to the host signal samples. Different chirp rates, each representing a unique watermark message, produce different slopes in the time-frequency domain. The efficient time-frequency representation is obtained by using the Wigner distribution. The extracted chirps are postprocessed in the time-frequency plane by an optimal line detection method based on the Hough-Radon transform. It can correctly estimate the slope of the watermark signal despite the broken lines caused by attacks. The simulation results show that the Hough-Radon transform applied to a time-frequency distribution can detect the watermark message correctly at bit error rates up to 20%.
3.2. Watermaking Approach Based on the Time-Frequency-Shaped Watermark
The approach that will be presented can be used either for audio signals or images [41, 42]. Thus, the embedding and detection procedures for both kinds of signals will be defined and discussed simultaneously, by using the multidimensional notation.
In order to ensure imperceptibility constraints, the watermark should be modeled according to the time-frequency characteristics of the signal components. The concept of nonstationary multidimensional filtering [52] is adapted and used to create a watermark with time-frequency characteristics that correspond to the characteristics of the host signal. The corresponding algorithm consists of the following steps:
()selection of the nonstationary parts of signal suitable for watermark embedding;
()watermark modeling according to the multidimensional time-frequency characteristics of the host signal;
()watermark embedding and watermark detection procedure within the multidimensional time-frequency domain.
Multidimensional time-frequency distributions are employed in order to determine the nonstationary regions. As it will be shown later, the S-method can be efficiently used to analyze dynamics of the regions of speech signals and images. Although the cross-terms are usually undesirable in the time-frequency analysis, they have found to be useful in watermarking. Namely, they may increase performances of a speech watermark detector, and also, increase the efficiency of dynamic regions selection within an image.
The watermark is obtained at the output of a nonstationary filter as follows:
where
is the short-time Fourier transform of a multidimensional random sequence
. The function
contains the information about the components within the region
. It is used to create the watermark that will be adjusted to these components. Thus, we may start with an arbitrary random multidimensional sequence
and, by using
, its multidimensional time-frequency characteristic is modeled.
The region
will be used for watermarking if a time-frequency distribution
contains a sufficient number of components whose energy is above a floor value:
The function
returns a number of components that satisfy the condition within the parenthesis, while
is the reference number of points used to make the decision about the region nonstationary. The parameter
is an energy floor that can be determined as a portion of the TFD maximum:
A value of
between 0 and 1 is taken.
The components' positions within
are identified by using the support function:
An additional function is defined in order to consider the significant components only:
The energy threshold is denoted by
. Thus, the resulting support function is defined as
The watermark embedding is done according to
where 
and
are the short-time Fourier transforms of the multidimensional watermarked data, the host data, and the watermark, respectively.
Note that when compared to the signal domain, in the multidimensional time-frequency domain the number of coefficients that contain the information about the watermark is significantly increased. Consequently, the detector response will be enhanced. The standard correlation detector in the multidimensional time-frequency domain is defined as
The multidimensional time-frequency domain-based detector provides a low probability of error, even when the number of watermarked samples in the signal domain is small.
3.2.1. Digital Audio Signal
Let us consider the voiced part of a speech signal. The region
is determined by the start and the end instances
and
of the voiced speech, as well as by the interval
that contains the strongest formants. The S-method is used to define the support function [41, 53]:
The region appropriate for watermarking is shown in Figure 12(a). The corresponding support function (Figure 12(b)) is created by using the value
=
with
.
The watermark is embedded in the time domain:
. The time-frequency representation of the watermark is shown in Figure 12(c).
As expected, the time-frequency characteristics of the watermark follow those components of the speech signal. Consequently, the watermark is inaudible within the speech signal.
Next, a music signal of the flute is considered, Figure 13(a).
In this case, the fourth-order complex-lag distribution is more appropriate for the region selection than the S-method, because it better follows the frequency variations in the signal (Figures 13(a) and 13(b)). Thus, by using this distribution an inaudible watermark is created.
Note that an important improvement in the watermark detection is obtained if the cross-terms are included [41]. Namely, the watermark is present within the cross-terms, as well. A standard correlation detector in the time-frequency domain that includes the cross-terms can be written in the form
The second term in (60) is the result of cross-terms.
Note that this form of detector can be used in other existing detector structures.
The following measure of the detection quality
is used. The mean value and standard deviation of the detector response are denoted by
and
. Indices
and
indicate the right and the wrong keys, respectively.
Efficiency of the proposed procedure is demonstrated on various examples. The results for speech signals with maximum frequencies of 4 kHz and 11,025 kHz are presented in [41]. This approach provides a reliable detection for a high SNR (SNR = 32 dB has been used) and under various attacks. The watermark sequence was created by using a pseudorandom Gaussian sequence of 1000 samples.
The probability of error was of order 10−7 for: MP3 (constant bit rate 8 kbps and variable bit rate 75–120 kbps are considered), delay monolight echo (180 ms, mixing 20%), echo 200 ms, deep flutter (deep 10, sweeping rate 5 kHz), amplitude (normalize 100%), and additive Gaussian noise (SNR = −35 dB). The worst case is obtained for pitch scaling
and it is of order 10−5. The results for other attacks (time stretch
wow delay 20%, wow delay 10% and bright flutter, MP3 variable bit rate 40–50 kbps) are of order 10−6.
3.2.2. Digital Image
The space-spatial-frequency analysis (two-dimensional time-frequency analysis) is used to select pixels that belong to the image nonstationary regions [42]. The two-dimensional S-method is used as a space-spatial frequency distribution [54]:
By increasing the size
of a two-dimensional window, the cross-terms start to appear. Thus, when compared to the spectrogram, the number of frequency components increases, hence making the region characterization easier. A pixel that belongs to the dynamic region can be selected by using the following procedure.
()The S-method is calculated for a
window (windows of size
up to
are used). The middle frequency range
is used.
()The energy floor
is obtained by using the experimentally determined
.
()The region is considered as nonstationary if:
where
, while
is the total number of the points within the region
.
The examples where the pixels belong to the dynamic and stationary regions, respectively, are shown in Figures 14(a) and 14(b).
The procedure for watermark embedding is just a two-dimensional case of the presented multidimensional approach. Namely, a two-dimensional support function is used:
The S-method as a two-dimensional time-frequency distribution is applied, while the energy threshold is
. The watermark is shaped by the space-spatial frequency characteristic of the image components:
A two-dimensional pseudorandom sequence
is used.
The watermark embedding and detection are performed in the space-spatial frequency domain:
This procedure is tested on several images (Lena, Peppers, Boat, F16, and Barbara), under various attacks (JPEG80-JPEG40, Median
, Median
, Average
, Impulse noise, Gaussian noise, Lightening, and Darkening). The PSNR was around 50 dB. The number of the selected pixels varied from 3304 for F16 to 7833 for Barbara. The probability of error was compared with the standard DCT-based procedure (with different detector forms), where 22050 coefficients are used. It was shown that the proposed procedure significantly outperforms the standard DCT procedures.
3.2.3. Digital Video
Observe that the proposed approach can be also used for video signal watermarking. The two-dimensional and one-dimensional time-frequency distributions are combined in this case. Namely, the stationary pixels and stationary regions around them are selected by using the two-dimensional analysis, as it was described in the previous subsection. Then, the time dependent sequence
is produced by taking the stationary pixels at the position
, along
consecutive frames. Based on
the frequency modulated signal is created as
where
, while
is a constant. The stationarity of the selected pixels, along the time axis, is examined by using the one-dimensional S-method. The experiments show that the minimal number of pixels for reliable watermark detection is about 600. This can be easily achieved, even for a very short video sequence (note that more than 2500 stationary pixels are obtained for a signal of duration of 2 s in the example provided in [42]). This approach was tested under the presence of MPEG4 compression. The obtained probabilities of errors were found to be within the range
.