- Open Access
Restoration of recto–verso colour documents using correlated component analysis
© Tonazzini and Bedini; licensee Springer. 2013
- Received: 25 September 2012
- Accepted: 25 February 2013
- Published: 24 March 2013
In this article, we consider the problem of removing see-through interferences from pairs of recto–verso documents acquired either in grayscale or RGB modality. The see-through effect is a typical degradation of historical and archival documents or manuscripts, and is caused by transparency or seeping of ink from the reverse side of the page. We formulate the problem as one of separating two individual texts, overlapped in the recto and verso maps of the colour channels through a linear convolutional mixing operator, where the mixing coefficients are unknown, while the blur kernels are assumed known a priori or estimated off-line. We exploit statistical techniques of blind source separation to estimate both the unknown model parameters and the ideal, uncorrupted images of the two document sides. We show that recently proposed correlated component analysis techniques overcome the already satisfactory performance of independent component analysis techniques and colour decorrelation, when the two texts are even sensibly correlated.
- Document restoration and analysis
- See-through interference
- Blind source separation
- Correlated component analysis
One of the most common degradations affecting historical and/or archival documents that are written or printed on both sides of the page is see-through, that is an undesired pattern in the background, caused by the text in the reverse side of the page. Such distortion can significantly degrade the readability of the document or make difficult the automatic analysis of its content. See-through is usually distinguished in bleed-through or show-through. Bleed-through is intrinsic to the analogue document, especially in the ancient ones, for effect of the paper thinness or chemical reactions of the ink, e.g. due to humidity. In these situations, the ink in the reverse side might penetrate through the paper fibres, thus emerging in the front side. The digital acquisition of document images through scanners can introduce show-through even in well-preserved documents, for effect of light transmission through the paper, or can worsen the already present degradation.
Several approaches for see-through reduction have been investigated, mainly for grayscale documents, and exploiting the availability of pre-registered scans of both sides (recto and verso).
In , a wavelet technique is applied for iteratively enhancing the foreground strokes and smearing the interfering strokes. In , a variational approach, based on nonlinear diffusion and wavelet transforms, has been proposed to model and then remove see-through from either single-sided or double-sided grayscale documents. In [3, 4], steps of segmentation to identify the see-through areas are followed by inpainting of estimated pure background areas. Segmentation-classification is the basis also for the methods derived in [5, 6].
In , the physical model of the show-through in modern scanners is first simplified for deriving a tractable mathematical nonlinear convolutional mixing model. This model is further approximated for decoupling the two recto and verso equations, in order to design an adaptive linear filter that is very effective in correcting a mild show-through.
Recently, the interest in applying blind source separation (BSS) algorithms for solving this problem has increased noticeably. The appearance of the degraded recto and verso scans is first modelled as a parametric superimposition of the uncorrupted recto and verso images, and then a separation algorithm is used to estimate both the mixing parameters and the ideal front and back side images (sources). The assumption of a linear instantaneous mixing model has led to BSS algorithms such as independent component analysis (ICA) [8, 9] or non-negative matrix factorization (NMF) . Some works have also addressed more realistic nonlinear and/or convolutional mixing models, and separation algorithms based on image regularization [11–14].
Specifically, within the linear instantaneous mixing model, in , by exploiting reasonable constraints of symmetry for the data model, ICA has been shown to be equivalent to symmetric whitening, and a fast separation algorithm has been proposed for grayscale recto–verso pairs affected by show-through.
NMF has been proposed for minimizing an energy function composed of a data term plus a regularization term compensating for the apparent nonlinearity of the show-through phenomenon, in correspondence of the occlusions between the two texts . Ophir and Malah  propose a convolutional BSS formulation, which accounts also for a nonlinearity of the show-through effect, assumed known and derived from . The solution is based on the total variation stabilizer for the ideal images. In , a maximum likelihood approach is proposed for two nonlinear mixtures of two texts, where the show-through nonlinearity is approximated as quadratic, and a blur kernel on the interfering pattern is accounted for and estimated.
In spite of the assumption of a linear instantaneous mixing model, ICA has proven to be very cost-effective and versatile for application to different typologies of data and several instances of document restoration and analysis. For example, it can easily be extended to the analysis of multispectral scans of a single-sided document containing multiple information layers , or when the data are the RGB recto and verso scans of a colour document . This latter case is particularly interesting if the aim is to produce a restored visible document that, while cleansed of the unwanted interferences, maintains its useful features, e.g. the original colour, as much as possible.
Nevertheless, ICA assumes independence or at least uncorrelation of the individual sources, that is, it forces uncorrelation between the recto and verso ideal images. If uncorrelation can be expected at the high frequencies, this is not realistic at the low frequencies, where we experimentally verified a significant cross-correlation between recto and verso, probably due to the large background areas.
In addition, efficient ICA algorithms have mainly been designed for instantaneous mixtures, whereas convolutional mixtures would be more appropriate to model the blur affecting the text emerging from the back side, which appears smeared for effect of light diffusion through the paper, or ink absorption from the paper fibres. Although ICA solutions to the convolutional mixture case have been proposed (see ), the problem has not fully solved, yet.
In this article, we consider the problem of removing see-through interferences from pairs of recto–verso documents, acquired either as grayscale images or RGB images. We still adopt a linear mixing data model, but remove the instantaneous assumption to account for blur on the source images. Hence, our model is linear convolutional. We show that the use of recently proposed correlated component analysis (CCA) techniques , based on second-order statistics and working in the Fourier domain, allows both to remove the uncorrelation assumption and to easily manage the convolutional nature of the data model. Our method is based on the joint estimation of the mixing parameters and the source spectra and cross-spectra, performed by alternating minimization of a suitable cost function, with respect to the two sets of variables. Once the estimates are available, the individual sources can be recovered either by a simple inverse filter or, when noise is present, by Wiener filtering, since the estimated spectra can effectively be exploited. At present, we assume that the blur kernels are known a priori, for example when the degradation is mainly caused by the scanning process, and the technical characteristics of the equipment used are available, or estimated off-line. In this latter case, the simple selection of a pure see-through area in one of the two sides can efficiently serve to the scope.
The method is very fast, and the experimental results performed show that it significantly outperforms instantaneous ICA, by permitting separation to be achieved also when the individual sources are largely correlated. This is especially true when the patterns that interfere from a side to the other of the page are sensibly blurred.
The remainder of the article is organized as follows. In Section 2, we describe the mathematical model that we assume herein for the see-through phenomenon in a pair of recto–verso colour images, and discuss our previously proposed solution [8, 9] that, neglecting the blur on the sources and assuming their statistical independence, makes use of ICA techniques. In Section 3, we describe the mathematical details of the method proposed in this article that, based on a technique of CCA, is able to account for both blur on the source and noise on the data, and allows for relaxing the assumption of full uncorrelation between the sources. Experiments on both synthetic and real documents are discussed in Section 4 and, finally, Section 5 summarizes the main achievements of this research, discusses advantages and limitations, and presents possible future developments.
where x r C (t) and x v C (t) are the recto and (flipped) verso images at pixel t and at channel C, with C = (R,G,B). Analogously, s r C (t) and s v C (t) are the clean recto and verso text patterns at pixel t, channel C. The positive elements a nm C , n = (r,v) and m = (r,v), of the so-called mixing matrix A, represent the unknown percentage of ink intensity attenuation of the two texts in the two sides, due to ageing factors, or transparency of the paper or ink seeping from the back to the front page. Analogously, functions h nm C , n = (r,v) and m = (r,v), of unitary sum, represent blur kernels explaining for the smearing of the ink, due to the same factors, and with symbol ⊕ we indicate the convolution operator. Finally, μ n C , n = (r,v), is the channel noise.
Restoring the degraded recto–verso images at hand entails solving the system in Equation (1) for ink attenuation indices, blur kernels and sources. Once estimated, the set of sources (s n C , n = r,v; C = R,G,B) can be arranged as (s r R ,s r G ,s r B ) and (s v R ,s v G ,s v B ) to give the restored RGB recto and verso images.
Some physical properties of the see-through phenomenon can be exploited to simplify the problem in Equation (1). In fact, we can observe that the text interfering from the back side is always attenuated, whereas the same text is not, or much less, in the side where it was originally written. Hence, the ink attenuation indices and the blur kernels are expected to be different, for each source, in the two observations, and for the two sources in a same observation. Let us consider that, at least in an idealized setting, the two sides have been written with the same ink, same pressure and at two close moments. Then, it is reasonable to assume that, at each channel, the attenuation of the see-through text in the two sides is the same, i.e. a rv C = a vr C , as well as the ink smearing, i.e. h rv C = h vr C . For similar considerations, it is also expected that a rr C = a vv C and h rr C = h vv C . Furthermore, as already mentioned, the ink intensity of the front text in the recto side should be higher than that of the see-through text, i.e. a rr C > a rv C , with the same relationship holding, reversed, in the verso side. Within these assumptions, system of Equation (1) results to be symmetric.
For the case of grayscale recto–verso pairs, and neglecting both blur and noise, in  the restoration problem has successfully been solved assuming statistical independence of the sources, and employing a very fast and fully blind symmetric whitening of the data, which is equivalent to ICA for symmetric mixing matrices. However, the mere application of ICA to the RGB recto–verso case is not suitable and even wrong. Indeed, here the mutual independence of the overall set of sources cannot be assumed, since the different colours of a same text pattern are certainly highly correlated.
A useful observation is that the 6 × 6 system of equations is separable into three independent 2 × 2 symmetric problems, which can be then solved separately. In this case, employing ICA to solve each subsystem only entails assuming independence of the recto and verso text at each individual channel, as done for the grayscale case, and no unrealistic uncorrelation is assumed among the various colour maps of the recto (verso) text. This allows reducing the interferences while preserving the original colour of the RGB recto and verso images . However, some residual see-through interference usually remain in the reconstructions. Indeed, if no blur is accounted for, two homologous patterns in the two different sides cannot match exactly; on the other side, this could also indicate that the recto and verso text are actually correlated at each channel.
In the following section, we will show that CCA techniques, recently applied with success to solve different image processing problems , permit to account for blur and to fully relax the uncorrelation assumption among the recto and verso texts.
The essence of the CCA technique is to admit non-zero, although unknown, auto- and cross-correlations for the sources, which must jointly be estimated with the mixing parameters. In our case, this complex joint estimation can take advantage from the explicit exploitation of the symmetries of both the mixing operator and the source covariance matrix. In order to easily enforce suitable constraints that are available on the source spectra the problem is also transformed from the space domain to the Fourier domain.
With this model, our scope reduces to the removal of the see-through interferences only, while the restoration of other degradations undergone by the text in the side where it was originally written are left to possible subsequent processing. However, it is to be noted that, often, archivists and scholars do prefer a restoration intervention limited to the artefact removal, which does not alter the original, aged appearance of the document itself. The proposed method could take into account for additive noise , restoring sources with a better SNR. However, since removing additive noise could alter the appearance of the document, as a first approach we decided to neglect noise.
and S C (ω) T = [S r C (ω),S v C (ω)].
Without losing generality, in the following we assume the blur kernels h C to be circularly symmetric. In a first approach, we assume them Gaussian with known variance. This assumption greatly simplifies the estimation process. It is also to be noted that, in a large part of documents, the off-line estimation of the blur kernel does not present significant difficulty.
From the circular symmetry, it is H C ∗(ω) = H C T (ω).
From several simulations we found that, for the specific problem at hand, by using either Equation (8) or (9) almost identical spectra can be obtained. Thus, Equation (9) establishes the existing relationship, ∀l, between the circular cross-spectra of the data and the sources. Note that, ∀l, and are 2 × 2 real and symmetric matrices. Hence, the number of independent equations is 3.
where ⊗ indicates the Kronecker product, d(l) and g(l) are the lexicographic forms of and , respectively.
and then performing alternate minimizations with respect to the mixing parameters and the spectra. By choosing suitable regularization functions, the minimizer with respect to g(l), l = 1,…,l max, can be computed in analytical form with a very low computational load, while the minimizer with respect to a C can be computed iteratively or employing stochastic algorithms. Functions Φ are chosen in such a way to enforce a global regularization constraint on the cross-spectra. Global smoothness and minimum energy are the constraints most frequently used for this purpose. We implemented both, and verified substantially equivalent performances, being the minimum energy slightly simpler. Thus, the results presented in this article have been obtained by enforcing a constraint of minimum energy on the cross-spectra. The minimization with respect to the model parameters requires, in general, the use of stochastic algorithms, of the type of simulated annealing. In our case, since we need to estimate a single parameter, we employed a faster iterative technique.
By analysing several documents affected by see-through, we found that at the high frequencies the cross-spectra go quickly to zero, which means that recto and verso are uncorrelated beyond a certain frequency l 1 that can be determined experimentally. Specifically, we found that, in most cases, by taking Δω = 1.5, it is l 1 = 10. Hence, the solution to problem of Equation (12) can be found by enforcing the further constraint of null cross-spectra for l > 10.
Once the estimates are available, the individual sources can be recovered, at each Fourier mode, by inverse filtering. When noise is present, Wiener filtering is employed instead, since the estimated spectra can effectively be exploited. In both cases, the method is very fast. In particular, its complexity is comparable to that of the FastICA algorithm , and much lower of that of methods based on nonlinear data models, such as the one proposed in .
In this section, we will show the performance of the CCA technique described above compared with that of the ICA technique, this latter implemented either through symmetric whitening or through the FastICA algorithm .
In a first set of experiments, described in Section 4.1, we processed a variety of both grayscale and RGB recto–verso pairs, either affected by show-through or bleed-through, including some among the ones that are mostly tested in the literature on the subject.
During the revision process of this article, we became aware of the existence of a recently published online database of high-resolution grayscale images of ancient documents affected by bleed-through . This database has been created by the project Irish Script on Screen (ISOS) of the School of Celtic Studies, Dublin Institute for Advanced Studies, in conjunction with the SIGMEDIA group of the Department of Electrical and Electronic Engineering at Trinity College Dublin.
Hence, in Section 4.2, we analyse the results of applying our techniques to those images.
4.1. Test images: miscellaneous
In the first example of Figure 3, we deemed to neglect the blur effect, and compared the performance of CCA with that of FastICA. In particular, Figure 3a,b shows the original recto and verso scans, Figure 3c,d shows the results obtained by FastICA and Figure 3e,f shows the results of CCA. Again, CCA outperforms ICA by producing almost perfectly cleansed reconstructions.
In the second example of Figure 4, we included a blur in the form of a Gaussian of standard deviation 2, to account for the sensible smearing of the see-through pattern, and obtained the results of Figure 4c,d (an enlarged portion is shown for a better qualitative evaluation).
4.2. Test images: the Irish bleed-through database
The bleed-through database at the website  comprises 25 registered recto–verso sample grayscale image pairs, taken from larger high-resolution manuscript images, with varied degrees of bleed-through. In addition, for each image a binary ground-truth mask of the foreground text is provided. Although these ground truth images are synthetic, i.e. manually created, they can be useful for a quantitative analysis of the results. Furthermore, in our case, they can also be used for estimating the cross-correlation of the clean, ideal recto and verso foreground texts.
We have experimented both ICA and CCA on a large subset of the whole database, and have found that the images can roughly be subdivided into two categories: those images where CCA and ICA perform similarly, and those images where CCA is definitely superior to ICA. As one might expect, the images whose corresponding ground-truths exhibit a low correlation fall in the first category, whereas when the cross-correlation of the ground-truths is significant, CCA outperforms ICA. In the following, we report two examples that are representative of this general behaviour of the two algorithms.
We have shown that a technique of CCA significantly outperforms ICA when applied to the restoration of RGB recto–verso pairs of historical documents. This can be achieved without affecting the typical computational efficiency and the unsupervised nature of BSS techniques, which make them suitable also for routinely application to large datasets of archival documents.
Differently from ICA, with CCA separation can be achieved also when the individual sources are largely correlated. This is especially true when the patterns that interfere from a side to the other of the page are sensibly blurred, for effect of light or ink spreading through the support. Although the method can easily account for these blur kernels on the sources, at present we considered them known, or we estimate them off-line. We are currently studying a strategy to jointly estimate the blur kernels along with all the other parameters.
We should point out that our method is based on a linear, although convolutional, mixing model. This is undoubtedly a limitation, in that the see-through effect is likely to be nonlinear. In fact, some recent works (see, e.g. ) have shown tha, using a nonlinear convolutional model, excellent results can be obtained, although at the price of a higher computational cost.
However, many issues still remain open, along the difficult way to find a comprehensive model that is able to describe all multiple and varied causes behind the see-through phenomenon in ancient documents. In our opinion, the two most critical open issues are correlation of the sources and non-stationarity of the degradation. Neither models nor methods are presently available to simultaneously address both problems. This article aims to give a contribution, supported by promising results, towards the solution of the source correlation problem in the linear convolutional case. A next step could be to include the treatment of source correlation within a nonlinear convolutional data model.
This study was supported by the European funds, through the program POR Calabria FESR 2007–2013 - PIA Regione Calabria, project ITACA (Innovative Tools for cultural heritage ArChiving and restorAtion).
- Tan CL, Cao R, Shen P: Restoration of archival documents using a wavelet technique. IEEE Trans. Pattern Anal. 2002, 24: 1399-1404. 10.1109/TPAMI.2002.1039211View ArticleGoogle Scholar
- Moghaddam RF, Cheriet M: Low quality document image modeling and enhancement. Int. J. Doc. Anal. Recognit. 2009, 11: 183-201. 10.1007/s10032-008-0076-2View ArticleGoogle Scholar
- Dubois E, Pathak A: Reduction of bleed-through in scanned manuscript documents. In Proceedings of the Image Processing, Image Quality, Image Capture Systems Conference (PICS), vol. 4. Montreal, Canada: ; 2001:177-180.Google Scholar
- Dano P: Joint restoration and compression of document images with bleed-through distortion. : Ottawa-Carleton Institute for Electrical and Computer Engineering, School of Information Technology and Engineering, University of Ottawa; 2003. DissertationGoogle Scholar
- Knox K: Show-through correction for two-sided documents, U.S. Patent 5,832,137. 1998.Google Scholar
- Wang Q, Tan CL: Matching of double-sided document images to remove interference, in Proceedings of the Conference on Computer Vision and Pattern Recognition ( CVPR ) (Hawaii. USA 2001, 1: 1084-1089.Google Scholar
- Sharma G: Show-through cancellation in scans of duplex printed documents. IEEE Trans. Image Process. 2001, 10: 736-754. 10.1109/83.918567View ArticleGoogle Scholar
- Tonazzini A, Salerno E, Bedini L: Fast correction of bleed-through distortion in greyscale documents by a blind source separation technique. Int. J. Doc. Anal. Recognit. 2007, 10: 17-25. 10.1007/s10032-006-0015-zView ArticleGoogle Scholar
- Tonazzini A, Bianco G, Salerno E: Registration and enhancement of double-sided degraded manuscripts acquired in multispectral modality. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR). Barcelona, Spain: ; 2009:546-550.Google Scholar
- Merrikh-Bayat F, Babaie-Zadeh M, Jutten C: Using non-negative matrix factorization for removing show-through. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), vol. LNCS 6365. St. Malo, France: ; 2010:482-489.Google Scholar
- Ophir B, Malah D: Show-through cancellation in scanned images using blind source separation techniques. In Proceedings of the IEEE International Conference on Image Processing (ICIP), vol. 3. San Antonio, Texas, USA: ; 2007:233-236.Google Scholar
- Tonazzini A, Gerace I, Martinelli F: Multichannel blind separation and deconvolution of images for document analysis. IEEE Trans. Image Process. 2010, 19: 912-925.MathSciNetView ArticleGoogle Scholar
- Merrikh-Bayat F, Babaie-Zadeh M, Jutten C: Linear-quadratic blind source separating structure for removing show-through in scanned documents. Int. J. Doc. Anal. Recognit. 2011, 14: 319-333. 10.1007/s10032-010-0131-7View ArticleGoogle Scholar
- Salerno E, Martinelli F, Tonazzini A: Nonlinear model identification and seethrough cancellation from recto-verso data. Int. J. Doc. Anal. Recognit published online 17 March 2012Google Scholar
- Tonazzini A, Bedini L, Salerno E: Independent component analysis for document restoration. Int. J. Doc. Anal. Recognit. 2004, 7: 17-27.View ArticleGoogle Scholar
- Comon P, Jutten C: Handbook of Blind Source Separation. 1st edition. New York: Academic Press; 2010.Google Scholar
- Ricciardi S, Bonaldi A, Natoli P, Polenta G, Baccigalupi C, Salerno E, Kayabol K, Bedini L, De Zotti G: Correlated component analysis for diffuse component separation with error estimation on simulated Planck polarization data. Mon. Not. R. Astron. Soc. 2010, 406: 1644-1658.Google Scholar
- Bedini L, Salerno E: Fourier-domain implementation of correlated component analysis, with error estimation. Internal Report ISTI-CNR; 2008.Google Scholar
- Gävert H, Hurri J, Särelä J, Hyvärinen A: The FastICA package for MATLAB. 2005. http://research.ics.aalto.fi/ica/fastica/ Google Scholar
- Irish Script On Screen: Sigmedia, Bleed-Through Database. 2012. http://www.isos.dias.ie/master.html?http://www.isos.dias.ie/libraries/Sigmedia/english/index.html?ref= Google Scholar
- Otsu N: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man. Cybern. 1979, 9(1):62-66.MathSciNetView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.