- Open Access
Improvement in twins handwriting identification with invariants discretization
EURASIP Journal on Advances in Signal Processing volume 2012, Article number: 48 (2012)
One of the most popular areas of study in pattern recognition which has now become the centre of many researchers' attention is Writer Identification. A more recent development in the area is Twins Handwriting Identification which has now become not only an important, but also widely popular area of study especially in the fields of forensic research and biometrical identification. In terms of biometrical identification, it is known that a pair of twins may share various similar traits genetically. Forensic evidence can be easily obtained from handwriting samples. Therefore, in order to achieve reliable and accurate identification based on handwriting, it is important for the similarities in the writing traits of a pair of twins to be differentiated. In identifying an individual, handwriting style can be analyzed to allow the implicit representation of the unique hidden features of the individual's handwriting. Said unique features can help in identifying the writer of the text which can be essential when identifying the writer between a pair of twins. Previous studies in authorship identification were highly concentrated in the study of the classification task as well as features extraction. However, the issue of the similarities in the traits of a pair of twins' handwriting were not taken into account thus, leaving a high possibility of degrading the performance of the classification process. Therefore, in order to achieve better input for the classification task, this article will discuss an additional process which can better represent an individual's personal features through the transformation of the similarities via discretization protocol. The additional process can help improve the level of identification for Individuality of Handwriting of a pair of twins.
Despite the advancement and technological achievement of the current age, documents are still printed on paper and widely exchanged, hence the need for Writer Identification (WI). WI helps to properly identify the writer of a handwritten document. Shapes and styles of an individual's handwriting contain hidden personal traits of the writer which can contribute to the process of handwriting identification in dynamic biometric study. The biometric features are used to identify the identity of the writer [1–4]. This is also applicable in the case of identifying the writer between a pair of twins. Through studying the signature, WI is commonly used to authenticate legal paper. WI can also be used in identifying the authorship of documents without signature, such as letters of threat, historical, or ancient manuscripts and other documents only containing handwritten text without the person's signature. The current technology allows WI to be performed even with the use of limited samples of handwriting. In the field of handwriting analysis for forensic purposes, WI holds great importance and is widely used on evidence to be used in the courtroom [5–8]. Therefore, the many issues and challenges in Twins Handwriting Identification need to be given attention for further investigation.
The Twins Handwriting Identification is a quite popular area of research in pattern recognition and computer vision fields as it, in some situations, provides the only means of discovering the real writer of a written text out of a group of people [9, 10].
Proven through previous studies on twins' biometric identification which include the studies on the discriminability between the fingerprints of a pair of twins , DNA analysis , computational discriminability analysis on the fingerprints of a pair of twins , coefficient values shown in individual sets which form of unique code for an individual's face , natural physiological traits is unchanging throughout an individual's life. However, unlike an individual's psychological traits which remain constant, the association between an individual's handwriting with the individual's behavioral nature allows the handwriting of the individual to change according to the changes in their behavior and provide a strong reasoning behind the study of handwritings .
Distinguishing the handwriting of a pair of twins is a challenge in the area of biometric study. Throughout the years, it has been noticed that the unique features of an individual is embedded in the individual's handwriting. Therefore, through studies and with the current status of knowledge, various techniques have been developed and further improved to properly study handwriting samples .
Through studying a pair of twins' handwriting, the writer can efficiently be distinguished. This form of study has been proven to be more complex compared to studying the handwriting of non-twins. This is due to the fact that the resemblance of a pair of twins' characteristics is also shown in their writing manners which generate similar features in their handwriting. There are two stages of the identification phase: the analysis of the individual feature as well as the identification of the features with similarities and the capture of the features. The results of the stages will be the functions and are computerized with the help of the classical method of identification.
In identifying the author of a handwritten text, previous studies on WI have shown more interest in the tasks of feature extraction classification. However, most work did not focus on the additional step which in this article will be focusing. The additional step aims to provide better representation for the input which will be used in the classification process. Better representation of the input can help in a way that the classification task can be done more quickly and accurately for the real writer to be more accurately identified especially in the case of handwriting identification of a pair of twins. The features extracted in the feature extraction process show that the handwriting of a pair of twins has very similar representations which causes a problem once the input is used in the classification process as similarities will lower the accuracy of the classification task. This article will provide the discussion on the additional step of transformation where the closely similar representation of features are transformed into clearer and better representations which can represent each twin.
2 Individuality of Twins Handwriting
An individual's nature can be seen through his or her handwriting and the hypothesis as mentioned in [16–18] stated that a person's individuality in writing shows through the fact that said person has a consistent from of handwriting. These figures are samples of handwritten texts from three pairs of twins. Figure 1 shows a sample of handwritten text of the same characters and Figure 2 samples with different characters. It can be seen that the shape of the writings are only slightly different when the author is the same writer in a pair of twins while have more defined difference for different authors in a pair of twins although the height of the writings are similar. This difference is 'Individuality of Handwriting' in which the difference in handwriting is still evident even between a pair of twins. This form of individuality is measurable by the variances where the feature of the writer (intra-class) has to be of lower value than that of different writers (inter-class) [19–21]. Individual features are considered good and acceptable if the features have the lowest similarity error for one author in a pair of twins (intra-class) and highest similarity error for both authors in a pair of twins (inter-class) . Therefore, individual features need to be acquired from the samples of handwritten texts to be able to identify the authorship of the text when the identification involves a pair of twins. This concept of handwriting individuality was defined and discussed in  as authorship invarianceness.
3 Unique representation
Features are used as input in the identification process used by the classifier; therefore, it is important to obtain good and reliable features in order to achieve accurate and clear identification. It is common for the features extracted from the features extraction process to be used directly by the classifier in the classification process. However, in the case of handwriting identification involving a pair of twins, it is not suitable to be used directly as the representation of the individual features of a pair of twins are usually very closely similar which causes the intra-class variance to be large and the inter-class variance to be small. Hence, in order to improve the invarianceness of authorship, another process can be added before the features are used in the classification process. This study implements the Invariant Discretization Technique from  on samples of twins' handwriting. The technique is meant to reduce the intra-class variance of the features while increasing the inter-class variance. Figure 3 shows an overview of the study which led to the need of this additional procedure to be performed for the identification of a pair of twins' handwriting to be improved.
4 Feature extraction
Macro-features which represent the global characteristics of the writing habit and style of an individual can be captured and extracted from an entire document [10, 15]. These macro-features are used in this study for the purpose of identifying the writer between a pair of twins. Thirteen macro-features including the 11 initial features stated in [10, 17] are used in this study. The 11 features include the entropy of grey values, the binarization threshold, number of black pixels, number of interior contours, number of exterior contours, contour slope components consisting of number of horizontal, number of positive, number of vertical and number of negative, the average height as well as the average slant. Only eight features are used in the experiments of this study which are the entropy of grey values, the binarization threshold, number of black pixels, number of interior contours, and number of exterior contours, average height, average slant, and average stroke width. Macro-features have been chosen for the experiments because of the global characteristics captures by the features which can present the writer's individuality in terms of writing style and habit . Detailed descriptions of the macro-feature algorithm are provided in [15–17].
In classification, the problem in focus is usually the training instances. The set of instances which have distinct, descriptive features are usually categorized into classes. In the discretization process, the transformation of the continuous features forms discrete partitions with a certain number of intervals. A lower and an upper boundary represent the range of each interval. As there are many ways in representing the continuous features, certain important points are needed. The first point is to determine the number of intervals for each discrete partition. The number is usually selected at random. Second, the boundaries are decided for the intervals. There are several known methods for discretization including Equal Information Gain, Maximum Entropy, and Equal Interval Width. Another method proposed in , the Invariants Discretization method, has however been proven more efficient in providing higher accuracy and success rates of identification. The Invariants Discretization method is a supervised method. The method starts by searching the appropriate intervals to represent the writer's information. The upper and lower boundaries are then set for each interval. The number of intervals for an image must be the same as the number of the feature vectors.
The individual's uniqueness can be computed according to each writer and the preservation of the information help ease the task of classification. This discretization process proved to be beneficial in terms of nonlinear representation  and through the set of intervals, interpretation can easily be done by humans . Reducing the amount of data also helps the computation process to be done quicker [25, 26]. According to the authors of , use of post-discretized data provided higher level of identification compared to using pre-discretized data. The result of the study showed that through the application of the discretization method on the proposed integrated Moment Invariant, higher accuracy can be achieved.
5.1 Discretization protocol
An appropriate number of intervals with a representation value representing the extracted feature are calculated in the discretization process. The representation value, called discretized feature vector, is where the 'generalized unique feature' for each individual feature is obtained. The generalized feature illustrates the hidden features of an individual's writing style. Then minimum and maximum range of the data for each writer is divided into intervals which can be called 'cuts' of equal sizes in order to obtain an interval. The number of feature vector columns of the extracted features defines the number of intervals.
The example shows eight feature vector columns obtained from the macro-feature technique. Each interval is given a lower and an upper approximation and one representation value represents each interval. In the supervised discretization, the value is calculated based on the writer class. An invariant feature vector which falls into an interval will have the interval's representation. Therefore, writers with closely similar invariant feature vectors will have similar intervals for the two classes. The information and characteristics of a writer are not affected by the Discretization algorithm. The algorithm only presents the invariant feature vector originally extracted from the feature extraction process in a standard representation with generalized features. Figure 4 shows an illustration of the discretization algorithm.
Invariant discretization requires the writer class information for the discretization process. The calculation of the range of intervals in the invariant discretization line uses the minimum (νmin) invariant feature vector and the maximum (νmax) invariant feature vector (if ν) of the writer. A line for a writer starts with the minimum (νmin) invariant feature vector and ends with the maximum (νmax) invariant feature vector. The interval is the average of the invariant discretization line when divided by the number of invariant feature vector column. The calculation of the interval's width (wd) is as follows:
where νmin is the minimum value of invariant feature vector for a writer; νmax is the maximum value of invariant feature vector for a writer; and f is the number of invariant feature vector column.
The interval in an invariant discretization line has cut points which are defined by the width. The invariant feature vector in an interval will have the interval's representation value. The representation value (rν) of each interval is the average of interval. It is calculated as rν = (iνmax - iνmin)/2. Intervals 1 to 7 are represented with the representation value of the invariant feature vector within if ν ≥ iνmin and if ν < iνmax while the invariant feature vector within if ν ≥ iνmin and if ν ≤ iνmax is put under the category of the last interval. The representation value, known as discretized feature vector, is a representation of the unique features in an individual's writing. Figures 5 and 6 show the transformation of the invariant feature vector into discretized feature vector for pre- and post-discretized data, respectively. It can be seen that the discretization algorithm provides discretized feature vector that shown clear illustration of an individual's unique features, even between a pair of twins.
6 Simulation result
Two experiments were conducted in this study: the experiment on the authorship invarianceness for the handwriting sample of a pair of twins and the evaluation of the accuracy of identification between the pair of twins. The first experiment was conducted in order to prove that the discretization technique improves the variance of the intra-class (same writer in a pair of twins) and inter-class (both writers in a pair of twins) features. The second experiment was conducted to evaluate the discretization in terms of improving the performance of the identification of the writer between a pair of twins using the Rosetta Toolkit  and artificial neural network (ANN). The data used for the experiments were from the collection of 390 data samples obtained from 13 pairs of identical twins from the Sulaimania University, Iraq.
6.1. Authorship invarianceness between twins
Through the use of the Mean Absolute Error (MAE) function, the authorship invarianceness can be measured. Figure 7 presents an example of the MAE calculation. For each twin, there are 15 images of handwriting samples. Features 1 to 8 are the features extracted to represent a character. The character's invarianceness and the reference image (the first image) are given by the MAE value . Small errors indicate that the image is close or similar to the reference image. The average MAE is calculated from the overall result.
where n is the number of images; x i is the current image; r i is the reference image or location measure; f is the number of features; i is the feature column of image.
The calculation for the authorship invarianceness for post- and pre-discretized feature vectors can be achieved through analyzing the intra-class and the inter-class of the MAE value. The result of the analysis show that the use of post-discretized feature vector feature provides improved authorship invarianceness compared to the use of pre-discretized feature vector as the intra-class MAE value using the post-discretized feature vector is smaller and the inter-class MAE value is higher than that of the pre-discretized feature vector. Low MAE value for intra-class indicates that the features for a single writer in a pair of twin are similar while the high value of MAE for inter-class indicates that the features of the handwriting of each twin is different from the another. The hypothesis is therefore proven correct and the discretization process is deemed able to improve the authorship invarianceness with the standard representation of the individual's unique features presented clearly to help identify the writer between a pair of twins. Figures 8 and 9 show the comparison of the authorship invarianceness for the macro-feature technique with post- and pre-discretized data, respectively.
Figures 8 and 9 show the results which describe an individual's unique features where even between a pair of twins, the uniqueness is evident. As the value of the MAE for intra-class (same writer in a pair of twins) is lower than the value of the MAE for inter-class (both writer in a pair of twins), it satisfies the concept that states that there are traits of individualities in the handwriting of a pair of twins. Using post-discretized feature vector, the individual features can be better illustrated compared to using pre-discretized feature vector. The post-discretized data should have a lower MAE value than the pre-discretized data for intra-class (same writer in the same twins), and the post-discretized data should give a higher MAE value when compared to the pre-discretized data for inter-class (both writer in the same twins). Furthermore, the results shown in Figure 10 also show that the use of post-discretized data improved the MAE value for inter-class.
6.2. Identification performance evaluation rough set classifier
In the classification task, whether it is to lessen the computation time, or to minimize classification errors, any method may be chosen based on its efficiency and ability to complete the task as required. In this article, rough set theory was chosen for its ability to deal with the upper and lower approximation concepts of the set which provides a way of classifying objects in noisy or incomplete condition.
In , it is stated that the boundary region of a set is represented by the set difference between its upper and lower approximations. Figure 11 illustrates the concept of rough set theory. Figure 12 on the other hand shows the approximation role of Rough set concept.
With the use of the Rosetta (Rough Set Toolkit) as suggested in , an experiment was conducted to evaluate the performance of the writer identification between a pair of twins which uses both the post- and pre-discretization techniques. The experiment takes into account the additional step used in the study for the purpose of Twins' Handwriting Identification which helped in increasing the accuracy of the identification through the process of discretization. A total of 390 data samples, divided into 2 datasets; namely, training and testing data, were used for the classification task.
In order to achieve a more reliable and accurate performance with the use of the discretization method, the Cross Validations (CV) in  were implemented on the post- and the pre-discretized data. The number of the folds was specified by the number of the CV iterations. The experiments in this study were done with 10, seven, and fivefold CV iterations. The process of discretization was done based on the Invariant Discretization method by Azah Kamilah.
Two experiments with 70% training data, 30% testing data and 60% training data, 40% testing data were completed (10, 7, 5 cross validation). The CV process provides the experimental results as shown in Tables 1 and 2. The results are then evaluated and are as visualized in Figures 13 and 14.
Through the results shown in both Tables 1 and 2, it can be concluded that the use of post-discretized data can result in higher accuracy when compared to the use of pre-discretized data. Thus, the use of post-discretized data can significantly improve the performance of Twins' Handwriting Identification.
6.3. Identification performance evaluation with artificial neural network classifier
The ANN classifier is used on both types of the Twins datasets in order to achieve the main goal of the research. In this article, ANN is used to classify the between- and within-writer distances while minimizing misclassification errors. ANNs have several desirable properties: sound statistical procedure, practical software implementation of the Bayesian (optimal) procedure, no presumptions about the nature of the data (unlike other classifiers), and they let us tap into the full multivariate nature of the data and enable us to use a nonlinear discrimination criterion. In this research, we used a 3-layered network: an input layer with eight units and a hidden layer with five units. Figure 15 shows the ANN architecture of this research.
Three experiments were conducted with a varied number of training data and testing data where the first experiment used 70% training data and 30% testing data from a combination of pre-discretized and post-discretized datasets. The second experiment was conducted with the use of 60% training data and 40% testing data. ANN was used for the training process. With the use of the classification matrix, the overall accuracy of identification was calculated from each training and testing dataset.
The results of both the experiment using 70 and 60% training data are as summarized in Table 3. Through the results, it can be noted that the use of post-discretized data can provide an overall identification rate with the Average Accuracy (%) of above 90.0%. The use of pre-discretized data on the other hand has lower identification rate which is below 60.0%. This proves that better identification and higher level of accuracy can be achieved with the use of post-discretized datasets.
It can be suggested through the results showing the value of MAE in Section 6.1 that the invarianceness of the authorship between a pair of twins was improved with the use of post-discretized feature vector for both intra-class (same writer in a pair of twins) and inter-class (each writer in a pair of twins) when compared to the use of pre-discretized feature vector. This satisfies the concept of Individuality of Handwriting even in terms of Twins' Handwriting Identification where the concept requires that the intra-class MAE value must be smaller than the inter-class MAE value regardless of the character used for the experiment. The discretization process provided post-discretized feature vector which can properly represent and illustrate the individuality of each writer. It proves that the concept of Individuality of Handwriting in WI where each writer has his or her own style of writing with differs even between a pair of twins. The standard representation of the features of each individual consists of small intra-class variance and large inter-class variance when compared to the invariant feature vectors originally extracted through the features extraction process. As proven in Sections 6.2 and 6.3, this contributes to the higher accuracy of identification for each individual's handwriting. Therefore, it can be concluded that through the analysis of authorship invarianceness, the application of the discretization technique should be further explored in the domain of Twins' Handwriting Identification.
Srihari SN, Huang C, Srinivasan H, Shah VA: Biometric and Forensic Aspects of Digital Document Processing. Edited by: Chaudhuri BB. Digital Document Processing Springer; 2006:379-405.
Tapiador M, Siguenza JA: Writer identification method based on forensic knowledge. First International Conference on Biometric Authentication, ICBA, 2004 2004, 555-561.
Kun Y, Yunhong W, Tieniu T: Writer identification using dynamic features. Biometric Authentication: First International Conference, ICBA 2004, Hong Kong, China; 2004:512-518.
Yong Z, Tieniu T, Yunhong W: Biometric personal identification based on handwriting. Proc 15th International Conference on Pattern recognition, Barcelona, Spain 2000, 2: 797-800.
Somaya M, Eman M, Dori K, Fatma M: Writer identification using edge-based directional probability distribution features for arabic words. IEEE/ACS International Conference on Computer Systems and Applications, AICCSA Doha 2008, 582-590.
Niels R, Vuurpijl L, Schomaker L: Automatic allograph matching in forensic writer identification. Int J Pattern Recogn Artif Intell IJPRAI 2007, 2(1):61-81.
Pervouchine V, Leedham G, Melikhov K: Handwritten character skeletonisation for forensic document analysis. Proceedings of the 2005 ACM Symposium on Applied Computing, Santa Fe, New Mexico, USA; 2005:754-758.
Franke K, Koppen M: A computer-based system to support forensic studies on handwritten documents. Int J Doc Anal Recogn 2001, 3(4):218-231. 10.1007/PL00013565
Plamondon R, Lorette G: Automatic signature verification and writer identification state of art. Pattern Recogn 1989, 22: 107-131. 10.1016/0031-3203(89)90059-9
Srihari SN: Computational methods for handwritten questioned document examination. Ph.D U.S Department of Justice 2010.
Jain AK, Prabhakar S, Pankanti S: On the similarity of identical twin finger prints. Pattern Recogn 2002, 35(1):2653-2663. 10.1016/S0031-3203(01)00218-7
Rubucki RJ, McCue BJ, Duffy KJ, Shepard KL, Shepherd SJ, Wisecarver JL: Natural DNA mixtures generated in fraternal twins in Utero. J Forensic Sci 2001, 46(1):120-125.
Liu Y, Sargur NS: A Computational Discriminability Analysis on Twins Fingerprints . Springer, Heidelberg 2009, 43-54.
Rycchilk M, Stankiewicz W, Moezynski M: Method of numerical analysis of similarity and differences of face shape of twins. Proceeding of ICBME 2009, 23: 1854-1857. [http://www.springerlink.com]
Sargur S, Chen H, Harish S, Vivek S: On the discriminability of the handwriting of twins. J Forensic Sci 2008, 53(2):430-446. 10.1111/j.1556-4029.2008.00682.x
Srihari SN, Cha S-H, Arora H, Lee S: Individuality of handwriting: a validation study. Sixth IAPR International Conference on Document Analysis and Recognition, Seattle 2001, 106-109.
Srihari SN, Cha S-H, Arora H, Lee S: Individuality of handwriting. J Forensic Sci 2002, 47(4):1-17.
Srihari SN, Huang C, Srinivasan H, Shah VA: Biometric and Forensic Aspects of Digital Document Processing. Digital Document Processing Springer, London 2006, 379-405.
Muda AK, Shamsuddin SM, Ajith A: Improvement of Authorship invarianceness for individuality representation in writer identification. Neural Netw World 2010, 3(10):371-387.
Leedham G, Chachra S: Writer identification using innovative binarised features of handwritten numerals. Proceeding of Seventh International Conference of Document Analysis and Recognition 2003, 1: 413-416.
Zois EN, Anastassopoulos V: Morphological waveform coding for writer identification. Pattern Recogn 2000, 33(3):385-398. 10.1016/S0031-3203(99)00063-1
Muda AK, Shamsuddin SM, Darus M: Invariants discretization for individuality representation in handwritten authorship. International Workshop on Computational Forensic (IWCF 2008), LNCS 5158, Springer 2008, 218-228.
Agre G, Peev S: On supervised and unsupervised discretization. CIT: Cybern Inf Technol 2002, 2(2):43-57.
Liu H, Hussain F, Tan CL, Dash M: Discretization: an enabling technique. Data Min Knowl Disc 2002, 6: 393-423. 10.1023/A:1016304305535
Prachya P, Thanawin R, Kitsana W: DCR: discretization using class information to reduce number of intervals (2009). QIMIE/PAKDD 2009, 17-28.
Hwang GJ, Li F: A Dynamic Method for Discretization of Continuous Attributes, IDEAL, LNCS 2412 Springer, Berlin. 2002, 506-511.
Øhrn A, Komorowski J: ROSETTA: a rough set toolkit for analysis of data. Third International Joint Conference on Information Sciences, Durham, NC 1997, 3: 403-407.
Pawlak Z: Rough sets. Int J Comput Inf Sci 1982, 11: 341-356. 10.1007/BF01001956
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Mohammed, B., Shamsuddin, S. Improvement in twins handwriting identification with invariants discretization. EURASIP J. Adv. Signal Process. 2012, 48 (2012). https://doi.org/10.1186/1687-6180-2012-48
- writer identification
- unique representation
- authorship invarianceness
- twins handwriting