EURASIP Journal on Applied Signal Processing 2002:7, 736–745 c ○ 2002 Hindawi Publishing Corporation Automated Quality Assurance Applied to Mammographic Imaging

Quality control in mammography is based upon subjective interpretation of the image quality of a test phantom. In order to suppress subjectivity due to the human observer, automated computer analysis of the Leeds TOR(MAM) test phantom is investigated. Texture analysis via grey-level co-occurrence matrices is used to detect structures in the test object. Scoring of the substructures in the phantom is based on grey-level differences between regions and information from grey-level co-occurrence matrices. The results from scoring groups of particles within the phantom are presented.


INTRODUCTION
Meticulous quality control of mammography was advocated [1] as a mandatory requirement of the National Health Service Breast Screening Programme prior to the establishment of the NHSBSP. The quality control is with respect to evaluating system performance and for long term monitoring. Standards of acceptability have been established based upon subjective interpretation of image quality test phantoms, for example, Leeds test objects TOR(MAM), TOR(MAX), or TOR(MAS). Leeds TOR(MAM) is widely used for this purpose, and has been reported as being among the most consistent and sensitive [2] of those available commercially.
This study is aimed at applying image analysis methods that will be employed for the analysis of phantom films (which were digitised) for quantitative objective assessment of different imaging systems, monitoring temporal performance, and evaluating the impact of introducing new imaging systems, for example, direct digital mammography.
Analysis of phantom images relies upon scoring by experienced observers according to the visibility of details. Scoring schemes have been proposed whereby the observer allocates a value according to visibility of the test details (i.e., 0 = detail not seen, 1 = barely visible/threshold, 2 = less visible/faint, 3 = detail easily seen). Experienced observers have shown good self-consistency but have yielded substantially different scores when scoring images from the same phantom and when using the same protocol [3]. These interobserver variations clearly have the potential to call into question the validity of conclusions drawn from such image quality measurements. Automated computer analysis has been applied to the evaluation of image quality test phantom images and has yielded results comparable to, but more consistent than, the human observer [4,5,6]. Computer-assisted analysis also has some role in providing decision support to those analysing analogue phantom images, in the training of inexperienced observers and in the optimisation process.
This preliminary study is focused on the detection and scoring of groups of microparticles covering sizes from 63 to 354 microns (see also Table 1). The microparticle groups are one of the three principal features of the TOR(MAM) test object (see Figure 2). An expert viewed the films in a darkened room and made a visual assessment of the films using the 0-1-2-3 scoring scheme. The computerized automated method involves three principal stages, each of which is described in detail. Figure 1 depicts the various stages of the method. Briefly, they are the following: (i) alignment of the phantom with a reference image and extraction of six regions of interest (RoI) around each microparticle group (see Section 2.1); (ii) extraction of the features from each RoI (see Section 2.6). This is based on grey-level co-occurrence matrices (GLCMs) and a distance measure to compare GLCM's is introduced and described in Section 2.3; (iii) scoring the detected features using a statistical classification method (k-nearest neighbour), which is described in Section 2.7.  The results obtained by the expert are compared with those obtained by the application of the automated detection method, as is the self-consistency of the expert. This is discussed in Section 3.

MATERIALS AND METHODS
This preliminary study is based on a test set of images taken from 16 films of a Leeds TOR(MAM) test object (for more detail see [7]). These films have been scanned with a Lumisys Lumiscan 85, at a colour depth of 12 bits per pixel and a pixel size of 50 µm.
A diagram of the phantom is shown in Figure 2. Three different types of structures are present in the phantom, these are groups of microparticles, radiating patterns of linear structures and groups of three disks of diminishing contrast. The groups of microparticles simulate mammographic microcalcification clusters. For each of the six groups, there is an upper and lower bound to the size of the microparticles and it should be noted that there is some overlap in the size range for subsequent groups (see Table 1); this is a feature of the phantom indicating a continuous transition between the groups. The linear structures simulate fibrous filament details. Again, there are six of these groups (of filaments) and the variation is in the diameter of the filaments. The groups containing three disks simulate lesion-like structures. The variation of contrast is used as an indication for their detectability. In this study, we concentrate on the automatic detection and classification of the six groups of particles (labeled A-F in Figure 2). From the phantom image, six RoIs centred around each group of particles are extracted.
Ninety-six regions are obtained and constitute our dataset. For the set of 16 phantom images, Table 2 shows the distribution of scores (i.e., 0 = detail not seen, 1 = barely visible/threshold, 2 = less visible/faint, 3 = detail easily seen) for   the various group of particles. The scoring of each group of particles refers to a scoring process of one expert (referred to as observation 1), which will be considered as ground truth in this study. A previous score was completed by the same expert one year earlier and will be referred to as observation 2.
This provided information about the intravariability of the observer (see Section 3). This indicates that, for our dataset, the details for the groups A and B are clearly visible, and that for the groups F no details can be seen.

Automated extraction of the region of interest
Automated evaluation of the phantom requires that the computer correctly identifies the individual details. The physical alignment of the test object within the imaged area and the location of the digitised data within digitised data files varies from image to image. We restrict the transformation to be linear because the phantom used is a rigid plastic structure, which does not allow any local deformation (i.e., no bending nor compression). To extract the RoI automatically for the whole dataset, one phantom image was manually annotated with the positions of the RoI. Rigid registration [8] was used to globally align all phantom images in the dataset. Subsequently, all the RoI are extracted based on their position in the reference image. To remove local mis-alignment, the individual RoI are registered with the relevant RoI from the reference image.

Fundamentals
In our method, we use GLCMs to extract textural information from the image and segment particles. In order to reduce the size of the GLCMs, a linear transformation of the RoI's histogram is applied. The resulting images have a colour depth of sixty-four grey levels.
In an image, let N g denote the set of grey levels, |N g | the number of grey-levels, and t a translation. The grey-level cooccurrence matrix [9] of the image based on the translation t is an |N g | × |N g | matrix where an entry m(i, j) denotes the number of pixel pairs, separated by a translation t, which have grey-level values i and j. The GLCM is normalised by dividing each entry by the number of pairs of pixels separated by the translation t.
In principle we want to compare GLCMs and to be able to do so, we need to introduce a distance in the GLCM-space. Our approach is based on two assumptions: (1) there is a statistical difference between the GLCMs for image regions which contain only background texture and regions which also include image structures (e.g., particles); (2) regions with image structures are mostly surrounded by regions without any structures. This condition is essential for the approximation of the background. This assumption is correct for the phantom images.
An example of our approach can be found in Figure 3, where Figures 3a and 3b show the same image, except that in Figure 3b a cluster of particles has been added. For both images, a GLCM can be obtained and in this ideal case, the difference between the two GLCMs (see Figure 3c) is caused by the blob-like structures in Figure 3b. This difference GLCM can be seen as being caused by perturbations of the normal background texture and hence, can be used to reconstruct a perturbation image of the image structures, the results of which are shown in Figure 3d. As can be seen, most of the blob-like structures have been enhanced.

Distance metric in the GLCM-space
An m × n matrix can be considered as a vector of dimension mn. As GLCMs have entries values restricted to [0, 1] an n×n GLCM can be seen as a vector of [0, 1] n 2 .
From now on, and ind is bijective and defined as ind : To compare the similarity of two GLCMs, we introduce the distance δ defined as where It should be noted that δ is not a distance on R N (see Appendix A for a formal proof that δ is a distance for the normalised GLCM space).
This distance has been used instead of the usual Euclidean distance because it takes into account the relative value of an entry. If we consider a small perturbation ε and three vectors X, Y , and Z where δ(X, Z). Using the distance δ we emphasise that a small variation on a high value in a GLCM is negligible compared to the same variation on a smaller GLCM value.

Background texture approximation
To approximate the textural properties for a RoI we need to compare the GLCM of the RoI with the GLCM of the local neighbourhood. Nine subregions of the image are used (see Figure 4), 8 regions constitute the local neighbourhood (white squares) and are centred around the region of interest (grey square). There is no overlap between the subregions. The approximation X of the GLCM for the neighbourhood is given by where X k = [x k i ] i∈Λ is the GLCM from the kth square region of the neighbourhood.

Pixel classification
A GLCM conforms to the definition of compositional data [10] when considering the following properties of a GLCM X = [x i ] i∈Λ : This means that a variation of one entry in the cooccurrence matrix influences others entries to conserve the second property. When we increase (resp., decrease) an entry by ε, we must decrease (resp., increase), the other entries to keep the sum equal to one. All this the individual entries must obey the first property.
We want to investigate the perturbation caused by a positive variation ε on one component of a GLCM. To do so, consider two GLCMs, X = [x i ] i∈Λ and X k ε = [x k i ] i∈Λ , with X k ε given by where Θ = {j ∈ Λ | x j > 0 }, |Θ| is the number of elements in the set Θ and 0 < ε < |Θ| min(x j ) j∈Θ , which ensures that negative values do not occur in the resulting vector X k ε . This definition of X k ε preserves the compositional data properties. We notice that a positive variation in one entry automatically induces a negative variation in some of the others entries. The distance δ between these two GLCMs is For our particular problem we substitute the GLCM X, the approximation of the neighbourhood, for X in (7) and we can measure the perturbation due to a variation ε on one component x i . A negative variation is caused by a structure replacing the background whilst a positive variation indicates an addition of a structure. For one single structure there is a high correlation between the negative and the positive variation (but this correlation is reduced when multiple structures are added). As we are interested in the added structures, only positive variations are considered. We define a perturbation function pert() as where |ε| 1, Θ = {k ∈ Λ|x k > 0} and Θ i = Θ − {i}. This perturbation function is computationally expensive (as we calculate the perturbation for each entry of a GLCM). Using problem specific facts (see Appendix B), we can approximate the function pert() as where K = j∈Θ 1/|Θ| |2 x j | and X = [x i ] i∈Λ is the GLCM from the RoI. From this perturbation function we are able to construct a probability image representing the probability that a pixel is a part of a particle. If we consider two pixels p 1 and p 2 with intensity I p1 and I p2 , respectively, and separated by a translation t, pixel p 1 adds one occurrence to the element x t ind(Ip 1 ,Ip 2 ) of the GLCM using translation t denoted by X t . The pixel p 1 can be seen as a part of the component x t ind(Ip 1 ,Ip 2 ) of X t . From that we can define the probability P t (p 1 ) of pixel p 1 to be part of a structure by To improve the detection of particles and to remove noise, we use a set of translations Γ. Due to the small size of the particles, a scale of 1 pixel has been used with four orientations (0 • , 45 • , 90 • , and 135 • ). The four translations cover the 8-connected neighbour pixels and the usage of additional orientations did not add any information. The inclusion of additional scales did not increase the sensitivity. The probability for a pixel p i is given by When we have the probability for a pixel to be part of a particle, the segmented image is obtained by applying a threshold to the probability image. All pixels having a probability higher than the threshold are classified as particles.

Features
The features used for the automated scoring in this study are based on GLCM information and particle detection and described by (1) the distance between the GLCM from the neighbourhood and the GLCM from the RoI is determined by δ + (X, X). The definition of the distance is derived from (3), where only positive differences in the RoI are taken into account (see (12)). This feature gives a measure of the difference in the texture of the two regions. δ + (X, X) is defined as (2) the mean grey-level difference (MGLDp) between the detected particles and the rest of the RoI. An increase in the difference means an increased visibility of the particles and hence a higher score where P is the set of pixels segmented as particles, B is the set of pixels classified as background, I(p) is the intensity of pixel p, and |P| is the number of elements in P; MGLDp is weighted by log(|P|) to discriminate between a small number of pixels with high contrast and a larger number of pixels with the same high contrast. A larger number of high contrast pixels means that the group is more significant.

Classification
A k-nearest-neighbour classifier [11], based on the two features described previously, is used to classify groups of particles into four classes. The choice of four classes is made to emulate the human observer process where a 0-1-2-3 scoring system is used. The features have been normalised to unit variance and zero mean. Furthermore, an equal a priori likelihood for each score is imposed by applying a correction to the number of neighbours in each class to take into account the different sample sizes of the four classes.
A leave-one-out approach is used for all scoring results. This means that each sample is scored by a classifier trained on all the other samples. Results are evaluated using confusion matrices, which allow a more detailed inspection of the scoring errors.

Comparing features and expert classification
Each RoI was characterised by the feature δ + (see (12)). Figure 5a illustrates the relationship between the scores assessed by an expert and δ + . Figure 5a shows that there is an overlap in the range of δ + between adjacent scores. However, we note that a good separation exists between a score of 0 or 1 and a score of 3. In addition, there is a general trend of an increased δ + measure when the expert's score increases.
Each RoI was also characterised through measurement of the MGLDp feature (see (13)). Figure 5b compares the results of the MGLDp measure with the expert scores in observation 1. The same trend as for the feature δ + exists.
The combination of the two features enhances the separation of the four scores as illustrated in the scatter plot shown in Figure 6.

Classification results
Classification results obtained using two features (δ + (X, X), MGLDp) and a 4-nearest-neighbour classifier. The first experiment uses observation 1 as ground truth. The confusion matrix in Table 3a shows the details of the scoring. The second experiment uses observation 2 as ground truth and the details of the results are presented in Table 3b. Table 4 is the confusion matrix for the observer when the film scoring was repeated (i.e., observation 1 versus observation 2) to assess intra-observer consistency.
An intrascore agreement between the expert and the automatic approach of 83% has been achieved. As the scores are not well defined and are subjective, exact agreement is hard to obtain. Table 4 shows that the same observer cannot obtain an exact agreement (a consistency of 76% has been achieved).
Using the conventional scoring system, the phantom score for the groups of particles is obtained by adding the score of the six groups A-B-C-D-E-F. The intra-observer standard deviation for these scores is 1.36 compared to 0.88 for the computer (between experiment (a) and (b) described in Table 3). The mean variation is 0.56 for the observer against 0.12 for the computer. In addition, the intrascore agreement in both experiments is similar (82% and 83%). These observations show a good consistency of the results. However, for the human observer no deviation larger than one in the score has been recorded, but in 4% of cases the computer delivered a score difference greater than one (i.e., 2).

DISCUSSION
Seventy percent of the computer's score deviation came from the last three groups of particles (D, E, and F). This is mainly due to the digitisation of films. Particles, especially small particles, are blurred when data are binned into individual pixels during digitisation. Figure 7 illustrates the problem for particles with a size of around 100 µm (e.g., 2 pixels at 50 µm/pixel). The expert observer has noticed a loss of readability of the digitised image compared with the initial film. Whilst visible in the film image, particles from groups E and F become obscured as a result of the combined effects of the pixel size limitation and the noise added into the digitised image by the digitiser.  Figure 6: Scatter plot of δ + (X, X) versus MGLDp where * is score 0, ♦ is score 1, is score 2, and is score 3.  Figure 8 compares a digitised film image with an image of the same microparticle group acquired directly, using a Small Field Digital (SFD) mammography system (Siemens Opdima, Siemens-Elema, Sweden). Whilst in this case the pixel sizes are comparable (50 µm for film and 44 µm for the SFD), the effect of the higher noise content of the digitised film all but obscures the microparticles. The addition of degrading factors into the computer image (pixel size and digitisation noise) clearly imposes a limitation when using this methodology as a tool in monitoring the quality of the mammography imaging system. This would be less important if the magnitude of these effects is well below the limits  for the key parameters of spatial resolution and noise of the primary imaging transducer, in this case the film. This could be achieved with higher quality film digitisation. Small Field Digital mammography systems are already entering into routine clinical use for the imaging of symptomatic women or those initially identified with high suspicion from the screening film. Full Field Digital mammography systems, having a pixel size of typically 100 µm, are being clinically evaluated against traditional film-screen systems and are likely to become increasingly used for breast imaging. Systems using computed radiography (CR) are also being evaluated for their effectiveness compared with conventional film-screen. The benefits of computerised detection with all these modalities is self-apparent as data is acquired directly and can be analysed without the need to undergo a secondary conversion process. Differences between the detection results and that of the observer can be used to enhance the performance of the observer and identify weaknesses in the image display to the observer. The optimisation process must hence take into account both ways in which the image is acquired, but also displayed. Work is in progress using image data from the Siemens Opdima SFD mammography system.
For a fully automated evaluation of the test object the next step is to analyse and score the low contrast linear (fibre) structures, and the low contrast circular (lesion) structures. Comparison with the ground truth based upon the known content of the test object must also replace the subjective opinion of the expert and be exploited in a "scoring methodology" specific to computerised detection. This has the potential for greater objectivity and sensitivity concerning overall image quality.

CONCLUSION
We have investigated the automated scoring of the microparticle features of the Leeds TOR(MAM) test object used in quality control for mammographic imaging. The results are encouraging as they show good agreement with the expert observer (better than 82%), and a higher consistency. Variability due to subjectivity in the assessment has been removed. This is important if true variations in mammography system performance are to be identified and monitored. There is also the potential to use this methodology in optimising system performance. The consistent evaluation of the image is essential for the many images that would need to be acquired under different acquisition conditions as part of the optimisation process. It would be difficult for the observer to maintain consistency and concentration during the timeconsuming evaluation of many images. In order for this to be viable, higher quality film digitisation or direct digital acquisition will be required so as to avoid degrading the image information.

A. THE DISTANCE MEASURE δ
We prove that δ is a distance on the space [0, 1] N . To be a distance, δ must have the following properties: Clearly (P1), (P2), and (P3) are correct. The property (P4) is proved below. Proof.
We study the sign of (1) if x i = z i or x i = y i or y i = z i , the result is trivial; (2) now consider the case where x i , y i , and z i are distinct, then is the same as A i defined by Now, we study the sign of A i . To remove the absolute values, we need to split the problem in six cases: (i) 0 ≤ x i < y i < z i ≤ 1, (vi) 0 ≤ z i < y i < x i ≤ 1, (A. 10) In all cases A i ≥ 0 so ∆ i (X, Y ) + ∆ i (Y, Z) − ∆ i (X, Z) ≥ 0 and finally the property (P4) is verified. Thus δ is a distance on the space [0, 1] N .

B. APPROXIMATION OF THE pert() FUNCTION
The definition of the function pert() given by (8) (see below and also Section 2.5) is computationally too expensive to be used. An approximation of this function is given (7), where |ε| 1, Θ = {k ∈ Λ|x k > 0}, Θ i = Θ\{i} and |Θ i | is the cardinality of the set Θ i .
In the investigated problem, ε i = x i − x i , where X = [x i ] i∈Λ is the GLCM of the RoI and X = [x i ] i∈Λ is the GLCM which approximates the background texture. In addition, |Θ| − 1 ≤ |Θ i | ≤ |Θ| and |Θ| 1 so we obtain a first approximation given by Moreover, for all k ∈ Θ, for all i ∈ Λ, x k ε i /|Θ| so |2 x j − ε i /|Θ|| ≈ |2x j | and Finally, for practical reasons the following approximation is done