Decision Aggregation in Distributed Classiﬁcation by a Transductive Extension of Maximum Entropy/Improved Iterative Scaling

In many ensemble classiﬁcation paradigms, the function which combines local/base classiﬁer decisions is learned in a supervised fashion. Such methods require common labeled training examples across the classiﬁer ensemble. However, in some scenarios, where an ensemble solution is necessitated , common labeled data may not exist: (i) legacy/proprietary classiﬁers, and (ii) spatially distributed and/or multiple modality sensors. In such cases, it is standard to apply ﬁxed ( untrained ) decision aggregation such as voting, averaging, or naive Bayes rules. In recent work, an alternative transductive learning strategy was proposed. There, decisions on test samples were chosen aiming to satisfy constraints measured by each local classiﬁer. This approach was shown to reliably correct for class prior mismatch and to robustly account for classiﬁer dependencies. Signiﬁcant gains in accuracy over ﬁxed aggregation rules were demonstrated. There are two main limitations of that work. First, feasibility of the constraints was not guaranteed. Second, heuristic learning was applied. Here, we overcome these problems via a transductive extension of maximum entropy/improved iterative scaling for aggregation in distributed classiﬁcation. This method is shown to achieve improved decision accuracy over the earlier transductive approach and ﬁxed rules on a number of UC Irvine datasets.


INTRODUCTION
There has been a great deal of research on techniques for building ensemble classification systems, (e.g., [1][2][3][4][5][6][7][8][9][10]).Ensemble systems form ultimate decisions by aggregating (hard or soft) decisions made by individual classifiers.These systems are usually motivated by biases associated with various choices in classifier design [11]: the features, statistical feature models, the classifier's (parametric) structure, the training set, the training objective function, parameter initialization, and the learning algorithm for minimizing this objective.Poor choices for any subset of these design elements can degrade classification accuracy.Ensemble techniques introduce diversity in these choices and thus mitigate biases in the design.Ensemble systems have been theoretically justified from several standpoints, including, under the assumption of statistical independence [12], variance and bias reduction [9,10], and margin maximization [8].In most prior research, an ensemble solution has been chosen at the designer's discretion so as to improve performance.
In paradigms such as boosting [5], all the classifiers are generated using the same training set.This training set could have simply been used to build a single (high complexity) classifier.However, boosted ensembles have been shown in some prior works to yield better generalization accuracy than single (standalone) classifiers [13].
In this work, we alternatively consider scenarios where, rather than discretionary, a multiple classifier architecture is necessitated by the "distributed" nature of the feature measurements (and associated training data) for building the recognition system [1,14,15].Such applications include: (1) classification over sensor networks, where multiple sensors separately obtain measurements from the same object or phenomenon to be classified, (2) legacy or proprietary systems, where multiple proprietary systems are leveraged to build an ensemble classifier and (3) classification based on multiple sensing modalities, for example, vowel recognition using acoustic signals and video of the mouth [16] with separate classifiers for each modality, or disease classification based on separate microarray and clinical classifiers.In each of these scenarios, it is necessary to build an ensemble solution.However, unlike the standard ensemble setting, in the scenarios above, each classifier may only have its own separate training resources, that is, there may be no common labeled training examples across all (or even any subset of) the classifiers.Each classifier/sensor may in fact not have any training resources at all-each sensor could simply use an a priori known class-conditional density model for its feature measurements, with a "plug-in" Bayes classification rule applied.We will refer to this case, of central interest in this paper, as the distributed classification problem.
This problem has been addressed before, both in its general form (e.g., [1]) and for classification over sensor networks (e.g., [14]).Both [1,14] developed fixed combining rule techniques.In [1], Bayes rule decision aggregation was derived accounting for redundancies in the features used by the different classifiers.This approach requires communication between local classifiers to identify the features they hold in common.In [14], fixed combining was derived under the assumption that feature vectors of the local classifiers are jointly Gaussian, with known correlation structure over the joint feature space (i.e., across the local classifiers).Neither these methods nor other past methods for distributed classification have considered learning the aggregation function.The novel contribution of [15] was the application and development of suitable transductive learning techniques [17][18][19], with learning based on the unlabeled test data, for optimized decision aggregation in distributed classification.In this work, we extend and improve upon the transductive learning framework from [15].
Common labeled training examples across local classifiers are needed if one is to jointly train the local classifiers in a supervised fashion, as done, for example, in boosting [5] and mixture of experts [20].Common labeled training data is also needed if one is to learn, in a supervised fashion, the function which aggregates classifier decisions [7,[21][22][23].These approaches treat local classifier hard/soft decisions as the input features to a second-stage classifier (the ensemble's aggregation function).Learning this second stage in a supervised fashion can only be achieved if there is a pool of common labeled training examples where, for each labeled instance, there is a realization of each local classifier's input feature vector (based upon which each local classifier can produce a hard/soft decision).
Consider legacy/proprietary systems.Multiple organizations may build separate recognition systems using "inhouse" data and proprietary designs.The government or some other entity would like to leverage all the resulting systems (i.e., fuse decisions) to achieve best accuracy.Thus an ensemble solution is needed, but unless organizations are willing to share data, there will be no common labeled data for learning how to best aggregate decisions.Alternatively, if organization A shares its design method (features used, classifier structure, and learning method) with organization B, then B can build a version of A's classifier using B's data and then further use this data as a common labeled resource for supervised learning of an aggregation function.
As a second example, consider diagnosis for a muchstudied disease.Different institutions may publish studies, each evaluating their own test biomarkers for predicting disease presence.Each study will have its own (labeled) patient pool, from which a classifier could be built (working on the study's biomarker features).If each study measured different features, for different patient populations, it is not possible to pool the datasets to create a common pool of labeled examples.Now, suppose there is a clinic with a population of new patients to classify.The clinic would like to leverage the biomarkers (and associated classifiers) from each of the studies in making decisions for its patients.This again amounts to distributed classification without common labeled training examples.
In all of these cases, without common labeled training data, the conventional wisdom is that one must apply a fixed (untrained) mathematical rule such as voting [12], voting with abstention mechanisms [24], fixed arithmetic averaging [25], or geometric averaging; Bayes rule [26]; a Bayesian sum rule [27]; or other fixed rules [3] in fusing individual classifier decisions.Fixed (untrained) decision aggregation also includes methods that weight the local classifier decisions [28] or even select a single classifier to rely on [29] in an input-dependent fashion, based on each classifier's local error rate estimate or local confidence.Such approaches do give input-dependent weights on classifier combination.However, the weights are heuristically chosen, separately by each local classifier.They are not jointly trained/learned to minimize a common mathematical objective function.In this sense, we still consider [28,29] as fixed (untrained) forms of decision aggregation.Alternatively, in [15], it was shown that one can still beneficially learn a decision aggregation function, that is, one can jointly optimize test sampledependent weights of classifier combination to minimize a well-chosen cost function and significantly outperform fixed aggregation rules.A type of transductive learning strategy [17][18][19] was proposed [15], wherein optimization of a wellchosen objective function measured over test samples directly yields the decisions on these samples.This work built on [18], which applied transductive learning to adapt class priors while making decisions in the case of a single classifier.While there is substantial separate literature on transductive/semisupervised learning and on ensemble/distributed classification, the novel contribution in [15] was the bridging of these areas via the application of transductive learning to decision aggregation in distributed classification.
There are two fundamental deficiencies of fixed combining which motivated the approach in [15].First, local classifiers might assume incorrect class prior probabilities [15], relative to the priors reflected in the test data [18].There are a number of reasons for this prior mismatch, for example, it may be difficult or expensive to obtain training examples from certain classes, (e.g., rare classes); also, classes that are highly confusable are not easily labeled and, thus, may not be adequately represented in a local training set.Prior mismatch can greatly affect fused decision accuracy.Second, there may be statistical dependencies between the decisions produced by individual classifiers.Fixed voting and averaging both give biased decisions in this case [30] and may yield very poor accuracy.This was demonstrated in [15] considering the case where some classifiers are perfectly redundant, that is, identical copies of each other.Suppose that in the ensemble there are a large number of identical copies of an inaccurate classifier and only a single highly accurate classifier.Clearly, the weak classifiers will dominate a single accurate classifier in a voting or averaging scheme, yielding biased, inaccurate ensemble decisions.Standard distributed detection techniques-which make the naive Bayes assumption that measurements at different sensors are independent given the class [31]-will also fare poorly when there is sensor dependency/redundancy.More localized schemes (e.g., [29]) can mitigate "dominance of the majority" in an ensemble, giving the most relevant classifiers (even if a small minority) primary influence on the ensemble decision making in a local region of the feature space.However, these methods are still vulnerable to the first-mentioned problem of class prior mismatch.In [2,4], ensemble construction methods were also proposed that reduce correlation within the ensemble while still achieving good accuracy for the individual local classifiers.However, these methods require availability of a common labeled training set and/or common features for building the local classifiers.
Alternatively, [15] proposed a transductive, constraintbased (CB) method that optimizes decision aggregation without common labeled training data.CB resolves both afore-mentioned difficulties with fixed combining: in making fused decisions, it effectively corrects for inaccurate local class priors; moreover, it accounts for dependencies between classifiers and does so without any communication between local classifiers.In CB, each local classifier contributes statistical constraints that the aggregation function must satisfy through the decisions it makes on test samples.The constraints amount to local classifier "confusion matrix" information-the probability that a local classifier chooses class k given that the true class is c.The aggregation function is learned so that the confusion statistic between the aggregation function's predicted class c and a local classifier's predicted class k matches the confusion statistic between the true class c and the local classifier's predicted class k.Constraint-based learning is quite robust in the presence of classifier dependency/redundancy-if local classifiers A and B are perfectly redundant (i.e., if B yields an identical classification rule as A), then so are their constraints.Thus, if the aggregation function is learned to satisfy A's constraints, B's are automatically met as well-B's constraints will not alter the aggregation function solution, and the method is thus invariant to (perfectly) redundant classifiers in the ensemble.More generally, CB well handles statistical dependencies between classifiers, giving greater decision accuracy than fixed rule (and several alternative) methods [15].Some of the key properties of CB are as follows [15]: (1) it is effective whether classifiers produce soft or hard decisions-the method (implicitly) compensates local classifier posteriors for inaccurate priors even when the local classifiers only produce hard decisions (to explicitly correct a local classifier for incorrect class priors, one must have access to the local class posteriors, not just to the hard decision output by the local classifier; e.g., the method in [18] performs explicit prior correction and thus requires access to soft classifier decisions); (2) CB works when local classifiers are weak (simple sensors) or strong (sophisticated classifiers, such as support vector machines); (3) CB gives superior results to fixed combining methods in the presence of classifier dependencies; (4) CB robustly and accurately handles the case where some classes are missing in the test data, whereas fixed combining methods perform poorly in this case; (5) CB is easily extended to encode auxiliary sensor/feature information, nonredundant with local classifier decisions, to improve the accuracy of the aggregation [32].The original method required making decisions jointly on a batch of test samples.In some applications, sample-by-sample decisions are needed.In particular, if decisions are time-critical (e.g., target detection) and in applications where decisions require a simple explanation (e.g., credit card approval).Recently, a CB extension was developed that makes (sequential) decisions, sample by sample [33].
There are, however, limitations of the heuristic learning applied in [15].First, in [15], there was no assurance of feasibility of the constraints because the local classifier training set support (on which constraints are measured) and the test set support (on which constraints are met by the aggregation function) are different.In the experiments in [15], constraints were found to be closely approximated.However, infeasibility of constraints could still be a problem in practice.Second, constraint satisfaction in [15] was practically effected by minimizing a particular nonnegative cost function (a sum of cross entropies).When, and only when, the cost is zeroed, the constraints are met.However, the cost function in [15] is nonconvex in the variables being optimized, with, thus, potential for finding positive (nonzero) local minima, for which the constraints are necessarily not met.Moreover, even in the feasible case, there is no unique feasible (zero cost) solution-feasible solutions found by [15] are not guaranteed to possess any special properties or good test set accuracy.In this paper, we address these problems by proposing a transductive extension of maximum entropy/improved iterative scaling (ME/IIS) [34][35][36] for aggregation in distributed classification.This approach ensures both feasibility of constraints and uniqueness of the solution.Moreover, the maximum entropy (ME) solution has been justified from a number of theoretical standpoints-in a well-defined statistical sense [37], ME is the "least-biased" solution, given measured constraints.We have found that this approach achieves greater accuracy than both the previous CB method [15] and fixed aggregation rules.
The rest of the paper is organized as follows.In Section 2, we give a concise description of the distributed classification problem.In Section 3, we review the previous work in [15].In Section 4, we develop our transductive extension of ME/IIS for decision fusion in distributed classification.In Section 5, we present experimental results.The paper concludes with a discussion and pointer to future work.

DISTRIBUTED CLASSIFICATION PROBLEM
A system diagram for the distributed classification problem is shown in Figure 1.Each classifier produces either hard decisions or a posteriori class probabilities P j [ C j = c | x ( j) ] ∈ [0, 1], c = 1, . . ., N c , j = 1, . . ., M e , where N c is the number of classes, M e the number of classifiers, and x ( j) ∈ R k ( j) the feature vector for the jth classifier.Each local classifier is designed based on its own (separate) training set X j = {( x is the class label.We also denote the training set excluding the class labels by X j = { x ( j) i }.The local class priors, as reflected in each local training set, may differ from each other.More importantly, they will in general differ from the true (test set) priors.While there is no common labeled training data, during the operational (use) phase of the system, common data is observed across the ensemble, that is, for each new object to classify, a feature vector is measured by each classifier.If this were not the case, decision fusion across the ensemble, in any form, would not be possible.We do not consider the problem of missing features in this work, wherein some local feature vectors and associated classifier decisions are unavailable for certain test instances.However, we believe our framework can be readily extended to address the missing features case.Thus, during use/testing, the input to the ensemble system is effectively the concatenated vector x = (x (1) , x (2) , . . ., x (Me) ), but with classifier j only observing x ( j) .
A key aspect is that we learn on a batch of test samples, X test = {x 1 , x 2 , . . ., x Ntest }-since we are learning solely from unlabeled data, we at least need a reasonably sizeable batch of such data, if we are to learn more accurate decisions than a fixed combining strategy.The transductive learning in [15] required joint decision making on all samples in the batch.In some applications, sequential decision making is instead required.To accomodate this, [33] developed a sequential extension wherein, at time t, a batch of size N is defined by a causal sliding window, containing the samples {x t−N+1 , x t−N+2 , . . ., x t−1 , x t }.While the transductive learning produces decisions on all samples in the current batch, only the decision on x t is actually used since decisions on the past samples have already been made [33].
Before performing transductive learning, the aggregation function collects batches of soft (or hard) decisions conveyed by each classifier, for example, in the batch decision making case We ignore communication bandwidth considerations, assuming each classifier directly conveys posteriors (if, instead of hard decisions, they are produced), without quantization.

Transductive maximum likelihood methods
In [15], methods were first proposed that explicitly correct for mismatched class priors in several well-known ensemble combining rules.These methods extended [18], which addressed prior correction for a single classifier.These methods are transductive maximum likelihood estimation (MLE) algorithms that learn on X test and treat the class priors as the sole model parameters to be estimated.There are three tasks that need to be performed in explicitly correcting for mismatched class priors: (1) estimating new (test batch) i ] ∀c, j = 1, . . ., M e , i = 1, . . ., N test to reflect the new class priors, and (3) aggregating the corrected posteriors to yield ensemble posteriors In [15], expectation maximization (EM) algorithms [38] were developed that naturally accomplish these tasks for several well-known aggregation rules when particular statistical assumptions are made.The M-step re-estimates class priors.Interestingly, the E-step directly accomplishes local classifier aggregation, yielding the ensemble posteriors and, internal to this step, correcting local posteriors.As shown in [15], these algorithms are globally convergent, to the unique MLE solution.At convergence, the ensemble posteriors produced in the E-step are used for maximum a posteriori (MAP) decision making.
For the naive Bayes (NB) case where local classifiers' feature vectors are assumed to be independent conditioned on the class, the following EM algorithm was derived [15]: E-step(NB): where p denotes Me j=1 (P The form of the ensemble posterior in (1) is the standard naive Bayes form, albeit with built-in prior correction.
In [15], it was also shown that aggregation based on arithmetic averaging (AA), again with built-in prior correction, is achieved via transductive MLE under different statistical assumptions.For this model, the M-step is the same as in ( 2), but the E-step now takes the (arithmetic averaging) form: E-step(AA): where q denotes Me j=1 ).These two algorithms do adapt the decision rule to new class priors (as reflected in a test data batch).However, x (1)  x (2)  x (Me) unlike CB, they cannot be applied if classifiers solely produce hard decisions.Correction of local posteriors for mismatched priors can only be achieved if there is access to local posteriors-if each classifier is a "black box" solely producing hard decisions, the transductive MLE methods cannot be used for prior correction.More importantly, the ML methods are limited by their statistical assumptions, for example, conditional independence.When there are statistical dependencies between local classifiers, failing to account for them will lead to suboptimal aggregation.In [15], the following extreme example was given: suppose there are M e − 1 identical copies of a weak (inaccurate) classifier, with the M e -th classifier an accurate one.Clearly, if M e is large, the weak classifiers will dominate (1) and ( 3) and yield poor accuracy [15].Thus, CB was proposed to properly account for classifier redundancies, both for this extreme example and more generally.

Transductive constraint-based learning
CB differs in important aspects from the ML methods.First, CB effectively corrects mismatched class priors even if each local classifier only produces hard decisions.Second, unlike the transductive ML methods, CB is spare in its underlying statistical assumptions-the sole premise is that certain statistics measured on each local classifier's training set should be preserved (via the aggregation function's decisions) on the test set.As noted earlier, learning via constraint encoding is inherently robust to classifier redundancy.In the case of the degenerate example from the last section, the M e −1 identical weak classifiers all have the same constraints.Thus, as far as CB is concerned, the ensemble will effectively consist of only two classifiers-one strong, and one weak.
The M e − 2 redundant copies of the weak classifier do not bias CB's decision aggregation [15].

Choice of constraints
In principle, we would like to encode as constraints joint statistics that reduce uncertainty about the class variable as much as possible.For example, the joint probability 0] = 0 is quite informative about the true class variable (C).However, in our distributed setting, with no common labeled training data, it is not possible to measure joint statistics involving two or more classifiers and C. Thus, we are limited to encoding pairwise statistics involving C and individual decisions ( C j ).Each classifier j, using its local training data, can measure the pairwise pmf P ( j) g [C, C j ] with "g" indicating "ground truth".This (naively) suggests choosing these probabilities as constraints.
Via the superscript ( j), we emphasize that these marginal pmfs are based on X j and are thus specific to local classifier j.Thus, choosing test set decisions to agree with P ( j) g [C, C j ] forces agreement with the local class and class decision priors.Recall that these may differ from the true (test) priors.The local class priors P ( j) g [C], j = 1, . . ., M e , also may be inconsistent with each other.Thus, encoding Instead, it was suggested in [15] to encode the conditional pmfs (confusion matrices) Confusion matrix information has been applied previously (e.g., in [39]) where it was used to define class ranks within a decision aggregation scheme and in [18] where it was used to help transductively estimate class prior probabilities for the case of a single classifier.In [15], alternatively, confusion matrices were used to specify the constraints in our CB framework.These pmfs specify the pairwise pmfs {P ( j) g [C, C j ]} except for the class priors.The constraint probabilities are (locally) measured by The aggregation function's transductive (ensemble) estimates are In principle, then, the objective should be to choose the ensemble posteriors {{P e [C = c | x i ]}∀i} so that the transductive estimates match the constraints, that is, However, there is one additional complication.Suppose there is a class c that does not occur in the test batch.Both the particular class and the fact that a class is missing from the test batch are of course unknown.It is inappropriate to impose the constraints P in doing so, one will assign test samples to class c, which will lead to gross inaccuracy in the solution [15].What is thus desired is a simple way to avoid encoding these constraints, even as it is actually unknown that c is void in the test set.A solution to this problem was practically effected by multiplying (6), both sides, by that is, by expressing the relaxed constraints Note that ( 7) is equivalent to (6) for c such that P e [C = c] > 0, but with no constraint imposed when P e [C = c] = 0. Thus, if the learning successfully estimates that c is missing from the test batch, encoding the pmf {P c } will be avoided.In [15], it was found that this approach worked quite well in handling missing classes.

CB learning approach
In [15], the constraints (7) were practically met by choosing the ensemble posterior pmfs on based on the left-and right-hand sides of (7).Here, relative entropy is defined as Note that if R is driven to zero such that P e [C = c] > 0 ∀c, the constraints are all met.Thus, minimizing (driving to zero) R can be used to effect satisfaction of the constraints.To ensure that P e [C = c | x i ] is preserved as a pmf throughout the optimization, the posterior was parameterized using a softmax function, P e [C = c | x i ] = e γc,i / c e γ c ,i , with {γ c,i , ∀c, i = 1, . . ., N test } the scalar parameters to be optimized.Minimization of R with respect to these parameters was performed by gradient descent.
This CB learning was found to give greater decision accuracy than both fixed naive Bayes, arithmetic averaging and their transductive ML extensions (1) and (3).However, there are three important limitations.First, the given constraints ( 6) may be infeasible.Second, even when these constraints are feasible, there is no assurance that the gradient descent learning will find a feasible solution (there may be local minima of the (nonconvex) cost, R).Finally, when the problem is feasible, there is a feasible solution set.Minimizing R assures neither a unique solution nor one possessing good properties (accuracy).We next address these shortcomings.

TRANSDUCTIVE CB BY MAXIMUM ENTROPY
The standard approach to finding a unique distribution satisfying given constraints is to invoke the principle of maximum entropy [37].In our distributed classification setting, given the constraints (6) and the goal of transductively satisfying them in choosing test set posteriors, application of this principle leads to the learning objective: where 4) and ( 5), respectively.In (9), we have assumed uniform support on the test set, that is, which constrains the joint pmf to the form A serious difficulty with Problem 1 is that the constraints may be infeasible.An example was given in [15].The difficulty arises because the constraints are measured using each local classifier's training support X j , but we are attempting to satisfy them using different (test) support.To overcome this, we propose to augment the test support to ensure feasibility.We next introduce three different support augmentations.In Section 4.1, we augment the test support using the local training set supports.In Section 4.2, we construct a more compact support augmentation derived from the constraints measured on the training supports.Both these augmentations ensure constraint feasibility.In Section 4.3, we discuss maximizing entropy on the full (discrete) support.

Augmentation with local classifier supports
The most natural augmentation is to add points from the training set supports ( X j ∀ j) to the test set support.Since the constraints were measured on each local classifier's support, augmenting the test set support to include local training points should allow constraint feasibility.Note that this will require local classifiers to communicate both their constraints and the support points used in measuring them to the aggregation function.Consider separately the cases of (i) continuous-valued features x ( j) ∈ R k ( j) ∀ j and (ii) discrete-valued features x ( j) ∈ A j , A j a finite set.In the former case, there is zero probability that a training point x ( j) occurs as a component vector x ( j) of a test point x.Thus, in this case, we will augment the test support with each local classifier's full training support set X j , and in doing so we are exclusively adding unique support points to the existing (test) support, that is, we assign nonzero probability to the joint events {C = c, X = x} ∀x ∈ X test and {C = c, {x : x ( j) = x ( j) }} ∀ x ( j) ∈ X j , ∀ j.Note that each test point is a distinct joint event, with the other joint events consisting of collections of the joint feature vectors sharing a common component vector that belongs to a local training set.Even if different local classifiers observe the same set of features, unless these classifiers measure precisely the same values for these features for some training examples (which should occur with probability zero in the continuousvalued case, assuming training sets are randomly generated, independently, for each local classifier), these classifiers will supply mutually exclusive additional support points.Now consider the latter (discrete) case.Here, it is quite possible ¡?bhlt?¿that a training point x ( j) will appear as a component vector of a test point.In this case, it is redundant to add such points to the test support.Let X ( j) test denote the set of component vectors for classifier j that occurred in the test set and X ( j) test the complement set.Then, in the discretevalued case, we will add j ( X j X ( j) test ) to the test support.In the following, our discussion is based on the continuous case.
We further note that some care must be taken to ensure that sufficient probability mass is allocated to the training supports to ensure constraint feasibility, for example, a uniform (equal) mass assignment to all support points, both test and training, will not in general ensure feasibility.Thus, we allow flexible allocation of probability mass to the training supports (both the total mass allocated to the training supports and how it is distributed across the different training support points are flexibly chosen), choosing the joint pmf to have the form Here, the total mass allocated to X test is P u = x∈Xtest c P e [c, x], each test sample is assigned equal mass P u /N test (we allow flexible allocation of mass to the training support points in order to ensure feasibility of the constraints: some points are pivotal to this purpose and will need to be assigned relatively large masses, while other points are extraneous and hence may be assigned small mass; for the test set support, on the other hand, unless there are outliers, these points should contribute "equally" to constraint satisfaction (just as each sample was given equal mass in measuring the constraints P ( j) g [•]); accordingly, we give equal mass to each test support point) and for x where that is, we exploit knowledge of the training labels in making exclusive posterior assignments on the training support.
Here, P u , {P e [c | x]}, and {P[ x ( j) , c ( j) ]} are all parameters whose values will be learned.
For P e [c, {x}] defined by (13), we would like to satisfy the constraints (6).Accordingly, we need to compute the trans- c], using the joint pmf (13).However, a difficulty here is that ( 13) is defined on the support set X test k X k , but the posterior P j [ C j = c | x ( j) ] can only be evaluated on the support subset where classifier j's feature vector is observed, that is, over X test X j .For X k , k / = j, we only have instances of classifier k's feature vector, not j's.This means we cannot use the full support to measure P e [ C j = c, C = c].Formally, we resolve this issue by conditioning.Let X ( j) r = {{x ∈ X test } {x : x ( j) ∈ X j }}.Then, we measure We have where Note that from (17), we see that the same normalization constant K 0 appears in both pmfs given in ( 16), ensuring that these pmfs both sum to 1. Letting and r ], we have The notation N e [•] reflects the fact that this quantity represents the expected number of occurrences of the event given x ∈ X ( j) r .We can thus now define the following problem.
subject to We emphasize that in this problem the constraints are guaranteed to be feasible.In particular, they can always be satisfied via an exclusive assignment of all probability mass to the labeled supports (i.e., via the choice P u = 0).A proof is provided in Appendix A.1.

Augmentation with support derived from constraints
The previous augmentation seems to imply that the local training set supports { X j } need to be made available to the decision aggregation function.Actually, only the posteriors (soft decisions) made on X j ∀ j, and the associated class labels are needed by the aggregation function.However, even this may not be realistic in distributed contexts involving proprietary classifiers or distributed multisensor classification.Suppose instead that the only local classifier information communicated to the aggregation function is the set of constraints P ∀c, ∀ c.We would still like to augment the test support to ensure a feasible solution.This can be achieved as follows.
First, note that x ( j) determines the local posterior {P j [ C j = c | x ( j) ], ∀ c } and that the joint probability P[ x ( j) , c ( j) ] can thus be equivalently written as In other words, the method in the last subsection assigns nonzero joint probability only to the posterior pmfs and conjoined class labels ∀ c }, c ( j) ), ( x ( j) , c ( j) ) ∈ X j } that are induced by the local training set X j .An alternative support augmentation ensuring feasibility is thus specified as follows.Consider all pairs ( c, c) such that P For each such pair, introduce a new support point ([0, . . ., 0, 1, 0, . . ., 0], c) with "1" in the c-th entry, that is, the joint event that C = c and ), there are N 2 c such support points added for each local classifier C j .We assert that adding these support points, as an alternative to the training set supports, ensures feasibility of the ME constrained problem.A proof sketch is given in Appendix A.2.In Section 5, we will demonstrate experimentally that there are only small performance differences in practice between the use of these two support augmentations.

Full support in the hard decision case
Suppose each local classifier makes a hard decision, that is, x determines a discrete-valued joint decision vector c(x) = ( c 1 , . . ., c Me ).In this case, we wish to transductively learn the joint pmf ), rather than restricting nonzero probability to the test set via We have the following proposition.

c, c has the naive Bayes joint pmf form
with associated posterior The proof follows the proof given in [40] that the ME solution given conditional probability constraints has the naive Bayes joint pmf form and, further, uses the fact that, given only conditional probability constraints, a uniform class prior pmf P e [C = c] = 1 / N c maximizes entropy.Thus, satisfying the constraints on the full discrete support leads to the naive Bayes solution and to (assumed) conditional independence of local classifier decisions.This is clearly undesirable, as the local classifier decisions may in fact be strongly dependent.
As a simple example, just consider the case where the classifiers in the ensemble are perfectly dependent, that is, identical copies.In this case, there is only nonzero support , any j. ( This is in fact the true posterior in the perfectly dependent case and correctly captures the fact that there is effectively only a single classifier in the ensemble.This solution is wholly different from that obtained by plugging c ∈ C ident in (24), which yields Note that ( 26) is highly biased, treating classifier decisions as conditional-independent when they are in fact perfectly dependent.A related point of view on the solution (24) is that it will have higher entropy H(C, X) than a solution that maximizes entropy on a reduced support set.Lower entropy is, in fact, desirable because although we choose distributions to maximize entropy while satisfying constraints, we should choose our constraints to make this maximum entropy as small as possible, that is, a min-max entropy principle [41].Restricting support to the test set imposes additional constraints on the solution, which reduces the (maximum) entropy.
The previous discussion instructs that the test set support contains vital information, and the only information, that we possess about statistical dependencies between local classifiers.Satisfying constraints on the full support set discards this information, and will increase entropy.Even augmenting the test set support less dramatically, for example, by adding the training set supports, could affect accuracy of the posterior-(over)use of the training set support augmentation (high (1 − P u )) may allow satisfying the constraints essentially only using the training set supports.Since the objective is to maximize H(C, X), the optimization would, in this case, choose posteriors on the test set to be as uniform as possible (while still satisfying the constraints).These test set posteriors could be quite inaccurate.In other words, too much reliance on training supports makes it less imperative to "get things right" on the test set.To make test set posteriors as accurate as possible, we believe they should contribute as much as possible to constraint satisfaction, for example, we have the following loosely stated learning principle: seek the minimal use of the extra support necessary to achieve the constraints.To capture this learning principle mathematically, we propose the following extension of Problem 2.
subject to and subject to In this objective, we modify Problem 2 to also constrain the total probability allocated to the labeled training supports to some specified value P o .In Section 4.7, we will develop an algorithm seeking to find the minimum value P o = P * o such that the constraints are still feasible.When the test set support is sufficient by itself to meet the constraints, P * o = 0, otherwise, P * o > 0. In the sequel, we will invoke the method of Lagrange multipliers and introduce a Lagrange multiplier β associated with (31) to set the level 1 − P u .Thus, for the algorithm in Section 4.7, the search for P * o will be realized by varying β.In our experimental results, we will demonstrate that as 1 − P u is reduced, the entropy H(C, X) decreases, and moreover, the test set classification accuracy tends to increase.

Constraint relaxation
In the sequel, we develop a transductive extension of iterative scaling (IS) techniques [35,36] for solving the ME constrained problem for fixed 1 − P u , that is, for fixed β.To apply IS, the ME problem must be convex in all parameters and the constraints must be linear in the probabilities P e [c, x] [34,36].The function H(C, X) is convex; however, the constraints (28) are nonlinear in the parameters since P e [C = c | x i ] appears in both the numerator and denominator in (15).However, it is possible to relax the constraints (28) to linear ones.In particular, assuming r ] > 0, if we plug the right-hand side of ( 18) into (28) and multiply through by r ], we then have the equivalent linear constraints Further, note that Thus, comparing ( 32) and ( 33), we see that whenever r ] > 0, the relaxed constraints (32) are equivalent to the original constraints (28), as desired.This is reminiscent of the constraint relaxation built into [15].However, in some cases, if it is not possible to satisfy the original constraints at the given value 1 − P u , the constraints (32) can in principle still be satisfied by choosing r ] = 0 for some c and j.This would amount to removing the associated pmf constraint r ] = 0 would preserve feasibility of the linearized constraints (32) and amount to satisfying only a subset of the original constraints (28), those being jointly feasible.It is quite conceivable that this type of constraint relaxation would be undesirable-it amounts to encoding less constraint information, which could have a deleterious effect on decision fusion accuracy.However, we emphasize that while in principle this "constraint relaxation" appears possible, this phenomenon has never occurred in any of our experiments, even those in which some classes were missing completely from the test batch, where we might have expected to observe this relaxation phenomenon.One explanation for why constraint relaxation never occurred in our experiments is that a solution with r ] = 0 is intrinsically a low entropy solution, which will always be rejected in favor of solutions with higher entropy, so long as such solutions exist.Since, for example, in the missing class case, there will always be augmented support points from class C = c (the missing test class), it is plausible that constraints involving C = c can still be met through almost exclusive use of these augmented support points (i.e., even if very little ensemble posterior probability in class C = c is assigned to any of the unlabeled test set points).In fact, this is what we have experimentally observed in the missing class case-on the test set, very little probability is assigned to the missing class, but positive mass is allocated to the augmented support points that are labeled by the missing class, and constraints involving the missing class are always met.Thus, at least in all of our experiments, the "linearized" constraints (32) were always found to be equivalent to the original constraints (28), as is desired.

Transductive iterative scaling (TIS) algorithm
The Lagrangian cost function corresponding to Problem 3 with the "relaxed constraints" is Here, {γ( C j = c, C = c), ∀ j, c, c} are the Lagrange multipliers associated with the local classifier constraints, which need to be learned.The Lagrange multipliers α and λ(x) ∀x ∈ X test will be automatically chosen to satisfy the pmf sum constraints.The Lagrange multiplier β is treated as an "external" parameter and chosen to set the value of 1 − P u as previously discussed.The transductive iterative scaling (TIS) algorithm for minimizing L(β) consists of alternating (i) optimization of P u , {P e [c | x]}, and {P[ x ( j) , c ( j) ]} given {γ( C j = c, C = c)} held fixed, followed by (ii) update of {γ( C j = c, C = c)} given the other parameters fixed.In step (ii), we update a single Lagrange multiplier (with the Lagrange multipliers selected in a fixed, cyclical order), that is, we have implemented a "sequential" TIS algorithm, akin to [42], rather than a batch algorithm where all Lagrange multipliers are updated in parallel.We believe there are no difficulties inherent to a batch version-we simply chose a sequential algorithm.For fixed {γ( C j = c, C = c)}, the globally optimal values for the remaining parameters, determined by taking partial derivatives of L(β) and setting to zero, are found to be where a denotes exp( where Given these parameters held fixed, we optimize a single Lagrange Multiplier, for example, γ( C j = c, C = c), via the TIS update: In Appendix A.3, we give a simple proof that the TIS updates (39) are descent steps in the Lagrangian cost.Thus, both the alternating steps ( 35)-( 37) (which are globally optimal) as well as (39) are descent steps in the Lagrangian.These alternating steps are performed until the constraints are well met (a convergence criterion is introduced in Section 4.8).Note that while descent in the Lagrangian is assured, we do not provide a proof in this paper that the TIS algorithm described above converges to the ME solution satisfying the constraints.This requires a more detailed technical argument.However, we do believe that such a proof can be developed, following along the lines of the technical approaches taken in [22,36,42].

Comments on TIS algorithm
(1) The algorithm solves a convex optimization problem with linear constraints, with, therefore, a unique solution at each β for which the constraints are feasible [43].
(2) The TIS algorithm descends in the Lagrangian cost function for a given β.
(3) If the problem is infeasible at a given β, convergence of the algorithm is not guaranteed.
(4) When β → − ∞, P u = 1, and we are seeking a solution that only relies on the test set support.
(5) When β →∞, P u = 0, and we are not using the test support at all.
(6) When β = 0, there is no constraint on P u -this solution thus achieves highest entropy H(C, X), compared to solutions at other β values.

Extended TIS (ETIS) algorithm
Problem 3, but with the constraints (28) replaced by the relaxed constraints (32) and seeking 1−P u = P * o , is addressed by Algorithm 1 on page 10.This method solves the ME problem for fixed β using the TIS algorithm and embeds TIS within a search for P * o by varying β via a bisection search.

Comments on ETIS algorithm
At fixed β, the TIS algorithm is run until practical constraint satisfaction is achieved.We measure the squared deviation 2 , and stop when the change in deviation (ΔD) is less than a threshold value.The overall algorithm terminates when one of the following two conditions is satisfied.(1) At the current β, constraint satisfaction is not achieved.This indicates that P * o is greater than the value 1 − P u associated with the current β.(2) At the current β, at termination of TIS, 1 − P u < δ, where δ is a small number.In this case, we have essentially found that there is an ME solution satisfying the constraints using only the test support, (i.e., P * o ≈ 0).

RESULTS
We next present both illustrative results and experiments comparing the TIS algorithm (Section 4.5, with the choice β = 0) and ETIS algorithm with a variety of alternative transductive and fixed combining schemes.

An infeasible constraint example
There are 2 classes ("+" and " ") in two dimensions, with local classifiers 1 and 2, shown in Figure 2. Suppose each local classifier makes hard decisions, that is, P Let local classifier 1 achieve perfect training accuracy, that is, Let local classifier 2 have As Figure 2 shows, we let the test set be the same as the training set except that it contains four additional points from class 2 that are all correctly classified by both classifiers.In this case, it is not possible to choose {{P e [C = c | x]}, x ∈ X test } to satisfy all the conditional constraints (6).Thus, Problem 1 is infeasible.We first evaluated the TIS algorithm specialized to the case where no training support augmentation is used, that is, seeking to solve Problem 1.We measured the Lagrangian cost function (34) and the deviation.In Figure 3, we can see the Lagrangian cost function monotonically decreases without convergence and the deviation approaches a fixed value above zero.
We next applied the TIS method (β = 0) with local training set support augmentation.As Figure 4 demonstrates, a feasible solution is achieved in this case.

Real data experimental results
We evaluated on datasets from the UC Irvine machine learning repository in Table 1 and we followed the experimental protocol from [15].
To simulate a distributed classification environment, we used five local classifiers, each a naive Bayes classifier working on a randomly selected subset of features.Thus, for the conditional independence model of Section 3.
These are also shown in Table 2.An extreme case of prior mismatch is when there are no examples from one of the classes in the test batch-this class has a zero test set prior probability.We refer to this case as the missing class case, also indicated in Table 2.In "missing class" experiments, we randomly selected a single class to be missing from the test batch.The missing class case is interesting for several reasons.First, it represents an extreme case of prior mismatch and thus a severe test of the robustness of a method to prior mismatch.Second, in an application scenario, it may be important to recognize that one of the classes is wholly not present in the testing data.This may provide valuable actionable information beyond mere class decisions.We also note that missing classes during testing is wholly different from the anomaly detection problem, wherein a class is present in test data but missing in the training data.The anomaly detection problem is not within the scope of this work.
Finally, we note that some datasets used in our experiments are relatively small, which makes it difficult to obtain accurate estimates of generalization accuracy.In particular, for liver disorder, sonar, and ionosphere, the test set size N test does not meet the requirement from [44] that N test ≥ 100/P e , P e the optimal error rate for the problem.Even when accurate error rate estimates are difficult to obtain, we would like to have statistical confidence on the relative performance of the combining schemes.Accordingly, to test the statistical significance of head-to-head comparison of two schemes, we performed the 5 × 2 cv F test, proposed in [45].According to [45], we can reject the hypothesis that two algorithms have the same error rate with 95% confidence if the F statistic is greater than 4.74.As seen in the sequel, this testing indicates that there are statistically significant gains in accuracy of our transductive CB method over fixed combining and over the previous CB methods [15] in many of the mismatch experiments.

Augmentation with support derived from constraints
We first compared the results obtained for TIS using local training support and training support derived from the constraints.We measured classification error rate on the test set and the joint entropy on the test and training supports.
The TIS algorithm allocated probability mass between the training sets and test set without any constraint, that is, by letting β = 0.
From the results of Note also that on some datasets, for example, diabetes and liver disorder, the error rate in high mismatch and missing class cases is significantly lower than for low mismatch values.This is explained by recognizing that altering the class priors changes the Bayes-optimal error rate of the classification problem.In some case, the Bayes error rate is made much lower.In the missing class case, it can be further recognized that if the transductive method is successful in identifying that a class is missing (or, at any rate, has very few test instances), the classification problem is effectively made "easier", with fewer classes to discriminate between in classifying the test samples.

Full support in the hard decision case
We next considered the case where local classifiers produce hard decisions.We compared ETIS and the naive Bayes solution (see Section 4.3) with respect to test error rate and joint entropy.For the naive Bayes method, since the test set class priors are unknown, a uniform class prior was assumed.
From Table 4, we can see the classification performance of ETIS is better than naive Bayes and with much lower joint entropy, as expected based on the discussion in Section 4.3.
Note also that we again see the phenomenon of lower error rates in the high mismatch cases on some datasets, especially for the ETIS method.

Influence of β
Figure 5 shows the influence of β on the TIS algorithm for diabetes.We measured four different values: deviation, joint entropy on the test and training set support, classification error rate on the test dataset, and probability mass allocation (P u on the test set support and P l = 1 − P u on the training set support).We plotted curves for four different mismatch cases.Each curve is based on an average of five replications of two-fold cross-validation.We changed β from −20 to 10 with Δβ = 1 and, for each β, we ran the TIS algorithm until the stop criterion ΔD < 10 −9 .We can see that when β < −6, the deviation does not approach 0 when TIS terminates.This demonstrates constraint infeasibility at finite β.As β becomes more negative, the test set support gets more probability mass allocation P u .However, the rate of P u increase decreases as β becomes more negative; this occurs because it is difficult to satisfy the local constraints if the TIS algorithm assigns too much probability mass to the test set.The TIS algorithm achieves peak entropy at β = 0 (as must be the case).When β goes either negative or positive, the joint entropy decreasesthere is no constraint on entropy maximization in this case.Note also that for all mismatch values, the best test set error occurs for negative β and the smallest β satisfying the constraints achieves accuracy close to this best error rate.As noted in Section 4.7, we used an ETIS bisection search to find the smallest β for which the constraints are satisfied.For this β, the least mass is allocated to the extra support points.The initial search range is from β b = −100 (a nonsatisfiable β) to β a = 0 (a satisfiable β).For all the datasets we experimented on, Problem 1 was in fact infeasible.Moreover, for each dataset, β = −100 was negative enough such that the TIS algorithm could not satisfy constraints.We used the deviation as the criterion to determine whether the constraints are satisfied or not; here we chose D/N 2 c < 10 −7 .Table 5 gives a comprehensive performance comparison between TIS (β = 0) and the ETIS algorithm (which stopped at the most negative β at which the constraints were met).ETIS guarantees more mass allocation on the test set support and smaller joint entropy.It has competitive classification performance with TIS and performs especially well in the missing class case.(We again emphasize that, in the missing class case, constraint relaxation never occurred in any of our experiments.)However, neither method dominates the other.

Comparisons between ETIS, TGD, TML, and fixed rules
In Table 6, we compare transductive ME (ETIS) with the transductive gradient descent (TGD) CB method and the transductive maximum likelihood (TML) algorithms proposed in [15] and also with the fixed rules Sum, Product, Max, and Min [3].Note that Tables 5 and 6 have some common columns because ETIS results appear in both tables.We can see ETIS achieves overall better error rates than the previous transductive methods and is uniformly better than the fixed rules.In Table 7, we evaluate statistical significance of the gains of ETIS compared head-to-head against the fixed combining methods and the previous CB method.Note that the F-statistic is much larger than 4.74 in many of these comparisons, indicating that gains of ETIS are statistically significant in most mismatch cases compared with fixed methods, and in high mismatch cases compared with TGD and the TML methods, on most of the datasets.Instead of measuring joint entropy (27), we measured conditional entropy (9) to make a fair entropy comparison between ETIS and TGD.Because TGD only uses the test support set to satisfy constraints, we only measured entropy on the test set as in (9).Note that ETIS has higher conditional entropy than TGD, as we would expect, since TGD does not perform entropy maximization while satisfying constraints.

CONCLUSIONS
In this work, we have proposed a new ME framework for transductive learning of decision aggregation rules when there is no common labeled data across local classifiers.The new approach overcomes constraint infeasibility and nonuniqueness of the solution and achieves better results than [15].While our approach can be directly applied also to the single classifier case, which amounts to a semisupervised learning problem, it remains to be seen whether our transductive ME approach offers benefits in classification accuracy over standard supervised learning in this case.Future work should also develop a proof that TIS converges to the ME solution satisfying the constraints, as well as investigate application contexts for the distributed classification problem studied here.

A.1. Proof of feasibility of Problem 2
First, assign all probability mass to the training set supports, that is, choose P u = 0. Further, assign this mass to the
minimize a nonnegative cost consisting of the sum of relative entropies: R = Me j=1 D P e C j | C P e [C] P ( j) g C j | C P e [C] [C = c]D P e C j | C = c P ( j) g C j | C = c , (8) P e [ c ] on c ∈ C ident = {(1, 1, . . ., 1), (2, 2, . . ., 2), . . ., (N c , N c , . . ., N c )}.It can be shown that the ME posterior satisfying the constraints P e [ C j | C = c] = P g [ C j | C = c] ∀ j using only the nonzero support set C ident is the posterior

Figure 3 :
Figure 3: Infeasibility of TIS for Figure 2 example when no support augmentation is used.

14 EURASIP
1, both the local classifiers and the combining rule are based on naive Bayes.For example, Segment has 7 classes and 19 continuous features.The first local classifier used 15 randomly selected features while the second classifier used 16.Since all the classifiers use a significant fraction of all the features, this

Figure 4 :
Figure 4: Feasible TIS solution for Figure 2 example when support augmentation is used.

M
= 0.74, P u M = 0.74, P l M = 1.43,P u M = 1.43,P l M = 2.57, P u M = 2.57, P l Missing class, P u Missing class, P l (d) Probability mass allocation

Table 1 :
UC irvine machine learning datasets that were evaluated.

Table 2 :
Average test set class priors and mismatch values for different datasets.statisticaldependenciesbetween local classifier feature vectors and, thus, between the local classifiers.All features in the datasets are continuous-valued and were modeled by (class-conditional) Gaussian densities.We performed five replications of two-fold cross-validation for all datasets.For each replication, the dataset was randomly divided into two equal-sized sets.In the first fold, one set was used as training and a subset of the other was used for testing, with these roles reversed in the second fold.We averaged test error rates over all replications and folds.For a given fold, only a subset was used for testing in order to introduce a controlled level of prior mismatch between the training and testing sets.This was accomplished as follows.The original test set priors and training set priors are very similar since they are based on a random, equal-sized split of the whole dataset.To introduce more prior mismatch and, thus, to test the robustness of the various combining schemes to incorrect local class priors, we performed resampling on the test data.For example, suppose in the original test set that there are two classes with priors [0.5, 0.5].To increase the prior mismatch, we set the new test set priors as [α, 1 − α].The new test set is then obtained by sampling with replacement from the original test set (to create a dataset of size equal to the original test set size), based on these new priors.Note that sampling with replacement does introduce significant sample duplication in some cases.For example, on diabetes, for test class prior distribution [0.2 0.8] (used in Table2) with 92 unique samples from class 2 and 266 test samples to be drawn, all samples from class 2 are likely to have a duplicate in the resampled test data. introduces

Table 3 ,
we see that the classification performance of TIS with derived training support is very similar to that of TIS with local training support, but with lower joint entropy.The mass allocation on the test set for TIS with local training support is much less than that with derived training support.The reason, we believe, is that since the derived training support is much smaller than the local training support and test support, entropy maximization in the derived support case requires heavy use of the test support, whereas in the local training support case, entropy is maximized making much more use of the augmented (local) training supports.In spite of this disparity in P u for these two cases, both approaches give very similar classification accuracy.

Table 4 :
Average classification performance (test error rate) and joint entropy for ETIS and naive Bayes in the hard decision case.

Table 5 :
Average classification performance (test error rate), joint entropy, and mass allocation comparison between TIS and ETIS algorithms.

Table 6 :
Classification performance (test error rate) and conditional entropy comparison between ETIS, TGD, TML, and Sum, Prod, Max, and Min fixed rules.