<TeXmacs|1.99.7>

<style|<tuple|article|std-latex>>

<\body>
  <\hide-preamble>
    <new-theorem|theorem|Theorem>

    <new-theorem|lemma|Lemma>

    <new-theorem|proposition|Proposition>
  </hide-preamble>

  <doc-data|<doc-title|Classification under Data Contamination with
  Application to Remote Sensing Image Mis-registration>|<doc-author|<author-data|<author-name|Donghui
  Yan<rsup|<math|1,3>>, Peng Gong<rsup|<math|2,3>>, Aiyou
  Chen<rsup|<math|4>>, Liheng Zhong<rsup|<math|2,3>>
  <vspace|2fn><next-line><rsup|<math|1>>Department of
  Statistics<next-line><rsup|<math|2>>Department of Environmental Science,
  Policy and Management<next-line><rsup|<math|3>>University of California,
  Berkeley, CA 94720<next-line><rsup|<math|4>>Google, Mountain View, CA
  94043>>>|<doc-date|>>

  <abstract-data|<\abstract>
    This work is motivated by the problem of image mis-registration in remote
    sensing and we are interested in determining the resulting loss in the
    accuracy of pattern classification. A statistical formulation is given
    where we propose to use data contamination to model and understand the
    phenomenon of image mis-registration. This model is widely applicable to
    many other types of errors as well, for example, measurement errors and
    gross errors etc. The impact of data contamination on classification is
    studied under a statistical learning theoretical framework. A closed-form
    asymptotic bound is established for the resulting loss in classification
    accuracy, which is less than <math|\<epsilon\>/<around|(|1-\<epsilon\>|)>>
    for data contamination of an amount of <math|\<epsilon\>>. Extensive
    simulations have been conducted on both synthetic and real datasets under
    various types of data contamination, including label flipping, feature
    swapping and the replacement of feature values with data generated from a
    random source such as a Gaussian or Cauchy distribution. Our simulation
    results show that the bound we derive is fairly tight.
  </abstract>>

  <markboth|>Shell <change-case|<with|font-shape|italic|et al.>|locase>: Bare
  Demo of IEEEtran.cls for Computer Society Journals

  <section|Introduction><label|section:Introduction>

  A motivating example of this work is the problem of image mis-registration
  which occurs almost ubiquitously in remote sensing. Image mis-registration
  refers to the phenomenon where the image of interest is mapped or aligned
  to a wrong position. This is usually caused by errors in the image or data
  acquisition device or the inaccuracy of the underlying mapping algorithms
  which try to map data collected at different scales, at different times, or
  taken from different angles. Figure<nbsp><reference|figure:RSimage> below
  illustrates an instance of image mis-registration where the image is tilted
  and then shifted by a small amount.

  <\big-figure>
    <\with|par-mode|center>
      <space|0cm><image|RSoriginal.eps||||>
      <space|0.1in><image|RSmisregistered.eps||||>

      \;
    </with>
  </big-figure|<label|figure:RSimage> <with|font-shape|italic|The original
  (left) and the mis-registered (right) remote sensing images for a cropland.
  Each color corresponds to one land class. >>

  The problem of image registration is of primary importance in remote
  sensing land monitoring applications which typically require the use of a
  number of images acquired at different times or time sequence data that can
  characterize seasonal changes or multi-annual similarities (Defries and
  Townshend, 1999 <cite|DefriesTownshend1999>; Liu et al., 2006
  <cite|LiuKellyGong2006>). This demands image registration and can affect
  such applications as image classification, change detection, ecological (or
  climatological, hydrological) modeling (Justice et al., 1998
  <cite|JusticeVermoteTownshend1998>; Gong and Xu, 2003 <cite|GongXu2003>)
  etc. Because image registration can never be perfectly made, a
  mis-registration error is inevitable. It has been suggested that
  mis-registration errors that are less than 0.5 pixels are acceptable in
  subsequent analysis (Gong et al., 1992 <cite|GongLeDrewMiller1992>;
  Townshend et al. 1992 <cite|TownshendJusticeGurney1992>; Jensen, 2004
  <cite|Jensen2004>). However, this is rarely achievable and it is thus
  important to assess the impact of image mis-registration.

  Of a similar nature are errors due to rounding or the inaccuracy of the
  measuring instruments. Besides, interference from electromagnetic waves,
  clouds or other unfavorable weather conditions can all cause errors to the
  remote sensing images. Additionally, various types of human errors often
  factor in where a small amount of arbitrary error maybe thrown in anywhere
  in the data or any part of the data can be missing. Errors of this type are
  often called gross errors, and are estimated to occur in about <math|0.1%>
  to <math|10%> of the data <cite|Hampel1974>. This estimation of the amount
  of errors will form the basis for our choice on the amount of data
  contamination in our simulation.

  We call errors discussed above broadly as data contamination. Data
  contamination can cause a disastrous effect to the data quality and may
  fundamentally impact subsequent analysis and inference. It is thus of
  significant practical importance to answer the following questions:
  <with|font-shape|italic|What is the nature of data contamination? How much
  does data contamination impact our analysis (classification)? Do current
  algorithms (classifiers) continue to work or how much do we lose in
  accuracy if a remote sensing image is mis-registered or the underlying data
  are contaminated?> The goal of the present work aims to shed lights on
  these questions.

  The study of data analysis and statistical inference under data
  contamination has been a long-standing research topic in statistics and
  machine learning. The earliest work can be traced back to at least a half
  century ago, see, for example, Tukey <cite|Tukey1960> for a survey on
  sampling from contaminated distribution. Extensive investigations have been
  carried out since under the name of robust estimation
  <cite|Huber1972|Hampel1974>, measurement error model
  <cite|Fuller1987|carrollDelaige2009|DelaigeFan2009> etc. However, work
  along this line concern primarily problems on regression or estimation.
  Work in the machine learning literature mostly deal with data
  contaminations in the form of label flipping and empirically study its
  impact to the performance of various specific classifiers. This includes
  Dietterich <cite|Dietterich1998> and Breiman <cite|RF> which evaluate the
  robustness of learning algorithms such as bagging, AdaBoost and Random
  Forests against label flipping. Other work includes
  <cite|Quinlan1986|Sloan1988|ZhuWu2004> and references therein.

  Relevant literature in remote sensing, however, have been sparse. Swain el
  al <cite|SwainVanderbilt1982> investigated the impact of image
  mis-registration to classification. However, this work is purely empirical
  and their results depend highly on the underlying scenes in the image; for
  example, even under the same amount of mis-registration, the impact would
  be considerably different on images formed primarily of large forest lands
  and those formed by many small patches of different land types such as
  corns and plants. Additionally, <cite|TownshendJusticeGurney1992|DaiKhorram1998>
  considered the impact of image mis-registration to change detection. Xu et
  al <cite|XuDickson2009> study parameter estimation for a simple linear
  model under measurement errors due to a mismatch of location and scales.

  To gain insights into the nature of data contamination, in particular the
  phenomenon of image mis-registration, it is highly desired to approach the
  problem with a formal model and to give some theoretical characterization.
  This forms the primary motivation of the present work. Our focus will be on
  classification.

  Assume the data of interest are drawn i.i.d. from some probability
  distribution <math|G> defined on <math|\<bbb-R\><rsup|p>>. By treating
  errors as contaminations to the probability distribution <math|G>, we
  arrive at the following statistical model for data contamination

  <\equation>
    <label|eq:DCModel><wide|G|~>=<around|(|1-\<epsilon\>|)>*G+\<epsilon\>*H
  </equation>

  where <math|<wide|G|~>> is the distribution of the data after contamination
  and <math|H> is an arbitrary distribution. Model<nbsp><eqref|eq:DCModel> is
  quite general, clearly it captures various types of data contaminations we
  have discussed (not the additive noise though). Note that, in the setting
  of classification, <math|G> is the joint distribution of the attributes and
  the label, thus a contamination under model<nbsp><eqref|eq:DCModel> can
  mean that to the attributes, or the label, or both. The <math|\<epsilon\>>
  in <eqref|eq:DCModel> can be thought of as the proportion of data (e.g.,
  image pixels) that are \Pcontaminated", e.g., being flipped in label or
  altered with data generated under a different distribution <math|H>.

  It is known that the effect of image mis-registration is determined by
  resolution, scene structure and amount of registration error (e.g., 0.5
  pixels or 1 pixel, or 1.5 pixels on RMS error). In
  model<nbsp><eqref|eq:DCModel>, we choose to use the proportion of pixels
  that are \Pcontaminated" as a measure of the extent of image
  mis-registration. This is to capture the essence of image mis-registration
  and to uncover the relationship between the amount of mis-registration and
  the resulting loss in classification accuracy. This is different from the
  usual practice in the remote sensing community where the image
  mis-registration is quantified in term of a shift of a certain number of
  pixels. Since given the same amount of shift, the impact on classification
  is highly scene-dependent, e.g., the impact would be drastically different
  for a large land consisting mainly of forests and a small land parcel
  formed by corn fields and rice fields, it would then hardly be possible to
  establish a generic relationship between the amount of mis-registration and
  the resulting loss on the classification accuracy.

  Our contributions are as follows. We propose a statistical model for the
  phenomenon of image mis-registration. This data contamination model
  captures a wide range of errors such as label flipping, measurement errors,
  rounding errors and accidental human errors which occur almost ubiquitously
  in real applications. We study classification under data contamination in
  the statistical learning framework. A bound is obtained on the loss of
  classification accuracy (term this as the data contamination bound) due to
  data contamination (to the training data) in terms of its amount. This
  bound allows one to give a conservative assessment on if a class of
  classification algorithms, i.e., those which are universally consistent,
  continue to work under data contamination.

  The rest of the paper is organized as follows. In Section
  <reference|section:classificationDC>, we formulate the problem of
  classification under data contamination and obtain a bound on the loss in
  classification accuracy in terms of the amount of data contamination. In
  Section<nbsp><reference|section:simulation>, we conduct extensive
  simulations on the impact of data contamination to classification
  performance of SVM for a number of synthetic and real datasets under
  various types of data contaminations. In
  Section<nbsp><reference|section:amtDC>, we briefly discuss heuristics to
  estimate the amount of data contamination for the case of image
  mis-registration. Finally we conclude in
  Section<nbsp><reference|section:conclusion>. In this section, we also
  collect results from the literature on the impact of classification
  performance by AdaBoost due to label flipping; additionally, we give
  insight on using data contamination as a model to understand co-training,
  which is particularly useful in situations where training data are scarce.

  <section|Classification under data contamination><label|section:classificationDC>

  Classification is an important problem in pattern recognition. However, as
  discussed in Section<nbsp><reference|section:Introduction>, especially in
  the context of land-cover, land-use mapping, crop yield estimation and many
  other important applications in remote sensing, the classification result
  may be affected by data contamination. In this section, we will study
  classification under data contamination with model<nbsp><eqref|eq:DCModel>
  and derive a bound on the resulting loss in classification accuracy. We
  start by an introduction of the statistical learning framework for
  classification <cite|DevroyeGyorfiLugosi1996>.

  <subsection|Classification in the statistical learning framework>

  In statistical learning, a classification rule (or classifier) is defined
  by a map: <math|\<cal-X\>\<rightarrow\>\<cal-Y\>> where <math|\<cal-X\>> is
  the sample space for observations and <math|\<cal-Y\>> is a finite set of
  labels. For simplicity, we consider throughout a two-class problem where
  <math|\<cal-Y\>=<around|{|0,1|}>>.

  Associated with each classifier, there is a performance measure called loss
  function, denoted by <math|l<around|(|f,X,Y|)>>. The loss function that is
  of special interest is the 0-1 loss, defined as

  <\equation>
    <label|eq:01loss>l<around|(|f,X,Y|)>=<around*|{|<tabular*|<tformat|<cwith|1|-1|1|1|cell-halign|l>|<cwith|1|-1|1|1|cell-lborder|0ln>|<cwith|1|-1|2|2|cell-halign|l>|<cwith|1|-1|2|2|cell-rborder|0ln>|<table|<row|<cell|0>|<cell|<text|if><nbsp>I<rsub|<around|{|f<around|(|X|)>\<gtr\>0|}>>=Y>>|<row|<cell|1>|<cell|<text|otherwise>>>>>>|\<nobracket\>>
  </equation>

  where <math|f> is a decision function and <math|I<rsub|<around|{|.|}>>> is
  the indicator function. Here we call a function <math|f> a decision
  function if a decision rule can be written as
  <math|I<rsub|<around|{|f\<gtr\>0|}>>>.

  <with|font-series|bold|Definition.> Let <math|\<bbb-P\>> be the joint
  probability distribution of <math|X> and <math|Y>. Then the risk associated
  with a decision function <math|f> is defined as

  <eqnarray|<tformat|<table|<row|<cell|R<rsub|\<bbb-P\>><around|(|f|)>>|<cell|=>|<cell|\<bbb-E\><rsub|\<bbb-P\>>*l<around|(|f,X,Y|)>=\<bbb-P\>*<around|(|Y\<neq\>I<rsub|<around|{|f<around|(|X|)>\<gtr\>0|}>>|)>.<eq-number>>>>>>

  Similarly, the empirical risk for a decision function <math|f>, on a
  training sample <math|<around|(|X<rsub|1>,Y<rsub|1>|)>,...,<around|(|X<rsub|n>,Y<rsub|n>|)>>,
  can be obtained by replacing <math|\<bbb-P\>> in the above with its
  empirical distribution <math|<wide|\<bbb-P\>|^><rsub|n>>.

  Fix a probability distribution <math|\<bbb-P\>> and a function class
  <math|\<cal-G\>>, the goal of classification is to find a decision rule
  <math|f<rsup|\<ast\>><rsub|\<cal-G\>>\<in\>\<cal-G\>> that minimizes
  <math|R<rsub|\<bbb-P\>><around|(|f|)>>, i.e.,

  <\equation>
    <label|eq:defLearning>f<rsup|\<ast\>><rsub|\<cal-G\>>=arg
    min<rsub|f\<in\>\<cal-G\>> R<rsub|\<bbb-P\>><around|(|f|)>.
  </equation>

  The rule learned from the training sample
  <math|<around|(|X<rsub|1>,Y<rsub|1>|)>,...,<around|(|X<rsub|n>,Y<rsub|n>|)>>,
  denoted by <math|f<rsub|n>>, can be defined similarly by substitution of
  <math|\<bbb-P\>> with <math|<wide|\<bbb-P\>|^><rsub|n>> in
  <eqref|eq:defLearning>.

  <with|font-series|bold|Definition.> Fix a probability distribution
  <math|\<bbb-P\>>. The function that achieves the minimum risk, among all
  possible decision rules, is called the Bayes rule. The corresponding risk
  is called the Bayes risk and is denoted by
  <math|R<rsub|\<bbb-P\>><rsup|\<ast\>>>.

  For the 0-1 loss as defined in <eqref|eq:01loss> and a fixed probability
  distribution, the Bayes rule is given by

  <\equation*>
    \<beta\><around|(|x|)>=I<rsub|<around|{|\<eta\><around|(|x|)>\<gtr\>0|}>>
  </equation*>

  where

  <\equation*>
    \<eta\><around|(|x|)>=\<bbb-P\>*<around|(|Y=1<nbsp>\|<nbsp>X=x|)>-0.5
  </equation*>

  is called the Bayes decision function.

  <with|font-series|bold|Definition.> A classification algorithm is
  universally consistent if, for all distributions <math|\<bbb-P\>>,

  <\equation*>
    R<rsub|\<bbb-P\>><around|(|f<rsub|n>|)>\<rightarrow\><rsub|a.*s.>R<rsub|\<bbb-P\>><rsup|\<ast\>>
  </equation*>

  as <math|n\<rightarrow\>\<infty\>> where a.s. stands for almost surely.

  <with|font-series|bold|Notation.> To simplify notation, we adopt the
  following convention. Denote <math|R\<triangleq\>R<rsub|G>> and
  <math|<wide|R|~>\<triangleq\>R<rsub|<wide|G|~>>>. Also we use
  <math|<wide||~>> to indicate a quantity associated with the contaminated
  distribution <math|<wide|G|~>>. In particular, <math|f<rsub|n>> and
  <math|<wide|f|~><rsub|n>> are the classifiers learned from a training
  sample of size <math|n> from <math|G> and <math|<wide|G|~>>, respectively;
  and <math|\<eta\>>, <math|<wide|\<eta\>|~>> and <math|\<eta\><rsup|H>> are
  the Bayes decision function under <math|G>, <math|<wide|G|~>> and <math|H>,
  respectively.

  <subsection|A bound on the loss of classification accuracy>

  In the standard setting of statistical learning theory, one is interested
  in the consistency of a classifier, <math|f<rsub|n>>, obtained via
  empirical risk minimization, that is,

  <\equation*>
    R<around|(|f<rsub|n>|)>\<longrightarrow\>R<rsup|\<ast\>>
  </equation*>

  as <math|n\<rightarrow\>\<infty\>>. In such a case, the classifiers
  <math|f<rsub|n>> are trained and tested with data generated from the same
  probability distribution <math|G>.

  In the present work, we consider a different setting where the probability
  distribution, <math|<wide|G|~>>, of the training sample differs from that
  of the test sample, <math|G>. Of course if <math|G> and <math|<wide|G|~>>
  are \Ptotally" different, then there is no hope of learning. We thus make
  the assumption that <math|G> and <math|<wide|G|~>> differ by a small amount
  in the sense of a \Psmall" <math|\<epsilon\>> under
  model<nbsp><eqref|eq:DCModel>. Clearly the rule learned from a training
  sample under <math|<wide|G|~>> will be different from that under <math|G>.
  Since the test sample is from <math|G>, classifier trained under
  <math|<wide|G|~>> would typically have a larger classification error. One
  important question is, how much additional classification error will be
  introduced if the classifier is trained on a sample from <math|<wide|G|~>>
  (instead of <math|G>) when testing on a sample generated from <math|G>.

  Really we wish to know how much <math|R<around|(|<wide|f|~><rsub|n>|)>> is
  different from <math|R<around|(|f<rsub|n>|)>> as
  <math|n\<rightarrow\>\<infty\>> for <math|\<epsilon\>> small. As we do not
  have access to data from <math|G>, a natural proxy for
  <math|R<around|(|f<rsub|n>|)>> is <math|R<rsup|\<ast\>>> since
  <math|R<around|(|f<rsub|n>|)>\<rightarrow\>R<rsup|\<ast\>>> as
  <math|n\<rightarrow\>\<infty\>> for consistent classifiers
  <math|f<rsub|n>>. We start by the following risk decomposition

  <eqnarray|<tformat|<table|<row|<cell|R<around|(|<wide|f|~><rsub|n>|)>-R<rsup|\<ast\>>>|<cell|=>|<cell|R<around|(|<wide|f|~><rsub|n>|)>-R<around|(|<wide|\<eta\>|~>|)>+R<around|(|<wide|\<eta\>|~>|)>-R<rsup|\<ast\>>.<eq-number><label|eq:riskDecomp2>>>>>>

  The <math|R<around|(|<wide|\<eta\>|~>|)>-R<rsup|\<ast\>>> term in
  <eqref|eq:riskDecomp2> can be bounded by a term that depends only on the
  amount of contamination, <math|\<epsilon\>>, under some weak assumptions.
  This is stated as Theorem<nbsp><reference|theorem:DCbound2>. The term
  <math|R<around|(|<wide|f|~><rsub|n>|)>-R<around|(|<wide|\<eta\>|~>|)>> can
  be shown to vanish as the training sample size increases if the underlying
  classifier is universally consistent. This is stated as
  Theorem<nbsp><reference|theorem:consistencyDiffMeasure>. Note that here the
  convergence rate may be different for different types of classifiers.

  <\theorem>
    <label|theorem:DCbound2>If <math|g<around|(|x|)>>, the probability
    density function of <math|G>, exists, then for data contamination with
    any distribution <math|H>,

    <eqnarray*|<tformat|<table|<row|<cell|R<around|(|<wide|\<eta\>|~>|)>-R<rsup|\<ast\>>>|<cell|\<leq\>>|<cell|<frac|\<epsilon\>|1-\<epsilon\>>,>>>>>

    where the equality holds if and only if the followings are true

    <\itemize>
      <item*|a)>

      <\equation*>
        \<epsilon\>=<frac|0.5-R<rsup|\<ast\>>|1-R<rsup|\<ast\>>>,<nbsp>h<around|(|x|)>=<frac|<around|\||\<eta\><around|(|x|)>|\|>*g<around|(|x|)>|1-2*R<rsup|\<ast\>>>,<nbsp><text|and>
      </equation*>

      <item*|b)><math|P<rsub|H>*<around|(|Y=1\|X=x|)>=1> when
      <math|\<eta\><around|(|x|)>\<less\>0>, and 0 otherwise.
    </itemize>
  </theorem>

  <with|font-series|bold|Remark.>

  <\enumerate>
    <item>The bound as stated in Theorem<nbsp><reference|theorem:DCbound2> is
    sharp as it is achievable under a special case as noted in the statement
    of the theorem.

    <item>A related data contamination model is as follows.

    <\equation>
      <label|eq:DCModelg>d*<wide|G|~><around|(|x|)>=<around|[|1-\<epsilon\><around|(|x|)>|]>*d*G<around|(|x|)>+\<epsilon\><around|(|x|)>*d*H<around|(|x|)>
    </equation>

    such that <math|0\<leq\>\<epsilon\><around|(|x|)>\<leq\>\<epsilon\>\<less\>1>
    for some positive constant <math|\<epsilon\>> where <math|G,H,<wide|G|~>>
    are probability distribution functions. Model <eqref|eq:DCModelg> allows
    the amount of data contamination to be data dependent as long as the
    amount is uniformly smaller than a constant. Similar result as
    Theorem<nbsp><reference|theorem:DCbound2> can be obtained.
  </enumerate>

  To prepare for the proof of Theorem<nbsp><reference|theorem:DCbound2>, we
  have the following lemma.

  <\lemma>
    <label|lemma:riskinSign>Let <math|f> be a decision function. Further
    assume <math|\<bbb-P\>*<around|(|f<around|(|X|)>=0|)>=0>. Then

    <eqnarray*|<tformat|<table|<row|<cell|R<around|(|f|)>>|<cell|=>|<cell|0.5-\<bbb-E\><around*|[|\<eta\><around|(|X|)>.<text|sign><around*|(|f<around|(|X|)>|)>|]>>>>>>

    where <math|s*i*g*n<around|(|x|)>=1<nbsp><text|if><nbsp>x\<gtr\>0<nbsp><text|and><nbsp>-1<nbsp>>otherwise.
  </lemma>

  <\proof>
    Note that we can write

    <\equation*>
      R<around|(|f|)>=\<bbb-E\><rsub|G>*<around*|\||Y-I<rsub|<around|{|f<around|(|X|)>\<gtr\>0|}>>|\|>.
    </equation*>

    Thus

    <eqnarray*|<tformat|<table|<row|<cell|R<around|(|f|)>>|<cell|=>|<cell|\<bbb-E\>*<around*|[|Y.*I<rsub|<around|{|f<around|(|X|)>\<less\>0|}>>|]>+\<bbb-E\>*<around*|[|<around|(|1-Y|)>.*I<rsub|<around|{|f<around|(|X|)>\<gtr\>0|}>>|]>>>|<row|<cell|>|<cell|=>|<cell|0.5+\<bbb-E\>*<around*|[|<around|(|Y-0.5|)>.*I<rsub|<around|{|f<around|(|X|)>\<less\>0|}>>|]>>>|<row|<cell|>|<cell|>|<cell|+\<bbb-E\>*<around*|[|<around|(|0.5-Y|)>.*I<rsub|<around|{|f<around|(|X|)>\<gtr\>0|}>>|]>>>|<row|<cell|>|<cell|=>|<cell|0.5-\<bbb-E\><around*|[|\<eta\><around|(|X|)>.<text|sign><around*|(|f<around|(|X|)>|)>|]>.>>>>>
  </proof>

  <no-indent>The posterior probability <math|<wide|\<eta\>|~><around|(|x|)>+0.5>
  under the contaminated distribution <math|<wide|G|~>> can be written as

  <eqnarray*|<tformat|<table|<row|<cell|>|<cell|>|<cell|<wide|\<eta\>|~><around|(|x|)>+0.5>>|<row|<cell|>|<cell|=>|<cell|<around|[|1-\<alpha\><rsub|\<epsilon\>><around|(|x|)>|]>*<around|(|\<eta\><around|(|x|)>+0.5|)>+\<alpha\><rsub|\<epsilon\>><around|(|x|)>*<around|(|\<eta\><rsup|H><around|(|x|)>+0.5|)>>>>>>

  where

  <eqnarray*|<tformat|<table|<row|<cell|\<alpha\><rsub|\<epsilon\>><around|(|x|)>>|<cell|=>|<cell|\<epsilon\>*h<around|(|x|)>*<around|[|<around|(|1-\<epsilon\>|)>*g<around|(|x|)>+\<epsilon\>*h<around|(|x|)>|]><rsup|-1>.>>>>>

  Here <math|g> and <math|h> are the continuous density or discrete
  probability functions corresponding to <math|G> and <math|H>, respectively.
  Then

  <eqnarray*|<tformat|<table|<row|<cell|<wide|\<eta\>|~>>|<cell|=>|<cell|<around|(|1-\<alpha\><rsub|\<epsilon\>>|)>*\<eta\>+\<alpha\><rsub|\<epsilon\>>*\<eta\><rsup|H>.>>>>>

  <\proof>
    <dueto|Proof of Theorem<nbsp><reference|theorem:DCbound2>>By
    Lemma<nbsp><reference|lemma:riskinSign>, we have

    <eqnarray*|<tformat|<table|<row|<cell|R<around|(|<wide|\<eta\>|~>|)>-R<rsup|\<ast\>>>|<cell|=>|<cell|\<bbb-E\>*<around*|[|\<eta\>*<around|(|s*i*g*n<around|(|\<eta\>|)>-s*i*g*n<around|(|<wide|\<eta\>|~>|)>|)>|]>>>|<row|<cell|>|<cell|=>|<cell|2*\<bbb-E\>*<around*|[|<around|\||\<eta\>|\|>.*I<rsub|<around|{|\<eta\>*<wide|\<eta\>|~>\<less\>0|}>>|]>.>>>>>

    Next notice that

    <eqnarray*|<tformat|<table|<row|<cell|\<eta\>*<wide|\<eta\>|~>>|<cell|=>|<cell|\<alpha\><rsub|\<epsilon\>>*\<eta\><rsup|2>*<around*|[|<frac|<around|(|1-\<alpha\><rsub|\<epsilon\>>|)>|\<alpha\><rsub|\<epsilon\>>>+<frac|2*\<eta\><rsup|H>|2*\<eta\>>|]>\<less\>0,>>>>>

    which implies

    <eqnarray*|<tformat|<table|<row|<cell|2<around|\||\<eta\>|\|>>|<cell|\<leq\>>|<cell|<frac|\<alpha\><rsub|\<epsilon\>>|<around|(|1-\<alpha\><rsub|\<epsilon\>>|)>>=<frac|\<epsilon\>|1-\<epsilon\>>*<frac|h<around|(|x|)>|g<around|(|x|)>>.>>>>>

    Hence,

    <eqnarray|<tformat|<table|<row|<cell|R<around|(|<wide|\<eta\>|~>|)>-R<rsup|\<ast\>>>|<cell|\<leq\>>|<cell|2*\<bbb-E\>*<around*|[|<around|\||\<eta\>|\|>.*I<rsub|<around*|{|2<around|\||\<eta\>|\|>\<leq\><frac|\<epsilon\>|1-\<epsilon\>>*<frac|h<around|(|X|)>|g<around|(|X|)>>|}>>|]><eq-number><label|eq:L1>>>|<row|<cell|>|<cell|\<leq\>>|<cell|<frac|\<epsilon\>|1-\<epsilon\>>*\<bbb-E\>*<frac|h<around|(|X|)>|g<around|(|X|)>><eq-number><label|eq:L2>>>|<row|<cell|>|<cell|=>|<cell|<frac|\<epsilon\>|1-\<epsilon\>>.>>>>>

    The equality in <eqref|eq:L1> holds if and only if
    <math|\<eta\><rsup|H>=1>, or, <math|P<rsub|H>*<around|(|Y=1\|X=x|)>=1>
    when <math|\<eta\><around|(|x|)>\<less\>0>, and 0 otherwise, i. e. for
    the same observation <math|X=x>, the optimal rule under <math|H> assigns
    a completely oppositive class membership w.r.t. that under <math|G>.
    Further, the equality in <eqref|eq:L2> holds if and only if

    <\equation*>
      2<around|\||\<eta\><around|(|x|)>|\|>=<frac|\<epsilon\>|1-\<epsilon\>>*<frac|h<around|(|x|)>|g<around|(|x|)>>,
    </equation*>

    which implies

    <eqnarray*|<tformat|<table|<row|<cell|2*\<bbb-E\><around|\||\<eta\>|\|>>|<cell|=>|<cell|<frac|\<epsilon\>|1-\<epsilon\>>>>>>>

    since <math|<big|int>h<around|(|x|)>*d*x=1>. Thus,

    <\equation*>
      \<epsilon\>=<frac|2*\<bbb-E\><around|\||\<eta\>|\|>|1+2*\<bbb-E\><around|\||\<eta\>|\|>>=<frac|0.5-R<rsup|\<ast\>>|1-R<rsup|\<ast\>>>
    </equation*>

    by Lemma<nbsp><reference|lemma:riskinSign>. This concludes the proof.
  </proof>

  <\theorem>
    <label|theorem:consistencyDiffMeasure>Suppose a classification algorithm
    is universally consistent. Then, under data contamination
    model<nbsp><eqref|eq:DCModel>, we have

    <\equation*>
      R<around|(|<wide|f|~><rsub|n>|)>\<rightarrow\>R<around|(|<wide|\<eta\>|~>|)>
    </equation*>

    as <math|n\<rightarrow\>\<infty\>>.
  </theorem>

  The proof of Theorem<nbsp><reference|theorem:consistencyDiffMeasure> relies
  on the following lemma.

  <\lemma>
    <label|lemma:ruleConvergence>Assume <math|\<bbb-P\>*<around*|(|\<eta\><around|(|X|)>=0|)>=0>.
    If <math|R<around|(|f<rsub|n>|)>\<rightarrow\>R<rsup|\<ast\>>>, then the
    decision induced by <math|f<rsub|n>> converges to the Bayes rule in
    probability as <math|n\<rightarrow\>\<infty\>>.
  </lemma>

  <with|font-series|bold|Remark.> Theorem 2 of Bartlett and Tewari
  <cite|BartlettAmbuj2007> implies that the decision rule given by SVM
  converges to the Bayes rule. Lemma<nbsp><reference|lemma:ruleConvergence>
  is more general in that it applies to all consistent rules.

  <\proof>
    Without loss of generality, assume the decision function <math|f<rsub|n>>
    is already centered, i.e., the corresponding decision rule can be written
    as <math|I<rsub|<around|{|f<rsub|n>\<gtr\>0|}>>>. From
    Lemma<nbsp><reference|lemma:riskinSign>, we have

    <\equation*>
      R<around|(|f<rsub|n>|)>=0.5-\<bbb-E\>*<around*|(|\<eta\><around|(|X|)>\<ast\><text|sign><around|(|f<rsub|n><around|(|X|)>|)>|)>.
    </equation*>

    Let

    <\equation*>
      \<xi\><rsub|n><around|(|x|)>=<around|\||<text|sign><around|(|\<eta\><around|(|x|)>|)>-<text|sign><around|(|f<rsub|n><around|(|x|)>|)>|\|>,
    </equation*>

    then <math|\<xi\><rsub|n><around|(|x|)>> takes two values
    <math|<around|{|0,2|}>>. We have

    <eqnarray*|<tformat|<table|<row|<cell|>|<cell|>|<cell|R<around|(|f<rsub|n>|)>-R<rsup|\<ast\>>>>|<row|<cell|>|<cell|=>|<cell|\<bbb-E\><around|(|\<eta\><around|(|X|)>|)>*<around*|[|<text|sign><around|(|\<eta\><around|(|x|)>|)>-<text|sign><around|(|f<rsub|n><around|(|x|)>|)>|]>>>|<row|<cell|>|<cell|=>|<cell|\<bbb-E\><around|\||\<eta\><rsub|m><around|(|X|)>|\|>.*\<xi\><rsub|n><around|(|X|)>.>>>>>

    Thus, <math|\<bbb-P\>*<around|(|\<xi\><rsub|n><around|(|X|)>=2|)>\<rightarrow\>0>
    by assumption <math|R<around|(|f<rsub|n>|)>\<rightarrow\>R<rsup|\<ast\>>>
    as <math|n\<rightarrow\>\<infty\>>. That is,
    <math|I<rsub|<around|{|f<rsub|n><around|(|X|)>\<gtr\>0|}>>> converges to
    <math|I<rsub|<around|{|\<eta\><around|(|X|)>\<gtr\>0|}>>> in probability
    as <math|n\<rightarrow\>\<infty\>>.
  </proof>

  <no-indent>

  <\proof>
    <dueto|Proof of Theorem<nbsp><reference|theorem:consistencyDiffMeasure>>By
    universal consistency and Lemma<nbsp><reference|lemma:ruleConvergence>,
    we have

    <\equation*>
      <big|int>I<rsub|<around*|{|<text|sign><around|(|<wide|f|~><rsub|n><around|(|X|)>|)>\<neq\><text|sign><around|(|<wide|\<eta\>|~><around|(|X|)>|)>|}>>*d*<wide|G|~><around|(|x|)>\<rightarrow\>0.
    </equation*>

    Thus

    <\equation*>
      <big|int>I<rsub|<around*|{|<text|sign><around|(|<wide|f|~><rsub|n>|)>\<neq\><text|sign><around|(|<wide|\<eta\>|~>|)>|}>>*d*G\<rightarrow\>0,
    </equation*>

    implying that, as <math|n\<rightarrow\>\<infty\>>,

    <\equation*>
      R<around|(|<wide|f|~><rsub|n>|)>\<rightarrow\>R<around|(|<wide|\<eta\>|~>|)>.
    </equation*>
  </proof>

  By risk decomposition <eqref|eq:riskDecomp2> as well as
  Theorem<nbsp><reference|theorem:DCbound2> and
  Theorem<nbsp><reference|theorem:consistencyDiffMeasure>, we arrive at a
  sharp asymptotic data contamination bound as

  <\equation>
    <label|eq:dataContBound2><frac|\<epsilon\>|1-\<epsilon\>>+O<around*|(|<frac|c<around|(|n|)>|<sqrt|n>>|)>.
  </equation>

  where <math|c<around|(|n|)>> is related to the complexity of the function
  class used by the classification algorithm which typically grows
  sufficiently slowly with <math|n> as compared to <math|<sqrt|n>>.

  Bound<nbsp><eqref|eq:dataContBound2> implies that, when the amount of data
  contamination is \Psmall", i.e., <math|\<epsilon\>\<rightarrow\>0>, we can
  make

  <\equation*>
    <around|\||R<around|(|<wide|f|~><rsub|n>|)>-R<rsup|\<ast\>>|\|>\<rightarrow\>0.
  </equation*>

  That is, as long as a classifier is consistent in the standard setting and
  the amount of contamination is small in the sense of a small
  <math|\<epsilon\>>, this classifier suffers very little from data
  contamination. This explains why, empirically, classifiers such as SVM or
  others work well even when a small fraction of labels are randomly flipped.

  Theorem<nbsp><reference|theorem:consistencyDiffMeasure> relies on the
  universal consistency of a classifier. Fortunately, several of the
  currently most popular classifiers are universally consistent, for example,
  SVM <cite|Steinwart2002> and Adaboost with early stopping
  <cite|BartlettTraskin2007>.

  <section|Experiments><label|section:simulation>

  Extensive simulations are performed on three different types of datasets,
  <math|3> synthetic datasets, <math|10> UC Irvine datasets <cite|UCI> and a
  simulated remote sensing image. For each dataset, four different types of
  data contaminations are applied to the training set and classification
  accuracy evaluated on the uncontaminated test set. SVM is used as the
  underlying classifier due to its universal consistency <cite|Steinwart2002>
  and the availability of a widely used software implementation (libsvm
  <cite|LIBSVM>).

  The five different types of data contaminations are as follows.

  <\itemize>
    <item*|<math|C<rsub|0>>>Randomly flip the labels of a randomly selected
    subset of observations from a fixed class.

    <item*|<math|C<rsub|1>>>Randomly flip the labels of a randomly selected
    subset of observations.

    <item*|<math|C<rsub|2>>>Randomly select a subset of observations and
    replace the feature values of each with that of a randomly chosen
    observation (the labels are kept). Call this feature swapping.

    <item*|<math|C<rsub|c>>>Replace a randomly selected subset of
    observations with Cauchy data with the labels kept.

    <item*|<math|C<rsub|g>>>Replace a randomly selected subset of
    observations with Gaussian data with the labels kept.
  </itemize>

  <\big-figure>
    <\with|par-mode|center>
      <space|0cm><image|plotFlipN.eps||||>

      \;
    </with>
  </big-figure|<label|figure:plotGaussian2D> <with|font-shape|italic|Scatter
  plot of <math|1000> observations generated i.i.d. from Gaussian
  <math|\<cal-N\><around|(|\<mu\>,\<Sigma\>|)>> with
  <math|\<mu\>=<around|(|1,0|)>> and <math|\<Sigma\>=A<rsup|t>*A> with
  entries of <math|A> generated i.i.d. from <math|\<cal-U\><around|[|0,1|]>>.
  Data from the two classes are represented as diamonds and solid circles,
  respectively. >>

  <\with|par-columns|1>
    <\big-figure>
      <\with|par-mode|center>
        <space|0cm><image|plotFlip05.eps||||> <image|plotFlip10.eps||||>

        <space|0cm><image|plotRep05.eps||||> <image|plotRep10.eps||||>

        <space|0cm><image|plotFlipg05.eps||||> <image|plotFlipg10.eps||||>

        <space|0cm><image|plotRepC05.eps||||> <image|plotRepC10.eps||||>

        <label|figure:plotGaussianDC>
      </with>
    </big-figure|<with|font-shape|italic|Illustration of the effect of
    different types of data contamination. The original Gaussian data are
    displayed in Figure<nbsp><reference|figure:plotGaussian2D>. The <math|4>
    rows of plots correspond to <math|C<rsub|1>,C<rsub|2>,C<rsub|g>,C<rsub|c>>,
    respectively and figures in the left and right columns are for data
    contamination at <math|5%> and <math|10%>, respectively. Data from the
    two classes are represented as diamonds and solid circles, respectively.
    >>
  </with>

  <no-indent><math|C<rsub|0>,C<rsub|1>,...,C<rsub|g>> are used to simulate
  data contaminations of different natures.

  <\itemize>
    <item><math|C<rsub|1>> and <math|C<rsub|2>> are expressly designed to
    simulate image mis-registration, which we believe capture important
    aspects of image mis-registration.

    <item><math|C<rsub|c>> and <math|C<rsub|g>> are used to simulate gross
    errors. <math|C<rsub|g>> is for errors with a Gaussian nature while
    <math|C<rsub|c>> is for errors with a heavy tail, that is, the error
    could be very large and this is to simulate accidental human error, for
    example, a shift in decimal place of a number.

    <item>Additionally, we also attempt to simulate extreme large errors by
    scaling the centers of the Gaussian and Cauchy by a factor of <math|100>,
    that is, the centers are multiplied by <math|100> coordinate-wise. These
    are denoted by <math|C<rsub|g*100>> and <math|C<rsub|c*100>>,
    respectively.

    <item><math|C<rsub|0>> is used to simulate a class of unfavorable
    situations where data contaminations occur in part of the data space.
    Such cases typically make classification more challenging. In contrast,
    other simulations are more or less average cases as the data
    contaminations occur uniformly across the whole data space.
  </itemize>

  For <math|C<rsub|g>>, the replacement Gaussian data is generated i.i.d.
  from <math|\<cal-N\><around|(|\<mu\>,\<Sigma\>|)>> with <math|\<mu\>> and
  <math|\<Sigma\>> calculated empirically on the non-contaminated training
  set. For <math|C<rsub|c>>, the Cauchy data is generated i.i.d. according to

  <\equation*>
    Z/W,<space|1em><text|for><nbsp>Z\<sim\>\<cal-N\><around|(|\<mu\>,\<Sigma\>|)>,<nbsp>W\<sim\><around|[|\<Gamma\><around|(|0.5,2|)>|]><rsup|1/2>
  </equation*>

  with <math|Z> and <math|W> independent. For each run, <math|\<mu\>> is
  generated uniformly from the interval <math|<around|[|min
  <around|(|X|)>,max <around|(|X|)>|]>> and <math|\<Sigma\>> estimated
  empirically from the training set.

  For an illustration of the effect of these different types of data
  contamination, see Figure <reference|figure:plotGaussian2D> for the
  original data and Figure<nbsp><reference|figure:plotGaussianDC> for the
  data after contamination.

  <subsection|Synthetic data>

  The three synthetic datasets used in our experiment are the Gaussian
  mixture data, the four-class and the nested-square data. The Gaussian
  mixture data is used to simulate cases with a linear decision boundary
  while the four-class and the nested-square datasets are for cases where the
  decision boundary is highly nonlinear and non-convex. For each of the
  <math|3> datasets, we take <math|80%> for training and the rest for test.
  Then <math|100> instances of data contaminations are applied and results
  (i.e., loss in classification accuracy) averaged. This is repeated and
  results averaged. The Gaussian kernel is used with SVM for all three
  synthetic datasets.

  The Gaussian mixture data are generated according to the following

  <\equation*>
    \<Delta\>*\<cal-N\><around|(|\<mu\>,\<Sigma\><rsub|10\<times\>10>|)>+<around|(|1-\<Delta\>|)>*\<cal-N\>(-\<mu\>,\<Sigma\><rsub|10\<times\>10>)
  </equation*>

  with <math|\<bbb-P\>*<around|(|\<Delta\>=1|)>=<frac|1|2>> and
  <math|\<Sigma\><rsub|10\<times\>10>=A<rsup|T>*A> for entries of <math|A>
  generated i.i.d. uniform from <math|<around|[|0,1|]>>, with
  <math|\<mu\>=<around|(|0.5,...,0.5|)><rsup|T>>. Data points with
  <math|\<Delta\>=1> are assigned label <math|1> and those with
  <math|\<Delta\>=0> are assigned label <math|2>. The sample size for the
  training set and test set are <math|1000> and <math|2000>, respectively.
  Loss in classification accuracy under data contaminations of different
  types and at different amounts are shown in
  Figure<nbsp><reference|figure:diffGaussian>. Note that here we are using
  only the first term in <eqref|eq:dataContBound2> as an estimate of the
  overall loss in classification accuracy while ignoring the second term,
  thus when the training sample size is not large enough, some adjustment (in
  the order of <math|O*<around|(|c<around|(|n|)>/<sqrt|n>|)>>) might be
  required.

  <\big-figure>
    <\with|par-mode|center>
      <space|0cm><image|diffGmix.eps||||>

      \;
    </with>
  </big-figure|<label|figure:diffGaussian> <with|font-shape|italic|Empirical
  and theoretical data contamination bound for data generated from a Gaussian
  mixture with <math|\<epsilon\>\<in\><around|{|0.01,0.02,0.03,0.04,0.05,0.10|}>>.>>

  <\big-figure>
    <\with|par-mode|center>
      <image|4class2.eps||||> <space|1em><image|nestsq.eps||||>

      \;
    </with>
  </big-figure|<label|figure:THdata> <with|font-shape|italic|The four-class
  and nested-square data. Different colors correspond to points from
  different classes.>>

  The four-class and nested-square datasets were originally used to
  demonstrate the superior performance of a class of projectable classifiers
  for data with a highly complex decision boundary <cite|HoKleinberg1996>.
  Figure<nbsp><reference|figure:THdata> is a plot of these two datasets and
  the data contamination bounds are shown in
  Figure<nbsp><reference|figure:diffBayesS1>. Note that the bound as
  established in <eqref|eq:dataContBound2> is for 2-class classification.
  When there are multiple classes, we can get a bound by repeated application
  of the 2-class bound. Let the class distribution be denoted by
  <math|<around|{|w<rsub|1>,...,w<rsub|J>|}>> such that
  <math|w<rsub|1>\<geq\>...\<geq\>w<rsub|J>>. Then we get the following
  multi-class bound

  <\equation*>
    <label|eq:mclassBound><frac|\<epsilon\>|1-\<epsilon\>>*<around*|{|1+<around|(|w<rsub|2>+...+w<rsub|J>|)>*\<alpha\>+...+<around|(|w<rsub|<around|{|J-1|}>>+w<rsub|J>|)>*\<alpha\><rsup|J-2>|}>
  </equation*>

  where <math|\<alpha\>=1-<frac|\<epsilon\>|1-\<epsilon\>>> and
  <math|\<epsilon\>> is the mount of contamination. This is used as the
  theoretical bound in our simulations when there are more than two classes.

  <\big-figure>
    <\with|par-mode|center>
      <space|0cm><image|diff4class.eps||||> <image|diffNestsq.eps||||>

      \;
    </with>
  </big-figure|<label|figure:diffBayesS1> <with|font-shape|italic|Empirical
  and theoretical data contamination bound for the 4-class and the
  nested-square datasets with <math|\<epsilon\>\<in\><around|{|0.01,0.02,0.03,0.04,0.05,0.10|}>>.>>

  <subsection|UC Irvine datasets>

  A total of <math|10> datasets are taken from the UC Irvine Machine Learning
  Repository <cite|UCI> in our experiment. A summary of these datasets is
  provided in Table<nbsp><reference|table:UCIdatasets> and more details can
  be found from <cite|UCI>.

  <\big-table>
    \;

    <tabular*|<tformat|<cwith|1|-1|1|1|cell-lborder|1ln>|<cwith|1|-1|1|1|cell-halign|c>|<cwith|1|-1|1|1|cell-rborder|2ln>|<cwith|1|-1|2|2|cell-halign|c>|<cwith|1|-1|2|2|cell-rborder|1ln>|<cwith|1|-1|3|3|cell-halign|c>|<cwith|1|-1|3|3|cell-rborder|1ln>|<cwith|1|-1|4|4|cell-halign|c>|<cwith|1|-1|4|4|cell-rborder|1ln>|<cwith|1|-1|5|5|cell-halign|c>|<cwith|1|-1|5|5|cell-rborder|1ln>|<cwith|1|-1|1|-1|cell-valign|c>|<cwith|1|1|1|-1|cell-tborder|1ln>|<cwith|1|1|1|-1|cell-bborder|1ln>|<cwith|2|2|1|-1|cell-bborder|1ln>|<cwith|3|3|1|-1|cell-bborder|1ln>|<cwith|4|4|1|-1|cell-bborder|1ln>|<cwith|5|5|1|-1|cell-bborder|1ln>|<cwith|6|6|1|-1|cell-bborder|1ln>|<cwith|7|7|1|-1|cell-bborder|1ln>|<cwith|8|8|1|-1|cell-bborder|1ln>|<cwith|9|9|1|-1|cell-bborder|1ln>|<cwith|10|10|1|-1|cell-bborder|1ln>|<cwith|11|11|1|-1|cell-bborder|1ln>|<table|<row|<cell|>|<cell|Training>|<cell|Testing>|<cell|Features>|<cell|Classes>>|<row|<cell|<math|i*m*a*g*e*S*e*g>>|<cell|210>|<cell|2100>|<cell|19>|<cell|7>>|<row|<cell|<math|V*o*w*e*l>>|<cell|528>|<cell|462>|<cell|10>|<cell|11>>|<row|<cell|<math|S*a*t*e*l*l*i*t*e<nbsp>i*m*a*g*e*s>>|<cell|4435>|<cell|2000>|<cell|36>|<cell|6>>|<row|<cell|<math|G*l*a*s*s>>|<cell|214>|<cell|-->|<cell|10>|<cell|6>>|<row|<cell|<math|V*e*h*i*c*l*e>>|<cell|946>|<cell|-->|<cell|18>|<cell|4>>|<row|<cell|<math|G*e*r*m*a*n<nbsp>c*r*e*d*i*t>>|<cell|1000>|<cell|-->|<cell|24>|<cell|2>>|<row|<cell|<math|Y*e*a*s*t>>|<cell|1484>|<cell|-->|<cell|8>|<cell|10>>|<row|<cell|<math|W*i*n*e<nbsp>q*u*a*l*i*t*y>>|<cell|1599>|<cell|-->|<cell|11>|<cell|6>>|<row|<cell|<math|M*u*s*k>>|<cell|6598>|<cell|-->|<cell|168>|<cell|2>>|<row|<cell|<math|M*a*g*i*c<nbsp>g*a*m*m*a>>|<cell|19020>|<cell|-->|<cell|10>|<cell|2>>>>>

    <label|table:UCIdatasets>
  </big-table|<with|font-shape|italic|Summary of the UC Irvine datasets used
  in our experiment. >>

  Some data sets come with predetermined training and test sets, which
  includes the image segmentation, vowel and satellite image datasets.
  Otherwise we split the data into a training and test set. For small to
  medium sized datasets, i.e., Glass, Vehicle, German Credit, Yeast and Wine
  Quality (red wine), we take <math|80%> of the data for training and the
  rest for test. For large datasets, i.e., the Musk and Magic Gamma
  Telescope, <math|20%> and <math|10%>, respectively, of the data are set
  aside for training and the rest for test. For each dataset, <math|100>
  instances of data contamination are applied to the training set and the
  resulting data contamination bounds are averaged. This is repeated and
  results averaged.

  The Gaussian kernel is used for all except the image segmentation dataset
  where a polynomial kernel with degree <math|3> is used. Tuning parameters
  for SVM are chosen so that the classification performance matches that
  reported in the literature (see, for example, references cited in the
  description of each dataset in <cite|UCI>). Some datasets are linearly
  scaled to <math|<around|[|0,1|]>> so as to speed up the painfully slow
  optimization of the SVM package; this includes the Musk, Magic Gamma,
  Satellite image, Vehicle, and the Wine quality dataset. The data
  contamination bounds by SVM on the UC Irvine datasets are plotted in
  Figure<nbsp><reference|figure:diffBayesS2>.

  <\with|par-columns|1>
    <\big-figure>
      <\with|par-mode|center>
        <space|0cm><image|diffMusk.eps||||> <image|diffGlass.eps||||>
        <image|diffVehicle.eps||||> <image|diffGcredit.eps||||>
        <image|diffWineq.eps||||> <image|diffImgSeg.eps||||>

        \;
      </with>
    </big-figure|<label|figure:diffBayesS2> <with|font-shape|italic|Empirical
    and theoretical data contamination bound for UC Irvine datasets (only
    <math|6> of them are shown here so that they can be placed in the same
    page, the rest are similar) with <math|\<epsilon\>\<in\><around|{|0.01,0.02,0.03,0.04,0.05,0.10|}>>.>>
  </with>

  <subsection|Remote sensing image>

  The remote sensing image used in the experiment is about a cropland with
  <math|5> different land-use classes. The image size is <math|596> pixel by
  <math|529> pixel. The features of interest are taken from the annual
  vegetation index time series (see Figure<nbsp><reference|figure:Vindex>) at
  an interval of <math|30> days among which <math|10> are used with each
  corresponding to one scene of image at a different time of the year. The
  vegetation index is an optical measure of vegetation canopy greenness
  and is closely related to the photosynthetic potential of plants. For each
  pixel, random noises, generated from Gaussian
  <math|\<cal-N\><around|(|0,0.01|)>>, are applied.

  <\big-figure>
    <\with|par-mode|center>
      <space|0cm><image|Vindex.eps||||>

      \;
    </with>
  </big-figure|<label|figure:Vindex> <with|font-shape|italic|The annual
  vegetation index. The x-axis is the day of a year and different colors
  indicate different land classes.>>

  To simulate the acquisition of remote sensing images, the following
  procedure is performed on each of the <math|10> scenes of image.

  <\enumerate>
    <item>Rotate all images clockwise by <math|10> degrees.

    <item>Re-sample each scene of image using a randomly generated offset
    from <math|\<cal-N\><around|(|0,0.01|)>>.

    <item>Remove the blank edges in all images that are caused by rotation
    and re-sampling.
  </enumerate>

  In Step 2 of the above, offsets are generated from the standard Gaussian
  and a bilinear interpolation <cite|GomesDCV1998> is applied during
  re-sampling. As a result, <math|247> pixel by <math|233> pixel
  multi-temporal vegetation index images for the cropland of interest are
  generated.

  To assess the impact of image mis-registration to the task of
  classification, two mis-registered images (corresponding to Case I and II
  in Table<nbsp><reference|table:RSimage>, respectively) are generated under
  different levels of mis-registration (roughly corresponding to <math|3%>
  and <math|4%> data contamination, respectively). The SVM classifier is
  trained on a sample from the original image and the mis-registered image,
  respectively, and then test on a sample taken from the original image. We
  use the data in a similar fashion as the 5-fold cross-validation, i.e.,
  select <math|4> folds for training and rest for testing. Table
  <reference|table:RSimage> reports the classification accuracy. We can see
  that, in both cases, the loss in classification accuracy is small and can
  be well bounded by our theoretical predication.

  <\big-table>
    \;

    <tabular*|<tformat|<cwith|1|-1|1|1|cell-lborder|1ln>|<cwith|1|-1|1|1|cell-halign|c>|<cwith|1|-1|1|1|cell-rborder|2ln>|<cwith|1|-1|2|2|cell-halign|c>|<cwith|1|-1|2|2|cell-rborder|1ln>|<cwith|1|-1|3|3|cell-halign|c>|<cwith|1|-1|3|3|cell-rborder|1ln>|<cwith|1|-1|4|4|cell-halign|c>|<cwith|1|-1|4|4|cell-rborder|1ln>|<cwith|1|-1|5|5|cell-halign|c>|<cwith|1|-1|5|5|cell-rborder|1ln>|<cwith|1|-1|6|6|cell-halign|c>|<cwith|1|-1|6|6|cell-rborder|1ln>|<cwith|1|-1|7|7|cell-halign|c>|<cwith|1|-1|7|7|cell-rborder|1ln>|<cwith|1|-1|8|8|cell-halign|c>|<cwith|1|-1|8|8|cell-rborder|1ln>|<cwith|1|-1|1|-1|cell-valign|c>|<cwith|1|1|1|-1|cell-tborder|1ln>|<cwith|1|1|1|-1|cell-bborder|1ln>|<cwith|2|2|1|-1|cell-bborder|1ln>|<cwith|3|3|1|-1|cell-bborder|1ln>|<cwith|4|4|1|-1|cell-bborder|1ln>|<table|<row|<cell|Fold>|<cell|1>|<cell|2>|<cell|3>|<cell|4>|<cell|5>|<cell|Average>>|<row|<cell|<math|O*r*i*g*i*n*a*l>>|<cell|98.13>|<cell|98.14>|<cell|97.90>|<cell|97.94>|<cell|97.98>|<cell|98.02>>|<row|<cell|Case
    I>|<cell|98.08>|<cell|98.11>|<cell|97.92>|<cell|97.94>|<cell|97.98>|<cell|98.01>>|<row|<cell|Case
    II>|<cell|98.09>|<cell|98.10>|<cell|97.92>|<cell|97.92>|<cell|97.96>|<cell|97.99>>>>>

    <label|table:RSimage>
  </big-table|<with|font-shape|italic|Accuracy of SVM for the cropland remote
  sensing image under different amount of image mis-registration. Each of the
  first <math|5> columns corresponds to one of the <math|5> folds. \ >>

  It is known that the effect of mis-registration on image classification
  varies with the relative size of the ground area corresponding to an image
  pixel (call this the pixel size) and the actual homogeneity (larger numbers
  correspond to more homogeneity) of an area. If the ratio of these two
  numbers is small, then the damage of mis-registration is small, otherwise
  it is large. Since we are using a crop field here and the corresponding
  pixel size is much smaller than that for the crop field, the effect of data
  contamination is small. If the pixel size is close to the actual object
  size, then mis-registration of half a pixel may cause more damages.

  <subsection|Some empirical results on Adaboost>

  So far SVM has been used as the underlying classifier in our experiment,
  other universally consistent classifiers such as Adaboost are applicable as
  well. Instead of repeating the experiment for AdaBoost, we collect results
  found in the literature <cite|FreundSchapire1996|Dietterich1998|RF> and
  summarize in Table<nbsp><reference|table:adaboost>. Note here we simply
  adopt the existing results and this corresponds to taking
  <math|\<epsilon\>=0.05> only.

  <\big-table>
    \;

    <tabular*|<tformat|<cwith|1|-1|1|1|cell-lborder|1ln>|<cwith|1|-1|1|1|cell-halign|c>|<cwith|1|-1|1|1|cell-rborder|2ln>|<cwith|1|-1|2|2|cell-halign|c>|<cwith|1|-1|2|2|cell-rborder|1ln>|<cwith|1|-1|3|3|cell-halign|c>|<cwith|1|-1|3|3|cell-rborder|1ln>|<cwith|1|-1|4|4|cell-halign|c>|<cwith|1|-1|4|4|cell-rborder|1ln>|<cwith|1|-1|1|-1|cell-valign|c>|<cwith|1|1|1|-1|cell-tborder|1ln>|<cwith|1|1|1|-1|cell-bborder|1ln>|<cwith|2|2|1|-1|cell-bborder|1ln>|<cwith|3|3|1|-1|cell-bborder|1ln>|<cwith|4|4|1|-1|cell-bborder|1ln>|<cwith|5|5|1|-1|cell-bborder|1ln>|<cwith|6|6|1|-1|cell-bborder|1ln>|<cwith|7|7|1|-1|cell-bborder|1ln>|<cwith|8|8|1|-1|cell-bborder|1ln>|<cwith|9|9|1|-1|cell-bborder|1ln>|<cwith|10|10|1|-1|cell-bborder|1ln>|<table|<row|<cell|>|<cell|Original
    data>|<cell|5% labels flipped>|<cell|Difference>>|<row|<cell|<math|G*l*a*s*s>>|<cell|22.00%>|<cell|22.35%>|<cell|0.35%>>|<row|<cell|<math|B*r*e*a*s*t<nbsp>c*a*n*c*e*r>>|<cell|<nbsp>3.20%>|<cell|<nbsp>4.58%>|<cell|1.38%>>|<row|<cell|<math|D*i*a*b*e*t*e*s>>|<cell|26.60%>|<cell|28.41%>|<cell|1.81%>>|<row|<cell|<math|S*o*n*a*r>>|<cell|15.60%>|<cell|17.96%>|<cell|2.36%>>|<row|<cell|<math|I*o*n*s*p*h*e*r*e>>|<cell|<nbsp>6.40%>|<cell|<nbsp>8.17%>|<cell|1.77%>>|<row|<cell|<math|S*o*y*b*e*a*n>>|<cell|<nbsp>7.57%>|<cell|<nbsp>9.61%>|<cell|2.04%>>|<row|<cell|<math|E*c*o*l*i>>|<cell|14.80%>|<cell|15.91%>|<cell|1.11%>>|<row|<cell|<math|V*o*t*e*s>>|<cell|<nbsp>4.80%>|<cell|<nbsp>7.14%>|<cell|2.34%>>|<row|<cell|<math|L*i*v*e*r>>|<cell|30.70%>|<cell|33.86%>|<cell|3.16%>>>>>

    <label|table:adaboost>
  </big-table|<with|font-shape|italic|Error rates of Adaboost on some UC
  Irvine datasets where 90% of the data are used as the training set. Results
  are shown for the original data and when 5% of the class labels in the
  training set are randomly flipped (uniformly into an alternate class).
  Results are adopted from <cite|RF|FreundSchapire1996|Dietterich1998> and
  then converted. >>

  <subsection|Estimating the amount of data
  contamination><label|section:amtDC>

  Using data contamination bound <eqref|eq:dataContBound2>, we can estimate
  the loss in accuracy for classifiers trained with contaminated data. The
  remaining question is to give a (rough) estimate of the amount of data
  contamination. This is a question we would like to leave to future work.

  In the special case of image mis-registration, we propose two simple
  heuristics for estimating the amount of data contamination. Both are based
  on the heuristic that the image pixels affected by mis-registration are
  roughly those near the boundary between different land classes. Thus the
  proportion of boundary pixels serves as a good indication on the amount of
  data contamination. Here the underlying assumption is that the proportion
  of boundary pixels are roughly the same in the true and the mis-registered
  images.

  One approach is based on sampling. A number, say <math|100> to <math|200>,
  of pixels are randomly sampled from the image, we then count the proportion
  of pixels that fall on the boundary by visual inspection. Another estimate
  is based on the classification results by a classifier trained on the
  contaminated data. For each pixel, we determine if it is on the boundary by
  the following heuristic. For each pixel in the image, take a
  <math|3\<times\>3> patch centering on it. If there are at least two pixels
  within the patch having a different class labels from the rest, then
  declare the pixel at the center of the patch to be on the boundary.

  <section|Conclusion and discussion><label|section:conclusion>

  We formulate the problem of image mis-registration as data contamination
  and equip it with a statistical model. This model captures a very general
  class of errors, for instance, measurement errors and gross errors that can
  be formulated as label-flipping, feature-swapping, or feature replacement
  by any proper distributions. Under a statistical learning theoretical
  framework, we derive an asymptotic bound for the loss in classification
  accuracy due to data contamination. One nice feature about this bound is
  that, it is essentially distribution-free thus it applies to all different
  types of data. Extensive simulations on both synthetic and real datasets
  under various types of data contaminations show that the data contamination
  bound we derive is fairly tight.

  As we have already discussed, our data contamination model can capture
  various types of errors such as image mis-registration, label noise and
  accidental human errors. Beyond that, we can also use data contamination as
  a useful device. We give here an example in the setting of co-training
  (<cite|Yarowsky1995|BlumMitchell1998|CollinsSinger1999>). Empirically, it
  has been shown that co-training can significantly boost the classification
  accuracy when the training sample size is extremely small, e.g., <math|12>
  in <cite|BlumMitchell1998> for web page classification and <math|6> in
  <cite|Nigam2000> for newsgroup classification. Theoretical work have been
  carried out to understand the success of co-training (see, for instance,
  <cite|BlumMitchell1998|DasguptaLM2001>). We provide here a different
  perspective. In co-training, starting from a small amount of labeled
  examples, the algorithm progressively enlarges the labeled set by
  transferring those examples which are originally unlabeled but are
  classified with high confidence by the classifier built from the labeled
  data available so far. This amounts to enlarging the labeled set with a
  small amount of label noise; the label noise here is small because those
  examples which are being transferred are classified with high confidence.
  Assume at certain point we have <math|n> examples in the labeled set and
  assume <math|n> is large, then, by our analysis (c.f.
  <eqref|eq:dataContBound2>), the additional classification error w.r.t. that
  resulting from a clean labeled set (of size <math|n>) is no more than
  <math|\<epsilon\>/<around|(|1-\<epsilon\>|)>+O*<around|(|c<around|(|n|)>/<sqrt|n>|)>>
  for <math|c<around|(|n|)>/<sqrt|n>\<rightarrow\>0> as <math|n> grows. Thus,
  loosely speaking,

  <eqnarray*|<tformat|<table|<row|<cell|>|<cell|>|<cell|E*r*r<around|(|<text|Bayes
  classifier on >G|)>>>|<row|<cell|>|<cell|\<leq\>>|<cell|E*r*r<around|(|<text|Classifier
  learned on <math|n> observations from ><wide|G|~>|)>>>|<row|<cell|>|<cell|\<leq\>>|<cell|E*r*r<around|(|<text|Bayes
  classifier on >G|)>+<frac|\<epsilon\>|1-\<epsilon\>>+O<around*|(|<frac|c<around|(|n|)>|<sqrt|n>>|)>>>>>>

  where <math|E*r*r> denotes the error rate. Here, we use <math|G> and
  <math|<wide|G|~>> to denote the data with clean label and that containing
  labels assigned by the co-training algorithm, respectively. It is clear
  that the error rate achieved by co-training equals that by a classifier
  learned on <math|n> observations from <math|<wide|G|~>>. However, the error
  rate by a classifier learned on <math|l> labeled examples from <math|G> is
  typically much larger, i.e.,

  <eqnarray|<tformat|<table|<row|<cell|>|<cell|>|<cell|E*r*r<around|(|<text|Classifier
  learned on <math|l> examples from >G|)>>>|<row|<cell|>|<cell|\<gg\>>|<cell|E*r*r<around|(|<text|Bayes
  classifier on >G|)>+<frac|\<epsilon\>|1-\<epsilon\>>+O<around*|(|<frac|c<around|(|n|)>|<sqrt|n>>|)><eq-number><label|eq:gapCotrain>>>>>>

  if <math|l> is small, <math|\<epsilon\>> is small and <math|n> is large.
  The gap between the two quantities in <eqref|eq:gapCotrain> is the
  \Pbenefit\Q of co-training. This explains why co-training may be feasible
  with a small amount of initial labeled examples. Since the gap in
  <eqref|eq:gapCotrain> shrinks as <math|l> increases. This, on the other
  hand, explains why co-training may not help much when the initial labeled
  set is large.

  A limitation of our data contamination model<nbsp><eqref|eq:DCModel> is
  that, in modeling the phenomenon of image mis-registration with a data
  contamination model, i.i.d. contaminations are assumed. However, in
  practice the mis-registered image pixels may be correlated in some way. It
  is thus desirable to take this into account in the model, which we shall
  leave to future work. Note that we derive the data contamination bound
  under a general class of data distributions, it is desired to take
  advantage of knowledge on the underlying distribution to get a sharper
  bound.

  <section*|Acknowledgement>

  The authors would like to thank Tin Kam Ho at Bell Labs for kindly
  providing the four-class and nested-square datasets.

  <\bibliography|bib|plain|myBib>
    <bib-list|[99]|>
  </bibliography>
</body>