<TeXmacs|1.99.7>

<style|<tuple|article|std-latex>>

<\body>
  <\hide-preamble>
    <assign|baselinestretch|<macro|1.1>>

    <new-theorem|theorem|Theorem>

    <new-theorem|proposition|Proposition>

    <new-theorem|lemma|Lemma>

    <new-theorem|corollary|Corollary>
  </hide-preamble>

  <doc-data|<doc-title|Prior Ordering and Monotonicity in Dirichlet
  Bandits>|<doc-author|<author-data|<author-name|Yaming
  Yu<next-line><with|font-size|0.84|Department of
  Statistics<next-line><vspace*|-0.8ex> University of
  California<next-line><vspace*|-0.8ex> Irvine, CA 92697,
  USA<next-line><vspace*|-0.8ex> <with|font-size|0.84|font-family|tt|yamingy@uci.edu>>>>>|<doc-date|>>

  <abstract-data|<\abstract>
    One of two independent stochastic processes (arms) are to be selected at
    each of <math|n> stages. The selection is sequential and depends on past
    observations as well as the prior information. Observations from arm
    <math|i> are independent given a distribution <math|P<rsub|i>>, and,
    following Clayton and Berry (1985), <math|P<rsub|i>>'s have independent
    Dirichlet process priors. The objective is to maximize the expected
    future-discounted sum of the <math|n> observations. We study structural
    properties of the bandit, in particular how the maximum expected payoff
    and the optimal strategy vary with the Dirichlet process priors. The main
    results are (i) for a particular arm and a fixed prior weight, the
    maximum expected payoff increases as the mean of the Dirichlet process
    prior becomes larger in the increasing convex order; (ii) for a fixed
    prior mean, the maximum expected payoff decreases as the prior weight
    increases. Specializing to the one-armed bandit, the second result
    captures the intuition that, given the same immediate payoff, the more is
    known about an arm, the less desirable it becomes because there is less
    to learn when selecting that arm. This extends some results of Gittins
    and Wang (1992) on Bernoulli bandits and settles a conjecture of Clayton
    and Berry (1985).

    <with|font-series|bold|Keywords:> convex order; Dirichlet bandits;
    sequential decision; two-armed bandits.

    <with|font-series|bold|MSC 2010:> Primary 62L05, 62C10; Secondary 62L15,
    60E15.
  </abstract>>

  <section|Introduction>

  Bandit problems are classical problems in statistical decision theory and
  have received considerable attention; see Berry and Fristedt (1985) for an
  overview. We consider discrete-time, finite-horizon, two-armed bandits from
  a Bayesian perspective. At each of <math|n> stages, an observation is taken
  from one of two stochastic processes (arms). A
  <with|font-shape|italic|strategy> specifies which process to select based
  on past observations. The objective is to maximize the expected payoff,
  <math|<big|sum><rsub|i=1><rsup|n>a<rsub|i>*Z<rsub|i>>, where
  <math|Z<rsub|i>> is the observation at stage <math|i> and
  <math|A<rsub|n>\<equiv\><around|(|a<rsub|1>,a<rsub|2>,\<ldots\>,a<rsub|n>|)>>
  is a discount sequence satisfying <math|a<rsub|i>\<geq\>0> and
  <math|<big|sum><rsub|i=1><rsup|n>a<rsub|i>\<gtr\>0>. A strategy is optimal
  if it achieves the maximum expected payoff. An arm is optimal initially if
  there exists an optimal strategy that selects that arm at the first stage.

  The most widely studied bandit problem is the Bernoulli bandit, where each
  arm generates a sequence of exchangeable Bernoulli random variables.
  Bernoulli bandits are important as a model for clinical trials. Others such
  as normal bandits have also been extensively studied (Chernoff 1968).
  Extending the Bernoulli bandit, Clayton and Berry (1985) have introduced a
  one-armed Bayesian nonparametric bandit using Dirichlet process priors
  (Ferguson 1973). Chattopadhyay (1994) extends this and studies the two
  armed Dirichlet bandit, which is also the setting of this work. Associated
  with arms <math|1> and <math|2> are probability measures
  <math|P<rsub|i>,i=1,2>, respectively. Observations from arm <math|i> are
  independent samples given <math|P<rsub|i>>; observations from different
  arms are independent. The <math|P<rsub|i>>'s themselves are treated as
  random, with independent Dirichlet process priors. Specifically,
  <math|P<rsub|i>\<sim\><with|font-family|rm|D*P<around|(|\<alpha\><rsub|i>|)>>>,
  where <math|\<alpha\><rsub|i>> is a finite nonnull measure with a finite
  first moment. It is often helpful to write
  <math|\<alpha\><rsub|i>=M<rsub|i>*F<rsub|i>> where
  <math|M<rsub|i>=\<alpha\><rsub|i><around|(|<math-bf|R>|)>> so that
  <math|F<rsub|i>> is a probability distribution. We refer to
  <math|F<rsub|i>> and <math|M<rsub|i>> as the prior mean distribution and
  prior weight of the Dirichlet process, respectively. We use
  <math|<around|(|\<alpha\><rsub|1>,\<alpha\><rsub|2>;A<rsub|n>|)>> to denote
  such a Dirichlet bandit with discount sequence <math|A<rsub|n>>.

  For such problems one must balance the desire to maximize the immediate
  payoff and the need to explore a less known arm in the hope of higher
  payoff later on (the exploitation versus exploration dilemma). Optimal
  strategies are usually specified through backward induction and are
  nontrivial to compute. Nevertheless certain structural properties such as
  the stay-on-a-winner rule (Bradt, Johnson and Karlin 1956; Berry 1972)
  often hold under suitable conditions. For Dirichlet bandits with known arm
  2, Clayton and Berry (1985) obtain several structural results. In
  particular, the maximum expected payoff increases as <math|F<rsub|1>>, the
  mean of the Dirichlet process prior for arm 1, increases in the usual
  stochastic order. Also, a version of the stay-on-a-winner rule holds: if
  arm 1 is optimal initially then it is optimal at the next stage provided
  that the initial observation from arm 1 is sufficiently large. Such results
  have been extended to the general two-armed Dirichlet bandits
  (Chattopadhyay 1994).

  This paper studies further structural properties of Dirichlet bandits, in
  particular how the value of the bandit (i.e., the maximum expected payoff)
  varies with the Dirichlet process priors. The main results are (i) the
  value increases as the mean of the Dirichlet process for any arm becomes
  larger in the increasing convex order (defined below); (ii) the value
  decreases as the prior weight of the Dirichlet process of an arm increases.
  The second result agrees with the intuition that, given the same immediate
  payoff, an arm is less appealing when more is known about it, because there
  remains less to be explored. Though easy to state and intuitively
  appealing, such results are often difficult to prove. We mention a
  long-standing conjecture of Berry (1972), which states that for a
  finite-horizon Bernoulli two-armed bandit with uniform discounting and
  independent <with|font-family|rm|Beta><math|<around|(|u<rsub|i>,v<rsub|i>|)>>
  priors, <math|i=1,2>, for arms 1 and 2 respectively, if
  <math|u<rsub|1>/v<rsub|1>=u<rsub|2>/v<rsub|2>> and
  <math|u<rsub|1>+v<rsub|1>\<less\>u<rsub|2>+v<rsub|2>>, then arm 1 is
  preferred to arm 2 at the initial pull. If, instead of finite-horizon
  uniform discounting, we assume infinite-horizon geometric discounting, then
  the corresponding conjecture is true, as shown by Gittins and Wang (1992),
  who also prove analogous results for some other parametric bandits.
  Geometric discounting is special in that the optimal strategy for a
  multi-armed bandit is characterized by a \Pdynamic allocation index,\Q or
  Gittins index (Gittins and Jones 1974; Gittins 1979; Whittle 1980), which
  reduces the problem to several one-armed bandits.

  As the Bernoulli bandit is a special case of the Dirichlet bandit, our
  results may be regarded as a generalization of Gittins and Wang (1992),
  although our method of proof, based on convexity and stochastic orders, is
  different. Our main result (Corollary<nbsp><reference|coro2>) confirms a
  conjecture of Clayton and Berry (1985) concerning the break-even value in
  the one-armed Dirichlet bandit. We also prove another conjecture of Clayton
  and Berry (1985) concerning the break-even observation when both arms are
  optimal initially (Proposition<nbsp><reference|prop1>). These results will
  hopefully shed some light on the conjecture of Berry (1972). See Herschkorn
  (1997) for related results and conjectures on the Bernoulli bandit.

  We find the usual stochastic order, the convex order and the increasing
  convex order particularly helpful in formulating and deriving the main
  results. For random variables <math|Z<rsub|1>> and <math|Z<rsub|2>> taking
  values on <math|<math-bf|R>>, we write <math|Z<rsub|1>\<leq\><rsub|<math-up|st>>Z<rsub|2>>
  (respectively, <math|Z<rsub|1>\<leq\><rsub|<math-up|cx>>Z<rsub|2>>), if

  <\equation>
    <label|orders>E*\<phi\><around|(|Z<rsub|1>|)>\<leq\>E*\<phi\><around|(|Z<rsub|2>|)>
  </equation>

  for every increasing (respectively, convex) function <math|\<phi\>> such
  that the expectations exist. If <math|Z<rsub|1>\<leq\><rsub|<math-up|st>>Z<rsub|2>>
  then we also say <math|Z<rsub|2>> is to the right of <math|Z<rsub|1>>. We
  say <math|Z<rsub|1>> is smaller than <math|Z<rsub|2>> in the increasing
  convex order, written as <math|Z<rsub|1>\<leq\><rsub|<math-up|icx>>Z<rsub|2>>,
  if (<reference|orders>) holds for every increasing and convex function
  <math|\<phi\>> such that the expectations exist. Hence
  <math|\<leq\><rsub|<math-up|icx>>> is implied by either
  <math|\<leq\><rsub|<math-up|st>>> or <math|\<leq\><rsub|<math-up|cx>>>. The
  convex order is concerned with variability. For example, if
  <math|Z<rsub|1>\<leq\><rsub|<math-up|cx>>Z<rsub|2>>, both with finite
  second moments, then <math|E*Z<rsub|1>=E*Z<rsub|2>> and
  <math|V*a*r<around|(|Z<rsub|1>|)>\<leq\>V*a*r<around|(|Z<rsub|2>|)>>.
  Another basic property is closure under mixtures: if distributions
  <math|F<rsub|i>,G<rsub|i>,i=1,2>, satisfy
  <math|F<rsub|1>\<leq\><rsub|<math-up|cx>>F<rsub|2>> and
  <math|G<rsub|1>\<leq\><rsub|<math-up|cx>>G<rsub|2>> then
  <math|\<rho\>*F<rsub|1>+<around|(|1-\<rho\>|)>*G<rsub|1>\<leq\><rsub|<math-up|cx>>\<rho\>*F<rsub|2>+<around|(|1-\<rho\>|)>*G<rsub|2>,\<rho\>\<in\><around|[|0,1|]>>;
  closure under mixtures also holds for <math|\<leq\><rsub|<math-up|icx>>>
  and <math|\<leq\><rsub|<math-up|st>>>. (We use the notation
  <math|\<leq\><rsub|<math-up|st>>,\<leq\><rsub|<math-up|cx>>,\<leq\><rsub|<math-up|icx>>>
  with distribution functions as well as random variables.) For further
  properties and applications of various stochastic orders, see Mller and
  Stoyan (2002) and Shaked and Shanthikumar (2007).

  <section|Prior mean monotonicity>

  Let us denote the maximum expected payoff of a two-armed Dirichlet bandit
  <math|<around|(|\<alpha\><rsub|1>,\<alpha\><rsub|2>;A<rsub|n>|)>> by
  <math|W<around|(|\<alpha\><rsub|1>,\<alpha\><rsub|2>;A<rsub|n>|)>>. Let
  <math|W<rsup|i><around|(|\<alpha\><rsub|1>,\<alpha\><rsub|2>;A<rsub|n>|)>>
  be the expected payoff when selecting arm <math|i> initially and using an
  optimal strategy thereafter. Then

  <\equation>
    <label|Wdef1>W<around|(|\<alpha\><rsub|1>,\<alpha\><rsub|2>;A<rsub|n>|)>=max
    <around*|{|W<rsup|1><around|(|\<alpha\><rsub|1>,\<alpha\><rsub|2>;A<rsub|n>|)>,W<rsup|2><around|(|\<alpha\><rsub|1>,\<alpha\><rsub|2>;A<rsub|n>|)>|}>.
  </equation>

  Suppose arm 1 is selected initially, resulting in an observation <math|X>.
  Because the prior on <math|P<rsub|1>> is a Dirichlet process, the posterior
  is again a Dirichlet process <with|font-family|rm|DP><math|<around|(|\<alpha\><rsub|1>+\<delta\><rsub|X>|)>>,
  where <math|\<delta\><rsub|x>> denotes a point mass at <math|x>. Thus we
  have

  <align|<tformat|<table|<row|<cell|<label|Wdef2>W<rsup|1><around|(|\<alpha\><rsub|1>,\<alpha\><rsub|2>;A<rsub|n>|)>>|<cell|=a<rsub|1>*\<mu\><rsub|1>+<around*|\<nobracket\>|E*<around*|[|W*<around|(|\<alpha\><rsub|1>+\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>|\|>*\<alpha\><rsub|1>|]>,>>|<row|<cell|<label|Wdef3>W<rsup|2><around|(|\<alpha\><rsub|1>,\<alpha\><rsub|2>;A<rsub|n>|)>>|<cell|=a<rsub|1>*\<mu\><rsub|2>+<around*|\<nobracket\>|E*<around*|[|W*<around|(|\<alpha\><rsub|1>,\<alpha\><rsub|2>+\<delta\><rsub|Y>;A<rsup|1><rsub|n>|)>|\|>*\<alpha\><rsub|2>|]>,>>>>>

  where <math|A<rsub|n><rsup|1>=<around|(|a<rsub|2>,a<rsub|3>,\<ldots\>,a<rsub|n>|)>>
  and <math|\<mu\><rsub|i>> denotes the first moment of
  <math|\<alpha\><rsub|i>>, which is also the expected value of an
  observation from arm <math|i>. In <math|E<around|[|g<around|(|X|)>\|\<alpha\>|]>>,
  the distribution of <math|X> is <math|\<alpha\>/M> with
  <math|M=\<alpha\><around|(|<math-bf|R>|)>>. The quantities
  <math|W,W<rsup|1>> and <math|W<rsup|2>> are well defined and finite as long
  as <math|\<alpha\><rsub|i>,i=1,2>, have finite first moments, which we
  assume throughout.

  Lemma <reference|lem1> reveals a convexity property of <math|W> which we
  shall use repeatedly.

  <\lemma>
    <label|lem1>Let <math|\<alpha\>> be a finite measure on
    <math|<math-bf|R>> with a finite mean. Then, for
    <math|u,v\<in\><math-bf|R>> and <math|r\<gtr\>0>, the function
    <math|W*<around|(|\<alpha\>+\<rho\>*\<delta\><rsub|u>+<around|(|r-\<rho\>|)>*\<delta\><rsub|v>,\<alpha\><rsub|2>;A<rsub|n>|)>>
    is convex in <math|\<rho\>\<in\><around|[|0,r|]>>.
  </lemma>

  <\proof>
    Let us use induction on <math|n>. It is easy to check that the claim
    holds for <math|n=1>. For <math|n\<geq\>2>, we note that by
    (<reference|Wdef1>) it suffices to show that each of
    <math|W<rsup|i>*<around|(|\<alpha\>+\<rho\>*\<delta\><rsub|u>+<around|(|r-\<rho\>|)>*\<delta\><rsub|v>,\<alpha\><rsub|2>;A<rsub|n>|)>,i=1,2>,
    is convex in <math|\<rho\>\<in\><around|[|0,r|]>>. Since the mean of
    <math|\<alpha\>+\<rho\>*\<delta\><rsub|u>+<around|(|r-\<rho\>|)>*\<delta\><rsub|v>>
    is linear in <math|\<rho\>>, by (<reference|Wdef2>) and
    (<reference|Wdef3>), we only need to show that both

    <\equation>
      <label|convex1><around*|\<nobracket\>|E*<around*|[|W*<around|(|\<alpha\>+\<rho\>*\<delta\><rsub|u>+<around|(|r-\<rho\>|)>*\<delta\><rsub|v>+\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>|\|>*\<alpha\>+\<rho\>*\<delta\><rsub|u>+<around|(|r-\<rho\>|)>*\<delta\><rsub|v>|]><space|1em><math-up|and>
    </equation>

    <\equation>
      <label|convex2><around*|\<nobracket\>|E*<around*|[|W*<around|(|\<alpha\>+\<rho\>*\<delta\><rsub|u>+<around|(|r-\<rho\>|)>*\<delta\><rsub|v>,\<alpha\><rsub|2>+\<delta\><rsub|Y>;A<rsup|1><rsub|n>|)>|\|>*\<alpha\><rsub|2>|]>
    </equation>

    are convex in <math|\<rho\>>. Convexity of (<reference|convex2>) follows
    from the induction hypothesis. To deal with (<reference|convex1>), we
    directly compute

    <align|<tformat|<table|<row|<cell|<no-number>*E[W(\<alpha\>>|<cell|+\<rho\>*\<delta\><rsub|u>+<around|(|r-\<rho\>|)>*\<delta\><rsub|v>+\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>)\|\<alpha\>+\<rho\>*\<delta\><rsub|u>+<around|(|r-\<rho\>|)>*\<delta\><rsub|v>]>>|<row|<cell|<label|convex3>=>|<cell|<frac|M|M+r>*E*<around|[|W*<around|(|\<alpha\>+\<rho\>*\<delta\><rsub|u>+<around|(|r-\<rho\>|)>*\<delta\><rsub|v>+\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\|\<alpha\>|]>>>|<row|<cell|<label|convex4>>|<cell|<space|1em>+<frac|\<rho\>*\<phi\>*<around|(|\<rho\>+1|)>+<around|(|r-\<rho\>|)>*\<phi\><around|(|\<rho\>|)>|M+r>,>>>>>

    where <math|M=\<alpha\><around|(|<math-bf|R>|)>> and

    <\equation*>
      \<phi\><around|(|\<rho\>|)>=W*<around|(|\<alpha\>+\<rho\>*\<delta\><rsub|u>+<around|(|r+1-\<rho\>|)>*\<delta\><rsub|v>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>.
    </equation*>

    By the induction hypothesis, <math|\<phi\><around|(|\<rho\>|)>> is convex
    in <math|\<rho\>\<in\><around|[|0,r+1|]>>. We claim that this implies
    that <math|\<psi\><around|(|\<rho\>|)>\<equiv\>\<rho\>*\<phi\>*<around|(|\<rho\>+1|)>+<around|(|r-\<rho\>|)>*\<phi\><around|(|\<rho\>|)>>
    is convex in <math|\<rho\>\<in\><around|[|0,r|]>>. In fact, if
    <math|\<phi\><around|(|\<rho\>|)>> is twice differentiable, then we have

    <\equation*>
      \<psi\><rprime|''><around|(|\<rho\>|)>=2*<around|(|\<phi\><rprime|'>*<around|(|\<rho\>+1|)>-\<phi\><rprime|'><around|(|\<rho\>|)>|)>+\<rho\>*\<phi\><rprime|''>*<around|(|\<rho\>+1|)>+<around|(|r-\<rho\>|)>*\<phi\><rprime|''><around|(|\<rho\>|)>\<geq\>0,<space|1em>\<rho\>\<in\><around|[|0,r|]>,
    </equation*>

    by the convexity of <math|\<phi\>>. A standard limiting argument shows
    that <math|\<psi\><around|(|\<rho\>|)>> is convex in
    <math|\<rho\>\<in\><around|[|0,r|]>> as long as
    <math|\<phi\><around|(|\<rho\>|)>> is convex in
    <math|\<rho\>\<in\><around|[|0,r+1|]>> without assuming
    differentiability. Hence the second term (<reference|convex4>) is convex.
    The first term (<reference|convex3>) is convex in
    <math|\<rho\>\<in\><around|[|0,r|]>> by the induction hypothesis, since
    in this expectation <math|X> is distributed according to
    <math|\<alpha\>/M> independently of <math|\<rho\>>. Thus the convexity of
    (<reference|convex1>) is established.
  </proof>

  Theorem <reference|thm1> says that the value of the bandit increases as the
  mean of the Dirichlet process prior for any arm becomes stochastically
  larger and more dispersed. This strengthens Proposition 2.2 of Clayton and
  Berry (1985) who consider the usual stochastic order rather than the
  increasing convex order.

  <\theorem>
    <label|thm1>If <math|M\<gtr\>0> and <math|F\<leq\><rsub|<math-up|icx>><wide|F|~>>,
    both with finite means, then

    <\equation*>
      W*<around|(|M*F,\<alpha\><rsub|2>;A<rsub|n>|)>\<leq\>W*<around|(|M*<wide|F|~>,\<alpha\><rsub|2>;A<rsub|n>|)>.
    </equation*>
  </theorem>

  <\proof>
    Let us use induction. The claim obviously holds for <math|n=1>. For
    <math|n\<geq\>2> we have <math|W<rsup|2>*<around|(|M*F,\<alpha\><rsub|2>;A<rsub|n>|)>\<leq\>W<rsup|2>*<around|(|M*<wide|F|~>,\<alpha\><rsub|2>;A<rsub|n>|)>>
    by (<reference|Wdef3>) and the induction hypothesis. Moreover,

    <align*|<tformat|<table|<row|<cell|W<rsup|1>*<around|(|M*F,\<alpha\><rsub|2>;A<rsub|n>|)>>|<cell|=a<rsub|1>*E<around|(|X\|F|)>+E*<around|[|W*<around|(|M*F+\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\|F|]>>>|<row|<cell|>|<cell|\<leq\>a<rsub|1>*E<around|(|X\|<wide|F|~>|)>+E*<around|[|W*<around|(|M*<wide|F|~>+\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\|F|]>>>|<row|<cell|>|<cell|\<leq\>a<rsub|1>*E<around|(|X\|<wide|F|~>|)>+E*<around|[|W*<around|(|M*<wide|F|~>+\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\|<wide|F|~>|]>>>|<row|<cell|>|<cell|=W<rsup|1>*<around|(|M*<wide|F|~>,\<alpha\><rsub|2>;A<rsub|n>|)>,>>>>>

    where the first inequality follows from
    <math|F\<leq\><rsub|<math-up|icx>><wide|F|~>> and the induction
    hypothesis, noting that <math|<around|(|M*F+\<delta\><rsub|x>|)>/<around|(|M+1|)>\<leq\><rsub|<math-up|icx>><around|(|M*<wide|F|~>+\<delta\><rsub|x>|)>/<around|(|M+1|)>>
    for any <math|x>; the second inequality holds by the definition of
    <math|\<leq\><rsub|<math-up|icx>>>, because
    <math|W*<around|(|M*<wide|F|~>+\<delta\><rsub|x>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>>
    is an increasing, convex function of <math|x>. To show this, fix
    <math|-\<infty\>\<less\>u\<less\>v\<less\>\<infty\>>. It is easy to show
    <math|<around|(|M*<wide|F|~>+\<delta\><rsub|u>|)>/<around|(|M+1|)>\<leq\><rsub|<math-up|icx>><around|(|M*<wide|F|~>+\<delta\><rsub|v>|)>/<around|(|M+1|)>>,
    which, by the induction hypothesis, implies
    <math|W*<around|(|M*<wide|F|~>+\<delta\><rsub|u>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\<leq\>W*<around|(|M*<wide|F|~>+\<delta\><rsub|v>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>>.
    Moreover,

    <align*|<tformat|<table|<row|<cell|W*<around|(|M*<wide|F|~>+\<delta\><rsub|u>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>>|<cell|+W*<around|(|M*<wide|F|~>+\<delta\><rsub|v>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>>>|<row|<cell|>|<cell|\<geq\>2*W*<around|(|M*<wide|F|~>+<around|(|\<delta\><rsub|u>+\<delta\><rsub|v>|)>/2,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>>>|<row|<cell|>|<cell|\<geq\>2*W*<around|(|M*<wide|F|~>+\<delta\><rsub|<around|(|u+v|)>/2>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>,>>>>>

    where the first inequality follows from Lemma <reference|lem1>, and the
    second inequality holds by the induction hypothesis, noting that

    <\equation*>
      <frac|M*<wide|F|~>+\<delta\><rsub|<around|(|u+v|)>/2>|M+1>\<leq\><rsub|<math-up|icx>><frac|M*<wide|F|~>+<around|(|\<delta\><rsub|u>+\<delta\><rsub|v>|)>/2|M+1>.
    </equation*>

    Hence <math|W*<around|(|M*<wide|F|~>+\<delta\><rsub|x>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>>
    is convex in <math|x> as needed.
  </proof>

  <with|font-series|bold|Remark 1.> Theorem <reference|thm1> extends to
  bandits with more than two arms. That is, the maximum expected payoff
  increases when the mean of the Dirichlet process prior for any arm becomes
  larger in the increasing convex order. We present the two-armed version for
  notational convenience. The discount sequence in
  Theorem<nbsp><reference|thm1> is very general, i.e., we only assume
  <math|A<rsub|n>> is nonnegative. By approximation, this can be further
  extended to the infinite-horizon case assuming
  <math|<big|sum><rsub|i=1><rsup|\<infty\>>a<rsub|i>\<less\>\<infty\>>.
  Similar comments apply to Theorem<nbsp><reference|thm2> in Section<nbsp>3.

  When arm 2 has a known distribution <math|P<rsub|2>> with mean
  <math|\<lambda\>>, the problem reduces to a one-armed bandit. Without loss
  of generality we may assume the known arm yields a constant payoff
  <math|\<lambda\>> at each stage, i.e., we consider the
  <math|<around|(|\<alpha\>,\<delta\><rsub|\<lambda\>>;A<rsub|n>|)>> bandit
  (the subscript on <math|\<alpha\><rsub|1>> is dropped for convenience). It
  is well known that, assuming the discount sequence is regular in the sense
  that <math|<around|(|<big|sum><rsub|i\<geq\>j+1>a<rsub|i>|)><rsup|2>\<geq\><around|(|<big|sum><rsub|i\<geq\>j>a<rsub|i>|)><around|(|<big|sum><rsub|i\<geq\>j+2>a<rsub|i>|)>>
  for all <math|j\<geq\>1>, this one-armed bandit is an optimal stopping
  problem, i.e., if at any stage it is optimal to pull arm 2 then arm 2
  should be used in all subsequent stages; see Berry and Fristedt (1979). If
  <math|A<rsub|n>> is regular, then there exists a break-even value
  <math|\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>> for the
  <math|<around|(|\<alpha\>,\<delta\><rsub|\<lambda\>>;A<rsub|n>|)>> bandit,
  such that arm 1 is optimal initially if and only if
  <math|\<lambda\>\<leq\>\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>> and arm
  2 is optimal initially if and only if <math|\<lambda\>\<geq\>\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>>.
  For infinite-horizon geometric discounting, this break-even value is also
  known as the dynamic allocation index or Gittins index (Gittins and Jones
  1974). The following result holds by the optimal stopping characterization
  and is stated for uniform discounting as Lemma 2.1 in Clayton and Berry
  (1985).

  <\lemma>
    <label|lemlam>If <math|A<rsub|n>> is regular, then
    <math|\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>> is the smallest
    <math|\<lambda\>> such that <math|W<around|(|\<alpha\>,\<delta\><rsub|\<lambda\>>;A<rsub|n>|)>\<leq\>\<lambda\>*<big|sum><rsub|i=1><rsup|n>a<rsub|i>>.
  </lemma>

  Lemma <reference|lemlam> and Theorem <reference|thm1> yield the following
  result comparing <math|\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>>.

  <\corollary>
    <label|coro1>For <math|M\<gtr\>0> and
    <math|F\<leq\><rsub|<math-up|icx>><wide|F|~>>, both with finite means, we
    have <math|\<Lambda\>*<around|(|M*F;A<rsub|n>|)>\<leq\>\<Lambda\>*<around|(|M*<wide|F|~>;A<rsub|n>|)>>,
    assuming <math|A<rsub|n>> is a regular discount sequence.
  </corollary>

  Suppose <math|A<rsub|n>> is regular. Monotonicity and continuity
  considerations (see Clayton and Berry 1985) show that, for the
  <math|<around|(|\<alpha\>,\<delta\><rsub|\<lambda\>>;A<rsub|n>|)>> bandit
  there exists a break-even observation <math|b<around|(|\<alpha\>;A<rsub|n>|)>>
  such that if both arms are optimal initially, and an observation <math|x>
  is taken from arm 1, then arm 1 remains optimal if
  <math|x\<geq\>b<around|(|\<alpha\>;A<rsub|n>|)>> and arm 2 becomes optimal
  if <math|x\<leq\>b<around|(|\<alpha\>;A<rsub|n>|)>>. That is,

  <align*|<tformat|<table|<row|<cell|\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>>|<cell|\<geq\>\<Lambda\>*<around|(|\<alpha\>+\<delta\><rsub|x>;A<rsub|n><rsup|1>|)>,<space|1em><math-up|if>x\<leq\>b<around|(|\<alpha\>;A<rsub|n>|)>;>>|<row|<cell|\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>>|<cell|\<leq\>\<Lambda\>*<around|(|\<alpha\>+\<delta\><rsub|x>;A<rsub|n><rsup|1>|)>,<space|1em><math-up|if>x\<geq\>b<around|(|\<alpha\>;A<rsub|n>|)>.>>>>>

  Calculating this break-even observation is nontrivial. In the case of
  uniform discounting, Clayton and Berry (1985) prove an upper bound for
  <math|b<around|(|\<alpha\>;A<rsub|n>|)>> and conjecture that
  <math|b<around|(|\<alpha\>;A<rsub|n>|)>\<geq\>\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>>
  based on numerical evidence. We confirm this in Proposition
  <reference|prop1>.

  <\proposition>
    <label|prop1>Suppose <math|n\<geq\>2> and <math|A<rsub|n>> is regular and
    all positive. Then <math|b<around|(|\<alpha\>;A<rsub|n>|)>\<geq\>\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>>.
  </proposition>

  As noted by Berry and Fristedt (1985; p. 131),
  Proposition<nbsp><reference|prop1> has an intuitive interpretation. Suppose
  both arms are optimal initially, and arm 1 is selected. If the initial pull
  on arm 1 yields no more than <math|\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>>,
  which is the yield of arm 2 per pull, the hope of getting higher payoff
  fades. Not surprisingly, arm 2 becomes optimal afterwards. This suggests
  that the break-even observation is at least
  <math|\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>>.

  To prove Proposition<nbsp><reference|prop1> we need a lemma.

  <\lemma>
    <label|lem3>For <math|c\<gtr\>0,\<lambda\>\<in\><math-bf|R>> and an
    arbitrary discount sequence <math|A<rsub|n>>, we have

    <\equation*>
      W*<around|(|\<alpha\>+c*\<delta\><rsub|\<lambda\>>,\<delta\><rsub|\<lambda\>>;A<rsub|n>|)>\<leq\>W<around|(|\<alpha\>,\<delta\><rsub|\<lambda\>>;A<rsub|n>|)>.
    </equation*>
  </lemma>

  <\proof>
    We use induction on <math|n>. The <math|n=1> case is easy. Suppose
    <math|n\<geq\>2>. Let us write <math|M=\<alpha\><around|(|<math-bf|R>|)>>
    and let <math|\<mu\>> be the first moment of <math|\<alpha\>>. Direct
    calculation using (<reference|Wdef1>)\U(<reference|Wdef3>) yields

    <align|<tformat|<table|<row|<cell|<label|lam>W*<around|(|\<alpha\>+c*\<delta\><rsub|\<lambda\>>,\<delta\><rsub|\<lambda\>>;A<rsub|n>|)>=max
    <around*|{|<frac|M*\<phi\><rsub|0>+c*\<phi\><rsub|1>|M+c>,\<phi\><rsub|2>|}>,>>>>>

    where

    <align*|<tformat|<table|<row|<cell|\<phi\><rsub|0>>|<cell|=a<rsub|1>*\<mu\>+E*<around*|[|W*<around|(|\<alpha\>+c*\<delta\><rsub|\<lambda\>>+\<delta\><rsub|X>,\<delta\><rsub|\<lambda\>>;A<rsub|n><rsup|1>|)>\|\<alpha\>|]>;>>|<row|<cell|\<phi\><rsub|1>>|<cell|=a<rsub|1>*\<lambda\>+W*<around|(|\<alpha\>+<around|(|c+1|)>*\<delta\><rsub|\<lambda\>>,\<delta\><rsub|\<lambda\>>;A<rsub|n><rsup|1>|)>;>>|<row|<cell|\<phi\><rsub|2>>|<cell|=a<rsub|1>*\<lambda\>+W*<around|(|\<alpha\>+c*\<delta\><rsub|\<lambda\>>,\<delta\><rsub|\<lambda\>>;A<rsub|n><rsup|1>|)>.>>>>>

    Applying the induction hypothesis, and then (<reference|Wdef1>) and
    (<reference|Wdef2>), we get

    <align*|<tformat|<table|<row|<cell|\<phi\><rsub|0>>|<cell|\<leq\>a<rsub|1>*\<mu\>+E*<around*|[|W*<around|(|\<alpha\>+\<delta\><rsub|X>,\<delta\><rsub|\<lambda\>>;A<rsub|n><rsup|1>|)>\|\<alpha\>|]>>>|<row|<cell|>|<cell|\<leq\>W<around|(|\<alpha\>,\<delta\><rsub|\<lambda\>>;A<rsub|n>|)>.>>>>>

    Applying the induction hypothesis, and then (<reference|Wdef1>) and
    (<reference|Wdef3>), we get

    <align*|<tformat|<table|<row|<cell|\<phi\><rsub|1>>|<cell|\<leq\>\<phi\><rsub|2>\<leq\>a<rsub|1>*\<lambda\>+W<around|(|\<alpha\>,\<delta\><rsub|\<lambda\>>;A<rsub|n><rsup|1>|)>\<leq\>W<around|(|\<alpha\>,\<delta\><rsub|\<lambda\>>;A<rsub|n>|)>.>>>>>

    That is, <math|\<phi\><rsub|i>\<leq\>W<around|(|\<alpha\>,\<delta\><rsub|\<lambda\>>;A<rsub|n>|)>>
    for <math|i=0,1,2>. Hence the claim holds by (<reference|lam>).
  </proof>

  <\proof>
    <dueto|Proof of Proposition<nbsp><reference|prop1>>Suppose
    <math|\<lambda\>=\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>>. By the
    optimal stopping characterization, we have
    <math|W<around|(|\<alpha\>,\<delta\><rsub|\<lambda\>>;A<rsub|n><rsup|1>|)>=\<lambda\>*<big|sum><rsub|i=2><rsup|n>a<rsub|i>>.
    Lemma<nbsp><reference|lem3> yields <math|W*<around|(|\<alpha\>+\<delta\><rsub|\<lambda\>>,\<delta\><rsub|\<lambda\>>;A<rsub|n><rsup|1>|)>\<leq\>\<lambda\>*<big|sum><rsub|i=2><rsup|n>a<rsub|i>>.
    It follows from Lemma<nbsp><reference|lemlam> that
    <math|\<lambda\>\<geq\>\<Lambda\>*<around|(|\<alpha\>+\<delta\><rsub|\<lambda\>>;A<rsub|n><rsup|1>|)>>.
    That is, <math|\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>\<geq\>\<Lambda\>*<around|(|\<alpha\>+\<delta\><rsub|\<lambda\>>;A<rsub|n><rsup|1>|)>>,
    which implies <math|\<lambda\>\<leq\>b<around|(|\<alpha\>;A<rsub|n>|)>>
    (under the assumptions <math|b<around|(|\<alpha\>;A<rsub|n>|)>> is
    unique).
  </proof>

  <section|Prior weight monotonicity>

  The main result of this section (Theorem <reference|thm2>) shows that the
  maximum expected payoff of a bandit decreases as the prior weight for the
  Dirichlet process prior of an arm increases. When arm 2 is known and the
  discount sequence is regular, this shows that the break-even value
  <math|\<Lambda\>*<around|(|M<rsub|1>*F<rsub|1>;A<rsub|n>|)>> decreases as
  <math|M<rsub|1>> (the prior weight associated with arm 1) increases. That
  is, given the same immediate payoff, arm 1 becomes less desirable as the
  amount of information about it increases.

  <\theorem>
    <label|thm2>Let <math|F> be a probability distribution on
    <math|<math-bf|R>> with a finite mean. If
    <math|0\<less\>M\<less\><wide|M|~>> then

    <\equation>
      <label|mono2>W*<around|(|M*F,\<alpha\><rsub|2>;A<rsub|n>|)>\<geq\>W*<around|(|<wide|M|~>*F,\<alpha\><rsub|2>;A<rsub|n>|)>.
    </equation>
  </theorem>

  Lemma <reference|lemlam> and Theorem <reference|thm2> yield the following
  result concerning the break-even value <math|\<Lambda\><around|(|\<alpha\>;A<rsub|n>|)>>
  for the one armed bandit <math|<around|(|\<alpha\>,\<delta\><rsub|\<lambda\>>;A<rsub|n>|)>>,
  as conjectured by Clayton and Berry (1985) in the case of uniform
  discounting.

  <\corollary>
    <label|coro2>For <math|0\<less\>M\<less\><wide|M|~>> we have
    <math|\<Lambda\>*<around|(|M*F;A<rsub|n>|)>\<geq\>\<Lambda\>*<around|(|<wide|M|~>*F;A<rsub|n>|)>>,
    assuming <math|A<rsub|n>> is a regular discount sequence.
  </corollary>

  When <math|F> has only two support points, Corollary 2 says that for a
  Bernoulli one-armed bandit with a <with|font-family|rm|Beta><math|<around|(|M*u,M*v|)>>
  prior, <math|u,v\<gtr\>0>, for the unknown arm, the break-even value
  decreases in <math|M>. This Bernoulli case was proved by Gittins and Wang
  (1992) for infinite-horizon geometric discounting.

  The rest of this section gives a proof of Theorem <reference|thm2>. We
  assume <math|F> has finite, and then bounded, and finally arbitrary,
  support. The key step is summarized as Lemma <reference|lem4>.

  <\lemma>
    <label|lem4>Assume <math|n\<geq\>2,L\<gtr\>0>. Assume <math|\<alpha\>> is
    a finite measure on <math|<math-bf|R>> with a finite mean and <math|F> is
    a probability distribution on <math|<math-bf|R>> with
    <math|s\<less\>\<infty\>> support points. Then
    <math|E*<around|[|W*<around|(|\<alpha\>+\<theta\>*F+<around|(|L-\<theta\>|)>*\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\|F|]>>
    decreases in <math|\<theta\>\<in\><around|[|0,L|]>>.
  </lemma>

  <\proof>
    We use induction on <math|s>. Although the induction may start at the
    trivial case <math|s=1>, we present the <math|s=2> case to illustrate the
    convexity arguments. Write <math|F=p*\<delta\><rsub|1>+<around|(|1-p|)>*\<delta\><rsub|0>>
    where <math|p\<in\><around|(|0,1|)>> and <math|<around|{|0,1|}>> are the
    support points without loss of generality. For fixed
    <math|0\<leq\>\<theta\><rsub|1>\<less\>\<theta\><rsub|2>\<leq\>L>, let
    <math|Z\<sim\><math-up|Bernoulli><around|(|p|)>> and define

    <\equation*>
      Z<rsub|i>=\<theta\><rsub|i>*p+<around|(|L-\<theta\><rsub|i>|)>*Z,<space|1em>i=1,2.
    </equation*>

    Then <math|E*Z<rsub|1>=E*Z<rsub|2>=p*L>, and it is easy to verify
    <math|Z<rsub|2>\<leq\><rsub|<math-up|cx>>Z<rsub|1>> as
    <math|\<theta\><rsub|1>\<less\>\<theta\><rsub|2>> (see, e.g., Shaked and
    Shanthikumar 2007, Theorem 3.A.18). Let us define

    <\equation*>
      \<phi\><around|(|u|)>=W*<around|(|\<alpha\>+u*\<delta\><rsub|1>+<around|(|L-u|)>*\<delta\><rsub|0>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>.
    </equation*>

    By direct calculation

    <align*|<tformat|<table|<row|<cell|E*<around*|[|W*<around*|(|\<alpha\>+\<theta\><rsub|1>*F+<around|(|L-\<theta\><rsub|1>|)>*\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\|F|]>=>|<cell|p*\<phi\>*<around|(|\<theta\><rsub|1>*p+L-\<theta\><rsub|1>|)>+<around|(|1-p|)>*\<phi\>*<around|(|\<theta\><rsub|1>*p|)>>>|<row|<cell|=>|<cell|E*\<phi\><around|(|Z<rsub|1>|)>>>|<row|<cell|\<geq\>>|<cell|E*\<phi\><around|(|Z<rsub|2>|)>>>|<row|<cell|=>|<cell|E*<around*|[|W*<around*|(|\<alpha\>+\<theta\><rsub|2>*F+<around|(|L-\<theta\><rsub|2>|)>*\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\|F|]>>>>>>

    where the inequality holds because <math|Z<rsub|2>\<leq\><rsub|<math-up|cx>>Z<rsub|1>>
    and, by Lemma <reference|lem1>, <math|\<phi\><around|(|u|)>> is convex in
    <math|u\<in\><around|[|0,L|]>>.

    For <math|s\<geq\>3>, write <math|F=<big|sum><rsub|j=1><rsup|s>p<rsub|j>*\<delta\><rsub|x<rsub|j>>>,
    where <math|<around|{|x<rsub|j>,j=1,\<ldots\>,s|}>> are the support
    points, <math|p<rsub|j>\<gtr\>0> and <math|<big|sum><rsub|j=1><rsup|s>p<rsub|j>=1>.
    Consider the leave-one-out distributions

    <align*|<tformat|<table|<row|<cell|F<rsup|k>>|<cell|=<big|sum><rsub|j\<neq\>k><frac|p<rsub|j>|1-p<rsub|k>>*\<delta\><rsub|x<rsub|j>>,<space|1em>k=1,\<ldots\>,s.>>>>>

    Denote <math|W<around|(|\<gamma\>|)>=W<around|(|\<gamma\>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>>
    for convenience. For fixed <math|0\<leq\>\<theta\><rsub|1>\<less\>\<theta\><rsub|2>\<leq\>L>,
    we have

    <align|<tformat|<table|<row|<cell|<no-number>*<around|(|s-1|)>>|<cell|<around*|\<nobracket\>|E*<around*|[|W*<around*|(|\<alpha\>+\<theta\><rsub|1>*F+<around|(|L-\<theta\><rsub|1>|)>*\<delta\><rsub|X>|)>|\|>*F|]>>>|<row|<cell|<no-number>>|<cell|=<big|sum><rsub|k=1><rsup|s><around|(|1-p<rsub|k>|)>*E*<around*|[|W*<around*|(|\<alpha\>+\<theta\><rsub|1>*F+<around|(|L-\<theta\><rsub|1>|)>*\<delta\><rsub|X>|)><around*|\||F<rsup|k>|]>|\<nobracket\>>>>|<row|<cell|<no-number>>|<cell|=<big|sum><rsub|k=1><rsup|s><around|(|1-p<rsub|k>|)>*<around*|\<nobracket\>|E*<around*|[|W*<around*|(|\<alpha\>+\<theta\><rsub|1>*p<rsub|k>*\<delta\><rsub|x<rsub|k>>+\<theta\><rsub|1>*<around|(|1-p<rsub|k>|)>*F<rsup|k>+<around|(|L-\<theta\><rsub|1>|)>*\<delta\><rsub|X>|)>|\|>*F<rsup|k>|]>>>|<row|<cell|<label|induct1>>|<cell|\<geq\><big|sum><rsub|k=1><rsup|s><around|(|1-p<rsub|k>|)>*<around*|\<nobracket\>|E*<around*|[|W*<around*|(|\<alpha\>+\<theta\><rsub|1>*p<rsub|k>*\<delta\><rsub|x<rsub|k>>+\<theta\><rsub|2>*<around|(|1-p<rsub|k>|)>*F<rsup|k>+<around|(|L-\<theta\><rsub|2>*<around|(|1-p<rsub|k>|)>-\<theta\><rsub|1>*p<rsub|k>|)>*\<delta\><rsub|X>|)>|\|>*F<rsup|k>|]>>>|<row|<cell|<no-number>>|<cell|=<big|sum><rsub|k=1><rsup|s><big|sum><rsub|j\<neq\>k>p<rsub|j>*V<rsub|j*k>,>>>>>

    where

    <align*|<tformat|<table|<row|<cell|V<rsub|j*k>>|<cell|=W*<around*|(|\<alpha\>+\<theta\><rsub|2>*\<gamma\><rsup|j*k>+\<theta\><rsub|1>*p<rsub|k>*\<delta\><rsub|x<rsub|k>>+<around|(|L-\<theta\><rsub|2>*<around|(|1-p<rsub|k>-p<rsub|j>|)>-\<theta\><rsub|1>*p<rsub|k>|)>*\<delta\><rsub|x<rsub|j>>|)>,>>|<row|<cell|\<gamma\><rsup|j*k>>|<cell|=<big|sum><rsub|l\<neq\>j,k>p<rsub|l>*\<delta\><rsub|l>,<space|1em>j\<neq\>k.>>>>>

    The inequality (<reference|induct1>) follows from the induction
    hypothesis; other steps are algebraic manipulations.

    For fixed <math|j\<neq\>k>, let <math|Z\<sim\><math-up|Bernoulli><around|(|p<rsub|k>/<around|(|p<rsub|j>+p<rsub|k>|)>|)>>
    and define

    <align*|<tformat|<table|<row|<cell|Z<rsub|1>>|<cell|=\<theta\><rsub|1>*p<rsub|k>+Z*<around|(|L-\<theta\><rsub|2>+<around|(|\<theta\><rsub|2>-\<theta\><rsub|1>|)>*<around|(|p<rsub|j>+p<rsub|k>|)>|)>;>>|<row|<cell|Z<rsub|2>>|<cell|=\<theta\><rsub|2>*p<rsub|k>+Z*<around|(|L-\<theta\><rsub|2>|)>.>>>>>

    It is easy to verify that

    <\equation*>
      E*Z<rsub|1>=E*Z<rsub|2>;<space|1em>Z<rsub|2>\<leq\><rsub|<math-up|cx>>Z<rsub|1>.
    </equation*>

    We have

    <align*|<tformat|<table|<row|<cell|p<rsub|j>*V<rsub|j*k>+p<rsub|k>*V<rsub|k*j>>|<cell|=<around|(|p<rsub|j>+p<rsub|k>|)>*E*W*<around*|(|\<alpha\>+\<theta\><rsub|2>*\<gamma\><rsup|j*k>+Z<rsub|1>*\<delta\><rsub|x<rsub|k>>+<around|(|L-\<theta\><rsub|2>*<around|(|1-p<rsub|k>-p<rsub|j>|)>-Z<rsub|1>|)>*\<delta\><rsub|x<rsub|j>>|)>>>|<row|<cell|>|<cell|\<geq\><around|(|p<rsub|j>+p<rsub|k>|)>*E*W*<around*|(|\<alpha\>+\<theta\><rsub|2>*\<gamma\><rsup|j*k>+Z<rsub|2>*\<delta\><rsub|x<rsub|k>>+<around|(|L-\<theta\><rsub|2>*<around|(|1-p<rsub|k>-p<rsub|j>|)>-Z<rsub|2>|)>*\<delta\><rsub|x<rsub|j>>|)>>>|<row|<cell|>|<cell|=p<rsub|j>*W*<around*|(|\<alpha\>+\<theta\><rsub|2>*F+<around|(|L-\<theta\><rsub|2>|)>*\<delta\><rsub|x<rsub|j>>|)>+p<rsub|k>*W*<around*|(|\<alpha\>+\<theta\><rsub|2>*F+<around|(|L-\<theta\><rsub|2>|)>*\<delta\><rsub|x<rsub|k>>|)>,>>>>>

    where the inequality holds by Lemma <reference|lem1> as
    <math|Z<rsub|2>\<leq\><rsub|<math-up|cx>>Z<rsub|1>>. Hence,

    <align*|<tformat|<table|<row|<cell|<big|sum><rsub|k=1><rsup|s><big|sum><rsub|j\<neq\>k>p<rsub|j>*V<rsub|j*k>>|<cell|=<big|sum><rsub|1\<leq\>j\<less\>k\<leq\>s><around|(|p<rsub|j>*V<rsub|j*k>+p<rsub|k>*V<rsub|k*j>|)>>>|<row|<cell|>|<cell|\<geq\><big|sum><rsub|1\<leq\>j\<less\>k\<leq\>s><around*|[|p<rsub|j>*W*<around*|(|\<alpha\>+\<theta\><rsub|2>*F+<around|(|L-\<theta\><rsub|2>|)>*\<delta\><rsub|x<rsub|j>>|)>+p<rsub|k>*W*<around*|(|\<alpha\>+\<theta\><rsub|2>*F+<around|(|L-\<theta\><rsub|2>|)>*\<delta\><rsub|x<rsub|k>>|)>|]>>>|<row|<cell|>|<cell|=<around|(|s-1|)>*<big|sum><rsub|j=1><rsup|s>p<rsub|j>*W*<around*|(|\<alpha\>+\<theta\><rsub|2>*F+<around|(|L-\<theta\><rsub|2>|)>*\<delta\><rsub|x<rsub|j>>|)>>>|<row|<cell|>|<cell|=<around|(|s-1|)>*E*<around|[|W*<around|(|\<alpha\>+\<theta\><rsub|2>*F+<around|(|L-\<theta\><rsub|2>|)>*\<delta\><rsub|X>|)>\|F|]>.>>>>>

    Thus we have shown that <math|E*<around|[|W*<around|(|\<alpha\>+\<theta\>*F+<around|(|L-\<theta\>|)>*\<delta\><rsub|X>|)>\|F|]>>
    decreases in <math|\<theta\>\<in\><around|[|0,L|]>>.
  </proof>

  <\proof>
    <dueto|Proof of Theorem <reference|thm2>>(i) Assume <math|F> has finite
    support. The claim obviously holds for <math|n=1>. For <math|n\<geq\>2>
    we use induction. In view of (<reference|Wdef1>)\U(<reference|Wdef3>), we
    only need to show

    <align|<tformat|<table|<row|<cell|<label|wt1>E*<around*|[|W*<around|(|M*F+\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\|F|]>>|<cell|\<geq\>E*<around*|[|W*<around|(|<wide|M|~>*F+\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\|F|]><space|1em><math-up|and>>>|<row|<cell|<label|wt2>E*<around*|[|W*<around|(|M*F,\<alpha\><rsub|2>+\<delta\><rsub|Y>;A<rsup|1><rsub|n>|)>\|\<alpha\><rsub|2>|]>>|<cell|\<geq\>E*<around*|[|W*<around|(|<wide|M|~>*F,\<alpha\><rsub|2>+\<delta\><rsub|Y>;A<rsup|1><rsub|n>|)>\|\<alpha\><rsub|2>|]>.>>>>>

    By the induction hypothesis, (<reference|wt2>) holds. Define
    <math|\<eta\>=<around|(|<wide|M|~>+1|)>/<around|(|M+1|)>> and
    <math|\<theta\>=<wide|M|~>/\<eta\>>. Noting
    <math|M\<less\>\<theta\>\<less\>M+1>, we may apply Lemma <reference|lem4>
    and get

    <align|<tformat|<table|<row|<cell|<no-number>*E*<around*|[|W*<around|(|M*F+\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\|F|]>\<geq\>>|<cell|E*<around*|[|W*<around|(|\<theta\>*F+<around|(|M+1-\<theta\>|)>*\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\|F|]>>>|<row|<cell|<label|strict>\<geq\>>|<cell|E*<around*|[|W*<around|(|\<eta\>*<around|(|\<theta\>*F+<around|(|M+1-\<theta\>|)>*\<delta\><rsub|X>|)>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\|F|]>>>|<row|<cell|<no-number>=>|<cell|E*<around*|[|W*<around|(|<wide|M|~>*F+\<delta\><rsub|X>,\<alpha\><rsub|2>;A<rsup|1><rsub|n>|)>\|F|]>,>>>>>

    where (<reference|strict>) holds by the induction hypothesis, as
    <math|\<eta\>\<gtr\>1>. Thus (<reference|wt1>) holds as required.

    (ii) Assume <math|F> has bounded support. Then for arbitrary
    <math|\<epsilon\>\<gtr\>0> we can construct two distributions
    <math|F<rsup|\<ast\>>> and <math|F<rsub|\<ast\>>> supported on
    <math|<around|{|x<rsub|1>,\<ldots\>,x<rsub|s>|}>> and
    <math|<around|{|x<rsub|0>,\<ldots\>,x<rsub|s-1>|}>> respectively, where
    <math|x<rsub|j>=x<rsub|0>+j*\<epsilon\>>, such that
    <math|F<around|(|x<rsub|0>|)>=0,F<around|(|x<rsub|s>|)>=1> and
    <math|F<rsub|\<ast\>><around|(|x<rsub|j>|)>=F<rsup|\<ast\>><around|(|x<rsub|j-1>|)>=F<around|(|x<rsub|j>|)>,j=1,\<ldots\>,s>.
    By construction, <math|F<rsub|\<ast\>>\<leq\><rsub|<math-up|st>>F\<leq\><rsub|<math-up|st>>F<rsup|\<ast\>>>.
    Theorem <reference|thm1> yields

    <\equation*>
      W*<around|(|M*F<rsub|\<ast\>>,\<alpha\><rsub|2>;A<rsub|n>|)>\<leq\>W*<around|(|M*F,\<alpha\><rsub|2>;A<rsub|n>|)>\<leq\>W*<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|n>|)>.
    </equation*>

    Note that if <math|X\<sim\>F<rsup|\<ast\>>> then
    <math|X-\<epsilon\>\<sim\>F<rsub|\<ast\>>>. Therefore the bandits
    <math|<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|n>|)>> and
    <math|<around|(|M*F<rsub|\<ast\>>,\<alpha\><rsub|2>;A<rsub|n>|)>> can be
    coupled in an obvious way such that, for every strategy of
    <math|<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|n>|)>>, there
    exists a strategy of <math|<around|(|M*F<rsub|\<ast\>>,\<alpha\><rsub|2>;A<rsub|n>|)>>
    under which the payoff at each stage is either the same (when arm 2 is
    selected), or exactly <math|\<epsilon\>> less (when arm 1 is selected).
    Thus we have shown

    <\equation*>
      W*<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|n>|)>-W*<around|(|M*F<rsub|\<ast\>>,\<alpha\><rsub|2>;A<rsub|n>|)>\<leq\>\<epsilon\>*<big|sum><rsub|i=1><rsup|n>a<rsub|i>.
    </equation*>

    Hence <math|W*<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|n>|)>\<to\>W*<around|(|M*F,\<alpha\><rsub|2>;A<rsub|n>|)>>
    as <math|\<epsilon\>\<to\>0>, and the monotonicity of
    <math|W*<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|n>|)>> with
    respect to <math|M> implies the corresponding monotonicity of
    <math|W*<around|(|M*F,\<alpha\><rsub|2>;A<rsub|n>|)>>.

    (iii) Finally, assume <math|F> is an arbitrary distribution with a finite
    mean. Suppose <math|X\<sim\>F>. For <math|L\<gtr\>0> let
    <math|F<rsup|\<ast\>>> be the distribution of <math|X<rsup|\<ast\>>>,
    defined as <math|X> if <math|<around|\||X|\|>\<leq\>L> and <math|0>
    otherwise. We construct a coupling between
    <math|<around|(|M*F,\<alpha\><rsub|2>;A<rsub|n>|)>> and
    <math|<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|n>|)>>. Let
    <math|X<rsub|k>> be the resulting observation when arm 1 of
    <math|<around|(|M*F,\<alpha\><rsub|2>;A<rsub|n>|)>> is pulled for the
    <math|k>th time. If <math|<around|\||X<rsub|1>|\|>\<leq\>L> then let
    <math|X<rsub|1><rsup|\<ast\>>=X<rsub|1>>, otherwise
    <math|X<rsub|1><rsup|\<ast\>>=0>, yielding
    <math|X<rsub|1><rsup|\<ast\>>\<sim\>F<rsup|\<ast\>>>. For general
    <math|k\<geq\>1>, if <math|<around|\||X<rsub|i>|\|>\<leq\>L,i=1,\<ldots\>,k>,
    then let <math|X<rsub|k+1><rsup|\<ast\>>=X<rsub|k+1>> if
    <math|<around|\||X<rsub|k+1>|\|>\<leq\>L> and
    <math|X<rsub|k+1><rsup|\<ast\>>=0> otherwise. In this case the
    conditional distribution of <math|X<rsub|k+1>> given
    <math|X<rsub|i>,i=1,\<ldots\>,k>, is <math|<around|(|M*F+<big|sum><rsub|i=1><rsup|k>\<delta\><rsub|X<rsub|i>>|)>/<around|(|M+k|)>>.
    Since <math|<around|\||X<rsub|i>|\|>\<leq\>L,i=1,\<ldots\>,k>, we have
    <math|X<rsub|i><rsup|\<ast\>>=X<rsub|i>,i=1,\<ldots\>,k>, and the
    conditional distribution of <math|X<rsub|k+1><rsup|\<ast\>>> given
    <math|X<rsub|i><rsup|\<ast\>>,i=1,\<ldots\>,k>, is precisely
    <math|<around|(|M*F<rsup|\<ast\>>+<big|sum><rsub|i=1><rsup|k>\<delta\><rsub|X<rsub|i><rsup|\<ast\>>>|)>/<around|(|M+k|)>>.
    That is, <math|X<rsub|i><rsup|\<ast\>>,i=1,\<ldots\>,k+1>, can be
    regarded as successive pulls from arm 1 of
    <math|<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|n>|)>> as long
    as <math|<around|\||X<rsub|i>|\|>\<leq\>L,i=1,\<ldots\>,k>. Let the
    <math|k>th pull from arm 2 be <math|Y<rsub|k>> for both bandits. In the
    event that all <math|<around|\||X<rsub|i>|\|>\<leq\>L,i=1,\<ldots\>,n>,
    the optimal strategy for <math|<around|(|M*F,\<alpha\><rsub|2>;A<rsub|2>|)>>
    can be adopted for <math|<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|2>|)>>
    throughout, yielding identical pulls (not all
    <math|X<rsub|i>,i=1,\<ldots\>,n>, are realized). By considering a trivial
    upper (respectively, lower) bound for the payoff of
    <math|<around|(|M*F,\<alpha\><rsub|2>;A<rsub|2>|)>> (respectively,
    <math|<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|2>|)>>) when
    at least one <math|<around|\||X<rsub|i>|\|>\<gtr\>L>, we have

    <align*|<tformat|<table|<row|<cell|W*<around|(|M*F,\<alpha\><rsub|2>;A<rsub|2>|)>-W*<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|2>|)>>|<cell|<around*|\<nobracket\>|<around*|\<nobracket\>|\<leq\>E*<around*|[|1<rsub|\<cup\><rsub|i=1><rsup|n><around|{|<around|\||X<rsub|i>|\|>\<gtr\>L|}>>*<big|sum><rsub|i=1><rsup|n><around*|(|a<rsub|i>*<around|(|<around|\||Y<rsub|i>|\|>+<around|\||X<rsub|i>|\|>|)>-a<rsub|i>|(>-<around|\||Y<rsub|i>|\|>-L|)>|)>|]>>>|<row|<cell|>|<cell|\<leq\>E*<around*|[|1<rsub|\<cup\><rsub|i=1><rsup|n><around|{|<around|\||X<rsub|i>|\|>\<gtr\>L|}>>*<big|sum><rsub|i=1><rsup|n>a<rsup|\<ast\>>*<around|(|2<around|\||Y<rsub|i>|\|>+<around|\||X<rsub|i>|\|>+L|)>|]>>>|<row|<cell|>|<cell|\<leq\>E*<around*|[|<around*|(|<big|sum><rsub|i=1><rsup|n>1<rsub|<around|{|<around|\||X<rsub|i>|\|>\<gtr\>L|}>>|)>*<big|sum><rsub|i=1><rsup|n>a<rsup|\<ast\>>*<around|(|2<around|\||Y<rsub|i>|\|>+<around|\||X<rsub|i>|\|>+L|)>|]>>>|<row|<cell|>|<cell|\<equiv\>a<rsup|\<ast\>>*h<around|(|L|)>,>>>>>

    where <math|a<rsup|\<ast\>>\<equiv\>max<rsub|i=1><rsup|n> a<rsub|i>>.
    Direct calculation using exchangeability yields

    <align*|<tformat|<table|<row|<cell|h<around|(|L|)>=n<rsup|2>*Pr
    <around|(|<around|\||X<rsub|1>|\|>\<gtr\>L|)>*<around|(|2*E<around|\||Y<rsub|1>|\|>+L|)>+n*E<around*|[|1<rsub|<around|\||X<rsub|1>|\|>\<gtr\>L><around|\||X<rsub|1>|\|>|]>+n*<around|(|n-1|)>*E<around*|[|1<rsub|<around|\||X<rsub|1>|\|>\<gtr\>L><around|\||X<rsub|2>|\|>|]>>>>>>

    The first two terms tend to zero as <math|L\<to\>\<infty\>> by dominated
    convergence since <math|E<around|\||X<rsub|1>|\|>\<less\>\<infty\>>. For
    the last term, by conditioning on <math|X<rsub|1>> we have

    <\equation*>
      E<around*|[|1<rsub|<around|\||X<rsub|1>|\|>\<gtr\>L><around|\||X<rsub|2>|\|>|]>=E*<around*|[|1<rsub|<around|\||X<rsub|1>|\|>\<gtr\>L>*<around*|(|<frac|M|M+1>*E<around|\||X|\|>+<frac|1|M+1><around|\||X<rsub|1>|\|>|)>|]>,
    </equation*>

    which also vanishes as <math|L\<to\>\<infty\>>. Thus

    <\equation*>
      limsup<rsub|L\<to\>\<infty\>><around*|[|W*<around|(|M*F,\<alpha\><rsub|2>;A<rsub|2>|)>-W*<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|2>|)>|]>\<leq\>0.
    </equation*>

    By a parallel argument, we get <math|liminf<rsub|L\<to\>\<infty\>><around*|[|W*<around|(|M*F,\<alpha\><rsub|2>;A<rsub|2>|)>-W*<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|2>|)>|]>\<geq\>0>.
    Thus <math|W*<around|(|M*F<rsup|\<ast\>>,\<alpha\><rsub|2>;A<rsub|2>|)>>
    tends to <math|W*<around|(|M*F,\<alpha\><rsub|2>;A<rsub|2>|)>> as
    <math|L\<to\>\<infty\>>, and the monotonicity of
    <math|W*<around|(|M*F,\<alpha\><rsub|2>;A<rsub|n>|)>> with respect to
    <math|M> is proved as before.
  </proof>

  <with|font-series|bold|Remark 2.> Clayton and Berry (1985) also conjecture
  that the monotonicity in Corollary <reference|thm2> is strict if
  <math|n\<geq\>2,A<rsub|n>=<around|(|1,1,\<ldots\>,1|)>>, and <math|F> is
  nondegenerate. This can be confirmed by a careful analysis of the above
  results. Some modifications are needed. Using arguments similar to steps
  (ii) and (iii) in the proof of Theorem <reference|thm2>, we can first
  establish that Lemma <reference|lem4> holds without the finite support
  restriction. Directly applying this strengthened Lemma <reference|lem4>
  shows that (<reference|mono2>) holds with strict inequality assuming
  <math|n\<geq\>2,A<rsub|n>=<around|(|1,1,\<ldots\>,1|)>,F> is nondegenerate,
  and arm 1 is optimal initially in <math|<around|(|<wide|M|~>*F,\<alpha\><rsub|2>;A<rsub|n>|)>>.
  Under such conditions, the strictness of the inequality holds by induction
  as one key step (<reference|strict>) holds with strict inequality. It
  follows that Corollary <reference|coro2> can be strengthened to strict
  monotonicity assuming uniform discounting, <math|n\<geq\>2>, and a
  nondegenerate <math|F>.

  <\thebibliography|10>
    <bibitem|B72>D. A. Berry, A Bernoulli two-armed bandit,
    <with|font-shape|italic|Ann. Math. Statist.> <with|font-series|bold|43>
    (1972) 871\U897.

    <bibitem|BF79>D. A. Berry and B. Fristedt, Bernoulli one-armed
    bandits\Varbitrary discount sequences, <with|font-shape|italic|Ann.
    Statist.> <with|font-series|bold|7> (1979) 1086\U1105.

    <bibitem|BF86>D. A. Berry and B. Fristedt, <with|font-shape|italic|Bandit
    Problems: Sequential Allocation of Experiments> (1985) Chapman and Hall,
    New York.

    <bibitem|BJK56>R. N. Bradt, S. M. Johnson and S. Karlin, On sequential
    designs for maximizing the sum of <math|n> observations,
    <with|font-shape|italic|Ann. Math. Statist.> <with|font-series|bold|27>
    (1956) 1060\U1074.

    <bibitem|Ch94>M. K. Chattopadhyay, Two-armed Dirichlet bandits with
    discounting, <with|font-shape|italic|Ann. Statist.>
    <with|font-series|bold|22> (1994) 1212\U1221.

    <bibitem|C68>H. Chernoff, Optimal stochastic control,
    <with|font-shape|italic|Sankhya A> <with|font-series|bold|30> (1968)
    221\U252.

    <bibitem|CB85>M. K. Clayton and D. A. Berry, Bayesian nonparametric
    bandits, <with|font-shape|italic|Ann. Statist.>
    <with|font-series|bold|13> (1985) 1523\U1534.

    <bibitem|F73>T. S. Ferguson, A Bayesian analysis of some nonparametric
    problems, <with|font-shape|italic|Ann. Statist.>
    <with|font-series|bold|1> (1973) 209\U230.

    <bibitem|G79>J. C. Gittins, Bandit processes and dynamic allocation
    indices (with discussion), <with|font-shape|italic|Journal of the Royal
    Statistical Society, Series B> <with|font-series|bold|41> (1979)
    148\U177.

    <bibitem|GJ74>J. C. Gittins and D. M. Jones, A dynamic allocation index
    for the sequential design of experiments. In: J. Gani, Editor,
    <with|font-shape|italic|Progress in Statistics,> North-Holland, Amsterdam
    (1974) 241-266.

    <bibitem|GW92>J. C. Gittins and Y.-G. Wang, The learning component of
    dynamic allocation indices, <with|font-shape|italic|Ann. Statist.>
    <with|font-series|bold|20> (1992) 1625\U1636.

    <bibitem|H92>S. J. Herschkorn, Bandit bounds from stochastic variability
    extrema, <with|font-shape|italic|Stat. Prob. Lett.>
    <with|font-series|bold|35> (1997) 283\U288.

    <bibitem|MS02>A. Mller and D. Stoyan, <with|font-shape|italic|Comparison
    Methods for Stochastic Models and Risks>, Wiley & Sons, Chichester
    (2002).

    <bibitem|SS07>M. Shaked and J. G. Shanthikumar,
    <with|font-shape|italic|Stochastic Orders>, Springer, New York (2007).

    <bibitem|W80>P. Whittle, Multi-armed bandits and the Gittins index,
    <with|font-shape|italic|J. Roy. Statist. Soc. B>
    <with|font-series|bold|42> (1980) 143-149.
  </thebibliography>
</body>