# NashCodingYet Another Artificial Intelligence Blog

12Mar/133

## Walkthrough of the UCB1 Optimality Proof

The Finite-time Analysis of the Multiarmed Bandit Problem paper is a seminal work in bandit research. It was the first paper to present and prove an algorithm for selecting a bandit with finite-time optimality guarantees. If you're not familiar with bandits, Wikipedia has a decent entry.

In this post, I'm going to assume you have made an attempt at reading the UCB paper and tried to understand the proof for UCB-1. My goal is to provide an in-depth explanation for how they arrive at their final finite-time optimality guarantee for the algorithm.

## Intuition for the Chernoff-Hoeffding bound and Bernstein inequality

We'll start with the two "Fact" portions of the Finite-time Analysis paper.

###### Chernoff-Hoeffding bound

You have a set of independent random variables $X_1, \dots, X_n$. They don't specify the independence assumption, but I believe it's required, unless they are referring to a more general proof than what Wikipedia has.

If you sample each of them once and add the results together, you get $S_n = X_1 + \dots + X_n$. The expected value of the last sample is $\mu$, and the expected value of the sum is thus $n \mu$. It's kind of strange notation and I think Wikipedia's page makes it more clear by specifying it in terms of the expected value of the sum. I think the goal here is to use a unified notation throughout the paper.

Now, let's say you want to determine what the probability is that the resulting sampled sum $S_n$ deviates from the expected value of the sum $n \mu$ by at least $a$. This is the left hand side of the inequality: $\mathbb{P}\{S_n \geq n \mu + a \}$.

The bound that is derived is $e^{-2a^2/n}$, so it's a function of $n$, the number of random variables (or samples) that you're summing, and and $a$, the amount by which the sum must deviate.

###### How $n$ affects the bound

Let's first think about $n$ and how the bound changes as $n$ increases. $e^{-2a^2/n} = \frac{1}{e^{2a^2/n}} = \frac{1}{e^{k/n}}$

Where $k = 2a^2$ is a constant. Since $n$ is in the exponent's denominator, as $n$ increases, $e^{k/n}$ decreases, which means the bound increases as $n$ increases.

Why does this happen? If you remember back to your intro to probability class, you will remember that the variance of a sum of random variables is the sum of their variances. So as you add more and more samples together, meaning as $n$ grows, you get a larger and larger variance. That means the odds of hitting a value way off from the mean increase. So hitting at least $a$ away from the mean will become easier and easier, meaning the probability increases. Thus, the bounds on the probability has to increase.

###### How $a$ affects the bound

Conversely, we see from the chain of inequalities that as $a$ increases, the bound should decrease. This is because you're increasing the required distance from the expected value. The farther out from the mean you are, the less likely it is to happen, so the probability drops.

###### Bernstein inequality

The intuition here should follow naturally from the explanation of the Chernoff-Hoeffding bound. The only thing we're changing here is that instead of $n$, the number of random variables we're summing, we have $\sigma^2$, the variance of the sum. Again, as $a$ increases, the bound will decrease; as $\sigma^2$ increases, the bound will also increase.

## Proof of Theorem 1 (UCB-1)

Now I'm going to try walking through the proof and hopefully save someone the agony of walking through it from scratch. I'll assume you understand Facts 1 and 2.

Preliminaries
They define the variable $c_{t,s}$ as follows: $c_{t,s} = \sqrt{ (2 \ln t) / s}$

This is also the value used in the UCB1 algorithm that gets added to the average reward for a given bandit. Compare this to the right hand side of the Chernoff-Hoeffding bound: $e^{-2a^2/n}$

You'll likely notice that $c_{t,s}$ seems to be some sort of inverse of the bound. That's important to keep in mind as it appears to be the crux of their whole proof and the main contribution of their algorithm.

Equation 6
I'll go line-by-line and justify each transformation.

line 1 $T_i(n) = 1 + \displaystyle \sum_{t=K+1}^{n} \{ I_t = i \}$

Remember here that $T_i(n)$ is the number of times that the policy will choose bandit $i$ after $n$ pulls.

The UCB1 algorithm requires that you start by playing each of the $K$ bandits once. So for the first $K$ pulls, we are guaranteed to pull bandit $i$ exactly once, hence the first constant of $1$ and why we only need to sum starting at $K+1$. From here, it's clear that the number of remaining pulls for the bandit is just the sum of all the pulls in the remaining steps.

line 2 $T_i(n) = \ell + \displaystyle \sum_{t=K+1}^{n} \{ I_t = i, T_i(t-1) \geq \ell \}$

Here they introduce an arbitrary positive integer $\ell \geq 1$. If we set $\ell = 1$, it's equivalent to the first line. If we set $\ell > 1$, however, we arrive at something that is at least as large. Why? Well, consider the case of $\ell = 2$. This means that we add a constant $2$ instead of $1$, and in the summation the first iteration will be to add $0$ because $T_i(t-1) \geq \ell$ means that after $K$ pulls we would have had to pull bandit $i$ at least $2$ times. However, we know that UCB1 will pull each bandit only once in the first $K$ pulls, so that will never happen.

Now, if the pull on $t = K+1$ was for bandit $i$, we will be back to the original algorithm. If it wasn't, then we keep skipping (adding 0). So effectively we're just shifting the majority of the total to the constant instead of the summation term. The exception is when $\ell > T_i(n)$, in which case of course we'll get a larger value than in line 1.

line 3 $T_i(n) \leq \ell + \displaystyle \sum_{t=K+1}^{n} \{ \bar{X}^*_{T^*(t-1)} + c_{t-1,T^*(t-1)} \leq \bar{X}_{i,T_i(t-1)} + c_{t-1,T_i(t-1)}, T_i(t-1) \geq \ell \}$

Here they've replaced the original $I_t = i$ with the inequality $\bar{X}^*_{T^*(t-1)} + c_{t-1,T^*(t-1)} \leq \bar{X}_{i,T_i(t-1)} + c_{t-1,T_i(t-1)}$. This inequality is effectively saying that at every step the bandit chosen will be the one which maximizes the equation used in the UCB1 algorithm.

line 4 $T_i(n) \leq \ell + \displaystyle \sum_{t=K+1}^{n} \{ \displaystyle \min_{0 < s < t} (\bar{X}^*_{s} + c_{t-1,s}) \leq \displaystyle \max_{\ell \leq s_i < t} (\bar{X}_{i,s_i} + c_{t-1,s_i}) \} [/latex] Rather than looking at each time-step comparison, they now expand that to say that at every time step, you look at the minimum value the optimal bandit has had until now and compare it to the maximum value that bandit $i$ has had from $\ell$ until now. This is actually going to produce a number of cases where the line 4 value is larger than line 3 because it means that the summation can go on streaks where it finds a big value and the optimal bandit had a low value. It's useful though as a stepping stone into the next line. line 5 $T_i(n) \leq \ell + \displaystyle{\sum_{t=1}^{\infty} \sum_{s=1}^{t-1} \sum_{s_i = \ell}^{t-1}} \{ \bar{X}^*_{s} + c_{t-1,s} \leq \bar{X}_{i,s_i} + c_{t-1,s_i} \}.$ The streakiness property in line 4 is now formally replaced by the two inner summations, which is similar but will produce a number of cases where the value of line 6 is greater than line 5. The important part here is that we have compacted things into the inequality of the observed average value of the optimal policy and the observed average value of the UCB1 policy. Equations 7-8 At least one of the cases in equations 7-9 must hold if the inequality in our formula is true. Let's look at equation 7: $\bar{X}^*_s \leq \mu^* - c_{t,s}$ In the Chernoff-Hoeffding bound in Fact 1, we were looking at the sum of our random variables and comparing them to $n$ times the average. However, we're looking at the average, $\bar{X}^*_s$ of our samples, so we can plug in $a = c_{t,s}$ and $n = 1$.1 $e^{-2a^2/n} = e^{\frac{-2\sqrt{\frac{2\ln t}{s}}^2}{1}} = e^{-2\sqrt{\frac{2\ln t}{s}}^2} = e^{\frac{-4\ln t}{s}} \leq e^{-4 \ln t} = t^{-4}$ Similarly, we have equation 8: $\bar{X}_{i,s_i} \geq \mu_i - c_{t,s_i}$ Which we can plug in $a = c_{t,s_i}$ and $n = 1$: $e^{-2a^2/n} = e^{\frac{-2\sqrt{\frac{2\ln t}{s_i}}^2}{1}} = e^{-2\sqrt{\frac{2\ln t}{s_i}}^2} = e^{\frac{-4\ln t}{s_i}} \leq e^{-4 \ln t} = t^{-4}$ Equation 9 $\mu^* < \mu_i + 2c_{t,s_i}$ This equation represents the case where the suboptimal bandit is chosen because the offset is too large, not because of random noise. They show that by choosing $\ell = \lceil (8 \ln n) / \Delta^2_i \rceil$, this is impossible.2 $c_{t, s} = \sqrt{\frac{2 \ln t}{s}}$ is the width of the confidence interval of bandit $i$ at time $t$, assuming $i$ has been pulled $s$ times. If $i$ has been pulled lots of times ( $\ell$ or more times), the interval will shrink to be no more than half of the gap between $\mu_i$ and $\mu^{*}$. $n$ is the time corresponding to which we are bounding the regret; $t \leq n$. In other words, if $s \geq \frac{8 \ln n}{\Delta_{i}^{2}}$, we get: $c_{t, s} = \sqrt{\frac{2 \ln t}{s}}\\ \\ <= \sqrt{\frac{2 \ln t}{\frac{8 \ln n}{\Delta_{i}^{2}}}} \text{ (Substituting in our value for s)}\\ \\ = \sqrt{\frac{2 \Delta_i^2 \ln t}{8 \ln n}} \text{ (Simplifying the fraction by bringing }\Delta_i^2\text{ to the numerator)}\\ \\ = \sqrt{\frac{\Delta_i^2 \ln t}{4 \ln n}} \text{ (Simplifying the fraction by}\frac{2}{8}=\frac{1}{4}\text{ )}\\ \\ = \Delta_i \sqrt{\frac{\ln t}{4 \ln n}} \text{ (Noting }\sqrt{\Delta_i^2} = \Delta_i\text{ )}\\ \\ = \frac{\Delta_i}{2} \sqrt{\frac{\ln t}{\ln n}} \text{ (Noting }\sqrt{\frac{1}{4}} = \frac{1}{2}\text{)}\\ \\ <= \frac{\Delta_{i}}{2} \text{ (Since }t \leq n, \sqrt{\frac{\ln t}{\ln n}} \leq 1\text{ )}\\$ Since $\Delta_{i} = \mu^* - \mu_i$, equation (9) cannot be true. Infinite series inequality [latex] \mathbb{E}[T_i(n)] \leq \Bigg{\lceil} \frac{8 \ln n}{\Delta^2_i} \Bigg{\rceil} + \displaystyle{\sum_{t=1}^{\infty} \sum_{s=1}^{t-1} \sum_{s_i = \lceil (8 \ln n) / \Delta^2_i \rceil}^{t-1}} (\mathbb{P}\{\bar{X}^*_s \leq \mu^* - c_{t,s}\} + \mathbb{P}\{\bar{X}_{i,s_i} \geq \mu_i - c_{t,s_i}\})$

Here we are subbing in our new value for $\ell$. We are also replacing the original inequality from the last line of equation 6 with what equations 7-9 showed can be the only possible reasons that the original inequality would be true. Since we know the bounds on each of the two component inequalities, we can replace them with their $t^{-4}$ bounds and add them together. We can also start the last sum at 1 instead $\ell$ because we know $\ell$ is a positive integer, so starting it at 1 will produce a result that is at least as large; the same goes for iterating to $t$ instead of $t-1$. Making those substitutions yields the next line: $\mathbb{E}[T_i(n)] \leq \Bigg{\lceil} \frac{8 \ln n}{\Delta^2_i} \Bigg{\rceil} + \displaystyle{\sum_{t=1}^{\infty} \sum_{s=1}^{t} \sum_{s_i = 1}^{t}} 2t^{-4}$

Since we are adding $2t^{-4}$ in the inner two loops $t$ times, we can simplify the equation: $\mathbb{E}[T_i(n)] \leq \Bigg{\lceil} \frac{8 \ln n}{\Delta^2_i} \Bigg{\rceil} + \displaystyle{\sum_{t=1}^{\infty} \sum_{s=1}^{t} \sum_{s_i = 1}^{t}} 2t^{-4}\\ = \mathbb{E}[T_i(n)] \leq \Bigg{\lceil} \frac{8 \ln n}{\Delta^2_i} \Bigg{\rceil} + \displaystyle{\sum_{t=1}^{\infty} \sum_{s=1}^{t}} 2t^{-3}\\ = \mathbb{E}[T_i(n)] \leq \Bigg{\lceil} \frac{8 \ln n}{\Delta^2_i} \Bigg{\rceil} + \displaystyle{\sum_{t=1}^{\infty}} 2t^{-2}$

Now, I'm sure you have your infinite series memorized so this is clear, but let's pretend we rely on Google to remember them for us. In that case, you will have reminded yourself of Euler's solution to the Basel Problem: $\sum_{n=1}^{\infty} \frac{1}{n^2} = 1 + \frac{1}{4} + ... + \frac{1}{n^2} = \frac{\pi^2}{6}$

Applying the constant multiplier of 2 to the infinite series and replacing the ceiling calculation with adding 1 to the real-valued calculation gives us the final inequality: $\mathbb{E}[T_i(n)] \leq \frac{8 \ln n}{\Delta^2_i} + 1 + \frac{\pi^2}{3}$

And we're done! Hopefully that helps someone!

1. I'm not sure that this is the best bound we can give. The average will not grow like the sum will-- in fact, it will actually shrink in its variance as n increases. Am I missing something here or is there some reason they chose such a conservative bound?
2. Thanks to Shivaram Kalyanakrishnan for the explanation of this step.
1. Is there an easy way to write the $latex c_{t,s}$ as part of a confidence bound? e.g., to say it gives a one-sided confidence bound of $latex (1-\alpha)$ for some $latex \alpha = o(1)$ in terms of $latex s,t$.