The Finite-time Analysis of the Multiarmed Bandit Problem paper is a seminal work in bandit research. It was the first paper to present and prove an algorithm for selecting a bandit with finite-time optimality guarantees. If you're not familiar with bandits, Wikipedia has a decent entry.
In this post, I'm going to assume you have made an attempt at reading the UCB paper and tried to understand the proof for UCB-1. My goal is to provide an in-depth explanation for how they arrive at their final finite-time optimality guarantee for the algorithm.
Intuition for the Chernoff-Hoeffding bound and Bernstein inequality
We'll start with the two "Fact" portions of the Finite-time Analysis paper.
You have a set of independent random variables . They don't specify the independence assumption, but I believe it's required, unless they are referring to a more general proof than what Wikipedia has.
If you sample each of them once and add the results together, you get . The expected value of the last sample is , and the expected value of the sum is thus . It's kind of strange notation and I think Wikipedia's page makes it more clear by specifying it in terms of the expected value of the sum. I think the goal here is to use a unified notation throughout the paper.
Now, let's say you want to determine what the probability is that the resulting sampled sum deviates from the expected value of the sum by at least . This is the left hand side of the inequality: .
The bound that is derived is , so it's a function of , the number of random variables (or samples) that you're summing, and and , the amount by which the sum must deviate.
How affects the bound
Let's first think about and how the bound changes as increases.
Where is a constant. Since is in the exponent's denominator, as increases, decreases, which means the bound increases as increases.
Why does this happen? If you remember back to your intro to probability class, you will remember that the variance of a sum of random variables is the sum of their variances. So as you add more and more samples together, meaning as grows, you get a larger and larger variance. That means the odds of hitting a value way off from the mean increase. So hitting at least $a$ away from the mean will become easier and easier, meaning the probability increases. Thus, the bounds on the probability has to increase.
How affects the bound
Conversely, we see from the chain of inequalities that as increases, the bound should decrease. This is because you're increasing the required distance from the expected value. The farther out from the mean you are, the less likely it is to happen, so the probability drops.
The intuition here should follow naturally from the explanation of the Chernoff-Hoeffding bound. The only thing we're changing here is that instead of , the number of random variables we're summing, we have , the variance of the sum. Again, as increases, the bound will decrease; as increases, the bound will also increase.
Proof of Theorem 1 (UCB-1)
Now I'm going to try walking through the proof and hopefully save someone the agony of walking through it from scratch. I'll assume you understand Facts 1 and 2.
They define the variable as follows:
This is also the value used in the UCB1 algorithm that gets added to the average reward for a given bandit. Compare this to the right hand side of the Chernoff-Hoeffding bound:
You'll likely notice that seems to be some sort of inverse of the bound. That's important to keep in mind as it appears to be the crux of their whole proof and the main contribution of their algorithm.
I'll go line-by-line and justify each transformation.
Remember here that is the number of times that the policy will choose bandit after pulls.
The UCB1 algorithm requires that you start by playing each of the bandits once. So for the first pulls, we are guaranteed to pull bandit exactly once, hence the first constant of and why we only need to sum starting at . From here, it's clear that the number of remaining pulls for the bandit is just the sum of all the pulls in the remaining steps.
Here they introduce an arbitrary positive integer . If we set , it's equivalent to the first line. If we set , however, we arrive at something that is at least as large. Why? Well, consider the case of . This means that we add a constant instead of , and in the summation the first iteration will be to add because means that after pulls we would have had to pull bandit at least times. However, we know that UCB1 will pull each bandit only once in the first pulls, so that will never happen.
Now, if the pull on was for bandit , we will be back to the original algorithm. If it wasn't, then we keep skipping (adding 0). So effectively we're just shifting the majority of the total to the constant instead of the summation term. The exception is when , in which case of course we'll get a larger value than in line 1.
Here they've replaced the original with the inequality . This inequality is effectively saying that at every step the bandit chosen will be the one which maximizes the equation used in the UCB1 algorithm.
The streakiness property in line 4 is now formally replaced by the two inner summations, which is similar but will produce a number of cases where the value of line 6 is greater than line 5.
The important part here is that we have compacted things into the inequality of the observed average value of the optimal policy and the observed average value of the UCB1 policy.
At least one of the cases in equations 7-9 must hold if the inequality in our formula is true. Let's look at equation 7:
In the Chernoff-Hoeffding bound in Fact 1, we were looking at the sum of our random variables and comparing them to times the average. However, we're looking at the average, of our samples, so we can plug in and .1
Similarly, we have equation 8:
Which we can plug in and :
is the width of the confidence interval of bandit at time , assuming has been pulled times. If has been pulled lots of times ( or more times), the interval will shrink to be no more than half of the gap between and . is the time corresponding to which we are bounding the regret; .
In other words, if , we get:
Here we are subbing in our new value for . We are also replacing the original inequality from the last line of equation 6 with what equations 7-9 showed can be the only possible reasons that the original inequality would be true. Since we know the bounds on each of the two component inequalities, we can replace them with their bounds and add them together. We can also start the last sum at 1 instead because we know is a positive integer, so starting it at 1 will produce a result that is at least as large; the same goes for iterating to instead of . Making those substitutions yields the next line:
Since we are adding in the inner two loops times, we can simplify the equation:
Now, I'm sure you have your infinite series memorized so this is clear, but let's pretend we rely on Google to remember them for us. In that case, you will have reminded yourself of Euler's solution to the Basel Problem:
Applying the constant multiplier of 2 to the infinite series and replacing the ceiling calculation with adding 1 to the real-valued calculation gives us the final inequality:
And we're done! Hopefully that helps someone!
- I'm not sure that this is the best bound we can give. The average will not grow like the sum will-- in fact, it will actually shrink in its variance as n increases. Am I missing something here or is there some reason they chose such a conservative bound? ↩
- Thanks to Shivaram Kalyanakrishnan for the explanation of this step. ↩