The Finite-time Analysis of the Multiarmed Bandit Problem paper is a seminal work in bandit research. It was the first paper to present and prove an algorithm for selecting a bandit with finite-time optimality guarantees. If you're not familiar with bandits, Wikipedia has a decent entry.
In this post, I'm going to assume you have made an attempt at reading the UCB paper and tried to understand the proof for UCB-1. My goal is to provide an in-depth explanation for how they arrive at their final finite-time optimality guarantee for the algorithm.
As I noted in my first post, I was a little skeptical about whether a Nash Equilibrium (NE) could be evolved by taking the squared loss of each hand. My conclusions were that, given an expected value evaluation function, it was possible using a best-opponent fitness but not using a squared-loss fitness. There turns out to be a small bug in the squared loss code which caused a big change in results. Below are the results for the correct fitness function implementation.
In part 1 of the tutorial, we setup a basic experiment to evolve a neural network to play Tic-Tac-Toe against a couple of hand-coded opponents. In part 2, we're going to create a competitive coevolution experiment where the networks evolve by playing against themselves.
The Neuro-Evolution via Augmenting Topologies (NEAT) algorithm enables users to evolve neural networks without having to worry about esoteric details like hidden layers. Instead, NEAT is clever enough to incorporate all of that into the evolution process itself. You only have to worry about the inputs, outputs, and fitness evaluation.
I'm the kind of person who finds himself reading about a new technology or a cool algorithm, and tries to implement it based on the high-level description. Unfortunately, I don't always guess everything correctly, and sometimes the implementation turns out to not work; or it kind of works, but not as well as expected, which can be even worse.
A key example of this for me was when I read about Evolutionary Algorithms. At the core, it's sounds so ingeniously simple:
- Create a population of individuals
- Score the individuals based on some performance metric
- Kill off the weakest performers
- Create children from the surviving parents
- If not finished, go to #2
That's really it, right? I always thought so.
I just read an article on HackerNews about a little-known algorithm called Sukhotin's Algorithm. The algorithm takes a dictionary of words and tries to figure out what the vowels are, based on the assumption that vowels are typically next to consonants. This sounded really cool, so I downloaded a big list of English words and implemented it.
I recently read this paper where the authors claim that they are evolving psuedo-optimal strategies for poker. Given that we know evolutionary processes do not necessarily minimize losses, I was skeptical.