## A Comment on “Evolving Nash-Optimal Poker Strategies Using Evolutionary Computation”

I recently read this paper where the authors claim that they are evolving psuedo-optimal strategies for poker. Given that we know evolutionary processes do not necessarily minimize losses, I was skeptical.

The claim is that by using a fitness function that only penalizes for losses, the evolution is effectively searching for a Nash Equilibrium. They used the AKQ game as a toy example. To recap, the game's rules are as follows:

- Three cards in the deck (A, K, and Q). Each player dealt 1 card face down.
- Each player antes $1 into the pot.
- Player 1 acts first and may check or bet.
- Player 2 acts second and may fold or call if player 1 bets. Player 2 cannot bet if Player 1 checks. Thus, all you have to worry about is how often Player 2 should call Player 1's bet.

You can work out the game tree and solve the linear equation yourself pretty easily, but basically the Nash-equilibrium (NE) optimal strategy is Player1={1/3, 0, 1} and Player2={0,1/3,1} for { Q, K, A }, respectively. Note that all the parameters except Player1 bluffing with a Queen and Player2 calling with a King are strictly dominated if you deviate from the optimal, so really the only thing that someone should have doubts about are those two probabilities.

If you work out the game tree, you get a really ugly (but straight-forward) fitness function that you're trying to optimize such that no other individual can beat you. The paper presented two fitness functions: total winnings against the population and sum of all squared losses against the population. They claimed that the latter "sets up" the evolution to find a NE by avoiding "intransitive" strategies -- that is, strategies which draw with the NE but are exploited by other strategies.

If that were true for random samplings, then certainly calculating the EV directly using their fitness functions would approach an NE as well, right?

Below are the results for a simple evolution using the total winnings fitness function with the evaluation function being the direct game tree EV calculation rather than 100 randomly sampled hands.

The wildly oscillating lines are the probability of Player1 bluffing with a Queen and Player2 calling with a King. The green line is a 100-generation moving average of the P2-King probability. The black line is a 100-generation moving average of the P1-Queen probability. The other parameters converge to the optimal 0 or 1 levels.

Now, given what the authors claim about intransitive strategies being handled by their squared-losses fitness, we should see a very smooth convergence using their fitness.

Interestingly, the parameters converge to the wrong values. The red line at the top is the probability Player2 calls with an Ace. The purple line in the middle is the probability Player2 calls with a king. The yellow line is the probability that Player1 bets with an Ace (stabilized around 5%).

Since there is no reason for the evolution to actually bet when it can't lose, it never does, and thus the Player1-Ace probability drifted to nearly 0%. Clearly this fitness function is not going to find a NE.

A more intuitive fitness function would be one that reflects the definition of a NE: the fitness of an individual is the worst expected value against any opponent. If you play every individual against each other in a tournament, set fitness of each individual to their worst EV result (remember, we're using EV not random sampling).

Eureka! The purple and blue lines (P1-Queen and P2-King) converge to the optimal 0.33; all other parameters also converge to the NE solution.

It's important to note that I changed the problem here by making the fitness function deterministic. I did try using all three fitness functions on the random 100-hands that the paper authors used, and I was unable to get any kind of convergence-- even cranking it up to 10,000 hands wasn't enough. However, I feel confident that if your fitness can't work on a static EV evaluator then it certainly won't work on a stochastic evaluator that is effectively sampling from that static EV evaluator.

Code for the above results available here.

Wang ChungMarch 6th, 2010 - 09:25

For games or game abstractions that are too large to calculate EV efficiently, what can be done if not random sampling?

WesleyMarch 6th, 2010 - 15:31

I think the key is to remove the luck factor as much as possible. There are a number of ways you can do this:

– Balanced distribution of hands

– Duplicate games

– Removing the luck trend via an approach like DIVAT

Those are a few, but there are certainly more. I’ve been working on evolutionary poker algorithms for a while, and I’ve been able to reduce the luck/variance decently, but still not all the way there yet.

Wang ChungMarch 6th, 2010 - 20:17

I’m surprised 10,000 hands wasn’t enough of a sample to estimate EV in the AQK game. Or, were you referring to HE?

And, nice blog. Please do keep writing and posting code. I’m going try this fitness test with an analog HUNL HE player.

WesleyMarch 6th, 2010 - 20:48

Thanks!

I was referring to the AKQ game for 10,000 hands. I was surprised by that as well. The code that I attached at the bottom of the article has both the EV and random sampling versions in the simulator. I double-checked myself on the random sampling code to make sure it had no bugs when I saw the results.

I think there is something inherently unstable about doing these tournaments with random sampling. In my code, I’m doing 99 evaluations (pop size = 100) and taking the worst result as fitness. So we’re basically talking about a 3 stdev event. Even if you’re just flipping coins 100 times, that can still seem like it’s pretty random.

There’s the classic Always-Call vs. Always-Raise experiment which helps support this. When I run this for 100,000 hands I get a stdev of around 15,000 (though I haven’t generated enough samples to confirm statistically). Of course, duplicate hands solves that problem, but when you introduce player strategies into the mix, the luck factor creeps back in.

Wang ChungMarch 7th, 2010 - 00:39

I’m using NeuroEvolution of Augmenting Topologies (SharpNEAT) as my evolution engine; an NN to represent my HU NL strategy; duplicate poker matches to estimate EV; and your fitness test. Unfortunately, it’s slow as hell. But, it will be pretty cool if I can get an NN to converge near to EQ.

I also incorporated a “default” strategy which each NN in the populus is scored against. It’s CBR based, built on the hands played by the University of Alberta’s 2009 HUNL EQ bot. The idea is that this will give the population a “direction” to evolve towards.

WesleyMarch 7th, 2010 - 00:58

I think we’re working on virtually the same problem.

I’ve done a lot of work with NEAT applied to poker. Mostly I’ve focused on 3-player LHE, though. The CBR benchmark strategy is an interesting idea. I tried using a hand-written one and was able to beat that pretty well… mostly because I wrote the benchmark in about 5 minutes. Have you had any promising results?

I went through a LOT of optimization for my framework. I even wrote a grid-based version of NEAT that operated over web services, so you just load a browser to the server’s website and you’re instantly a slave in the poker evaluation process.

How many hands/second (or minute, hour, whatever) are you doing? I think I was averaging around 30,000,000 hands per hour.

Oh, and if you’re on the NEAT listserv, then it’s important to note the recent comments on fitness scaling. You have to make sure to use a fitness function that will give enough separation between individuals, otherwise you get a random search basically.

Wang ChungMarch 7th, 2010 - 01:39

Interesting. I actually just got it up and running today. Up until yesterday I was using NEAT versus the default strategy, but no evolution towards EQ. Even against just the default, it was very slow; I’ve done no optimization other than using a cache pregenerated hands. Probably close to ~ 400k/hour on a single core 2ghz Celeron. I’m simulating tournaments, so the hand count varies.

That’s a pretty cool idea about the grid based NEAT. I was thinking of doing something distributed as well, though something more generic. Was that 30m on one CPU?

Any interesting results with 3LHE?

WesleyMarch 7th, 2010 - 01:54

That was 1 CPU, but it was a quad Q8200, and my evaluator is multi-threaded.

I don’t have any really interesting results at LHE yet because I’ve been having issues with the fitness function. The way I had been doing it leads to maniac always-raise players after a few thousand generations. I’ll be interested to hear if you have the same issue.

The blog article was the result of me exploring the fundamental issue.

Wang ChungMarch 9th, 2010 - 05:53

I was able to improve the # of hands/hour quite a bit. I was using my CBR strat for the preflop. Instead, I just the strats call preflop and focused on postflop play. I’m probably pulling close to 10 or 20 million hands/hour on a single core now.

I’m having trouble getting the population to move towards a maximum (EQ), though. I don’t think it’s because of fitness scaling, but rather variance. The “best” genome just seems to fluctuate around a couple standard deviations below break-even because every time a new genome is added, it’s fitness is updated.

WesleyMarch 9th, 2010 - 06:13

Not sure I understand what you mean by “its fitness is updated.” You’re trying to evolve to beat a benchmark, right?

Wang ChungMarch 9th, 2010 - 06:25

What do you mean by “beat a benchmark?”

WesleyMarch 9th, 2010 - 06:29

You have a CBR strategy that you’re playing the NEAT agents against, right? Or are you doing co-evolution and just using the CBR as the preflop strategy for all agents?

Wang ChungMarch 9th, 2010 - 06:39

I’m not using the CBR agent at all anymore. I’m using the group to evolve towards EQ, using your fitness function.

WesleyMarch 9th, 2010 - 06:45

So you have one monolithic controller for all rounds of play? And that works? I would think you need different controllers.

What did you mean by trying to get them to move towards a maximum? How do you know when you’ve hit an NE?

Wang ChungMarch 9th, 2010 - 08:30

I have one NN that represents post-flop play, pre-flop it’s just calling..

They hit NE when the min of the populous == 0, yes?

WesleyMarch 9th, 2010 - 17:00

No, I don’t think that is true. The simulations I did looked at the champion (best), not the bottom of the population. Also, there is not much chance that you will be able to get there with random sampling unless you can somehow remove the luck factor. While things like duplicate hands do help reduce luck, it doesn’t eliminate it.

For example, if my strategy always folds AA and always goes all-in with 72o, that’s clearly not a NE strategy. However, if you play a duplicate hand of AA vs 72o, where the board comes 7 2 x x x, then my strategy works out great for both hands.

The real challenge is eliminating this kind of systemic luck. How one does that is not clear to me. Maybe DIVAT or similar tools could be useful here like I mentioned before. In general, what we need is a tool that can generate a “fair” session, or somehow adjust results to make them fair.

Wang ChungMarch 10th, 2010 - 00:29

Ok, I misspoke. What I meant to say was that when the best hit’s 0 it’s reached an NE.

How are you removing variance? Like I said, I’m playing duplicate poker. And, right now, I’m just doing the simulation using a cache of 1000 hands.

WesleyMarch 10th, 2010 - 00:54

The last thing I did was to play using 10,000 duplicate hands, but I do not think this is enough to remove variance. I don’t have any other ideas at the moment as to how it can be done.

Are you seeing the champions drift closer to 0? What do you get if you graph their fitness?

Wang ChungMarch 15th, 2010 - 16:37

msi — but what I don’t understand is: do you think it’s good?

Wesley — It did eventually drift towards 0, quite above it actually. Though, I imagine it will come back down. 900 on the below graph is break-even, as I’m using a fitness^2 with a baseline of 30 (because you can’t have fitness levels below 0). I’m actually using a population of 30, playing 100 hands each. However, when updating the genome fitness, I’m only allowing a 5% adjustment after the initial fitness assignment. I think this allows for a much smoother “update” over time. It took awhile, but it did eventually “catch.”

http://i43.tinypic.com/4qn2og.jpg

Wang ChungMarch 15th, 2010 - 16:40

Whoops, I meant to say that 6400 is break even on that graph! Not 900. So, out of a 100 games with a baseline of 30, break even would be 80^2.

Wang ChungMarch 15th, 2010 - 16:41

Arg, another mistake… It’s not playing 100 hands, but rather 100 tournaments! That equals around ~ 4000 hands.

Wang ChungMarch 15th, 2010 - 16:47

And, in case you were wondering what I meant by a 5% update, here’s my code:

weight = 0.05;

score = CompareNetworkAgainstPopulation(network);

if (EvaluationCount == 0)

fitness = score;

else

fitness = ((1.0 – weight) * fitness) + (weight * score);

EvaluationCount++;

Wang ChungMarch 15th, 2010 - 19:43

Also, regarding your question above: “So you have one monolithic controller for all rounds of play? And that works? I would think you need different controllers.”

Marv Anderson, for postflop play, used one simple feedforward NN with a single layer perceptron, trained using data from the previous year’s winner, in the 2009 World Computer Poker Championship Limit Heads Up matches. He came in first in the money, and third in the roll-out. He also used a custom NN for preflop play.

I, of course, am simulating a much more complicated abstraction, but having seen the above in action, I believe a single recurrent network is capable of representing a HUNL EQ postflop.

WesleyMarch 17th, 2010 - 14:31

I see. So you’re doing something similar to an exponential moving average for fitness. The graphs look interesting!

One issue though is that I’ve found tournaments produce very different results than infinite bankroll games. In tournaments, presumably the idea is to not die (in heads up it’s also to kill off the other guy). This typically works well with evolution, and can produce strong tournament players. However, strong tournament players may not be strong NE. Calling an all-in may be the right play in a tournament even if it’s negative EV in this hand, because you have sufficiently more chips and a chance to win the tournament now; such a hand would not occur in an infinite bankroll game.

Unless by tournaments you mean you’re playing a fixed number of hands and taking the winnings after that as the score… in which case, forget the above sentence.

I don’t understand what your definition of fitness is. Could you expand on that a little?

Wang ChungMarch 17th, 2010 - 17:18

With infinite bankrolls in an NL game, what is an all-in action? Or, by ‘infinite bankroll’ is there a finite bankroll that’s reset after each hand?

Isn’t NE a contextual definition? In the context of the game I’m playing, tournaments, this process is still moving towards NE, yes?

My definition of fitness is the same as yours; the worst score when evaluated against the entire populus. However, for reevaluation, which often happens to the elite portion of the population, the fitness is updated with only a percentage (5%) of the worst score.

WesleyMarch 18th, 2010 - 00:15

Right, by infinite bankroll I mean players have finite stacks/chips that are reset every hand. You’re right that NL is a game of context, which means you may not want to reset the chips to equal values every hand. Maybe you should create a stack size distribution to sample from every hand.

I did some work on that a few months ago… I took ~25,000 NL cash game hands that I had and just did a histogram of stack sizes. There was an interesting double-peak, with the main peak at 100bb and a secondary peak at 50bb, with basically Gaussian distributions around those two means. I’ll see if I can dig up the histogram image.

I think it depends on what the game is for which you’re trying to find the NE. If you’re trying to evolve a network to play optimally at heads up tournaments then it’s different from playing heads up cash games where people can come or go at any point.

In a cash game, it’s much more like what I described above, where you’re playing against stacks sampled from a distribution. Each player may or may not be there the next hand. Your goal is thus to maximize your play this hand. A tournament assumes that the other players are staying here until they go bust, which means you’re maximizing your play over a session (given that assumption).

Also, if you’re playing tournaments, then aren’t all players playing until they reach 0? Or are you letting the models play each other until one goes broke or a fixed max hands are played, then taking the chips left (squared) as fitness?

Wang ChungMarch 18th, 2010 - 00:54

I’d love to see that histogram. I believe the University of Alberta used a 200 SB resetting stack for last year’s HUNL competition. They gave some explanation as to that arbitration, which I cannot recant at the moment.

I’m scoring each genome on the number of tournaments it’s won minus opponent wins. Both players go until one is broke. Blinds are doubled every 10 hands. So, in the graph I posted above, for a single population evaluation each genome was playing each other genome in 100 tournaments. And, the score is the lowest of those 100 tournaments plus the baseline (30) squared.

for i = 0 to population.count

. for j = 0 to population.count

… play N tournaments i vs j

… i_score = argmin(((i_hero_won – j_villian_won) + baseline)^2)

. i_fitness = ((1.0 – weight) * i_fitness) + (weight * i_score)

Note: argmin() is the lowest of all i vs population.

I should also mention that when either player’s stack gets below 10 big blinds, the strategy automatically switches to a pre-computed push/fold NE. The idea is to have the game/play setup as close as possible to real HUNL SNGs.

If I can manage to get this algorithm working well, I’ll probably do something like you mentioned above (infinite stack) and enter it into this year’s U of A championship.

WesleyMarch 18th, 2010 - 01:29

Found it! http://www.nashcoding.com/wp-content/uploads/2010/03/stacks_0_200.png

The data is taken from mostly low stakes NL on full tilt and party poker. I think it’s all between NL 25 and NL 200.

I cut the histogram off at 200 big blinds, for formatting purposes. The most common is clearly 100 big blinds, but there is a strange second peak around 35 big blinds, indicating a lot of players play short stacked.

I plan on re-assessing this using the pokerai.org hand history set, but just haven’t had time lately.

What you’re doing makes a lot more sense to me now, thanks for the explanation.

I’m currently working on a related project that I may try to publish, depending on the results, at CIG this year. The deadline is a little too close given that I don’t have anything written yet. Are you planning on publishing these results?

Wang ChungMarch 18th, 2010 - 11:27

I was thinking, as I often do late Wednesday nights, that what we’re looking at here is akin to chaos theory, attractors, and the like. If you look at the graph I posted, the population is just bouncing around a few std devs below average, a random oscilation around a “default” centroid. Then, all of sudden, it “catches,” there’s a paradigm shift in the system dynamics, and the population begins to circle around a new center. Though probably an insignificant observation, and just an emergent property of variance, nonetheless it’s cool to think about in these terms.

http://en.wikipedia.org/wiki/Attractor

I’m not affiliated with an educational body, so I hadn’t planned on publishing anything. I just do this for fun and profit. 😉 If you want to collaborate on something, I’d be happy to share any data I come up with.

WesleyApril 17th, 2010 - 14:13

Just got back from a long vacation. I found a bug in the article’s squared loss fitness function, which actually leads me to radically different conclusions. I’ll post an update as a new post soon.

LordenancecedJanuary 12th, 2011 - 08:50

learned a lot

Pro Poker AdvocateFebruary 18th, 2011 - 14:26

I’ve taken some C++ classes and though I’m not an exceptional programmer, I’m considering trying to work out something like this for the full street AKQ game.