Easy21 - A Reinforcement Learning Exercise
Solutions to a Simple but Non-Trivial Problem
The video recordings of David Silver’s Reinforcement Learning course are a great place to start if you want to understand Reinforcement Learning (RL), especially if you already have some coding experience. There is one assignment given out about half way through the course to which all RL approaches are applied. This is my write up of solutions to the assignment. Please refer to the course webpage for links to the videos, slides and assignment.
One thing that makes this a good exercise is that it can be solved without using Reinforcement Learning. In the assignment it is suggested to use Monte Carlo Control to determine the optimal value function and policy for Easy21. When I tried this the results did not seem right and even with a large number of iterations the results remained rather noisy. I wanted more confidence in the optimal value function and policy, so I computed it without using RL (see below.) The method I used was to sample the outcome of playing out hands under different policies. It may also be possible to calculate this analytically, but it wasn’t necessary to get a reliable answer.
The Goals of This Exercise
The goal of this assignment is to apply reinforcement learning methods to a simple card game that we call Easy21. This exercise is similar to the Blackjack example in Sutton and Barto 5.3 – please note, however, that the rules of the card game are different and non-standard.
- The game is played with an infinite deck of cards (i.e. cards are sampled with replacement)
- Each draw from the deck results in a value between 1 and 10 (uniformly distributed) with a color of red (probability 1/3) or black (probability 2/3).
- There are no aces or picture (face) cards in this game
- At the start of the game both the player and the dealer draw one black card (fully observed)
- Each turn the player may either stick or hit
- If the player hits then she draws another card from the deck
- If the player sticks she receives no further cards
- The values of the player’s cards are added (black cards) or subtracted (red cards)
- If the player’s sum exceeds 21, or becomes less than 1, then she “goes bust” and loses the game (reward -1)
- If the player sticks then the dealer starts taking turns. The dealer always sticks on any sum of 17 or greater, and hits otherwise. If the dealer goes bust, then the player wins; otherwise, the outcome – win (reward +1), lose (reward -1), or draw (reward 0) – is the player with the largest sum.
SARSA(lambda) Algorithm for On-Policy Learning
- Initialize Q(s, a), for all s in S, a in A(s), arbitrarily, and Q(terminal-state, .) = 0
- Repeat (for each episode):
- E(s, a) = 0, for all s in S, a in A(s)
- Initialize S, A
- Repeat (for each step of episode):
- Take action A, observe R, S’
- Choose A’ from S’ using policy derived from Q (e.g. e-greedy)
- delta updates to R + gamma*Q(S’, A’) - Q(S, A)
- E(S, A) updates to E(S, A) + 1
- For all s in S, a in A(s):
- Q(s, a) updates to Q(s, a) + alphadeltaE(s, a)
- E(s, a) updates to gammalambdaE(s, a)
- S updates to S’, A updates to A’
- Until S is terminal
Let’s think about this
- If the objective is to get the highest possible average score playing against the dealer, what is the optimal policy?
- What do we expect the optimal policy to be?
- This is a simple enough system that we can find the optimal policy without using Reinforcement Learning.
- To start, sample the results under a policy for a range of starting states.
- For a given policy, playing out a starting state will yield each end state with some probability.
- Results from the dealer’s policy can be combined with results from trial policies to give expected rewards for each starting state under that policy.
- It turns out that for each policy you can find a stochastic matrix which takes starting states and converts them to ending states with some probability through playing the game.
- We can take the stochastic matrix resulting from the dealer’s policy as D, and the one resulting from the player’s policy as P.
- Then, if we define the Rewards matrix, R, with rows enumerating the player end value and columns enumerating the dealer end value, we can express the expected returns matrix E(R) as P^T * R * D, where T is the transpose and * is a matrix multiplication.
- Of interest to note here is that the matrix is symmetric, with the upper right all -1 reward where the dealer wins and the lower left all being 1 where the player wins and the diagonal being zeros, showing a tie, except where both end states are a bust, when the dealer wins.
- Since the starting state is one black card, all policies that specify to stick for all values less than 11 are equivalent. The RL algorithm, with e-greedy exploration, might make a meaningless distinction between them since it would access states that would not be possible during actual game play.
- This is a simple enough system that we can find the optimal policy without using Reinforcement Learning.
- How should we evaluate the performance of a policy?
- There are 100 possible starting states, one black card of value between 1 and 10 for the dealer and one for the player, with equal probability.
- Over time the average reward will be the mean of the expected rewards from these states. Expected reward from higher states does not matter except so far as it affects the reward for the lower states.
- Assumptions and constraints to help us find the optimal policy:
- For any given game, visited states will be confined to one column, since the dealer’s card does not change.
- For a given dealer’s card we will assume that the policy will take the form of hitting for some range of card values and sticking outside of that range.