Telling a computer to perform an action based on an input isn’t too hard. Teaching a computer to learn what action to take based on what it sees is a whole different challenge. Now imagine that the computer wont even know if the action is good or bad until some unknown point in the future – how hard would that be? Well, let’s find out as we take a look at a machine learning algorithm called Q-Learning.
Q-Learning is a type of reinforcement machine learning. One (simple) way to implement the algorithm is to use something called a Q-Table. All that really means is that we will have a table which maps from (state, action) pairs, to a value of the expected potential reward for taking the action from that state. In other words, as we fill out the values in the table, a machine learning agent can look up the potential values of any action at a given state, pick the one with the highest reward, and execute it as a policy for solving a problem. We will continue to follow along with the original blog post here, as we analyze Arthur Juliani’s solution:
import gym import numpy as np env = gym.make('FrozenLake-v0') #Initialize table with all zeros Q = np.zeros([env.observation_space.n,env.action_space.n]) # Set learning parameters lr = .8 y = .95 num_episodes = 2000 #create lists to contain total rewards and steps per episode #jList =  rList =  for i in range(num_episodes): #Reset environment and get first new observation s = env.reset() rAll = 0 d = False j = 0 #The Q-Table learning algorithm while j < 99: j+=1 #Choose an action by greedily (with noise) picking from Q table a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1))) #Get new state and reward from environment s1,r,d,_ = env.step(a) #Update Q-Table with new knowledge Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a]) rAll += r s = s1 if d == True: break #jList.append(j) rList.append(rAll) print "Score over time: " + str(sum(rList)/num_episodes) print "Final Q-Table Values" print Q
The code above came from here. Copy & paste the code and run it to make sure everything works. On one run (note that your results may vary), after printing the Q table I saw the following values:
[[ 1.18181511e-01, 1.05839621e-02, 5.93369485e-03, 5.28426147e-04], [ 0.00000000e+00, 5.08521011e-03, 2.31402787e-03, 5.81610901e-02], [ 3.58928767e-03, 2.29724031e-03, 2.30403831e-02, 0.00000000e+00], [ 1.25818424e-03, 2.06772726e-03, 2.04618843e-03, 1.28835672e-02], [ 1.45386252e-01, 0.00000000e+00, 2.60545654e-03, 8.42432295e-03], [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00], [ 2.86682494e-04, 2.65293005e-07, 8.33202120e-04, 1.24716625e-05], [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00], [ 3.38969412e-03, 2.58918580e-03, 1.52705296e-03, 1.51976034e-01], [ 0.00000000e+00, 2.83306822e-01, 8.81415294e-04, 3.89093922e-03], [ 8.20269416e-01, 1.14104585e-03, 4.87808190e-04, 0.00000000e+00], [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00], [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00], [ 0.00000000e+00, 2.43200000e-02, 5.62247331e-01, 0.00000000e+00], [ 0.00000000e+00, 4.77429893e-01, 0.00000000e+00, 0.00000000e+00], [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]]
You might recognize this as a 2 dimensional array. If so, you’re right – we have 16 rows (for possible game states) by 4 columns (for possible actions). However, I don’t find it very intuitive to look at the Q-Table in this way. It would help for states to be positioned relative to each other like they are on the gameboard, and for actions to be positioned over their state toward the direction they provide. I have overlaid grid lines on top of the tile map I made earlier, with each tile split into four sub sections by an ‘X’. The space in each slice of the tile represents the value of taking an action to move in a direction of that side. I also rounded the values at two decimal places for easier readability. Take a look at another round of completed Q-Table data with this format:
To interpret this data the way our machine learning agent does, you must first look at whatever state the environment is in (what tile is the agent located at). Then you look at each of the values for each of the actions at that state (the 4 slices on the tile). The “best” action to take is the action with the highest value.
Let’s walk through the ideal solution according to this table. *Quick tip* remember that the tiles are indexed in the same order we read (from left to right, and top to bottom) with the values from 0 to 15 like so:
0, 1, 2, 3 4, 5, 6, 7 8, 9, 10, 11 12, 13, 14, 15
- [Tile 0] Beginning in the upper left corner, we see that the highest value (0.08) is on the left slice. The computer is relying on random wind to push it down a tile [to 4], because the path to the right is more risky.
- [Tile 4] Again we also have the highest value (0.13) on the left slice. Any other action has a chance of falling in the hole and thus ending the game without any reward, therefore the computer is again relying on random wind to push it down a tile [to 8].
- [Tile 8] Now the highest value (0.19) is on the top slice. The computer is hoping for random wind to push it toward the right [to 9] even though it may actually backtrack into a position that is still safe [at 4].
- [Tile 9] One of our first “obvious” answers, from this position, the computer tries to go down [to 13]. Any random wind pushing it to the right or left would still be a safe position, so there is nothing to worry about.
- [Tile 13] Another “obvious” answer, here we try to go to the right [to 14]. Any random wind up or down will still be a safe position.
- [Tile 14] We’ve almost won – it would be tempting for any human to just choose to go right, but the computer (not being in any rush) takes the safest approach and actually tries to go down (into the wall) hoping for random wind to push it onto the goal [at 15]. Going right is still “safe”, but in the event that random wind pushed the agent up, it would end up taking a circular route [Tile 10, 9, 13, 14] to get back to where it had been (to avoid any possibility of falling in holes). Whereas by going down it will either end up in the same tile, backtrack to Tile 13 then forward to 14 again, or get lucky and land on the goal.
Now that I’ve explained what the values mean, let’s observe how they are learned over time:
Hopefully you are able to gain some extra intuitions about what the code is doing. I observed things like:
- The values flood outward from the goal tile, because it is the only one that generates a reward (note that I played back frames more slowly at the very beginning to make sure you could see this).
- The values fluctuate – they can go up or down based on the outcome of each round of the game, but should generally learn a truth that favors certain actions over others.
- The agent has learned to favor the safest path, even when it is a potentially longer one.
- None of the slices on the holes or goal ever hold any value. This is because no actions were taken on the environment once those states were reached, therefore the Q-Table is never updated at those entries.
I’d like to focus a bit more on that first bullet point, because I think it is really important to understand what happened here. Initially, we started out with a blank slate – the values all over the table were zero. Reaching the goal, and earning a reward from the environment, was the ONLY way to introduce a new value to the table at that time. Therefore, when the agent first began learning, the ONLY way it could determine that ANY action was good, was to be lucky enough to take a series of actions leading all the way to the goal, all chosen at random, and only the very last action taken would be “learned” as a good choice.
After having reached the goal on at least one session, the table now had an action with a positive value. Any followup session would have an opportunity to learn both from reaching the goal, as well as from reaching a tile that has its own reward from having reached the goal in the past. With each new session, the chain of rewards can potentially grow another tile longer, and this is how the agent can ultimately learn to do things even when the reward for its actions is delayed until some point in the future.
Next, consider the second bullet point – that the values in the table can both increase and decrease. As each action is taken our agent attempts to “learn” based on the maximum potential reward of the tile it reached. When following the “correct” path, such as when moving from tile 14 to tile 15, we grow by some learning rate multiplied by a positive reward causing the tile’s own action (and therefore the tile position itself) to grow in value. This results in a sort of gradient where the value of each tile in the path should generally grow until it reaches the goal. This also helps indicate when the agent might be going the wrong direction, even when moving along the correct path, because the maximum value of the next tile would be less than the value of the tile it had been standing on. Such an action would then “decrease” the value of taking an action that led to backtracking.
Before we do a deep dive of the code for the solution, I want to point out a few features of the numpy module. Feel free to open up a python console window and follow along:
import numpy as np
Start by importing the module. I used the same abbreviation, ‘np’, as the solution code so we dont get confused.
table = np.zeros([4,3]) table # prints to the console: # array([[ 0., 0., 0.], # [ 0., 0., 0.], # [ 0., 0., 0.], # [ 0., 0., 0.]])
Here I created an N-dimensional numpy array. The array I passed in defines the number of dimensions (the length of the array) as well as the capacity of each dimension (the values at each index in the array). Each entry in the table uses a default value of zero because of the “zeroes” method used to create it. Note that the documentation suggests we could have passed an int or a tuple of ints, but the array also worked, and matched the style of code in the solution.
table[0,1] = 3.14 table # prints to the console: # array([[ 0. , 3.14, 0. ], # [ 0. , 0. , 0. ], # [ 0. , 0. , 0. ], # [ 0. , 0. , 0. ]])
You can index into an N-dimensional numpy array just like you would with a normal array. Here I set one of the table’s values to pi.
table[0,:] # prints to the console: # array([ 0. , 3.14, 0. ])
Here I have read and printed the whole contents at the first row of the array.
list = np.random.randn(1,3) # array([[-1.45347796, 0.71221563, 0.54473448]])
Zeros isnt the only way to create an N-dimensional array- here we create a 1×3 array of random values and stored it in a variable named “list”.
list * 3 # array([[-4.36043389, 2.1366469 , 1.63420343]])
One interesting feature of an N-Dimensional array is that you can scale each of its elements simply by multiplying by a scalar as I did above. Note that I didn’t assign the result back to “list” (but I could have), I merely allowed it to print to the console.
table[0,:] + list # array([[-1.45347796, 3.85221563, 0.54473448]])
Here I have added values from a row in the first table, with the array of random values. Note that this is a component-wise operation, which means it adds the values of matched indices.
Again, I didn’t assign the result either to the original table or to the list of random numbers, I merely printed the result of the operation.
np.argmax(list) # 1
The argmax function returns the index of the largest element in an array. For the random value array that we created earlier, the middle element at index ‘1’ held the largest value, so that index was returned.
Now, let’s dive a little deeper and look at each line of the solution code:
import gym import numpy as np
The first two lines import code from other modules. This is pretty standard practice in Python as well as many other languages. The gym module is what allows us to create and work with the Frozen Lake environment. The numpy module handles a lot of scientific computing and will be used in this code for its powerful N-dimensional array object. The last bit “as np” allows us to refer to the module with an abbreviated syntax, not that “numpy” was that long in my opinion. Oh well.
env = gym.make('FrozenLake-v0')
Here we are creating an environment based on the Frozen Lake game we’ve been using all along.
Q = np.zeros([env.observation_space.n,env.action_space.n])
Here we create our Q-Table, which is the lookup table we described above. We create it as a 2 dimensional array using the “numpy” module. The capacity of the table is 16 x 4 which is obtained by the values returned from “env.observation_space.n” (16) and “env.action_space.n” (4). Every entry in the table will be populated with a value of ‘0’ because we create the array using the call to “np.zeros”.
lr = .8
This represents a “learning rate”, which is a sort of magic number constant that can have a big impact on the performance of your algorithm. You can try with different values and observe the different ways it causes the A.I. to learn. Values closer to zero mean that the agent will “exploit” its previous knowledge while values closer to one mean that the agent will “explore” by more heavily considering recent knowledge. Essentially you want to find the right balance between learning speed (number closer to 1) and learning quality (number closer to 0).
y = .95
This represents a “discount factor” which is applied to the impact that future expected rewards have on our learning. Values closer to zero care more about current rewards while values closer to one favor long-term higher rewards. Like the “learning rate”, it is a sort of magic number that will need to be experimented with. It was likely represented with ‘y’ because that character looks kind of like the lower-case Greek letter gamma, which is normally seen in related math formulas.
num_episodes = 2000
The “num_episodes” is the number of attempts we want to take at solving the environment. We will want to try repeatedly because we need a lot of wins and losses to really develop an understanding of the (state, action) value pairs. If your learning rate was lower, you would probably need more episodes to learn from, but might end up with a better end result, so consider this yet another constant you can tweak to your liking.
for i in range(num_episodes):
The outer loop goes from 0 to the “num_episodes” (2000) constant declared previously. So the next lines of code will be repeated many times.
s = env.reset()
Each time we begin a new pass on solving the game, we need to reset the environment to its starting position. ‘s’ is the state which is implemented as an index of the possible board positions. Upon reset, the state will be ‘0’ which is the first state, and appears in the upper left hand corner designated with the letter ‘S’.
d = False
The ‘d’ will be a flag indicating whether or not the board has reached an end-game position. We initialize it to false so that it will continue playing.
j = 0 while j < 99: j+=1
The ‘j’ is the iterator for an inner loop. Each episode will be allowed 99 ‘turns’ with which to try and solve the environment. This is to prevent scenarios, particularly early in the learning process, where the computer might continually pick choices that never result in an end game state. In theory it could do so forever. The first line of the inner loop causes the iterator to increment so that the loop will eventually terminate even if the done flag never triggers.
The author probably would have used a simpler
for j in range(99): syntax, but appears to have wanted to use the value of the iterator even outside of the loop for diagnostic purposes.
a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1)))
This step chooses an action based in part on the learned values of the Q-Table, and in part by a random noise. ‘Q[s,:]’ means the array of values for the actions at state ‘s’. ‘np.random.randn(1,env.action_space.n)’ will create a list of random numbers that is the same length as our number of actions and we will add the two together. We use ‘(1./(i+1))’ to scale the random values over time. The higher ‘i’ becomes, the less influence the random number generator will have on our decision because (1/1) is much larger than (1/2000). In other words, when we first start learning, we use a lot of random decisions because we have no history of actions to rely on. As we play more and more rounds of the game, we rely more and more on the results of past actions so that what we have learned is more important than making random choices. ‘np.argmax’ picks the index of the largest value in the resulting array which means it is picking the action with the highest ‘expected’ reward.
s1,r,d,_ = env.step(a)
Using the value of ‘a’, which is the action index we chose on the previous line, we will apply an action to the environment using the “step” call. There are four return values from this method including: ‘s1’ which is the new state that has been entered. ‘r’ is the reward for reaching the new state. ‘d’ is the flag indicating whether or not the game is done such as by falling in a hole or reaching the goal. The final parameter ‘_’ is unused – it holds debug info about the environment but should not be used in any attempt to solve it.
Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a])
Now that we have chosen an action and gotten a result, we can update our Q-Table with what we’ve found. This line is based on a formula called the Bellman equation. Note that it uses our magic number constants defined above. Let’s try to break the line down a bit more to make sure its dynamic parts are fully understood. Imagine that we are doing another pass on the table starting from where it left off, so the values in the table will match the chart I have above. For now, let’s create an example based on the following:
- ‘s’ is state ‘0’ (the upper left tile)
- ‘a’ is action ‘0’ (move left)
- Q[s,a] is Q[0,0] which currently holds the value of ‘0.08’
- ‘s1’ is ‘4’ (due to random wind)
- ‘r’ is ‘0’ (only the goal tile provides a reward)
- Q[s1:] refers to the entire array of action values at state ‘s1’
- np.max(Q[s1,:]) is ‘0.13’ (the highest possible reward at state ‘s1’)
Now we know enough to rewrite the equation by substituting real values like so:
Q[0,0] = 0.08 + 0.8 * (0.0 + 0.95 * 0.13 - 0.08) # This results in a value of (0.1148) which would continue to strengthen / increase the expected reward for moving left
Make sense so far? Let’s continue…
s = s1
Now that we have updated the Q-Table, we are free to update our local reference of what the current state is – which is the resulting state based on the action we took.
if d == True: break
If we reached the goal, or fell in a hole, we need to abort the loop, because the game has ended and there is no point to taking further actions upon the environment. We check for these conditions by the reference to ‘d’ which indicates when the environment is done, and we use the ‘break’ statement to break out of the inner loop whenever done is true.
Note that I didn’t bother to comment on a few lines of this code such as anything referring to “jList”, because it was commented out anyway. I also didn’t comment on the “rList” which merely held an array of reward values – one entry for each episode. This code is merely diagnostic and has nothing to do with training the machine learning agent.
In this lesson we discussed the implementation of a Q-Table for a mchine learning agent. The code “solved” the Frozen Lake environment, such that the solution resulted in a table that mapped from a state and action, to a potential reward. We also covered the bits of the numpy module that were important to complete the problem.
Overall, we actually succeeded at writing and (hopefully) understanding a reinforcement machine learning algorithm. Cool as it is, we can’t stop here. Any game I would be interested to make is going to require a lot more states as well as actions, and our Q-Table wont scale well for that purpose. Don’t worry, there are solutions that do scale well – one of which is a neural network. Keep following along and we will take a look at one of those next.
If you find value in my blog, you can support its continued development by becoming my patron. Visit my Patreon page here. Thanks!