Q-Learning Agents, Part 3

A Q-Table greatly simplified the challenge of helping a computer agent “learn” to solve an environment. Unfortunately, this particular approach doesn’t scale well to the kinds of applications I would like to create. To help overcome this next hurdle, we will raise the complexity a bit more as the Frozen Lake environment is approached again, this time by using a neural network.

Optional Learning

One of my goals is to enable my readers to follow along and have a general intuition of everything that is happening here – even without knowledge of the advanced math driving it all. However, if the thought of learning the math doesn’t scare you away, it would certainly be beneficial to review topics such as Linear Algebra, Calculus and Statistics. There are a variety of excellent (and free) resources out there to help you:

  • This youtube channel 3Blue1Brown has a ton of very good videos that give a good intuition behind advanced topics. Like what is a matrix really doing? He has a whole series on neural networks, as well as for relevant advanced math like linear algebra and calculus.
  • Khan Academy also provides a ton of excellent video lectures covering everything from elementary to advanced math (and more even outside of math).
  • Udacity also provides a ton of excellent material. I’ve taken several classes from them and have enjoyed them all so far.

Neural Networks

If you’re already familiar with what a neural network is, feel free to skip ahead. Otherwise, I want to give a very quick and simple overview of how I understand them. At the same time, I will show how they kind of relate to the Q-Table which we used in the previous lesson to solve the environment.

As a programmer, I relate a neural network to something like a method that takes an array of inputs and returns an array of outputs. A real neural network, just like a method body, can be relatively simple or complex. You don’t necessarily have to know any of the implementation details in order to use either one.

// Imaginary representation of a neural net as a function in C-Sharp
public float[] EvaluateNeuralNet(float[] input) {
    // Do advanced calculations and potentially hidden stuff here...
    return new float[] { /* output here */  };
}

Imagine invoking our “EvaluateNeuralNet” function by taking an input made of an array of 16 state toggles, where one of them will hold a value of ‘1’ for being active, and all of the others will hold a value of ‘0’ for being inactive, such as:

[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] // Represents state '0'
[0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0] // Represents state '1'
...
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1] // Represents state '15'

Tip:

This style of input is referred to as one-hot. You’ll probably hear terms like one-hot encoding or one-hot vector, etc. One-hot is used to represent categorical values. In our example, the state ‘0’ could mean ‘upper left corner tile’ which is a category of where the player is located. If the value was intended to represent something like a ‘spatial coordinate’, a ‘count’ of something, or a ‘measurement’ then we would have treated it as a single input rather than as separate inputs.

When invoked, some logic will be executed or calculations will be made, and the end result will be a new array of 4 values, representing the estimated value for taking an action given the active state. These could be exactly the same values you would see from a query to our Q-Table:

// Return value for input of state '0'
[0.08,0.02,0.02,0.03] // Represents expected values of actions for: Left, Down, Right, Up.  We would select the index with the highest value.

The “body” of our example method represents the calculations that occur when running a neural network. Whereas we thought in arrays for our method input and output, the neural network is thinking in matrices, and the final output is the result of matrix multiplication(s). At a minimum, there will be an “input” matrix which is multiplied by a “weights” matrix to produce the “output” matrix. More complex neural networks may also include things like hidden layers and activation functions, but you won’t need those for this lesson.

Just like our Q-Table “learned” values in a 16×4 table, our neural network will “learn” the values for its 16×4 matrix of weights. Because we are using a one-hot vector as input, any product of the input by the weights will be “masked” such that only the weights for the active state have any influence on the result. In theory, the entire weights matrix could hold values identical to our Q-Table.

So what is the difference between our Neural Network and the Q-Table? Why is a neural network able to scale better than a table? Due to the simplicity of the implementation of this lesson’s neural network, I’d have to say there isn’t actually much difference. However, imagine a problem where you didn’t need a one-hot input. The same 16×4 matrix of weights will still “solve” problems for any combination of values used as input. For example, a lot of neural network examples show how to classify digits where the input is the values in the flattened array of pixels making up the image. Not only do you have multiple ‘activated’ input nodes, but the values range from 0-1 in intensity as well. That mixed sort of input is definitely not something you can easily query in a standard table.

Tensor Flow

The solution for this lesson relies heavily on the Tensor Flow module. Training a neural network requires some pretty advanced math, but common algorithms like gradient descent are already provided and are highly optimized to take advantage of your hardware wherever possible. Let’s take a moment to learn a bit more about how to work with it all. Open up a python console window and feel free to follow along:

import tensorflow as tf

We’ll get started by importing the module. We can use “as tf” to abbreviate our calls instead of having to type out the full module name each time. Note that this module may take a while to load – it’s a big one.

Not surprisingly, given the module’s name, you will primarily work on something called a tensor. As a programmer, I think of it as a sort of “base class” from which other data structures inherit. A scalar (single value) is like a 0-D tensor, a vector is like a 1-D tensor, and a typical matrix is like a 2-D tensor, etc. Tensors can be made up of higher dimensions as well.

a = tf.constant(5.0)
b = tf.constant(6.0)
c = a * b

Here I have created three very simple tensor objects. If these were normal python datatypes, I could print them to the screen and see their value immediately. For example, when printing ‘c’ I might expect to see the value ’30’ appear in the console. What I actually see looks like this:

<tf.Tensor 'mul:0' shape=() dtype=float32>

The tensor’s value hasn’t actually been evaluated yet. It knows that there is going to be a tensor that is dependent on the multiplication of two other tensors, but it is waiting on the computation until I tell it to actually run. As it stands I have only put together a sort of dataflow “graph”.

You run operations in tensorflow using a session.

sess = tf.Session()
sess.run(c)
# output:
# 30.0

There are two ways to create a session – with or without a context manager. Here I created a session without a context manager, so I am responsible for closing it later. I wanted to leave it open so we could continue to experiment with a variety of operations. Next I called “run” and passed in our ‘c’ tensor. I think of the ‘c’ tensor as the last leaf of the graph that needs to be evaluated – because it has dependencies on ‘a’ and ‘b’, they will automatically be evaluated as needed. The result of running with the current graph is to see the value ‘30.0’ print to the screen.

d = tf.constant([[1.0, 2.0], [3.0, 4.0]])
e = tf.constant([[1.0, 1.0], [0.0, 1.0]])
f = tf.matmul(d, e)
sess.run(f)
# output:
# array([[ 1.,  3.],
#        [ 3.,  7.]], dtype=float32)

Here is another example to reinforce the same ideas. I have defined three new tensors: ‘d’, ‘e’, and ‘f’, all of which are 2-D tensors much like a typical matrix. The tensors ‘d’ and ‘e’ are multiplied via “matmul” which performs matrix multiplication. Like before, this wasn’t evaluated until we used the session object to “run” the final result of tensor ‘f’. Note that the leaf of this graph has no dependencies on the earlier three tensors (‘a’, ‘b’ and ‘c’) so I don’t see their calculated result (‘30.0’) included in the output.

When we created our Q-Table in the previos lesson, we initialized all of its values to zero. When working with neural networks you initialize a table-like matrix of weights. However, for the weights it is standard practice to initialize each with different non-zero values. We can create random tensors to serve this purpose like so:

w = tf.random_uniform([16,4],0,0.01)
sess.run(w)
# output:
# array([[ 0.00441311,  0.003696  ,  0.00391775,  0.00458831],
#        [ 0.00571358,  0.00573707,  0.00489775,  0.0091996 ],
#        [ 0.000183  ,  0.00866533,  0.00849367,  0.0064985 ],
#        [ 0.00794172,  0.00897235,  0.00482526,  0.00513851],
#        [ 0.00893636,  0.00359473,  0.00211888,  0.00951563],
#        [ 0.00529519,  0.00864596,  0.00043187,  0.00362178],
#        [ 0.00973868,  0.00188461,  0.00794218,  0.00229512],
#        [ 0.00604649,  0.00714333,  0.00995134,  0.0032349 ],
#        [ 0.00851274,  0.00915071,  0.00586296,  0.00170361],
#        [ 0.00238598,  0.0092519 ,  0.00845877,  0.00192152],
#        [ 0.0046787 ,  0.00372767,  0.00333749,  0.00426622],
#        [ 0.00719282,  0.0004046 ,  0.00793462,  0.00073979],
#        [ 0.00330698,  0.00515023,  0.00336977,  0.0009353 ],
#        [ 0.00931754,  0.0094718 ,  0.00650125,  0.00393471],
#        [ 0.00257977,  0.00671494,  0.00363974,  0.00679468],
#        [ 0.00610296,  0.00429768,  0.00274958,  0.00202579]], dtype=float32)

Now we have a ’16×4′ tensor of values ranged between ‘0’ and ‘0.01’. This is a great start, but there’s a small problem. Try running sess.run(w) a couple more times. You should see that every time it runs, the ‘w’ tensor will be holding a different set of values. We need the initial values to persist between calls to “run” so that we can adjust them over time and allow the neural network to learn.

In order to persist state between calls to “run”, we’ll need to wrap the weights tensor in a variable. The session will then store the variable in memory until the session is closed and its memory is released.

W = tf.Variable(w)
init = tf.initialize_all_variables()
sess.run(init)

When working with variables, I must also initialize them. Note that ‘tf.initialize_all_variables’ is deprecated so you’ll probably see a warning printed to the screen. I used this version because it matches the solution code, but feel free to use ‘tf.global_variables_initializer’ instead.

Now that we have created a variable and initialized it, run this a few times (note the capital ‘W’): sess.run(W). The same values should be printed every time!

There is one more special kind of tensor we need to discuss called a placeholder. You can use this placeholder while building up a graph of other tensors and operations, but wait to provide values for it until you “run” your session. This is how we will provide different “input” to our neural network on each step of its training, and is also how the neural network will actually know which state is active as we take actions on our gym environment.

inputs1 = tf.placeholder(shape=[1,16],dtype=tf.float32)

Let’s take a quick detour into numpy for a convenient way to feed our placeholder tensor:

import numpy as np

Don’t forget to import the module.

states = np.identity(16)
states
# output:
# array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
#        [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
#        [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
#        [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
#        [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
#        [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
#        [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
#        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
#        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
#        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
#        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
#        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
#        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
#        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
#        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
#        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.]])

The code above creates an identity matrix, which means that all of the values are ‘0’, except for a diagonal where they are ‘1’. This is convenient because each row matches the representation of a state as a one-hot vector.

s = 0
t = states[s:s+1]
t
# output:
# array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

Given a state ‘s’ I can grab the corresponding one-hot vector by grabbing a slice of our identity matrix with a range of [s:s+1]. Above we grabbed the one-hot vector for state ‘0’.

Now we have a great way to “feed” our placeholder tensor. We could simply print it like so:

sess.run(inputs1,feed_dict={inputs1:t})
# output:
# array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]], dtype=float32)

Or we could use it in calculations like so:

Qout = tf.matmul(inputs1,W)
sess.run(Qout,feed_dict={inputs1:t})
# output:
# array([[ 0.00382317,  0.00165965,  0.0075581 ,  0.00908442]], dtype=float32)

The solution code includes a variety of other operations like creating a gradient descent optimizer and using it to train our neural network’s weights. Their use follows the same patterns we’ve already demonstrated – build a graph, then run it. Any other helpful info I have, I’ll just provide as we break down the actual solution code.

I’ve only scratched the surface of this module of course. If you want to go deeper, there are a ton of learning resources available online, such as here.

Solution Breakdown

For convenience, here is the entire solution taken from its original source. As a reminder, I am not the author of the following code, I am merely commenting on it.

import gym
import numpy as np
import random
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline

env = gym.make('FrozenLake-v0')
tf.reset_default_graph()

#These lines establish the feed-forward part of the network used to choose actions
inputs1 = tf.placeholder(shape=[1,16],dtype=tf.float32)
W = tf.Variable(tf.random_uniform([16,4],0,0.01))
Qout = tf.matmul(inputs1,W)
predict = tf.argmax(Qout,1)

#Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
updateModel = trainer.minimize(loss)

init = tf.initialize_all_variables()

# Set learning parameters
y = .99
e = 0.1
num_episodes = 2000
#create lists to contain total rewards and steps per episode
jList = []
rList = []
with tf.Session() as sess:
    sess.run(init)
    for i in range(num_episodes):
        #Reset environment and get first new observation
        s = env.reset()
        rAll = 0
        d = False
        j = 0
        #The Q-Network
        while j < 99:
            j+=1
            #Choose an action by greedily (with e chance of random action) from the Q-network
            a,allQ = sess.run([predict,Qout],feed_dict={inputs1:np.identity(16)[s:s+1]})
            if np.random.rand(1) < e:
                a[0] = env.action_space.sample()
            #Get new state and reward from environment
            s1,r,d,_ = env.step(a[0])
            #Obtain the Q' values by feeding the new state through our network
            Q1 = sess.run(Qout,feed_dict={inputs1:np.identity(16)[s1:s1+1]})
            #Obtain maxQ' and set our target value for chosen action.
            maxQ1 = np.max(Q1)
            targetQ = allQ
            targetQ[0,a[0]] = r + y*maxQ1
            #Train our network using target and predicted Q values
            _,W1 = sess.run([updateModel,W],feed_dict={inputs1:np.identity(16)[s:s+1],nextQ:targetQ})
            rAll += r
            s = s1
            if d == True:
                #Reduce chance of random action as we train the model.
                e = 1./((i/50) + 10)
                break
        jList.append(j)
        rList.append(rAll)
print "Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%"
plt.plot(rList)
plt.plot(jList)

After running the code above (plus some simple modifications to save the output), I recreated the final lookup graph as I had done for the Q-Table:

Here are a few observations I’ve made:

  • It successfully solved the problem, and found the same winning path as the Q-Table solution. (To check for yourself, verify which side of each tile has the highest value).
  • The neural network seems less “confident” in the answers because the values of each slice are more similar than they are with the Q-Table. It could be the case that different “magic” number constants would have produced a better result, but it could also be due to the randomness that goes along with stochastic gradient decent (the optimizer used to train the neural net).
  • Just like before, none of the “end” state weights were updated, but note that in this case they retain their random initial weights instead of ‘0.0’.

You might notice that there is a lot of code similar to the previous lesson’s solution. With the basics of tensforflow also covered, I hope that most of this already makes sense. To be sure, let’s go ahead and break it down:

import gym
import numpy as np
import random
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline

You should already be very familiar with loading modules. We have used gym, and numpy before. We learned about tensorflow in this lesson. It doesn’t appear that the random module is actually used. That leaves ‘matplotlib.pyplot’ as the only important new material – it allows you to make pretty graphs. The last bit, %matplotlib inline, refers to a magic function in IPython. The purpose is to include your graphs in your Jupyter notebook. It wont work in the python console window: ‘SyntaxError: invalid syntax’, but don’t worry because it is not needed to solve the challenge.

env = gym.make('FrozenLake-v0')
tf.reset_default_graph()

You should already understand about creating the frozen lake environment, but the call to reset the tensorflow graph is new. Since we aren’t explicity creating our session with our own graph, it will use a default graph instead. This call just makes sure we’re beginning with a blank state.

#These lines establish the feed-forward part of the network used to choose actions
inputs1 = tf.placeholder(shape=[1,16],dtype=tf.float32)
W = tf.Variable(tf.random_uniform([16,4],0,0.01))
Qout = tf.matmul(inputs1,W)
predict = tf.argmax(Qout,1)

This set of operations make up our neural network:

  • ‘inputs1’ is a placeholder tensor which we will feed with a one-hot vector representing the current game state. It is a 1×16 tensor.
  • ‘W’ is a variable tensor that is initialized with random values between ‘0’ and ‘0.01’, but then its values will persist for any additional calls to “run”. ‘W’ represents the weights of our neural network. You can think of these weights as being similar to the values of the Q-Table from the previous lesson – it indicates the value of taking an action at a given state. It is a 16×4 tensor.
  • ‘Qout’ is a tensor which holds the result of a matrix multiplication between our input and weights. The values it holds are the predicted reward values of the actions that can be taken from the active state. It is a 1×4 tensor.
  • ‘predict’ is a tensor holding the index of the ‘Qout’ actions which holds the highest value. It is the neural network’s “choice” for us to apply to the gym environment.

Because ‘predict’ has dependencies on ‘Qout’ which has dependencies on ‘inputs1’ and ‘W’, we know that all of these lines will be evaluated if we “run” our ‘predict’ tensor.

#Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
updateModel = trainer.minimize(loss)

These lines of code relate to how we train our neural network.

  • ‘nextQ’ – this is another placeholder Tensor. It will be populated with new target Q-values that we want to train the network toward. It is a 1×4 tensor.
  • ‘loss’ – this tensor holds a single value – the sum of all the squares of the differences between our ‘nextQ’ and ‘Qout’ tensors. It is a parameter used in training our neural network and it will help adjust our weights based on the amount of difference.
  • ‘trainer’ – Gradient descent is a common algorithm used to train a neural network. It involves a lot of complex math that tensorflow handles auto-magically for us. Its purpose is to update the weights of our neural network, much like we used the Bellman equation to update the entries in our Q-Table. In this case, we only need to experiment with one magic number – the “learning rate”. Like before, we are trying to balance between speed and quality. I have seen a range of values used: as high as ’10’ and as low as ‘0.00001’. A number like ‘0.1’ might be considered a typical starting place, although perhaps even that might be on the high end. Learning rates that are too large risk ‘diverging’ from the correct answer, whereas learning rates that are too small can require too many steps of training to be practical.
  • ‘updateModel’ – this is the operation which will try to minimize the “loss” of our neural network. The minimize call computes and applies gradients towards this purpose.

init = tf.initialize_all_variables()

# Set learning parameters
y = .99
e = 0.1
num_episodes = 2000
#create lists to contain total rewards and steps per episode
jList = []
rList = []

Because we are using tensorflow variables, we must initialize them – we covered that in the overview above. The learning parameters should feel similar to our previous solution. We have seen a ‘discount factor’ (‘y’) before, and also have used a ‘num_episodes’ constant before. They are just constants so that our numbers are better ‘documented’ in our code. The ‘e’ is new, and will be used to add a bit of extra random exploration to our game – the value it holds will decrease over time. The two lists ‘jList’ and ‘rList’ are for diagnostic purposes and don’t relate to the process of solving this challenge, so I will simply ignore them.

with tf.Session() as sess:
    sess.run(init)
    for i in range(num_episodes):
        #Reset environment and get first new observation
        s = env.reset()
        rAll = 0
        d = False
        j = 0
        #The Q-Network
        while j < 99:
            j+=1

Next we start a large body of code beginning with the keyword ‘with’. This allows us to create a tensorflow session which is using a context manager. In other words, when the main body ends, the session will automatically close and have its memory cleared. Next, we “run” the “init” for our variables. Then we start our outer (one pass per episode) and inner loops (one pass per turn) of training just like we did in the previous lesson. This includes reseting the gym environment and setting a few local variables such as whether or not we are done with the simulation. Refer back to that lesson if you need further clarification.

#Choose an action by greedily (with e chance of random action) from the Q-network
a,allQ = sess.run([predict,Qout],feed_dict={inputs1:np.identity(16)[s:s+1]})
if np.random.rand(1) < e:
    a[0] = env.action_space.sample()

Here we see another way to invoke the “run” method – by passing a list of things to evaluate. Technically ‘Qout’ will already be evaluated because we are also evaluating ‘predict’, but in order to get a convenient reference to its output we included it in the list. The method will return a tuple of values based on the number of elements in the list. Also remember that because our ‘predict’ has a dependency on the placeholder tensor ‘inputs1’, that we must feed the values for that tensor now. The value we feed is based on selecting a row of the 16×16 identity matrix just like we did when I first introduced placeholder variables in the tensorflow section above.

The ‘a’ will hold the evaluation of ‘predict’ which is the index of the action with the highest value. The ‘allQ’ will hold the values of all of the actions available at the current state.

Finally, there is a quick roll to see if a random number is less than our exploration ‘e’. If so, we override the chosen action with a random sample from the environment.

#Get new state and reward from environment
s1,r,d,_ = env.step(a[0])

This should look familiar by now, we are applying the chosen action to our environment, the result of which is the resulting new state, reward, done flag, and unused diagnostic info.

#Obtain the Q' values by feeding the new state through our network
Q1 = sess.run(Qout,feed_dict={inputs1:np.identity(16)[s1:s1+1]})

Using the new state, obtained by applying our action, we run another pass on our neural network to get the estimated Q values for the new state.

#Obtain maxQ' and set our target value for chosen action.
maxQ1 = np.max(Q1)
targetQ = allQ
targetQ[0,a[0]] = r + y*maxQ1

Now we store a reference to the highest value (‘maxQ1’) returned by the actions of the next state (‘Q1’) which we will use to help train our network’s weights. We make a new copy of the ‘allQ’ actions and store it in ‘targetQ’. Then we will update one of the values in ‘targetQ’ corresponding to the index of the action we actually selected ‘a[0]’. The new value will be a mix of the reward we earned at the new step, plus some discounted portion of the max action at the next step ‘maxQ1’.

It is probably worth pointing out that we did not directly update the weights matrix of our neural network. We have only updated a ‘copy’ of the output of the network. This updated variant will be our teaching data, and the difference between this teaching data, and the scores that were previously predicted will be applied as the ‘loss’ function that tells our gradient descent optimizer how to modify the network’s weights.

#Train our network using target and predicted Q values
_,W1 = sess.run([updateModel,W],feed_dict={inputs1:np.identity(16)[s:s+1],nextQ:targetQ})

Now that we have the teaching data, we can run the “updateModel” code. Remember that this object has a pretty large graph of dependencies. It knows it is going to use a loss function to train a neural network using gradient descent. It knows the loss function is between the passed ‘nextQ’ and a calculated ‘Qout’. It knows that ‘Qout’ is a matrix multiply between the ‘inputs1’ and the variable representing our matrix weights ‘W’. The run will return values (_,W1) but the ‘_’ indicates an unused value, and ‘W1’ isn’t actually used either.

            rAll += r
            s = s1
            if d == True:
                #Reduce chance of random action as we train the model.
                e = 1./((i/50) + 10)
                break
        jList.append(j)
        rList.append(rAll)
print "Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%"
plt.plot(rList)
plt.plot(jList)

Most of the rest of this is diagnostic, such as accumulating ‘rAll’ with the value of ‘r’, appending values to the ‘jList’ and ‘rList’ and printing the results. However, there is also an important step just after checking whether or not the environment has entered a done state or not. As the comment says, he is reducing the value of ‘e’ which is our exploration frequency. Here are a couple of values as ‘i’ changes over time:

  • [i is 0]; 1./(( 0/50) + 10) = 0.1
  • [i is 50]; 1./(( 50/50) + 10) = 0.09 (rounded)
  • [i is 100]; 1./(( 100/50) + 10) = 0.08 (rounded)
  • [i is 1000]; 1./((1000/50) + 10) = 0.03 (rounded)

Summary

In this lesson we introduced neural networks, learned to work with the tensor flow module, and finally looked at yet another solution of the frozen lake environment that used a neural network instead of a Q-Table. At this point we have finished covering all of the material from the original post that inspired this mini series of lessons. If you enjoyed it, be sure to check out that post which links several more lessons, each more challenging than the last, but also much more capable. By the end, you could be making A.I. smart enough to play Atari games!

If you find value in my blog, you can support its continued development by becoming my patron. Visit my Patreon page here. Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *