Q-Learning Agents, Part 1

Machine Learning provides us an interesting way to solve special kinds of problems. If you’re just playing around, you may see that creating a good problem to work with can be a lot of work on its own. OpenAI gym has recognized this challenge and provided a great solution. They have created a whole collection of different “environments” that are perfectly suited to machine learning. To help us get started, we will be looking at one of the easy challenges which we can solve using Q-Learning.


This mini-series is based on this blog post. I whole heartedly recommend you read the original post and support the author there. While I found the original post to be excellent, I also felt that it was too advanced for me upon my first encounter. My goal is to elaborate a lot more and try to make the ideas within more accessible to beginners, myself included. I am also only a beginning student of machine learning, but I feel that one of the best ways to learn, is to try to teach. If I say something wrong along the way feel free to point it out.


I am assuming at least a little familiarity with Python (version 3.5+ is needed for gym), including the ability to install it and any needed modules, and how to read and write basic Python code. If you need to brush up a little first, here are a couple of links that might help:

Frozen Lake

The Open Gym environment we will use for this lesson is called Frozen Lake. You can check out the home page for this environment here. It is one of the “Toy text” environments, all of which are categorized as Easy. Frozen Lake is actually just a 4×4 grid of letters, where each letter represents something about the world:

  • S: is where the player (S)tarts the game
  • F: is (F)rozen ice, and is safe to walk on
  • H: is a (H)ole, and will cause you to lose the game
  • G: is the (G)oal, and when you reach this tile you win!

The game board looks something like this:


I know that letters aren’t exactly exciting to look at, but try to keep in mind that this merely needs to represent an environment, and that we could choose to render it just as easily with any sort of asset in any theme of our choosing. Below I have created a tile map for some sort of dungeon using the exact same layout. A hero occupies the starting square, spikes are used in place of holes, and stairs appear at the goal tile as an exit for this level:

As you might have guessed, the only input needed for this game is directional input – to move from square to square. You can move in any of the four directions like up, down, left or right. When you provide input, you “should” move to the next tile in the direction indicated. You are allowed to provide any input from any tile, even if it doesn’t seem to provide any value. For example, in the start position, using the input for up or left would just mean you bump into a wall and end up in the same place you started.

However, this environment is actually a little more complex than it first appears. In this game, you are not guaranteed to move in the direction of your input. There is a chance that random wind, not to mention slippery ice, will result in a movement perpindicular to the direction you tried for. For example, you might provide input for the intention of moving right, while ending up in a position either above or below your tile instead. This element of randomness is what provides a challenge sufficient for machine learning. Otherwise, it would be a trivial matter to simply hard code a path to follow and be done with it – no machine learning necessary.


A game with completely predictable outcomes for each input would be called “deterministic”, while games with probabilistic outcomes are called “stochastic”.

Manual Play

To reinforce the ideas behind the environment, it can be a good idea to play around with it manually before you start programming your solution. Feel free to use whatever IDE you are most comfortable with, but for something this simple, even the Python console window will work. Try following along:

import gym
env = gym.make('FrozenLake-v0')

We begin by importing the gym module, and then create the environment for Frozen Lake as we had mentioned above. We will be using the “env” object a lot to determine what the environment looks like, where we are, what actions we can take, etc.

s = env.reset()

Before you start playing with the environment, it is important to call “reset” on it. This makes sure that everything is properly configured for you. It also returns the value of the initial state which I have stored in a variable ‘s’. I then printed ‘s’ on the next line which showed that we are currently in state ‘0’.

During play, and especially during training, you will be very likely to encounter an end-game state. Any tile with a hole or with the goal will cause the game to end, and then you must call “reset” again if you want to continue to play or train your agent.


Probably the first thing you will be curious about, is to see what the environment looks like. You can cause the environment to display with the “render” method. In the terminal, it outputs a grid of letters just like I had shown before, but it also highlights my position with a red box:

Even just by looking, you might be able to tell how many possible “states” exist for this game. There will be one state per tile that you can reach. Since the game board is a 4×4 grid, there will be 16 states, indexed from 0 to 15. The states are laid out in the following pattern (left to right then top to bottom):

  0,  1,  2,  3
  4,  5,  6,  7
  8,  9, 10, 11
 12, 13, 14, 15

Some environments will be more complex, so you can also query the number of states by printing the value of “env.observation_space.n” to the console:

# will print 16

Similarly you may want to query the number of possible actions which can be taken. I mentioned earlier that we are able to move in four directions, so you might have already guessed the number of possible actions:

# will print 4

Here I simply printed the value of the environment’s action space. The output is “4” which tells me there are four actions I can take. These actions are indexed by values from 0 to 3 and their purposes are as follows:

  • 0 = Left
  • 1 = Down
  • 2 = Right
  • 3 = Up

To actually perform an action, you use a method called “step” and pass the index of an action:


Note that I also called render after applying an action. This time, the rendered output will also show a labeled direction just above the game board, which happens to be the direction I “tried” to go. But did you actually go that direction? I didn’t – the random wind and ice kicked in, and even though I tried to move Down, I actually moved Right.

action = env.action_space.sample()
observation, reward, done, info = env.step(action)

In this example, I picked a random action from the available actions and applied it. This might be handy when you are in the “exploration” phase of machine learning, and you don’t have any idea about what a good or bad action would be. Not surprisingly, my random choice didn’t perform very well, and I fell in a hole. As a human I can see this easily in the rendered output because the red highlight sits over a tile marked by the letter ‘H’. In order to inform our machine learning agent of the same ideas, we will need to utilize the four returned values that a call to “env.step” returns:

  1. observation – this is the new state of our board game after having applied the action. In my case, it held the value ‘5’ which is the state of the game when I have fallen in the hole just south east of the starting tile.
  2. reward – this is a reward earned based on entering the new state. Earning a positive award is how the machine learning agent will see that an action was good. If we reach the goal, the value will be ‘1.0’. In my case, the reward was ‘0.0’ indicating that there is no benefit to falling in a hole. In fact, every reward will be zero until we reach the goal.
  3. done – this will either be true or false and indicates whether the game has ended. Falling in a hole, or reaching the goal will both result in this value being true. At any other location the value will be false.
  4. info – this is intended as diagnostic information only. It helps you understand what or why something happened. You aren’t actually supposed to use it for the training of your agent (it is considered to be cheating), so I will mostly ignore it for now. But what do you suppose it means for ‘prob’ to be 0.333…, see if you can guess while playing.

Some environments might choose to use a negative reward as a punishment for a bad action like falling in a hole. Other environments might even use small negative rewards for each step, which result in making an agent want to “hurry” toward the goal – otherwise, it might be happy with the scenic route. The way you distribute the reward to a machine learning agent can have a huge impact on what it actually learns. The environments we will be using are already configured to distribute appropriate rewards automatically, so you wont have to worry too much about it yet.

Whether by observing the rendered output, or by seeing the value of “done” as true, I must now reset the environment to play again.

s = env.reset()

Try to play on your own until you reach the goal. It might be a little hard due to the random wind. How did you end up winning? Did you always use the direction you wanted to go even though it might have a chance of you falling in a hole?

One strategy might be to play it safe. For example, imagine you are just south of the starting tile, so there is a hole to your right. You probably want to go down at this point, but if you use down as input, there is a chance you will move left or right. Moving left wouldn’t be a problem because there is a wall and you will just stay put. But if you move right, you fall in a hole and it’s game over! The alternative is to try something like using left as your input. If you actually go the direction you input then you end up staying still, but if the random forces kick in you will either end up in the tile above or below your current position, both of which are safe locations.


In this introductory lesson, I introduced you to OpenAI gym. We learned all about one of their simple text based environments and spent some time learning all about it, as well as how to interact with it. Hopefully you followed along, and perhaps even beat the game yourself! Whether you did or not, don’t worry, because in the next post, we will begin looking at how we can train the computer to learn to play all on its own.

If you find value in my blog, you can support its continued development by becoming my patron. Visit my Patreon page here. Thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *