From pixels to RL policy
If you are interested in learning RL fundamentals, a good place to start is Frozen Lake, a deceptively simple yet challenging environment that distills the core concepts of reinforcement learning with a touch of 1990s video game nostalgia. In the animated .gif to the right, an elf (your agent) navigates an icy landscape in search of treasure, all while dodging hidden holes.
Frozen Lake, originally from OpenAI and now under the stewardship of the Farama Foundation, offers a hands-on introduction to Q-learning. But what is Q? In the context of reinforcement learning, it refers to an action-value function (the Q-value). Technically speaking, Q-learning is a model-free reinforcement learning algorithm where the goal is to learn the quality (Q-value) of actions. In other words, Q represents the agent’s best guess at the total reward it can get by:
- Taking a specific action in its current situation
- Then making the best possible choices for the rest of the game
Let’s break down using Frozen Lake. Here, best guess refers to steps taken aligned with ultimately getting the treasure. The game itself involves a grid, like a small chessboard, but instead of black and white squares, you have safe ice and treacherous holes. Our elf starts in one corner. Let’s use 0,0 in the 4×4 (or 8×8) grid rather than 1,1 to refer to the top-left cell. Our elf needs to make it to the opposite corner ([3,3]) without experiencing hypothermic shock — or getting lost in the tiny Frozen Lake world.
The Q-learning equation, also known as the Bellman equation is:
The Q-learning equation is given by:
\( Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a’} Q(s_{t+1}, a’) – Q(s_t, a_t) \right] \)
Where:
\( Q(s_t, a_t) \) is the Q-value of taking action \( a_t \) in state \( s_t \). In Frozen Lake, this represents how good it is for our elf to take a specific action (e.g., move right) when in a particular position on the lake.
\( \alpha \) is the learning rate. This determines how quickly our elf updates its knowledge based on new experiences. A higher \( \alpha \) means the elf is more adaptable to changes in the icy environment.
\( r_{t+1} \) is the reward received after taking action \( a_t \). For our elf, this could be a positive reward for reaching the treasure, a negative reward for falling into a hole, or a small positive reward for safely moving on ice.
\( \gamma \) is the discount factor. This represents how much the elf values future rewards compared to immediate ones. A low \( \gamma \) makes the elf focus on immediate safety, while a high \( \gamma \) encourages long-term planning to reach the treasure, even if it means taking a longer path.
\( \max_{a’} Q(s_{t+1}, a’) \) is the maximum estimated Q-value for the next state \( s_{t+1} \) across all possible actions \( a’ \). This is like our elf looking ahead and considering the best possible outcome from its next position on the lake.
In essence, Q-learning in Frozen Lake is about:
- States: Each square on the grid represents a unique state. So, in a 4×4 Frozen Lake, you have 16 states.
- Actions: The elf can choose from four actions: move up, down, left, or right.
- Rewards: The elf gets a reward for reaching the treasure (+1) and a penalty for falling into a hole (-1). Moving on safe ice might give a small positive reward (+0.1) to encourage exploration.
- Q-Matrix: This is where the learning magic happens. Q-learning uses a table to store a “Q-value” for each state-action pair. This value represents the expected reward for taking a particular action from a particular state.
- Learning: The elf starts by exploring randomly. As it interacts with the environment (moving, hitting holes, finding treasure), it updates the Q-values in the table. Over time, the Q-table becomes a roadmap, guiding the elf towards actions with higher expected rewards.
As the Q-matrix evolves through this learning process, it gradually shapes the elf’s decision-making strategy. This strategy, which determines how the elf chooses actions based on its current state, is what we call a “policy” in reinforcement learning.
Below, check out what the Q matrix might look like after the elf explores the board for some time. Initially, all of the cells will be zeroed out.
State | Left | Down | Right | Up |
---|---|---|---|---|
0 | 0.29 | 0.09 | 0.24 | 0.09 |
1 | 0.04 | 0.03 | 0.01 | 0.24 |
2 | 0.03 | 0.06 | 0.16 | 0.03 |
3 | 0.02 | 0.04 | 0.02 | 0.07 |
4 | 0.29 | 0.05 | 0.09 | 0.03 |
5 | 0.00 | 0.00 | 0.00 | 0.00 |
6 | 0.00 | 0.00 | 0.23 | 0.00 |
7 | 0.00 | 0.00 | 0.00 | 0.00 |
8 | 0.15 | 0.10 | 0.04 | 0.37 |
9 | 0.30 | 0.42 | 0.10 | 0.12 |
10 | 0.39 | 0.06 | 0.04 | 0.08 |
11 | 0.00 | 0.00 | 0.00 | 0.00 |
12 | 0.00 | 0.00 | 0.00 | 0.00 |
13 | 0.09 | 0.32 | 0.48 | 0.03 |
14 | 0.27 | 0.75 | 0.28 | 0.50 |
15 | 0.32 | 0.40 | 0.50 | 1.00 |
Over many episodes of trial-and-error, the elf becomes adept at navigating Frozen Lake, consistently finding the treasure while avoiding a chilly demise.
Now, let’s break down this implementation, highlighting the main components that help our little elf evolve from a random explorer to a master of icy navigation.
Speaking of exploration, a quick diversion on epsilon (ε), which in RL is a parameter used in exploration strategies. It shows up in the epsilon-greedy algorithm where it represents the probability of taking a random exploratory action instead of exploiting the current knowledge. In other words, epsilon is our elf’s curiosity meter. A high epsilon means our elf is feeling adventurous, willing to try new paths and potentially discover hidden shortcuts. A low epsilon indicates a more cautious elf, sticking to the paths it knows best.
Breaking down the code
Here’s a brief overview of the Python that follows, which is executable in the Colab link where you can see the elf in action.
- Setting the Stage: We start by importing the necessary libraries and creating our Frozen Lake playground using Gymnasium (the successor to OpenAI Gym). This gives us a 4×4 grid where our elf can practice its ice-skating skills.
- Q-Table Initialization: Just like a blank map, we create a Q-table filled with zeros. This table will serve as our elf’s treasure map, eventually resembling a well-defined guide for the elf’s journey.
- The Learning Loop: Our elf doesn’t become an expert overnight; it needs to practice for thousands of episodes. In each episode, the elf starts at the beginning (state reset). Next, it decides whether to explore randomly (with probability epsilon) or exploit its current knowledge. Then, the elf takes an action, observes the result, and updates its Q-table accordingly.
Here’s the Python in Google Colab where you can see the elf in action and below via GitHub Gist. In this example, the agent interacts with the environment 10,000 times. Turns out, you don’t need pointy shoes to master icy terrain. Just some trial and error and a well-tuned Q-table!
Tell Us What You Think!