Research & Development World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • Call for Nominations: The 2025 R&D 100 Awards
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
    • Explore the 2024 R&D 100 award winners and finalists
  • Resources
    • Research Reports
    • Digital Issues
    • R&D Index
    • Subscribe
    • Video
    • Webinars
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE

Slip and slide into reinforcement learning with the Frozen Lake challenge

By Brian Buntz | August 21, 2024

Frozen Lake

[Farama.org]

Remember when DeepBlue mastered chess and AlphaGo conquered the game of Go, the board game with more moves than particles in the known universe? Or what about when cars learned to drive themselves? Reinforcement learning (RL) plays an essential role in each of these examples. RL is having a significant impact on R&D across an array of fields.In autonomous vehicles, it is used in concert with sensor fusion, computer vision, and control systems to make real-time decisions based on complex, ever-changing environments. From robots learning to automate suture-sewing or quadruped robots mastering agile locomotion over rough terrain, RL is pushing engineering boundaries through trial-and-error, pushing algorithms to learn and adapt just like humans do, only faster and at a much larger scale.

From pixels to RL policy

If you are interested in learning RL fundamentals, a good place to start is Frozen Lake, a deceptively simple yet challenging environment that distills the core concepts of reinforcement learning with a touch of 1990s video game nostalgia. In the animated .gif to the right, an elf (your agent) navigates an icy landscape in search of treasure, all while dodging hidden holes.

Frozen Lake, originally from OpenAI and now under the stewardship of the Farama Foundation, offers a hands-on introduction to Q-learning. But what is Q? In the context of reinforcement learning, it refers to an action-value function (the Q-value). Technically speaking, Q-learning is a model-free reinforcement learning algorithm where the goal is to learn the quality (Q-value) of actions. In other words, Q represents the agent’s best guess at the total reward it can get by:

  1. Taking a specific action in its current situation
  2. Then making the best possible choices for the rest of the game

Let’s break down using Frozen Lake. Here, best guess refers to steps taken aligned with ultimately getting the treasure. The game itself involves a grid, like a small chessboard, but instead of black and white squares, you have safe ice and treacherous holes. Our elf starts in one corner. Let’s use 0,0 in the 4×4 (or 8×8) grid rather than 1,1 to refer to the top-left cell. Our elf needs to make it to the opposite corner ([3,3]) without experiencing hypothermic shock — or getting lost in the tiny Frozen Lake world.

The Q-learning equation, also known as the Bellman equation is:

The Q-learning equation is given by:


\( Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a’} Q(s_{t+1}, a’) – Q(s_t, a_t) \right] \)

Where:

\( Q(s_t, a_t) \) is the Q-value of taking action \( a_t \) in state \( s_t \). In Frozen Lake, this represents how good it is for our elf to take a specific action (e.g., move right) when in a particular position on the lake.

\( \alpha \) is the learning rate. This determines how quickly our elf updates its knowledge based on new experiences. A higher \( \alpha \) means the elf is more adaptable to changes in the icy environment.

\( r_{t+1} \) is the reward received after taking action \( a_t \). For our elf, this could be a positive reward for reaching the treasure, a negative reward for falling into a hole, or a small positive reward for safely moving on ice.

\( \gamma \) is the discount factor. This represents how much the elf values future rewards compared to immediate ones. A low \( \gamma \) makes the elf focus on immediate safety, while a high \( \gamma \) encourages long-term planning to reach the treasure, even if it means taking a longer path.

\( \max_{a’} Q(s_{t+1}, a’) \) is the maximum estimated Q-value for the next state \( s_{t+1} \) across all possible actions \( a’ \). This is like our elf looking ahead and considering the best possible outcome from its next position on the lake.

In essence, Q-learning in Frozen Lake is about:

  • States: Each square on the grid represents a unique state. So, in a 4×4 Frozen Lake, you have 16 states.
  • Actions: The elf can choose from four actions: move up, down, left, or right.
  • Rewards: The elf gets a reward for reaching the treasure (+1) and a penalty for falling into a hole (-1). Moving on safe ice might give a small positive reward (+0.1) to encourage exploration.
  • Q-Matrix: This is where the learning magic happens. Q-learning uses a table to store a “Q-value” for each state-action pair. This value represents the expected reward for taking a particular action from a particular state.
  • Learning: The elf starts by exploring randomly. As it interacts with the environment (moving, hitting holes, finding treasure), it updates the Q-values in the table. Over time, the Q-table becomes a roadmap, guiding the elf towards actions with higher expected rewards.

As the Q-matrix evolves through this learning process, it gradually shapes the elf’s decision-making strategy. This strategy, which determines how the elf chooses actions based on its current state, is what we call a “policy” in reinforcement learning.

Below, check out what the Q matrix might look like after the elf explores the board for some time. Initially, all of the cells will be zeroed out.

State Left Down Right Up
0 0.29 0.09 0.24 0.09
1 0.04 0.03 0.01 0.24
2 0.03 0.06 0.16 0.03
3 0.02 0.04 0.02 0.07
4 0.29 0.05 0.09 0.03
5 0.00 0.00 0.00 0.00
6 0.00 0.00 0.23 0.00
7 0.00 0.00 0.00 0.00
8 0.15 0.10 0.04 0.37
9 0.30 0.42 0.10 0.12
10 0.39 0.06 0.04 0.08
11 0.00 0.00 0.00 0.00
12 0.00 0.00 0.00 0.00
13 0.09 0.32 0.48 0.03
14 0.27 0.75 0.28 0.50
15 0.32 0.40 0.50 1.00

Over many episodes of trial-and-error, the elf becomes adept at navigating Frozen Lake, consistently finding the treasure while avoiding a chilly demise.

Now, let’s break down this implementation, highlighting the main components that help our little elf evolve from a random explorer to a master of icy navigation.

Speaking of exploration, a quick diversion on epsilon (ε), which in RL is a parameter used in exploration strategies. It shows up in the epsilon-greedy algorithm where it represents the probability of taking a random exploratory action instead of exploiting the current knowledge. In other words, epsilon is our elf’s curiosity meter. A high epsilon means our elf is feeling adventurous, willing to try new paths and potentially discover hidden shortcuts. A low epsilon indicates a more cautious elf, sticking to the paths it knows best.

Breaking down the code

Here’s a brief overview of the Python that follows, which is executable in the Colab link where you can see the elf in action.

  • Setting the Stage: We start by importing the necessary libraries and creating our Frozen Lake playground using Gymnasium (the successor to OpenAI Gym). This gives us a 4×4 grid where our elf can practice its ice-skating skills.
  • Q-Table Initialization: Just like a blank map, we create a Q-table filled with zeros. This table will serve as our elf’s treasure map, eventually resembling a well-defined guide for the elf’s journey.
  • The Learning Loop: Our elf doesn’t become an expert overnight; it needs to practice for thousands of episodes. In each episode, the elf starts at the beginning (state reset). Next, it decides whether to explore randomly (with probability epsilon) or exploit its current knowledge. Then, the elf takes an action, observes the result, and updates its Q-table accordingly.

Here’s the Python in Google Colab where you can see the elf in action and below via GitHub Gist. In this example, the agent interacts with the environment 10,000 times. Turns out, you don’t need pointy shoes to master icy terrain. Just some trial and error and a well-tuned Q-table!

Related Articles Read More >

2025 R&D layoffs tracker tops 92,000
Is your factory (or lab) ready to think? An insider’s take on next-gen automation and what really works
8 reasons all is not well in GenAI land
Efficiency first: Sandia’s new director balances AI drive with deterrent work
rd newsletter
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, trends, and strategies in Research & Development.
RD 25 Power Index

R&D World Digital Issues

Fall 2024 issue

Browse the most current issue of R&D World and back issues in an easy to use high quality format. Clip, share and download with the leading R&D magazine today.

Research & Development World
  • Subscribe to R&D World Magazine
  • Enews Sign Up
  • Contact Us
  • About Us
  • Drug Discovery & Development
  • Pharmaceutical Processing
  • Global Funding Forecast

Copyright © 2025 WTWH Media LLC. All Rights Reserved. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media
Privacy Policy | Advertising | About Us

Search R&D World

  • R&D World Home
  • Topics
    • Aerospace
    • Automotive
    • Biotech
    • Careers
    • Chemistry
    • Environment
    • Energy
    • Life Science
    • Material Science
    • R&D Management
    • Physics
  • Technology
    • 3D Printing
    • A.I./Robotics
    • Software
    • Battery Technology
    • Controlled Environments
      • Cleanrooms
      • Graphene
      • Lasers
      • Regulations/Standards
      • Sensors
    • Imaging
    • Nanotechnology
    • Scientific Computing
      • Big Data
      • HPC/Supercomputing
      • Informatics
      • Security
    • Semiconductors
  • R&D Market Pulse
  • R&D 100
    • Call for Nominations: The 2025 R&D 100 Awards
    • R&D 100 Awards Event
    • R&D 100 Submissions
    • Winner Archive
    • Explore the 2024 R&D 100 award winners and finalists
  • Resources
    • Research Reports
    • Digital Issues
    • R&D Index
    • Subscribe
    • Video
    • Webinars
  • Global Funding Forecast
  • Top Labs
  • Advertise
  • SUBSCRIBE