Reinforcement Learning : Q Learning Tutorial !
Learn Q Learning through a practical example
Introduction
Are you interested in learning Reinforcement learning through practical examples? This article is for you ! In this tutorial, I will be explaining how to create a very simple reinforcement algorithm, we will train together an agent to play "FrozenLake-v1". But before we dive into the code, we will see some basic definitions and the principle of Q-learning.
Requirements :
Basic understanding of RL : Agent, Environment, Actions, Rewards.
Basic understanding of MDP (Markov Decision Processes).
Python 3.X installed with GYM Library and numpy.
What is reinforcement learning?
RL (reinforcement learning) is a sub-field of machine learning, its main objective is to allow an agent (robot, algorithm, etc.), placed in an environment, whose actions modify its states in the environment, to choose actions that maximize rewards. The agent performs trials and improves its policy according to the rewards provided by the environment.
What is Q-learning?
Before defining Q-learning, let's take a Quick look at the definition of the Q-function: Q is the expected value of the return which is (the return) defined by the following formula:
$$R(\tau) = \sum_{n=0}^{T} \gamma^{t} \times r^t$$
Which means that the return is the discounted sum of the rewards in a trajectory. This means that if the agent does several experiments starting from state S0 and taking an action A0, the Q-function is the average of all the returns the agent will get. In Q-learning, the agent tries to learn a policy that maximizes the Q-function.
The Bellman Equation
The main objective of the agent in Q learning is to maximize the Q function, the Bellman equation is a technique used to solve the optimal policy problem and thus maximize the function Q : Q∗(s,a)=r(s,a)+γ×max(Q(s′,a′))
In Practice, we define a matrix containing all Q values at episode t, then we update these values by a quantity equals to the difference between the max Q value of the next state multiplied by gamma and the Q value of the currnt state all multiplied by the learning rate as shown in the next equation: Q(s,a)=Q(s,a)+lr×(r(s,a)+γ×max(Q(s+1,a+1))−Q(s,a)) Where lr is the learning rate. This means that the previous Q values are updated at every episode by the quantity : lr×(r(s,a)+γ×max(Q(s+1,a+1))−Q(s,a))
Practical Example :
1 - Import dependecies :
The first thing we will do is to import the dependencies we will need throughout this tutorial: import numpy as np import gym import random import time
Gym is an open source Python library for developing and comparing reinforcement learning algorithms, we can use this library to load environments such as "FrozenLake-v1".
2 - Create the environment
Now let's build the environment, we will use the game called "FrozenLake-v1", this game contains 4 rows and 4 columns, the agent starts at position S which represents the start and tries to reach position G which represents the goal, the agent must avoid stepping on holes otherwise the game will end.
env = gym.make("FrozenLake-v1")
3 - Construct the Q matrix
Now, let's construct the Q matrix : state_space_size = env.observation_space.n action_space_size = env.action_space.n Q_values = np.zeros((state_space_size, action_space_size))
This matrix Q_values contains the values of Q, where the rows are the states and the columns are the actions, env.observation_space.n is the number of states which is equal to 16 in our case, and env.action_space.n is the number of actions which is 4 in our case.
3- Define some parameters
Now, let's define some parameters : episodes_number = 10000 max_steps_per_episode = 100 learning_rate = 0.1 discount_rate = 0.99
episodes_number represents the number of episodes (i.e. how many times the game will be played), max_steps_per_episode represents the maximum number of steps per episode. We use max_steps_per_episode to avoid playing the game over and over again
4- Exploration-Exploitation trade-off
We will now define what is called the exploration-exploitation trade-off. In fact, at first, we do not want our model to exploit the environment, i.e., to choose only actions with a high Q value, but we do want our model to explore new states. Then, we want the agent to exploit the environment rather than explore it. exploration_rate = 1 max_exploration_rate = 1 min_exploration_rate = 0.01 exploration_decay_rate = 0.001 rewards_all_episodes = []
5- Training the Agent
Now, let's train our agent in the environment.
# Set up the agent and environment
state = env.reset()
rewards_all_episodes = []
exploration_rate = max_exploration_rate
# Train the agent over a specified number of episodes
for episode in range(episodes_number):
# Reset episode-specific variables
done = False
current_episode_rewards = 0.0
# Interact with environment for a fixed number of steps
for step in range(max_steps_per_episode):
# Choose action using exploration-exploitation tradeoff
exploration_rate_thresh = random.uniform(0, 1)
if exploration_rate > exploration_rate_thresh:
action = env.action_space.sample()
else:
action = np.argmax(Q_values[state, :])
# Take chosen action and update Q-values
new_state, reward, done, _ = env.step(action)
Q_values[state, action] = (1 - learning_rate) * Q_values[state, action] \
+ learning_rate * (reward + discount_rate * np.max(Q_values[new_state, :]))
state = new_state
current_episode_rewards += reward
# Exit episode loop if environment signals done
if done == True:
break
# Update exploration rate and track rewards
exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) \
* np.exp(-exploration_decay_rate * episode)
rewards_all_episodes.append(current_episode_rewards)
6- Visualize the results
reward_per_thousand_episodes = np.split(np.array(rewards_all_episodes), episodes_number/1000)
count = 1000
print("average reward per thousand episodes")
for r in reward_per_thousand_episodes:
print(count,':', str(sum(r/1000)))
count += 1000
# print update Q-table
print("Q-table")
print(Q_values)
# Playing frozen Lake Game
7- Watch it play !
Now, let's run the following code to watch our agent play "FrozenLake-v1" Game !
for episode in range(4):
state = env.reset()
done = False time.sleep(1)
for _ in range(max_steps_per_episode):
env.render() time.sleep(0.3)
action = np.argmax(Q_values[state, :])
new_state, rew, done, info = env.step(action)
if done:
env.render()
if rew == 1:
print("You won!")
time.sleep(3)
else:
print('You fel into a hole')
time.sleep(3)
break
state = new_state
Bonus!
Subscribe to my newsletter now and receive free courses, valuable tips, and cheat sheets in PDF format! Stay up-to-date with the latest news and trends in your field and gain access to exclusive content only available to subscribers. Join our community today and start learning for free!
Thank you for reading, and let's connect! 🤝
Thank you for reading my blog. Feel free to subscribe to my email newsletter and connect on Twitter
If you like this article! Don't miss the upcoming ones, follow me and subscribe to my newsletter to receive more!
See you soon :)