1️⃣Q-Learning
Q-Learning is a model-free, off-policy reinforcement learning algorithm that is used to find the optimal action-value function, Q(s,a), for an agent in a given state, s, and taking a specific action, a. The optimal action-value function represents the maximum expected long-term reward an agent can receive by taking a specific action in a given state.
The Q-function is updated using the Bellman equation, which expresses the Q-function of a state and an action in terms of the Q-function of the next state and the reward received for taking that action. The Bellman equation is defined as:
where r is the immediate reward received after taking action a in state s, γ is the discount factor, a' is the action taken in the next state s', and max(Q(s',a')) is the maximum expected long-term reward the agent can receive in the next state s' by taking any action a'.
The agent uses the Q-function to choose the action that maximizes the expected long-term reward in each state. This is known as the greedy action policy.
The Q-Learning algorithm starts with an arbitrary Q-function, and updates it using the Bellman equation as the agent interacts with the environment. The update rule is defined as:
where α is the learning rate, that determines how much the agent relies on the new information.
Example
In this example, the agent uses a Q-table to represent the Q-function, and updates it using the Bellman equation. The agent selects actions using an exploration-exploitation strategy, where it chooses a random action with a probability of epsilon, and the action that maximizes the Q-value with a probability of 1-epsilon. The select_action
and take_action
functions would need to be implemented for the specific problem you are trying to solve.
It's important to note that Q-Learning is a model-free algorithm, it doesn't require a model of the environment. Q-Learning is also an off-policy algorithm, it doesn't need to follow the current policy during learning process. This allows Q-Learning to learn from any experience it has, even if it's not following the optimal policy.
Python code
import numpy as np
# Define the number of rows and columns in the grid-world
n = 5
m = 4
# Define the number of states and actions
num_states = n * m
num_actions = 4
# Define the Q-table with dimensions [states, actions]
Q = np.zeros((num_states, num_actions))
# Define the learning rate, discount factor and exploration rate
alpha = 0.1
gamma = 0.9
epsilon = 0.1
# Define the number of episodes
num_episodes = 1000
# Define the initial and goal state
initial_state = 0
goal_state = (n-1)*m + m-1
# Define the take_action function
def take_action(state, action):
if action == 0: # move up
next_state = state - m if state - m >= 0 else state
elif action == 1: # move down
next_state = state + m if state + m < num_states else state
elif action == 2: # move left
next_state = state - 1 if state % m != 0 else state
elif action == 3: # move right
next_state = state + 1 if (state + 1) % m != 0 else state
else:
next_state = state
if next_state == goal_state:
reward = 1
else:
reward = 0
return next_state, reward
# Q-Learning loop
for episode in range(num_episodes):
# Initialize the state
state = initial_state
reached_goal = False
while not reached_goal:
# Select an action using a epsilon-greedy policy
if np.random.rand() < epsilon:
action = np.random.randint(num_actions)
else:
action = np.argmax(Q[state])
# Take the action and observe the next state and reward
next_state, reward = take_action(state, action)
if next_state == goal_state:
reached_goal = True
# Update the Q-table
Q[state, action] = Q[state, action] + alpha*(reward + gamma*np.max(Q[next_state, :]) - Q[state, action])
# Update the state
state = next_state
print(state)
Last updated