1️⃣Q-Learning
Q-Learning is a model-free, off-policy reinforcement learning algorithm that is used to find the optimal action-value function, Q(s,a), for an agent in a given state, s, and taking a specific action, a. The optimal action-value function represents the maximum expected long-term reward an agent can receive by taking a specific action in a given state.
The Q-function is updated using the Bellman equation, which expresses the Q-function of a state and an action in terms of the Q-function of the next state and the reward received for taking that action. The Bellman equation is defined as:
where r is the immediate reward received after taking action a in state s, γ is the discount factor, a' is the action taken in the next state s', and max(Q(s',a')) is the maximum expected long-term reward the agent can receive in the next state s' by taking any action a'.
The agent uses the Q-function to choose the action that maximizes the expected long-term reward in each state. This is known as the greedy action policy.
The Q-Learning algorithm starts with an arbitrary Q-function, and updates it using the Bellman equation as the agent interacts with the environment. The update rule is defined as:
where α is the learning rate, that determines how much the agent relies on the new information.
Example
In this example, the agent uses a Q-table to represent the Q-function, and updates it using the Bellman equation. The agent selects actions using an exploration-exploitation strategy, where it chooses a random action with a probability of epsilon, and the action that maximizes the Q-value with a probability of 1-epsilon. The select_action
and take_action
functions would need to be implemented for the specific problem you are trying to solve.
It's important to note that Q-Learning is a model-free algorithm, it doesn't require a model of the environment. Q-Learning is also an off-policy algorithm, it doesn't need to follow the current policy during learning process. This allows Q-Learning to learn from any experience it has, even if it's not following the optimal policy.
Python code
Last updated