2️⃣SARSA

SARSA is a popular on-policy reinforcement learning algorithm. It stands for State-Action-Reward-State-Action. The algorithm estimates the action-value function for the current policy, which is the expected return for taking a certain action in a certain state and following the current policy afterwards.

SARSA is similar to Q-Learning, another popular reinforcement learning algorithm, but there are some key differences. Q-Learning is an off-policy algorithm and it estimates the action-value function for the optimal policy, whereas SARSA is an on-policy algorithm and it estimates the action-value function for the current policy.

The main idea behind SARSA is to update the Q-table based on the action that the agent will take in the next state, rather than the action that has the highest Q-value. This allows the algorithm to take into account the current policy and adapt to it.

Analogy:

An analogy for the SARSA algorithm would be a person trying to navigate a new city to reach a specific destination.

The person is the agent and the city is the environment.
The different streets and intersections in the city are the states.
The different directions the person can take (e.g. turn left, turn right, go straight) are the actions.
The distance to the destination is the reward.

The person starts at an initial location in the city and must make a series of decisions (take actions) in order to reach the destination. At each intersection, the person must decide which direction to go (take an action), based on the current policy. The person uses a map to navigate the city (Q-table), which is updated as they explore more of the city.

Just like SARSA, the person has a current policy of navigation, and might change it as they learn more about the city. The person also uses an epsilon-greedy policy to decide if they should follow their current policy or explore new directions.

SARSA is an on-policy algorithm, the person is following a certain policy and adapting it based on the current situation. While Q-Learning is off-policy algorithm where the person is trying to find the optimal policy.

Example

Python code

PreviousQ-Learning NextDDPG

Last updated 2 years ago