What is a policy in reinforcement learning?
In plain words, in the simplest case, a policy π
is a function that takes as input a state s
and returns an action a
. That is: π(s) → a
In this way, the policy is typically used by the agent to decide what action a
should be performed when it is in a given state s
.
Sometimes, the policy can be stochastic instead of deterministic. In such a case, instead of returning a unique action a
, the policy returns a probability distribution over a set of actions.
In general, the goal of any RL algorithm is to learn an optimal policy that achieve a specific goal.
The definition is correct, though not instantly obvious if you see it for the first time. Let me put it this way: a policy is an agent's strategy.
For example, imagine a world where a robot moves across the room and the task is to get to the target point (x, y), where it gets a reward. Here:
- A room is an environment
- Robot's current position is a state
A policy is what an agent does to accomplish this task:
- dumb robots just wander around randomly until they accidentally end up in the right place (policy #1)
- others may, for some reason, learn to go along the walls most of the route (policy #2)
- smart robots plan the route in their "head" and go straight to the goal (policy #3)
Obviously, some policies are better than others, and there are multiple ways to assess them, namely state-value function and action-value function. The goal of RL is to learn the best policy. Now the definition should make more sense (note that in the context time is better understood as a state):
A policy defines the learning agent's way of behaving at a given time.
Formally
More formally, we should first define Markov Decision Process (MDP) as a tuple (S
, A
, P
, R
, y
), where:
S
is a finite set of statesA
is a finite set of actionsP
is a state transition probability matrix (probability of ending up in a state for each current state and each action)R
is a reward function, given a state and an actiony
is a discount factor, between 0 and 1
Then, a policy π
is a probability distribution over actions given states. That is the likelihood of every action when an agent is in a particular state (of course, I'm skipping a lot of details here). This definition corresponds to the second part of your definition.
I highly recommend David Silver's RL course available on YouTube. The first two lectures focus particularly on MDPs and policies.