What is the difference between Q-learning and Value Iteration?

You are 100% right that if we knew the transition probabilities and reward for every transition in Q-learning, it would be pretty unclear why we would use it instead of model-based learning or how it would even be fundamentally different. After all, transition probabilities and rewards are the two components of the model used in value iteration - if you have them, you have a model.

The key is that, in Q-learning, the agent does not know state transition probabilities or rewards. The agent only discovers that there is a reward for going from one state to another via a given action when it does so and receives a reward. Similarly, it only figures out what transitions are available from a given state by ending up in that state and looking at its options. If state transitions are stochastic, it learns the probability of transitioning between states by observing how frequently different transitions occur.

A possible source of confusion here is that you, as the programmer, might know exactly how rewards and state transitions are set up. In fact, when you're first designing a system, odds are that you do as this is pretty important to debugging and verifying that your approach works. But you never tell the agent any of this - instead you force it to learn on its own through trial and error. This is important if you want to create an agent that is capable of entering a new situation that you don't have any prior knowledge about and figuring out what to do. Alternately, if you don't care about the agent's ability to learn on its own, Q-learning might also be necessary if the state-space is too large to repeatedly enumerate. Having the agent explore without any starting knowledge can be more computationally tractable.


Value iteration is used when you have transition probabilities, that means when you know the probability of getting from state x into state x' with action a. In contrast, you might have a black box which allows you to simulate it, but you're not actually given the probability. So you are model-free. This is when you apply Q learning.

Also what is learned is different. With value iteration, you learn the expected cost when you are given a state x. With q-learning, you get the expected discounted cost when you are in state x and apply action a.

Here are the algorithms:

I'm currently writing down quite a bit about reinforcement learning for an exam. You might also be interested in my lecture notes. However, they are mostly in German.