Section 3: Reinforcement Learning: Markov Decision Process, Q-Learning
Key Learning Points
- Explain Markov decision process (MDP), which is one of the most powerful models to describe the multiagent problem.
- Explain the Bellman equation, and how to solve known MDPs, i.e., generate a policy for an MDP problem, via the VALUE-ITERATION algorithm.
Activities
- Read the following materials:
- Appendix C of the textbook.
- Tutorial on Q-learning: A painless tutorial on Q-Learning and its Java code.
- Section 7.4.1 of the textbook.
- Watch the following videos:
- Markov Decision Process (MDP) Tutorial
- Value Iteration
- Reinforcement Learning
- Do the following exercises:
- Marvin is a robot that inhabits a 3 × 3 grid and can only move north, south, east, and west. Marvin’s wheels sometimes spin so that on each action there is a 0.2 probability that Marvin remains stationary. Marvin receives a reward of 1 whenever it visits the centre square.
- Draw an MDP that describes Marvin’s problem.
- What is the optimal policy?
- Consider the following grid world game:
This is a stationary MDP with an infinite horizon. The agent can only be in one of the six locations. It gets the reward/punishment in a particular cell when it leaves the cell. It gets a reward of 10 for leaving the bottom-middle square and a punishment of 100 for leaving the top-left square. In each iteration of the game, the agent must choose a direction to move. The agent can choose to move either up, down, left, or right. There is a 0.8 probability that it will move in that direction and a 0.1 probability that it will move in either of the neighbouring directions. For example, if the agent wants to move up, there is a 0.8 probability that it will move up, a 0.1 probability that it will move left, and a 0.1 probability that it will move right. If the agent bumps into a wall, it stays in its current location and does not get any reward or punishment.
- Perform one step of value iteration and show the resulting value function for each state. Use an initial value of zero in each state and a discount factor of 0.9.
- What is the value in the top-left state after performing another step of value iteration?
- What is the optimal policy?
- What would the optimal policy be if the punishment in the top-left square was changed to each of the following values: 5, 0.0001, 0?
- Suppose we have the following values for the states:
If we select the top-left state and the action “right,” what is the value of Q[top-left, right] using asynchronous value iteration? Use a discount factor of 0.9 and the reward/punishment described in the original problem.
- Calculate the value V(i) for each state i in the table shown below, given a discount factor of 0.9. Hint: draw a diagram first.
State |
Reward (state) |
Action consequence |
Probability that consequence occurs |
Reward (action) |
1 |
0 |
Go to 2
Go to 4 |
0.8
0.2 |
−2
+1 |
2 |
0 |
Go to 1
Go to 3 |
0.2
0.8 |
−1
−2 |
3 |
0 |
Go to 4
Go to 5 |
0.3
0.7 |
−2
10 |
4 |
0 |
Go to 2
Go to 3 |
0.5
0.5 |
−2
+3 |
5 |
10 |
— |
— |
— |
- Discuss the following question in the discussion forum: What are the differences between the value-iteration algorithm and Q-learning?
Updated June 04 2018 by FST Course Production Staff