Section 3: Reinforcement Learning: Markov Decision Process, Q-Learning

Key Learning Points

Explain Markov decision process (MDP), which is one of the most powerful models to describe the multiagent problem.
Explain the Bellman equation, and how to solve known MDPs, i.e., generate a policy for an MDP problem, via the VALUE-ITERATION algorithm.

Activities

Read the following materials:
1. Appendix C of the textbook.
2. Tutorial on Q-learning: A painless tutorial on Q-Learning and its Java code.
3. Section 7.4.1 of the textbook.
Watch the following videos:

Do the following exercises:

Marvin is a robot that inhabits a 3 × 3 grid and can only move north, south, east, and west. Marvin’s wheels sometimes spin so that on each action there is a 0.2 probability that Marvin remains stationary. Marvin receives a reward of 1 whenever it visits the centre square.
1. Draw an MDP that describes Marvin’s problem.
2. What is the optimal policy?
Consider the following grid world game:

−100

+10

This is a stationary MDP with an infinite horizon. The agent can only be in one of the six locations. It gets the reward/punishment in a particular cell when it leaves the cell. It gets a reward of 10 for leaving the bottom-middle square and a punishment of 100 for leaving the top-left square. In each iteration of the game, the agent must choose a direction to move. The agent can choose to move either up, down, left, or right. There is a 0.8 probability that it will move in that direction and a 0.1 probability that it will move in either of the neighbouring directions. For example, if the agent wants to move up, there is a 0.8 probability that it will move up, a 0.1 probability that it will move left, and a 0.1 probability that it will move right. If the agent bumps into a wall, it stays in its current location and does not get any reward or punishment.
1. Perform one step of value iteration and show the resulting value function for each state. Use an initial value of zero in each state and a discount factor of 0.9.
2. What is the value in the top-left state after performing another step of value iteration?
3. What is the optimal policy?
4. What would the optimal policy be if the punishment in the top-left square was changed to each of the following values: 5, 0.0001, 0?
5. Suppose we have the following values for the states:
  
  −10 +1 +5
  
  −1 +15 +10
  
  If we select the top-left state and the action “right,” what is the value of Q[top-left, right] using asynchronous value iteration? Use a discount factor of 0.9 and the reward/punishment described in the original problem.

Calculate the value V(i) for each state i in the table shown below, given a discount factor of 0.9. Hint: draw a diagram first.

State	Reward (state)	Action consequence	Probability that consequence occurs	Reward (action)
1	0	Go to 2 Go to 4	0.8 0.2	−2 +1
2	0	Go to 1 Go to 3	0.2 0.8	−1 −2
3	0	Go to 4 Go to 5	0.3 0.7	−2 10
4	0	Go to 2 Go to 3	0.5 0.5	−2 +3
5	10	—	—	—

Discuss the following question in the discussion forum: What are the differences between the value-iteration algorithm and Q-learning?

Athabasca University

Study Guide

Multiagent Systems (Rev. 4)

Computer Science 667

Unit 1 Foundations

Section 3: Reinforcement Learning: Markov Decision Process, Q-Learning

Key Learning Points

Activities

−100
	+10

−10	+1	+5
−1	+15	+10