Skip to content


temporal difference
Browse files Browse the repository at this point in the history
  • Loading branch information
cmburgul committed Sep 5, 2019
1 parent 05c3f77 commit 479767e
Show file tree
Hide file tree
Showing 6 changed files with 1,495 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,356 @@
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"# Temporal-Difference Methods\n",
"In this notebook, you will write your own implementations of many Temporal-Difference (TD) methods.\n",
"While we have provided some starter code, you are welcome to erase these hints and write your code from scratch.\n",
"### Part 0: Explore CliffWalkingEnv\n",
"We begin by importing the necessary packages."
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"import gym\n",
"import numpy as np\n",
"from collections import defaultdict, deque\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"import check_test\n",
"from plot_utils import plot_values"
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the code cell below to create an instance of the [CliffWalking]( environment."
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"env = gym.make('CliffWalking-v0')"
"cell_type": "markdown",
"metadata": {},
"source": [
"The agent moves through a $4\\times 12$ gridworld, with states numbered as follows:\n",
"[[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],\n",
" [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],\n",
" [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],\n",
" [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]\n",
"At the start of any episode, state `36` is the initial state. State `47` is the only terminal state, and the cliff corresponds to states `37` through `46`.\n",
"The agent has 4 potential actions:\n",
"UP = 0\n",
"RIGHT = 1\n",
"DOWN = 2\n",
"LEFT = 3\n",
"Thus, $\\mathcal{S}^+=\\{0, 1, \\ldots, 47\\}$, and $\\mathcal{A} =\\{0, 1, 2, 3\\}$. Verify this by running the code cell below."
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"In this mini-project, we will build towards finding the optimal policy for the CliffWalking environment. The optimal state-value function is visualized below. Please take the time now to make sure that you understand _why_ this is the optimal state-value function.\n",
"_**Note**: You can safely ignore the values of the cliff \"states\" as these are not true states from which the agent can make decisions. For the cliff \"states\", the state-value function is not well-defined._"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# define the optimal state-value function\n",
"V_opt = np.zeros((4,12))\n",
"V_opt[0:13][0] = -np.arange(3, 15)[::-1]\n",
"V_opt[0:13][1] = -np.arange(3, 15)[::-1] + 1\n",
"V_opt[0:13][2] = -np.arange(3, 15)[::-1] + 2\n",
"V_opt[3][0] = -13\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 1: TD Control: Sarsa\n",
"In this section, you will write your own implementation of the Sarsa control algorithm.\n",
"Your algorithm has four arguments:\n",
"- `env`: This is an instance of an OpenAI Gym environment.\n",
"- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n",
"- `alpha`: This is the step-size parameter for the update step.\n",
"- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n",
"The algorithm returns as output:\n",
"- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.\n",
"Please complete the function in the code cell below.\n",
"(_Feel free to define additional functions to help you to organize your code._)"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"def sarsa(env, num_episodes, alpha, gamma=1.0):\n",
" # initialize action-value function (empty dictionary of arrays)\n",
" Q = defaultdict(lambda: np.zeros(env.nA))\n",
" # initialize performance monitor\n",
" # loop over episodes\n",
" for i_episode in range(1, num_episodes+1):\n",
" # monitor progress\n",
" if i_episode % 100 == 0:\n",
" print(\"\\rEpisode {}/{}\".format(i_episode, num_episodes), end=\"\")\n",
" sys.stdout.flush() \n",
" \n",
" ## TODO: complete the function\n",
" \n",
" return Q"
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the next code cell to visualize the **_estimated_** optimal policy and the corresponding state-value function. \n",
"If the code cell returns **PASSED**, then you have implemented the function correctly! Feel free to change the `num_episodes` and `alpha` parameters that are supplied to the function. However, if you'd like to ensure the accuracy of the unit test, please do not change the value of `gamma` from the default."
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# obtain the estimated optimal policy and corresponding action-value function\n",
"Q_sarsa = sarsa(env, 5000, .01)\n",
"# print the estimated optimal policy\n",
"policy_sarsa = np.array([np.argmax(Q_sarsa[key]) if key in Q_sarsa else -1 for key in np.arange(48)]).reshape(4,12)\n",
"check_test.run_check('td_control_check', policy_sarsa)\n",
"print(\"\\nEstimated Optimal Policy (UP = 0, RIGHT = 1, DOWN = 2, LEFT = 3, N/A = -1):\")\n",
"# plot the estimated optimal state-value function\n",
"V_sarsa = ([np.max(Q_sarsa[key]) if key in Q_sarsa else 0 for key in np.arange(48)])\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 2: TD Control: Q-learning\n",
"In this section, you will write your own implementation of the Q-learning control algorithm.\n",
"Your algorithm has four arguments:\n",
"- `env`: This is an instance of an OpenAI Gym environment.\n",
"- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n",
"- `alpha`: This is the step-size parameter for the update step.\n",
"- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n",
"The algorithm returns as output:\n",
"- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.\n",
"Please complete the function in the code cell below.\n",
"(_Feel free to define additional functions to help you to organize your code._)"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"def q_learning(env, num_episodes, alpha, gamma=1.0):\n",
" # initialize empty dictionary of arrays\n",
" Q = defaultdict(lambda: np.zeros(env.nA))\n",
" # loop over episodes\n",
" for i_episode in range(1, num_episodes+1):\n",
" # monitor progress\n",
" if i_episode % 100 == 0:\n",
" print(\"\\rEpisode {}/{}\".format(i_episode, num_episodes), end=\"\")\n",
" sys.stdout.flush()\n",
" \n",
" ## TODO: complete the function\n",
" \n",
" return Q"
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the next code cell to visualize the **_estimated_** optimal policy and the corresponding state-value function. \n",
"If the code cell returns **PASSED**, then you have implemented the function correctly! Feel free to change the `num_episodes` and `alpha` parameters that are supplied to the function. However, if you'd like to ensure the accuracy of the unit test, please do not change the value of `gamma` from the default."
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# obtain the estimated optimal policy and corresponding action-value function\n",
"Q_sarsamax = q_learning(env, 5000, .01)\n",
"# print the estimated optimal policy\n",
"policy_sarsamax = np.array([np.argmax(Q_sarsamax[key]) if key in Q_sarsamax else -1 for key in np.arange(48)]).reshape((4,12))\n",
"check_test.run_check('td_control_check', policy_sarsamax)\n",
"print(\"\\nEstimated Optimal Policy (UP = 0, RIGHT = 1, DOWN = 2, LEFT = 3, N/A = -1):\")\n",
"# plot the estimated optimal state-value function\n",
"plot_values([np.max(Q_sarsamax[key]) if key in Q_sarsamax else 0 for key in np.arange(48)])"
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 3: TD Control: Expected Sarsa\n",
"In this section, you will write your own implementation of the Expected Sarsa control algorithm.\n",
"Your algorithm has four arguments:\n",
"- `env`: This is an instance of an OpenAI Gym environment.\n",
"- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n",
"- `alpha`: This is the step-size parameter for the update step.\n",
"- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n",
"The algorithm returns as output:\n",
"- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.\n",
"Please complete the function in the code cell below.\n",
"(_Feel free to define additional functions to help you to organize your code._)"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"def expected_sarsa(env, num_episodes, alpha, gamma=1.0):\n",
" # initialize empty dictionary of arrays\n",
" Q = defaultdict(lambda: np.zeros(env.nA))\n",
" # loop over episodes\n",
" for i_episode in range(1, num_episodes+1):\n",
" # monitor progress\n",
" if i_episode % 100 == 0:\n",
" print(\"\\rEpisode {}/{}\".format(i_episode, num_episodes), end=\"\")\n",
" sys.stdout.flush()\n",
" \n",
" ## TODO: complete the function\n",
" \n",
" return Q"
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the next code cell to visualize the **_estimated_** optimal policy and the corresponding state-value function. \n",
"If the code cell returns **PASSED**, then you have implemented the function correctly! Feel free to change the `num_episodes` and `alpha` parameters that are supplied to the function. However, if you'd like to ensure the accuracy of the unit test, please do not change the value of `gamma` from the default."
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# obtain the estimated optimal policy and corresponding action-value function\n",
"Q_expsarsa = expected_sarsa(env, 10000, 1)\n",
"# print the estimated optimal policy\n",
"policy_expsarsa = np.array([np.argmax(Q_expsarsa[key]) if key in Q_expsarsa else -1 for key in np.arange(48)]).reshape(4,12)\n",
"check_test.run_check('td_control_check', policy_expsarsa)\n",
"print(\"\\nEstimated Optimal Policy (UP = 0, RIGHT = 1, DOWN = 2, LEFT = 3, N/A = -1):\")\n",
"# plot the estimated optimal state-value function\n",
"plot_values([np.max(Q_expsarsa[key]) if key in Q_expsarsa else 0 for key in np.arange(48)])"
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
"nbformat": 4,
"nbformat_minor": 2
5 changes: 5 additions & 0 deletions temporal-difference/
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Temporal-Difference Methods

### Instructions

Follow the instructions in `Temporal_Difference.ipynb` to write your own implementations of many temporal-difference methods! The corresponding solutions can be found in `Temporal_Difference_Solution.ipynb`.

0 comments on commit 479767e

Please sign in to comment.