-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
1,495 additions
and
0 deletions.
There are no files selected for viewing
356 changes: 356 additions & 0 deletions
356
temporal-difference/.ipynb_checkpoints/Temporal_Difference-checkpoint.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,356 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Temporal-Difference Methods\n", | ||
"\n", | ||
"In this notebook, you will write your own implementations of many Temporal-Difference (TD) methods.\n", | ||
"\n", | ||
"While we have provided some starter code, you are welcome to erase these hints and write your code from scratch.\n", | ||
"\n", | ||
"---\n", | ||
"\n", | ||
"### Part 0: Explore CliffWalkingEnv\n", | ||
"\n", | ||
"We begin by importing the necessary packages." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import sys\n", | ||
"import gym\n", | ||
"import numpy as np\n", | ||
"from collections import defaultdict, deque\n", | ||
"import matplotlib.pyplot as plt\n", | ||
"%matplotlib inline\n", | ||
"\n", | ||
"import check_test\n", | ||
"from plot_utils import plot_values" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Use the code cell below to create an instance of the [CliffWalking](https://github.com/openai/gym/blob/master/gym/envs/toy_text/cliffwalking.py) environment." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"env = gym.make('CliffWalking-v0')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The agent moves through a $4\\times 12$ gridworld, with states numbered as follows:\n", | ||
"```\n", | ||
"[[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],\n", | ||
" [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],\n", | ||
" [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],\n", | ||
" [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]\n", | ||
"```\n", | ||
"At the start of any episode, state `36` is the initial state. State `47` is the only terminal state, and the cliff corresponds to states `37` through `46`.\n", | ||
"\n", | ||
"The agent has 4 potential actions:\n", | ||
"```\n", | ||
"UP = 0\n", | ||
"RIGHT = 1\n", | ||
"DOWN = 2\n", | ||
"LEFT = 3\n", | ||
"```\n", | ||
"\n", | ||
"Thus, $\\mathcal{S}^+=\\{0, 1, \\ldots, 47\\}$, and $\\mathcal{A} =\\{0, 1, 2, 3\\}$. Verify this by running the code cell below." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"print(env.action_space)\n", | ||
"print(env.observation_space)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In this mini-project, we will build towards finding the optimal policy for the CliffWalking environment. The optimal state-value function is visualized below. Please take the time now to make sure that you understand _why_ this is the optimal state-value function.\n", | ||
"\n", | ||
"_**Note**: You can safely ignore the values of the cliff \"states\" as these are not true states from which the agent can make decisions. For the cliff \"states\", the state-value function is not well-defined._" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# define the optimal state-value function\n", | ||
"V_opt = np.zeros((4,12))\n", | ||
"V_opt[0:13][0] = -np.arange(3, 15)[::-1]\n", | ||
"V_opt[0:13][1] = -np.arange(3, 15)[::-1] + 1\n", | ||
"V_opt[0:13][2] = -np.arange(3, 15)[::-1] + 2\n", | ||
"V_opt[3][0] = -13\n", | ||
"\n", | ||
"plot_values(V_opt)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Part 1: TD Control: Sarsa\n", | ||
"\n", | ||
"In this section, you will write your own implementation of the Sarsa control algorithm.\n", | ||
"\n", | ||
"Your algorithm has four arguments:\n", | ||
"- `env`: This is an instance of an OpenAI Gym environment.\n", | ||
"- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n", | ||
"- `alpha`: This is the step-size parameter for the update step.\n", | ||
"- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n", | ||
"\n", | ||
"The algorithm returns as output:\n", | ||
"- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.\n", | ||
"\n", | ||
"Please complete the function in the code cell below.\n", | ||
"\n", | ||
"(_Feel free to define additional functions to help you to organize your code._)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"def sarsa(env, num_episodes, alpha, gamma=1.0):\n", | ||
" # initialize action-value function (empty dictionary of arrays)\n", | ||
" Q = defaultdict(lambda: np.zeros(env.nA))\n", | ||
" # initialize performance monitor\n", | ||
" # loop over episodes\n", | ||
" for i_episode in range(1, num_episodes+1):\n", | ||
" # monitor progress\n", | ||
" if i_episode % 100 == 0:\n", | ||
" print(\"\\rEpisode {}/{}\".format(i_episode, num_episodes), end=\"\")\n", | ||
" sys.stdout.flush() \n", | ||
" \n", | ||
" ## TODO: complete the function\n", | ||
" \n", | ||
" return Q" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Use the next code cell to visualize the **_estimated_** optimal policy and the corresponding state-value function. \n", | ||
"\n", | ||
"If the code cell returns **PASSED**, then you have implemented the function correctly! Feel free to change the `num_episodes` and `alpha` parameters that are supplied to the function. However, if you'd like to ensure the accuracy of the unit test, please do not change the value of `gamma` from the default." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# obtain the estimated optimal policy and corresponding action-value function\n", | ||
"Q_sarsa = sarsa(env, 5000, .01)\n", | ||
"\n", | ||
"# print the estimated optimal policy\n", | ||
"policy_sarsa = np.array([np.argmax(Q_sarsa[key]) if key in Q_sarsa else -1 for key in np.arange(48)]).reshape(4,12)\n", | ||
"check_test.run_check('td_control_check', policy_sarsa)\n", | ||
"print(\"\\nEstimated Optimal Policy (UP = 0, RIGHT = 1, DOWN = 2, LEFT = 3, N/A = -1):\")\n", | ||
"print(policy_sarsa)\n", | ||
"\n", | ||
"# plot the estimated optimal state-value function\n", | ||
"V_sarsa = ([np.max(Q_sarsa[key]) if key in Q_sarsa else 0 for key in np.arange(48)])\n", | ||
"plot_values(V_sarsa)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Part 2: TD Control: Q-learning\n", | ||
"\n", | ||
"In this section, you will write your own implementation of the Q-learning control algorithm.\n", | ||
"\n", | ||
"Your algorithm has four arguments:\n", | ||
"- `env`: This is an instance of an OpenAI Gym environment.\n", | ||
"- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n", | ||
"- `alpha`: This is the step-size parameter for the update step.\n", | ||
"- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n", | ||
"\n", | ||
"The algorithm returns as output:\n", | ||
"- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.\n", | ||
"\n", | ||
"Please complete the function in the code cell below.\n", | ||
"\n", | ||
"(_Feel free to define additional functions to help you to organize your code._)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"def q_learning(env, num_episodes, alpha, gamma=1.0):\n", | ||
" # initialize empty dictionary of arrays\n", | ||
" Q = defaultdict(lambda: np.zeros(env.nA))\n", | ||
" # loop over episodes\n", | ||
" for i_episode in range(1, num_episodes+1):\n", | ||
" # monitor progress\n", | ||
" if i_episode % 100 == 0:\n", | ||
" print(\"\\rEpisode {}/{}\".format(i_episode, num_episodes), end=\"\")\n", | ||
" sys.stdout.flush()\n", | ||
" \n", | ||
" ## TODO: complete the function\n", | ||
" \n", | ||
" return Q" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Use the next code cell to visualize the **_estimated_** optimal policy and the corresponding state-value function. \n", | ||
"\n", | ||
"If the code cell returns **PASSED**, then you have implemented the function correctly! Feel free to change the `num_episodes` and `alpha` parameters that are supplied to the function. However, if you'd like to ensure the accuracy of the unit test, please do not change the value of `gamma` from the default." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# obtain the estimated optimal policy and corresponding action-value function\n", | ||
"Q_sarsamax = q_learning(env, 5000, .01)\n", | ||
"\n", | ||
"# print the estimated optimal policy\n", | ||
"policy_sarsamax = np.array([np.argmax(Q_sarsamax[key]) if key in Q_sarsamax else -1 for key in np.arange(48)]).reshape((4,12))\n", | ||
"check_test.run_check('td_control_check', policy_sarsamax)\n", | ||
"print(\"\\nEstimated Optimal Policy (UP = 0, RIGHT = 1, DOWN = 2, LEFT = 3, N/A = -1):\")\n", | ||
"print(policy_sarsamax)\n", | ||
"\n", | ||
"# plot the estimated optimal state-value function\n", | ||
"plot_values([np.max(Q_sarsamax[key]) if key in Q_sarsamax else 0 for key in np.arange(48)])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Part 3: TD Control: Expected Sarsa\n", | ||
"\n", | ||
"In this section, you will write your own implementation of the Expected Sarsa control algorithm.\n", | ||
"\n", | ||
"Your algorithm has four arguments:\n", | ||
"- `env`: This is an instance of an OpenAI Gym environment.\n", | ||
"- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n", | ||
"- `alpha`: This is the step-size parameter for the update step.\n", | ||
"- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n", | ||
"\n", | ||
"The algorithm returns as output:\n", | ||
"- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.\n", | ||
"\n", | ||
"Please complete the function in the code cell below.\n", | ||
"\n", | ||
"(_Feel free to define additional functions to help you to organize your code._)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"def expected_sarsa(env, num_episodes, alpha, gamma=1.0):\n", | ||
" # initialize empty dictionary of arrays\n", | ||
" Q = defaultdict(lambda: np.zeros(env.nA))\n", | ||
" # loop over episodes\n", | ||
" for i_episode in range(1, num_episodes+1):\n", | ||
" # monitor progress\n", | ||
" if i_episode % 100 == 0:\n", | ||
" print(\"\\rEpisode {}/{}\".format(i_episode, num_episodes), end=\"\")\n", | ||
" sys.stdout.flush()\n", | ||
" \n", | ||
" ## TODO: complete the function\n", | ||
" \n", | ||
" return Q" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Use the next code cell to visualize the **_estimated_** optimal policy and the corresponding state-value function. \n", | ||
"\n", | ||
"If the code cell returns **PASSED**, then you have implemented the function correctly! Feel free to change the `num_episodes` and `alpha` parameters that are supplied to the function. However, if you'd like to ensure the accuracy of the unit test, please do not change the value of `gamma` from the default." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# obtain the estimated optimal policy and corresponding action-value function\n", | ||
"Q_expsarsa = expected_sarsa(env, 10000, 1)\n", | ||
"\n", | ||
"# print the estimated optimal policy\n", | ||
"policy_expsarsa = np.array([np.argmax(Q_expsarsa[key]) if key in Q_expsarsa else -1 for key in np.arange(48)]).reshape(4,12)\n", | ||
"check_test.run_check('td_control_check', policy_expsarsa)\n", | ||
"print(\"\\nEstimated Optimal Policy (UP = 0, RIGHT = 1, DOWN = 2, LEFT = 3, N/A = -1):\")\n", | ||
"print(policy_expsarsa)\n", | ||
"\n", | ||
"# plot the estimated optimal state-value function\n", | ||
"plot_values([np.max(Q_expsarsa[key]) if key in Q_expsarsa else 0 for key in np.arange(48)])" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"anaconda-cloud": {}, | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.1" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# Temporal-Difference Methods | ||
|
||
### Instructions | ||
|
||
Follow the instructions in `Temporal_Difference.ipynb` to write your own implementations of many temporal-difference methods! The corresponding solutions can be found in `Temporal_Difference_Solution.ipynb`. |
Oops, something went wrong.