temporal difference

cmburgul · Sep 5, 2019 · 479767e · 479767e
1 parent 05c3f77
commit 479767e
Show file tree

Hide file tree

Showing 6 changed files with 1,495 additions and 0 deletions.
diff --git a/temporal-difference/.ipynb_checkpoints/Temporal_Difference-checkpoint.ipynb b/temporal-difference/.ipynb_checkpoints/Temporal_Difference-checkpoint.ipynb
@@ -0,0 +1,356 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Temporal-Difference Methods\n",
+    "\n",
+    "In this notebook, you will write your own implementations of many Temporal-Difference (TD) methods.\n",
+    "\n",
+    "While we have provided some starter code, you are welcome to erase these hints and write your code from scratch.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "### Part 0: Explore CliffWalkingEnv\n",
+    "\n",
+    "We begin by importing the necessary packages."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "import gym\n",
+    "import numpy as np\n",
+    "from collections import defaultdict, deque\n",
+    "import matplotlib.pyplot as plt\n",
+    "%matplotlib inline\n",
+    "\n",
+    "import check_test\n",
+    "from plot_utils import plot_values"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Use the code cell below to create an instance of the [CliffWalking](https://github.com/openai/gym/blob/master/gym/envs/toy_text/cliffwalking.py) environment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "env = gym.make('CliffWalking-v0')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The agent moves through a $4\\times 12$ gridworld, with states numbered as follows:\n",
+    "```\n",
+    "[[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11],\n",
+    " [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],\n",
+    " [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],\n",
+    " [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]\n",
+    "```\n",
+    "At the start of any episode, state `36` is the initial state.  State `47` is the only terminal state, and the cliff corresponds to states `37` through `46`.\n",
+    "\n",
+    "The agent has 4 potential actions:\n",
+    "```\n",
+    "UP = 0\n",
+    "RIGHT = 1\n",
+    "DOWN = 2\n",
+    "LEFT = 3\n",
+    "```\n",
+    "\n",
+    "Thus, $\\mathcal{S}^+=\\{0, 1, \\ldots, 47\\}$, and $\\mathcal{A} =\\{0, 1, 2, 3\\}$.  Verify this by running the code cell below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(env.action_space)\n",
+    "print(env.observation_space)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this mini-project, we will build towards finding the optimal policy for the CliffWalking environment.  The optimal state-value function is visualized below.  Please take the time now to make sure that you understand _why_ this is the optimal state-value function.\n",
+    "\n",
+    "_**Note**: You can safely ignore the values of the cliff \"states\" as these are not true states from which the agent can make decisions.  For the cliff \"states\", the state-value function is not well-defined._"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# define the optimal state-value function\n",
+    "V_opt = np.zeros((4,12))\n",
+    "V_opt[0:13][0] = -np.arange(3, 15)[::-1]\n",
+    "V_opt[0:13][1] = -np.arange(3, 15)[::-1] + 1\n",
+    "V_opt[0:13][2] = -np.arange(3, 15)[::-1] + 2\n",
+    "V_opt[3][0] = -13\n",
+    "\n",
+    "plot_values(V_opt)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part 1: TD Control: Sarsa\n",
+    "\n",
+    "In this section, you will write your own implementation of the Sarsa control algorithm.\n",
+    "\n",
+    "Your algorithm has four arguments:\n",
+    "- `env`: This is an instance of an OpenAI Gym environment.\n",
+    "- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n",
+    "- `alpha`: This is the step-size parameter for the update step.\n",
+    "- `gamma`: This is the discount rate.  It must be a value between 0 and 1, inclusive (default value: `1`).\n",
+    "\n",
+    "The algorithm returns as output:\n",
+    "- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.\n",
+    "\n",
+    "Please complete the function in the code cell below.\n",
+    "\n",
+    "(_Feel free to define additional functions to help you to organize your code._)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def sarsa(env, num_episodes, alpha, gamma=1.0):\n",
+    "    # initialize action-value function (empty dictionary of arrays)\n",
+    "    Q = defaultdict(lambda: np.zeros(env.nA))\n",
+    "    # initialize performance monitor\n",
+    "    # loop over episodes\n",
+    "    for i_episode in range(1, num_episodes+1):\n",
+    "        # monitor progress\n",
+    "        if i_episode % 100 == 0:\n",
+    "            print(\"\\rEpisode {}/{}\".format(i_episode, num_episodes), end=\"\")\n",
+    "            sys.stdout.flush()   \n",
+    "        \n",
+    "        ## TODO: complete the function\n",
+    "        \n",
+    "    return Q"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Use the next code cell to visualize the **_estimated_** optimal policy and the corresponding state-value function.  \n",
+    "\n",
+    "If the code cell returns **PASSED**, then you have implemented the function correctly!  Feel free to change the `num_episodes` and `alpha` parameters that are supplied to the function.  However, if you'd like to ensure the accuracy of the unit test, please do not change the value of `gamma` from the default."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# obtain the estimated optimal policy and corresponding action-value function\n",
+    "Q_sarsa = sarsa(env, 5000, .01)\n",
+    "\n",
+    "# print the estimated optimal policy\n",
+    "policy_sarsa = np.array([np.argmax(Q_sarsa[key]) if key in Q_sarsa else -1 for key in np.arange(48)]).reshape(4,12)\n",
+    "check_test.run_check('td_control_check', policy_sarsa)\n",
+    "print(\"\\nEstimated Optimal Policy (UP = 0, RIGHT = 1, DOWN = 2, LEFT = 3, N/A = -1):\")\n",
+    "print(policy_sarsa)\n",
+    "\n",
+    "# plot the estimated optimal state-value function\n",
+    "V_sarsa = ([np.max(Q_sarsa[key]) if key in Q_sarsa else 0 for key in np.arange(48)])\n",
+    "plot_values(V_sarsa)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part 2: TD Control: Q-learning\n",
+    "\n",
+    "In this section, you will write your own implementation of the Q-learning control algorithm.\n",
+    "\n",
+    "Your algorithm has four arguments:\n",
+    "- `env`: This is an instance of an OpenAI Gym environment.\n",
+    "- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n",
+    "- `alpha`: This is the step-size parameter for the update step.\n",
+    "- `gamma`: This is the discount rate.  It must be a value between 0 and 1, inclusive (default value: `1`).\n",
+    "\n",
+    "The algorithm returns as output:\n",
+    "- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.\n",
+    "\n",
+    "Please complete the function in the code cell below.\n",
+    "\n",
+    "(_Feel free to define additional functions to help you to organize your code._)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def q_learning(env, num_episodes, alpha, gamma=1.0):\n",
+    "    # initialize empty dictionary of arrays\n",
+    "    Q = defaultdict(lambda: np.zeros(env.nA))\n",
+    "    # loop over episodes\n",
+    "    for i_episode in range(1, num_episodes+1):\n",
+    "        # monitor progress\n",
+    "        if i_episode % 100 == 0:\n",
+    "            print(\"\\rEpisode {}/{}\".format(i_episode, num_episodes), end=\"\")\n",
+    "            sys.stdout.flush()\n",
+    "        \n",
+    "        ## TODO: complete the function\n",
+    "        \n",
+    "    return Q"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Use the next code cell to visualize the **_estimated_** optimal policy and the corresponding state-value function. \n",
+    "\n",
+    "If the code cell returns **PASSED**, then you have implemented the function correctly!  Feel free to change the `num_episodes` and `alpha` parameters that are supplied to the function.  However, if you'd like to ensure the accuracy of the unit test, please do not change the value of `gamma` from the default."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# obtain the estimated optimal policy and corresponding action-value function\n",
+    "Q_sarsamax = q_learning(env, 5000, .01)\n",
+    "\n",
+    "# print the estimated optimal policy\n",
+    "policy_sarsamax = np.array([np.argmax(Q_sarsamax[key]) if key in Q_sarsamax else -1 for key in np.arange(48)]).reshape((4,12))\n",
+    "check_test.run_check('td_control_check', policy_sarsamax)\n",
+    "print(\"\\nEstimated Optimal Policy (UP = 0, RIGHT = 1, DOWN = 2, LEFT = 3, N/A = -1):\")\n",
+    "print(policy_sarsamax)\n",
+    "\n",
+    "# plot the estimated optimal state-value function\n",
+    "plot_values([np.max(Q_sarsamax[key]) if key in Q_sarsamax else 0 for key in np.arange(48)])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part 3: TD Control: Expected Sarsa\n",
+    "\n",
+    "In this section, you will write your own implementation of the Expected Sarsa control algorithm.\n",
+    "\n",
+    "Your algorithm has four arguments:\n",
+    "- `env`: This is an instance of an OpenAI Gym environment.\n",
+    "- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n",
+    "- `alpha`: This is the step-size parameter for the update step.\n",
+    "- `gamma`: This is the discount rate.  It must be a value between 0 and 1, inclusive (default value: `1`).\n",
+    "\n",
+    "The algorithm returns as output:\n",
+    "- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.\n",
+    "\n",
+    "Please complete the function in the code cell below.\n",
+    "\n",
+    "(_Feel free to define additional functions to help you to organize your code._)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def expected_sarsa(env, num_episodes, alpha, gamma=1.0):\n",
+    "    # initialize empty dictionary of arrays\n",
+    "    Q = defaultdict(lambda: np.zeros(env.nA))\n",
+    "    # loop over episodes\n",
+    "    for i_episode in range(1, num_episodes+1):\n",
+    "        # monitor progress\n",
+    "        if i_episode % 100 == 0:\n",
+    "            print(\"\\rEpisode {}/{}\".format(i_episode, num_episodes), end=\"\")\n",
+    "            sys.stdout.flush()\n",
+    "        \n",
+    "        ## TODO: complete the function\n",
+    "        \n",
+    "    return Q"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Use the next code cell to visualize the **_estimated_** optimal policy and the corresponding state-value function.  \n",
+    "\n",
+    "If the code cell returns **PASSED**, then you have implemented the function correctly!  Feel free to change the `num_episodes` and `alpha` parameters that are supplied to the function.  However, if you'd like to ensure the accuracy of the unit test, please do not change the value of `gamma` from the default."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# obtain the estimated optimal policy and corresponding action-value function\n",
+    "Q_expsarsa = expected_sarsa(env, 10000, 1)\n",
+    "\n",
+    "# print the estimated optimal policy\n",
+    "policy_expsarsa = np.array([np.argmax(Q_expsarsa[key]) if key in Q_expsarsa else -1 for key in np.arange(48)]).reshape(4,12)\n",
+    "check_test.run_check('td_control_check', policy_expsarsa)\n",
+    "print(\"\\nEstimated Optimal Policy (UP = 0, RIGHT = 1, DOWN = 2, LEFT = 3, N/A = -1):\")\n",
+    "print(policy_expsarsa)\n",
+    "\n",
+    "# plot the estimated optimal state-value function\n",
+    "plot_values([np.max(Q_expsarsa[key]) if key in Q_expsarsa else 0 for key in np.arange(48)])"
+   ]
+  }
+ ],
+ "metadata": {
+  "anaconda-cloud": {},
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/temporal-difference/README.md b/temporal-difference/README.md
@@ -0,0 +1,5 @@
+# Temporal-Difference Methods
+
+### Instructions
+
+Follow the instructions in `Temporal_Difference.ipynb` to write your own implementations of many temporal-difference methods!  The corresponding solutions can be found in `Temporal_Difference_Solution.ipynb`.