Skip to content

Commit

Permalink
Monte Carlo
Browse files Browse the repository at this point in the history
  • Loading branch information
cmburgul committed Sep 5, 2019
1 parent 37fdcaf commit 05c3f77
Show file tree
Hide file tree
Showing 6 changed files with 923 additions and 0 deletions.
355 changes: 355 additions & 0 deletions monte-carlo/Monte_Carlo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,355 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Monte Carlo Methods\n",
"\n",
"In this notebook, you will write your own implementations of many Monte Carlo (MC) algorithms. \n",
"\n",
"While we have provided some starter code, you are welcome to erase these hints and write your code from scratch.\n",
"\n",
"### Part 0: Explore BlackjackEnv\n",
"\n",
"We begin by importing the necessary packages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"import gym\n",
"import numpy as np\n",
"from collections import defaultdict\n",
"\n",
"from plot_utils import plot_blackjack_values, plot_policy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the code cell below to create an instance of the [Blackjack](https://github.com/openai/gym/blob/master/gym/envs/toy_text/blackjack.py) environment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"env = gym.make('Blackjack-v0')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each state is a 3-tuple of:\n",
"- the player's current sum $\\in \\{0, 1, \\ldots, 31\\}$,\n",
"- the dealer's face up card $\\in \\{1, \\ldots, 10\\}$, and\n",
"- whether or not the player has a usable ace (`no` $=0$, `yes` $=1$).\n",
"\n",
"The agent has two potential actions:\n",
"\n",
"```\n",
" STICK = 0\n",
" HIT = 1\n",
"```\n",
"Verify this by running the code cell below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(env.observation_space)\n",
"print(env.action_space)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute the code cell below to play Blackjack with a random policy. \n",
"\n",
"(_The code currently plays Blackjack three times - feel free to change this number, or to run the cell multiple times. The cell is designed for you to get some experience with the output that is returned as the agent interacts with the environment._)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for i_episode in range(3):\n",
" state = env.reset()\n",
" while True:\n",
" print(state)\n",
" action = env.action_space.sample()\n",
" state, reward, done, info = env.step(action)\n",
" if done:\n",
" print('End game! Reward: ', reward)\n",
" print('You won :)\\n') if reward > 0 else print('You lost :(\\n')\n",
" break"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 1: MC Prediction\n",
"\n",
"In this section, you will write your own implementation of MC prediction (for estimating the action-value function). \n",
"\n",
"We will begin by investigating a policy where the player _almost_ always sticks if the sum of her cards exceeds 18. In particular, she selects action `STICK` with 80% probability if the sum is greater than 18; and, if the sum is 18 or below, she selects action `HIT` with 80% probability. The function `generate_episode_from_limit_stochastic` samples an episode using this policy. \n",
"\n",
"The function accepts as **input**:\n",
"- `bj_env`: This is an instance of OpenAI Gym's Blackjack environment.\n",
"\n",
"It returns as **output**:\n",
"- `episode`: This is a list of (state, action, reward) tuples (of tuples) and corresponds to $(S_0, A_0, R_1, \\ldots, S_{T-1}, A_{T-1}, R_{T})$, where $T$ is the final time step. In particular, `episode[i]` returns $(S_i, A_i, R_{i+1})$, and `episode[i][0]`, `episode[i][1]`, and `episode[i][2]` return $S_i$, $A_i$, and $R_{i+1}$, respectively."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def generate_episode_from_limit_stochastic(bj_env):\n",
" episode = []\n",
" state = bj_env.reset()\n",
" while True:\n",
" probs = [0.8, 0.2] if state[0] > 18 else [0.2, 0.8]\n",
" action = np.random.choice(np.arange(2), p=probs)\n",
" next_state, reward, done, info = bj_env.step(action)\n",
" episode.append((state, action, reward))\n",
" state = next_state\n",
" if done:\n",
" break\n",
" return episode"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute the code cell below to play Blackjack with the policy. \n",
"\n",
"(*The code currently plays Blackjack three times - feel free to change this number, or to run the cell multiple times. The cell is designed for you to gain some familiarity with the output of the `generate_episode_from_limit_stochastic` function.*)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for i in range(3):\n",
" print(generate_episode_from_limit_stochastic(env))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, you are ready to write your own implementation of MC prediction. Feel free to implement either first-visit or every-visit MC prediction; in the case of the Blackjack environment, the techniques are equivalent.\n",
"\n",
"Your algorithm has three arguments:\n",
"- `env`: This is an instance of an OpenAI Gym environment.\n",
"- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n",
"- `generate_episode`: This is a function that returns an episode of interaction.\n",
"- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n",
"\n",
"The algorithm returns as output:\n",
"- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def mc_prediction_q(env, num_episodes, generate_episode, gamma=1.0):\n",
" # initialize empty dictionaries of arrays\n",
" returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))\n",
" N = defaultdict(lambda: np.zeros(env.action_space.n))\n",
" Q = defaultdict(lambda: np.zeros(env.action_space.n))\n",
" # loop over episodes\n",
" for i_episode in range(1, num_episodes+1):\n",
" # monitor progress\n",
" if i_episode % 1000 == 0:\n",
" print(\"\\rEpisode {}/{}.\".format(i_episode, num_episodes), end=\"\")\n",
" sys.stdout.flush()\n",
" \n",
" ## TODO: complete the function\n",
" \n",
" return Q"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the cell below to obtain the action-value function estimate $Q$. We have also plotted the corresponding state-value function.\n",
"\n",
"To check the accuracy of your implementation, compare the plot below to the corresponding plot in the solutions notebook **Monte_Carlo_Solution.ipynb**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# obtain the action-value function\n",
"Q = mc_prediction_q(env, 500000, generate_episode_from_limit_stochastic)\n",
"\n",
"# obtain the corresponding state-value function\n",
"V_to_plot = dict((k,(k[0]>18)*(np.dot([0.8, 0.2],v)) + (k[0]<=18)*(np.dot([0.2, 0.8],v))) \\\n",
" for k, v in Q.items())\n",
"\n",
"# plot the state-value function\n",
"plot_blackjack_values(V_to_plot)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 2: MC Control\n",
"\n",
"In this section, you will write your own implementation of constant-$\\alpha$ MC control. \n",
"\n",
"Your algorithm has four arguments:\n",
"- `env`: This is an instance of an OpenAI Gym environment.\n",
"- `num_episodes`: This is the number of episodes that are generated through agent-environment interaction.\n",
"- `alpha`: This is the step-size parameter for the update step.\n",
"- `gamma`: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: `1`).\n",
"\n",
"The algorithm returns as output:\n",
"- `Q`: This is a dictionary (of one-dimensional arrays) where `Q[s][a]` is the estimated action value corresponding to state `s` and action `a`.\n",
"- `policy`: This is a dictionary where `policy[s]` returns the action that the agent chooses after observing state `s`.\n",
"\n",
"(_Feel free to define additional functions to help you to organize your code._)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def mc_control(env, num_episodes, alpha, gamma=1.0):\n",
" nA = env.action_space.n\n",
" # initialize empty dictionary of arrays\n",
" Q = defaultdict(lambda: np.zeros(nA))\n",
" # loop over episodes\n",
" for i_episode in range(1, num_episodes+1):\n",
" # monitor progress\n",
" if i_episode % 1000 == 0:\n",
" print(\"\\rEpisode {}/{}.\".format(i_episode, num_episodes), end=\"\")\n",
" sys.stdout.flush()\n",
" \n",
" ## TODO: complete the function\n",
" \n",
" return policy, Q"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the cell below to obtain the estimated optimal policy and action-value function. Note that you should fill in your own values for the `num_episodes` and `alpha` parameters."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# obtain the estimated optimal policy and action-value function\n",
"policy, Q = mc_control(env, ?, ?)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we plot the corresponding state-value function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# obtain the corresponding state-value function\n",
"V = dict((k,np.max(v)) for k, v in Q.items())\n",
"\n",
"# plot the state-value function\n",
"plot_blackjack_values(V)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we visualize the policy that is estimated to be optimal."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plot the policy\n",
"plot_policy(policy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The **true** optimal policy $\\pi_*$ can be found in Figure 5.2 of the [textbook](http://go.udacity.com/rl-textbook) (and appears below). Compare your final estimate to the optimal policy - how close are you able to get? If you are not happy with the performance of your algorithm, take the time to tweak the decay rate of $\\epsilon$, change the value of $\\alpha$, and/or run the algorithm for more episodes to attain better results.\n",
"\n",
"![True Optimal Policy](images/optimal.png)"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit 05c3f77

Please sign in to comment.