diff --git a/documentation/tutorials/beginner_colab.ipynb b/documentation/tutorials/beginner_colab.ipynb index 194d342e..7024e4c9 100644 --- a/documentation/tutorials/beginner_colab.ipynb +++ b/documentation/tutorials/beginner_colab.ipynb @@ -2,21 +2,16 @@ "cells": [ { "cell_type": "markdown", - "metadata": { - "id": "Tce3stUlHN0L" - }, "source": [ "##### Copyright 2020 The TensorFlow Authors." - ] + ], + "metadata": { + "id": "Tce3stUlHN0L" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "cellView": "form", - "id": "tuOe1ymfHZPu" - }, - "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", @@ -29,37 +24,40 @@ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." - ] + ], + "outputs": [], + "metadata": { + "cellView": "form", + "id": "tuOe1ymfHZPu" + } }, { "cell_type": "markdown", + "source": [ + "# Build, train and evaluate Random Forests and Gradient Boosted Trees with TensorFlow Decision Forests\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " View on TensorFlow.org\n", + " \n", + " Run in Google Colab\n", + " \n", + " View on GitHub\n", + " \n", + " Download notebook\n", + "
\n" + ], "metadata": { "id": "36EdAGhThQov" - }, - "source": [ - "# Build, train and evaluate models with TensorFlow Decision Forests\n", - "\n", - "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n", - " \u003ctd\u003e\n", - " \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/decision_forests/tutorials/beginner_colab\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n", - " \u003c/td\u003e\n", - " \u003ctd\u003e\n", - " \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/decision-forests/blob/main/documentation/tutorials/beginner_colab.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n", - " \u003c/td\u003e\n", - " \u003ctd\u003e\n", - " \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/decision-forests/blob/main/documentation/tutorials/beginner_colab.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView on GitHub\u003c/a\u003e\n", - " \u003c/td\u003e\n", - " \u003ctd\u003e\n", - " \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/decision-forests/documentation/tutorials/beginner_colab.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n", - " \u003c/td\u003e\n", - "\u003c/table\u003e\n" - ] + } }, { "cell_type": "markdown", - "metadata": { - "id": "kvvDY0LVhuaW" - }, "source": [ "## Introduction\n", "\n", @@ -88,67 +86,66 @@ "\n", "Detailed documentation is available in the [user manual](https://github.com/tensorflow/decision-forests/documentation).\n", "The [example directory](https://github.com/tensorflow/decision-forests/examples) contains other end-to-end examples." - ] + ], + "metadata": { + "id": "kvvDY0LVhuaW" + } }, { "cell_type": "markdown", - "metadata": { - "id": "jK9tCTcwqq4k" - }, "source": [ "## Installing TensorFlow Decision Forests\n", "\n", "Install TF-DF by running the following cell." - ] + ], + "metadata": { + "id": "jK9tCTcwqq4k" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "Pa1Pf37RhEYN" - }, - "outputs": [], "source": [ "!pip install tensorflow_decision_forests" - ] + ], + "outputs": [], + "metadata": { + "id": "Pa1Pf37RhEYN" + } }, { "cell_type": "markdown", - "metadata": { - "id": "vZGda2dOe-hH" - }, "source": [ "Install [Wurlitzer](https://pypi.org/project/wurlitzer/) to display\n", "the detailed training logs. This is only needed in colabs." - ] + ], + "metadata": { + "id": "vZGda2dOe-hH" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "lk26uBSCe8Du" - }, - "outputs": [], "source": [ "!pip install wurlitzer" - ] + ], + "outputs": [], + "metadata": { + "id": "lk26uBSCe8Du" + } }, { "cell_type": "markdown", - "metadata": { - "id": "3oinwbhXlggd" - }, "source": [ "## Importing libraries" - ] + ], + "metadata": { + "id": "3oinwbhXlggd" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "52W45tmDjD64" - }, - "outputs": [], "source": [ "import tensorflow_decision_forests as tfdf\n", "\n", @@ -165,25 +162,24 @@ "\n", "from IPython.core.magic import register_line_magic\n", "from IPython.display import Javascript" - ] + ], + "outputs": [], + "metadata": { + "id": "52W45tmDjD64" + } }, { "cell_type": "markdown", - "metadata": { - "id": "0LPPwWxYxtDM" - }, "source": [ "The hidden code cell limits the output height in colab.\n" - ] + ], + "metadata": { + "id": "0LPPwWxYxtDM" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "cellView": "form", - "id": "2AhqJz3VmQM-" - }, - "outputs": [], "source": [ "#@title\n", "\n", @@ -195,66 +191,67 @@ " display(\n", " Javascript(\"google.colab.output.setIframeHeight(0, true, {maxHeight: \" +\n", " str(size) + \"})\"))" - ] + ], + "outputs": [], + "metadata": { + "cellView": "form", + "id": "2AhqJz3VmQM-" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "8gVQ-txtjFU4" - }, - "outputs": [], "source": [ "# Check the version of TensorFlow Decision Forests\n", "print(\"Found TensorFlow Decision Forests v\" + tfdf.__version__)" - ] + ], + "outputs": [], + "metadata": { + "id": "8gVQ-txtjFU4" + } }, { "cell_type": "markdown", - "metadata": { - "id": "QGRtRECujKeu" - }, "source": [ "## Training a Random Forest model\n", "\n", "In this section, we train, evaluate, analyse and export a binary classification Random Forest trained on the [Palmer's Penguins](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) dataset.\n", "\n", - "\u003ccenter\u003e\n", - "\u003cimg src=\"https://allisonhorst.github.io/palmerpenguins/man/figures/palmerpenguins.png\" width=\"150\"/\u003e\u003c/center\u003e\n", + "
\n", + "
\n", "\n", "**Note:** The dataset was exported to a csv file without pre-processing: `library(palmerpenguins); write.csv(penguins, file=\"penguins.csv\", quote=F, row.names=F)`. " - ] + ], + "metadata": { + "id": "QGRtRECujKeu" + } }, { "cell_type": "markdown", - "metadata": { - "id": "3qsSU1RfmNiP" - }, "source": [ "### Load the dataset and convert it in a tf.Dataset" - ] + ], + "metadata": { + "id": "3qsSU1RfmNiP" + } }, { "cell_type": "markdown", - "metadata": { - "id": "9nJ5igfElg2I" - }, "source": [ "This dataset is very small (300 examples) and stored as a .csv-like file. Therefore, use Pandas to load it.\n", "\n", - "**Note:** Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (\u003e1M examples), using the\n", + "**Note:** Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (>1M examples), using the\n", "[TensorFlow Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) to read the files may be better suited.\n", "\n", "Let's assemble the dataset into a csv file (i.e. add the header), and load it:" - ] + ], + "metadata": { + "id": "9nJ5igfElg2I" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "44Jq6g_mJFmj" - }, - "outputs": [], "source": [ "# Download the dataset\n", "!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv\n", @@ -264,27 +261,27 @@ "\n", "# Display the first 3 examples.\n", "dataset_df.head(3)" - ] + ], + "outputs": [], + "metadata": { + "id": "44Jq6g_mJFmj" + } }, { "cell_type": "markdown", - "metadata": { - "id": "23AewWT1lkIK" - }, "source": [ "The dataset contains a mix of numerical (e.g. `bill_depth_mm`), categorical\n", "(e.g. `island`) and missing features. TF-DF supports all these feature types natively (differently than NN based models), therefore there is no need for preprocessing in the form of one-hot encoding, normalization or extra `is_present` feature.\n", "\n", "Labels are a bit different: Keras metrics expect integers. The label (`species`) is stored as a string, so let's convert it into an integer." - ] + ], + "metadata": { + "id": "23AewWT1lkIK" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "uO_jz2sj0IBZ" - }, - "outputs": [], "source": [ "# Encode the categorical label into an integer.\n", "#\n", @@ -299,64 +296,65 @@ "print(f\"Label classes: {classes}\")\n", "\n", "dataset_df[label] = dataset_df[label].map(classes.index)" - ] + ], + "outputs": [], + "metadata": { + "id": "uO_jz2sj0IBZ" + } }, { "cell_type": "markdown", - "metadata": { - "id": "vwJjLFhbtozI" - }, "source": [ "Next split the dataset into training and testing:" - ] + ], + "metadata": { + "id": "vwJjLFhbtozI" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "u7DEIxn2oB3U" - }, - "outputs": [], "source": [ "# Split the dataset into a training and a testing dataset.\n", "\n", "def split_dataset(dataset, test_ratio=0.30):\n", " \"\"\"Splits a panda dataframe in two.\"\"\"\n", - " test_indices = np.random.rand(len(dataset)) \u003c test_ratio\n", + " test_indices = np.random.rand(len(dataset)) < test_ratio\n", " return dataset[~test_indices], dataset[test_indices]\n", "\n", "\n", "train_ds_pd, test_ds_pd = split_dataset(dataset_df)\n", "print(\"{} examples in training, {} examples for testing.\".format(\n", " len(train_ds_pd), len(test_ds_pd)))" - ] + ], + "outputs": [], + "metadata": { + "id": "u7DEIxn2oB3U" + } }, { "cell_type": "markdown", - "metadata": { - "id": "uWq7uQcCuBzO" - }, "source": [ "And finally, convert the pandas dataframe (`pd.Dataframe`) into tensorflow datasets (`tf.data.Dataset`):" - ] + ], + "metadata": { + "id": "uWq7uQcCuBzO" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "qtXgUBKluTX0" - }, - "outputs": [], "source": [ "train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)\n", "test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)" - ] + ], + "outputs": [], + "metadata": { + "id": "qtXgUBKluTX0" + } }, { "cell_type": "markdown", - "metadata": { - "id": "BRKLWIWNuOZ1" - }, "source": [ "**Notes:** `pd_dataframe_to_tf_dataset` could have converted the label to integer for you.\n", "\n", @@ -364,24 +362,23 @@ "\n", "- The learning algorithms work with a one-epoch dataset and without shuffling.\n", "- The batch size does not impact the training algorithm, but a small value might slow down reading the dataset.\n" - ] + ], + "metadata": { + "id": "BRKLWIWNuOZ1" + } }, { "cell_type": "markdown", - "metadata": { - "id": "mYAoyfYtqHG4" - }, "source": [ "### Train the model" - ] + ], + "metadata": { + "id": "mYAoyfYtqHG4" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "xete-FbuqJCV" - }, - "outputs": [], "source": [ "%set_cell_height 300\n", "\n", @@ -396,13 +393,14 @@ "# \"sys_pipes\" is optional. It enables the display of the training logs.\n", "with sys_pipes():\n", " model_1.fit(x=train_ds)" - ] + ], + "outputs": [], + "metadata": { + "id": "xete-FbuqJCV" + } }, { "cell_type": "markdown", - "metadata": { - "id": "OBnjxdip-MC0" - }, "source": [ "### Remarks\n", "\n", @@ -421,88 +419,88 @@ " is provided, it will only be used to show metrics.\n", "\n", "**Note:** A *Categorical-Set* feature is composed of a set of categorical values (while a *Categorical* is only one value). More details and examples are given later." - ] + ], + "metadata": { + "id": "OBnjxdip-MC0" + } }, { "cell_type": "markdown", - "metadata": { - "id": "tSdtNJUArBpl" - }, "source": [ "## Evaluate the model" - ] + ], + "metadata": { + "id": "tSdtNJUArBpl" + } }, { "cell_type": "markdown", - "metadata": { - "id": "Udtu_uS1paSu" - }, "source": [ "Let's evaluate our model on the test dataset." - ] + ], + "metadata": { + "id": "Udtu_uS1paSu" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "xUy4ULEMtDXB" - }, - "outputs": [], "source": [ "evaluation = model_1.evaluate(test_ds, return_dict=True)\n", "print()\n", "\n", "for name, value in evaluation.items():\n", " print(f\"{name}: {value:.4f}\")" - ] + ], + "outputs": [], + "metadata": { + "id": "xUy4ULEMtDXB" + } }, { "cell_type": "markdown", - "metadata": { - "id": "tlhfzZ34pfO4" - }, "source": [ "**Remark:** The test accuracy (0.86514) is close to the Out-of-bag accuracy\n", "(0.8672) shown in the training logs.\n", "\n", "See the **Model Self Evaluation** section below for more evaluation methods." - ] + ], + "metadata": { + "id": "tlhfzZ34pfO4" + } }, { "cell_type": "markdown", - "metadata": { - "id": "mHBFtUeElRYz" - }, "source": [ "## Prepare this model for TensorFlow Serving." - ] + ], + "metadata": { + "id": "mHBFtUeElRYz" + } }, { "cell_type": "markdown", - "metadata": { - "id": "JbC4lmgfr5Sm" - }, "source": [ "Export the model to the SavedModel format for later re-use e.g.\n", "[TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving).\n" - ] + ], + "metadata": { + "id": "JbC4lmgfr5Sm" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "08YWGr9U2fza" - }, - "outputs": [], "source": [ "model_1.save(\"/tmp/my_saved_model\")" - ] + ], + "outputs": [], + "metadata": { + "id": "08YWGr9U2fza" + } }, { "cell_type": "markdown", - "metadata": { - "id": "6-8R02_SXpbq" - }, "source": [ "## Plot the model\n", "\n", @@ -511,39 +509,39 @@ "Because of the difference in the way they are trained, some models are more interresting to plan than others. Because of the noise injected during training and the depth of the trees, plotting Random Forest is less informative than plotting a CART or the first tree of a Gradient Boosted Tree.\n", "\n", "Never the less, let's plot the first tree of our Random Forest model:" - ] + ], + "metadata": { + "id": "6-8R02_SXpbq" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "KUIxf8N6Yjl0" - }, - "outputs": [], "source": [ "tfdf.model_plotter.plot_model_in_colab(model_1, tree_idx=0, max_depth=3)" - ] + ], + "outputs": [], + "metadata": { + "id": "KUIxf8N6Yjl0" + } }, { "cell_type": "markdown", - "metadata": { - "id": "cPcL_hDnY7Zy" - }, "source": [ - "The root node on the left contains the first condition (`bill_depth_mm \u003e= 16.55`), number of examples (240) and label distribution (the red-blue-green bar).\n", + "The root node on the left contains the first condition (`bill_depth_mm >= 16.55`), number of examples (240) and label distribution (the red-blue-green bar).\n", "\n", - "Examples that evaluates true to `bill_depth_mm \u003e= 16.55` are branched to the green path. The other ones are branched to the red path.\n", + "Examples that evaluates true to `bill_depth_mm >= 16.55` are branched to the green path. The other ones are branched to the red path.\n", "\n", "The deeper the node, the more `pure` they become i.e. the label distribution is biased toward a subset of classes. \n", "\n", "**Note:** Over the mouse on top of the plot for details." - ] + ], + "metadata": { + "id": "cPcL_hDnY7Zy" + } }, { "cell_type": "markdown", - "metadata": { - "id": "-ob3ovQ2seVY" - }, "source": [ "## Model tructure and feature importance\n", "\n", @@ -564,67 +562,67 @@ "Out-of-bag is only available for Random Forest) and the hyper-parameters (e.g.\n", "the *mean-decrease-in-accuracy* variable importance can be disabled in the\n", "hyper-parameters)." - ] + ], + "metadata": { + "id": "-ob3ovQ2seVY" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "kzXME28Lq7Il" - }, - "outputs": [], "source": [ "%set_cell_height 300\n", "model_1.summary()" - ] + ], + "outputs": [], + "metadata": { + "id": "kzXME28Lq7Il" + } }, { "cell_type": "markdown", - "metadata": { - "id": "d4ApRpUm02zU" - }, "source": [ "The information in ``summary`` are all available programatically using the model inspector:" - ] + ], + "metadata": { + "id": "d4ApRpUm02zU" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "G3xuB3jN1Cww" - }, - "outputs": [], "source": [ "# The input features\n", "model_1.make_inspector().features()" - ] + ], + "outputs": [], + "metadata": { + "id": "G3xuB3jN1Cww" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "BZ2RBbU51L6s" - }, - "outputs": [], "source": [ "# The feature importances\n", "model_1.make_inspector().variable_importances()" - ] + ], + "outputs": [], + "metadata": { + "id": "BZ2RBbU51L6s" + } }, { "cell_type": "markdown", - "metadata": { - "id": "0zvyRJVk1aEk" - }, "source": [ "The content of the summary and the inspector depends on the learning algorithm (`tfdf.keras.RandomForestModel` in this case) and its hyper-parameters (e.g. `compute_oob_variable_importances=True` will trigger the computation of Out-of-bag variable importances for the Random Forest learner)." - ] + ], + "metadata": { + "id": "0zvyRJVk1aEk" + } }, { "cell_type": "markdown", - "metadata": { - "id": "tFVmrHtWXYKY" - }, "source": [ "## Model Self Evaluation\n", "\n", @@ -633,24 +631,24 @@ "**Note:** While this evaluation is computed during training, it is NOT computed on the training dataset and can be used as a low quality evaluation.\n", "\n", "The model self evaluation is available with the inspector's `evaluation()`:" - ] + ], + "metadata": { + "id": "tFVmrHtWXYKY" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "BZPzyIMmYmsI" - }, - "outputs": [], "source": [ "model_1.make_inspector().evaluation()" - ] + ], + "outputs": [], + "metadata": { + "id": "BZPzyIMmYmsI" + } }, { "cell_type": "markdown", - "metadata": { - "id": "vBSz-jE0Qss_" - }, "source": [ "## Plotting the training logs\n", "\n", @@ -664,36 +662,35 @@ "1. Using [TensorBoard](https://www.tensorflow.org/tensorboard)\n", "\n", "Let's try the options 2 and 3:\n" - ] + ], + "metadata": { + "id": "vBSz-jE0Qss_" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "ZbRk7xvpTKQG" - }, - "outputs": [], "source": [ "%set_cell_height 150\n", "model_1.make_inspector().training_logs()" - ] + ], + "outputs": [], + "metadata": { + "id": "ZbRk7xvpTKQG" + } }, { "cell_type": "markdown", - "metadata": { - "id": "WynFJCEbhuF_" - }, "source": [ "Let's plot it:" - ] + ], + "metadata": { + "id": "WynFJCEbhuF_" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "xzPH7Gggh0g1" - }, - "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", @@ -712,85 +709,86 @@ "plt.ylabel(\"Logloss (out-of-bag)\")\n", "\n", "plt.show()" - ] + ], + "outputs": [], + "metadata": { + "id": "xzPH7Gggh0g1" + } }, { "cell_type": "markdown", - "metadata": { - "id": "w1xzugBRhwuN" - }, "source": [ "This dataset is small. You can see the model converging almost immediately.\n", "\n", "Let's use TensorBoard:" - ] + ], + "metadata": { + "id": "w1xzugBRhwuN" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "5R_m-JmvU9tu" - }, - "outputs": [], "source": [ "# This cell start TensorBoard that can be slow.\n", "# Load the TensorBoard notebook extension\n", "%load_ext tensorboard\n", "# Google internal version\n", "# %load_ext google3.learning.brain.tensorboard.notebook.extension" - ] + ], + "outputs": [], + "metadata": { + "id": "5R_m-JmvU9tu" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "j6mp7K6HWwqQ" - }, - "outputs": [], "source": [ "# Clear existing results (if any)\n", "!rm -fr \"/tmp/tensorboard_logs\"" - ] + ], + "outputs": [], + "metadata": { + "id": "j6mp7K6HWwqQ" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "16NbLILYo124" - }, - "outputs": [], "source": [ "# Export the meta-data to tensorboard.\n", "model_1.make_inspector().export_to_tensorboard(\"/tmp/tensorboard_logs\")" - ] + ], + "outputs": [], + "metadata": { + "id": "16NbLILYo124" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "TSsN6aTXW0LJ" - }, - "outputs": [], "source": [ "# docs_infra: no_execute\n", "# Start a tensorboard instance.\n", "%tensorboard --logdir \"/tmp/tensorboard_logs\"" - ] + ], + "outputs": [], + "metadata": { + "id": "TSsN6aTXW0LJ" + } }, { "cell_type": "markdown", + "source": [ + "\n" + ], "metadata": { "id": "r_tlSccjZ8kE" - }, - "source": [ - "\u003c!-- \u003cimg class=\"tfo-display-only-on-site\" src=\"images/beginner_tensorboard.png\"/\u003e --\u003e\n" - ] + } }, { "cell_type": "markdown", - "metadata": { - "id": "phTUr6F1t-_E" - }, "source": [ "## Re-train the model with a different learning algorithm\n", "\n", @@ -801,63 +799,62 @@ "\n", "The learning algorithms are listed by calling `tfdf.keras.get_all_models()` or in the\n", "[learner list](https://github.com/google/yggdrasil-decision-forests/manual/learners)." - ] + ], + "metadata": { + "id": "phTUr6F1t-_E" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "OwEAAzUZq2m8" - }, - "outputs": [], "source": [ "tfdf.keras.get_all_models()" - ] + ], + "outputs": [], + "metadata": { + "id": "OwEAAzUZq2m8" + } }, { "cell_type": "markdown", - "metadata": { - "id": "xmzvuI78voD4" - }, "source": [ "The description of the learning algorithms and their hyper-parameters are also available in the [API reference](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf) and builtin help:" - ] + ], + "metadata": { + "id": "xmzvuI78voD4" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "2hONToBav4DE" - }, - "outputs": [], "source": [ "# help works anywhere.\n", "help(tfdf.keras.RandomForestModel)\n", "\n", "# ? only works in ipython or notebooks, it usually opens on a separate panel.\n", "tfdf.keras.RandomForestModel?" - ] + ], + "outputs": [], + "metadata": { + "id": "2hONToBav4DE" + } }, { "cell_type": "markdown", - "metadata": { - "id": "PuWEYvXaiwhk" - }, "source": [ "## Using a subset of features\n", "\n", "The previous example did not specify the features, so all the columns were used\n", "as input feature (except for the label). The following example shows how to\n", "specify input features." - ] + ], + "metadata": { + "id": "PuWEYvXaiwhk" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "sgn_LnRz3M7z" - }, - "outputs": [], "source": [ "feature_1 = tfdf.keras.FeatureUsage(name=\"bill_length_mm\")\n", "feature_2 = tfdf.keras.FeatureUsage(name=\"island\")\n", @@ -874,22 +871,23 @@ "model_2.fit(x=train_ds, validation_data=test_ds)\n", "\n", "print(model_2.evaluate(test_ds, return_dict=True))" - ] + ], + "outputs": [], + "metadata": { + "id": "sgn_LnRz3M7z" + } }, { "cell_type": "markdown", - "metadata": { - "id": "zvM84cgCmbUR" - }, "source": [ "**Note:** As expected, the accuracy is lower than previously." - ] + ], + "metadata": { + "id": "zvM84cgCmbUR" + } }, { "cell_type": "markdown", - "metadata": { - "id": "MFmqpivc7x7p" - }, "source": [ "**TF-DF** attaches a **semantics** to each feature. This semantics controls how\n", "the feature is used by the model. The following semantics are currently supported:\n", @@ -915,15 +913,14 @@ "In some cases, the inferred semantics is incorrect. For example: An Enum stored as an integer is semantically categorical, but it will be detected as numerical. In this case, you should specify the semantic argument in the input. The `education_num` field of the Adult dataset is classical example.\n", "\n", "This dataset doesn't contain such a feature. However, for the demonstration, we will make the model treat the `year` as a categorical feature:" - ] + ], + "metadata": { + "id": "MFmqpivc7x7p" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "RNRIwLYC8zrp" - }, - "outputs": [], "source": [ "%set_cell_height 300\n", "\n", @@ -937,22 +934,23 @@ "\n", "with sys_pipes():\n", " model_3.fit(x=train_ds, validation_data=test_ds)" - ] + ], + "outputs": [], + "metadata": { + "id": "RNRIwLYC8zrp" + } }, { "cell_type": "markdown", - "metadata": { - "id": "2AQaNwihcpP7" - }, "source": [ "Note that `year` is in the list of CATEGORICAL features (unlike the first run)." - ] + ], + "metadata": { + "id": "2AQaNwihcpP7" + } }, { "cell_type": "markdown", - "metadata": { - "id": "GYrw7nKN40Vm" - }, "source": [ "## Hyper-parameters\n", "\n", @@ -963,29 +961,28 @@ "Alternatively, you can find them on the [TensorFlow Decision Forest Github](https://github.com/tensorflow/decision-forests/keras/wrappers_pre_generated.py) or the [Yggdrasil Decision Forest documentation](https://github.com/google/yggdrasil_decision_forests/documentation/learners).\n", "\n", "The default hyper-parameters of each algorithm matches approximatively the initial publication paper. To ensure consistancy, new features and their matching hyper-parameters are always disable by default. That's why it is a good idea to tune your hyper-parameters." - ] + ], + "metadata": { + "id": "GYrw7nKN40Vm" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "vHgPr4Pt43hv" - }, - "outputs": [], "source": [ "# A classical but slighly more complex model.\n", "model_6 = tfdf.keras.GradientBoostedTreesModel(\n", " num_trees=500, growing_strategy=\"BEST_FIRST_GLOBAL\", max_depth=8)\n", "model_6.fit(x=train_ds)" - ] + ], + "outputs": [], + "metadata": { + "id": "vHgPr4Pt43hv" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "uECgPGDc2P4p" - }, - "outputs": [], "source": [ "# A more complex, but possibly, more accurate model.\n", "model_7 = tfdf.keras.GradientBoostedTreesModel(\n", @@ -996,58 +993,59 @@ " categorical_algorithm=\"RANDOM\",\n", " )\n", "model_7.fit(x=train_ds)" - ] + ], + "outputs": [], + "metadata": { + "id": "uECgPGDc2P4p" + } }, { "cell_type": "markdown", - "metadata": { - "id": "Xk7wEmUZu3V0" - }, "source": [ "As new training methods are published and implemented, combinaisons of hyper-parameters can emerge as good or almost-always-better than the default parameters. To avoid changing the default hyper-parameter values these good combinaisons are indexed and available as hyper-parameter templates.\n", "\n", "For example, the `benchmark_rank1` template is the best combinaison on our internal benchmarks. Those templates are versioned to allow training configuration stability e.g. `benchmark_rank1@v1`." - ] + ], + "metadata": { + "id": "Xk7wEmUZu3V0" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "LtrRhMhj3hSu" - }, - "outputs": [], "source": [ "# A good template of hyper-parameters.\n", "model_8 = tfdf.keras.GradientBoostedTreesModel(hyperparameter_template=\"benchmark_rank1\")\n", "model_8.fit(x=train_ds)" - ] + ], + "outputs": [], + "metadata": { + "id": "LtrRhMhj3hSu" + } }, { "cell_type": "markdown", - "metadata": { - "id": "FSDXcKXB3u6M" - }, "source": [ "The available tempaltes are available with `predefined_hyperparameters`. Note that different learning algorithms have different templates, even if the name is similar." - ] + ], + "metadata": { + "id": "FSDXcKXB3u6M" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "MQrWI2iv37Bo" - }, - "outputs": [], "source": [ "# The hyper-parameter templates of the Gradient Boosted Tree model.\n", "print(tfdf.keras.GradientBoostedTreesModel.predefined_hyperparameters())" - ] + ], + "outputs": [], + "metadata": { + "id": "MQrWI2iv37Bo" + } }, { "cell_type": "markdown", - "metadata": { - "id": "gcX4tov1_lwp" - }, "source": [ "## Feature Preprocessing\n", "\n", @@ -1074,15 +1072,14 @@ "\n", "In the next example, pre-process the `body_mass_g` feature into `body_mass_kg = body_mass_g / 1000`. The `bill_length_mm` is consumed without pre-processing. Note that such\n", "monotonic transformations have generally no impact on decision forest models." - ] + ], + "metadata": { + "id": "gcX4tov1_lwp" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "tGcIvTeKAApp" - }, - "outputs": [], "source": [ "%set_cell_height 300\n", "\n", @@ -1102,25 +1099,25 @@ "model_4.fit(x=train_ds)\n", "\n", "model_4.summary()" - ] + ], + "outputs": [], + "metadata": { + "id": "tGcIvTeKAApp" + } }, { "cell_type": "markdown", - "metadata": { - "id": "h1Bx3Feyjb2o" - }, "source": [ "The following example re-implements the same logic using TensorFlow Feature\n", "Columns." - ] + ], + "metadata": { + "id": "h1Bx3Feyjb2o" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "fnwe3sBt-yJk" - }, - "outputs": [], "source": [ "def g_to_kg(x):\n", " return x / 1000\n", @@ -1135,13 +1132,14 @@ "model_5 = tfdf.keras.RandomForestModel(preprocessing=preprocessing)\n", "model_5.compile(metrics=[\"accuracy\"])\n", "model_5.fit(x=train_ds)" - ] + ], + "outputs": [], + "metadata": { + "id": "fnwe3sBt-yJk" + } }, { "cell_type": "markdown", - "metadata": { - "id": "9vif6gsAjfzv" - }, "source": [ "## Training a regression model\n", "\n", @@ -1154,32 +1152,31 @@ "\n", "**Note:** The csv file is assembled by appending UCI's header and data files. No preprocessing was applied.\n", "\n", - "\u003ccenter\u003e\n", - "\u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/LivingAbalone.JPG/800px-LivingAbalone.JPG\" width=\"200\"/\u003e\u003c/center\u003e" - ] + "
\n", + "
" + ], + "metadata": { + "id": "9vif6gsAjfzv" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "0uKI_Uy7RyWN" - }, - "outputs": [], "source": [ "# Download the dataset.\n", "!wget -q https://storage.googleapis.com/download.tensorflow.org/data/abalone_raw.csv -O /tmp/abalone.csv\n", "\n", "dataset_df = pd.read_csv(\"/tmp/abalone.csv\")\n", "print(dataset_df.head(3))" - ] + ], + "outputs": [], + "metadata": { + "id": "0uKI_Uy7RyWN" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "_gjrquQySU7Q" - }, - "outputs": [], "source": [ "# Split the dataset into a training and testing dataset.\n", "train_ds_pd, test_ds_pd = split_dataset(dataset_df)\n", @@ -1191,15 +1188,15 @@ "\n", "train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)\n", "test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)" - ] + ], + "outputs": [], + "metadata": { + "id": "_gjrquQySU7Q" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "t8fUhQKISqYT" - }, - "outputs": [], "source": [ "%set_cell_height 300\n", "\n", @@ -1212,15 +1209,15 @@ "# Train the model.\n", "with sys_pipes():\n", " model_7.fit(x=train_ds)" - ] + ], + "outputs": [], + "metadata": { + "id": "t8fUhQKISqYT" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "aSriIAaMSzwA" - }, - "outputs": [], "source": [ "# Evaluate the model on the test dataset.\n", "evaluation = model_7.evaluate(test_ds, return_dict=True)\n", @@ -1229,13 +1226,14 @@ "print()\n", "print(f\"MSE: {evaluation['mse']}\")\n", "print(f\"RMSE: {math.sqrt(evaluation['mse'])}\")" - ] + ], + "outputs": [], + "metadata": { + "id": "aSriIAaMSzwA" + } }, { "cell_type": "markdown", - "metadata": { - "id": "S54mR6i9jkhp" - }, "source": [ "## Training a ranking model\n", "\n", @@ -1266,17 +1264,16 @@ "\n", "In this example, use a sample of the\n", "[LETOR3](https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/#!letor-3-0)\n", - "dataset. More precisely, we want to download the `OHSUMED.zip` from [the LETOR3 repo](https://onedrive.live.com/?authkey=%21ACnoZZSZVfHPJd0\u0026id=8FEADC23D838BDA8%21107\u0026cid=8FEADC23D838BDA8). This dataset is stored in the\n", + "dataset. More precisely, we want to download the `OHSUMED.zip` from [the LETOR3 repo](https://onedrive.live.com/?authkey=%21ACnoZZSZVfHPJd0&id=8FEADC23D838BDA8%21107&cid=8FEADC23D838BDA8). This dataset is stored in the\n", "libsvm format, so we will need to convert it to csv." - ] + ], + "metadata": { + "id": "S54mR6i9jkhp" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "axD6x1ZivHCS" - }, - "outputs": [], "source": [ "%set_cell_height 200\n", "\n", @@ -1286,24 +1283,24 @@ "\n", "# Path to the train and test dataset using libsvm format.\n", "raw_dataset_path = os.path.join(os.path.dirname(archive_path),\"OHSUMED/Data/All/OHSUMED.txt\")" - ] + ], + "outputs": [], + "metadata": { + "id": "axD6x1ZivHCS" + } }, { "cell_type": "markdown", - "metadata": { - "id": "rcManr98ZGID" - }, "source": [ "The dataset is stored as a .txt file in a specific format, so first convert it into a csv file." - ] + ], + "metadata": { + "id": "rcManr98ZGID" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "mkiM9HJox-e8" - }, - "outputs": [], "source": [ "def convert_libsvm_to_csv(src_path, dst_path):\n", " \"\"\"Converts a libsvm ranking dataset into a flat csv file.\n", @@ -1335,15 +1332,15 @@ "\n", "# Display the first 3 examples.\n", "dataset_df.head(3)" - ] + ], + "outputs": [], + "metadata": { + "id": "mkiM9HJox-e8" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "wB7bWAja1G-o" - }, - "outputs": [], "source": [ "train_ds_pd, test_ds_pd = split_dataset(dataset_df)\n", "print(\"{} examples in training, {} examples for testing.\".format(\n", @@ -1351,39 +1348,39 @@ "\n", "# Display the first 3 examples of the training dataset.\n", "train_ds_pd.head(3)" - ] + ], + "outputs": [], + "metadata": { + "id": "wB7bWAja1G-o" + } }, { "cell_type": "markdown", - "metadata": { - "id": "YQKqN9zN4L00" - }, "source": [ "In this dataset, the `relevance` defines the ground-truth rank among rows of the same `group`." - ] + ], + "metadata": { + "id": "YQKqN9zN4L00" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "5QMbBkCEXxu_" - }, - "outputs": [], "source": [ "# Name of the relevance and grouping columns.\n", "relevance = \"relevance\"\n", "\n", "ranking_train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=relevance, task=tfdf.keras.Task.RANKING)\n", "ranking_test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=relevance, task=tfdf.keras.Task.RANKING)" - ] + ], + "outputs": [], + "metadata": { + "id": "5QMbBkCEXxu_" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "Ba1gb75SX1rr" - }, - "outputs": [], "source": [ "%set_cell_height 400\n", "\n", @@ -1394,13 +1391,14 @@ "\n", "with sys_pipes():\n", " model_8.fit(x=ranking_train_ds)" - ] + ], + "outputs": [], + "metadata": { + "id": "Ba1gb75SX1rr" + } }, { "cell_type": "markdown", - "metadata": { - "id": "spZCfxfR3VK0" - }, "source": [ "At this point, keras does not propose any ranking metrics. Instead, the training and validation (a GBDT uses a validation dataset) are shown in the training\n", "logs. In this case the loss is `LAMBDA_MART_NDCG5`, and the final (i.e. at\n", @@ -1410,20 +1408,23 @@ "the model. For this reason, the loss to be -NDCG.\n", "\n", "As before, the model can be analysed:" - ] + ], + "metadata": { + "id": "spZCfxfR3VK0" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "L4N1R8fM4jFh" - }, - "outputs": [], "source": [ "%set_cell_height 400\n", "\n", "model_8.summary()" - ] + ], + "outputs": [], + "metadata": { + "id": "L4N1R8fM4jFh" + } } ], "metadata": { @@ -1439,5 +1440,5 @@ } }, "nbformat": 4, - "nbformat_minor": 0 -} + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/documentation/tutorials/intermediate_colab.ipynb b/documentation/tutorials/intermediate_colab.ipynb index 98dac947..2cd64414 100644 --- a/documentation/tutorials/intermediate_colab.ipynb +++ b/documentation/tutorials/intermediate_colab.ipynb @@ -2,21 +2,16 @@ "cells": [ { "cell_type": "markdown", - "metadata": { - "id": "Tce3stUlHN0L" - }, "source": [ "##### Copyright 2020 The TensorFlow Authors." - ] + ], + "metadata": { + "id": "Tce3stUlHN0L" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "cellView": "form", - "id": "tuOe1ymfHZPu" - }, - "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", @@ -29,40 +24,42 @@ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." - ] + ], + "outputs": [], + "metadata": { + "cellView": "form", + "id": "tuOe1ymfHZPu" + } }, { "cell_type": "markdown", - "metadata": { - "id": "8yo62ffS5TF5" - }, "source": [ "# Using text and neural network features\n", "\n", - "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n", - " \u003ctd\u003e\n", - " \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/decision_forests/tutorials/intermediate_colab\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n", - " \u003c/td\u003e\n", - " \u003ctd\u003e\n", - " \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/decision-forests/blob/main/documentation/tutorials/intermediate_colab.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n", - " \u003c/td\u003e\n", - " \u003ctd\u003e\n", - " \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/decision-forests/blob/main/documentation/tutorials/intermediate_colab.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView on GitHub\u003c/a\u003e\n", - " \u003c/td\u003e\n", - " \u003ctd\u003e\n", - " \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/decision-forests/documentation/tutorials/intermediate_colab.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n", - " \u003c/td\u003e\n", - " \u003ctd\u003e\n", - " \u003ca href=\"https://tfhub.dev/google/universal-sentence-encoder/4\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/hub_logo_32px.png\" /\u003eSee TF Hub model\u003c/a\u003e\n", - " \u003c/td\u003e\n", - "\u003c/table\u003e\n" - ] + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " View on TensorFlow.org\n", + " \n", + " Run in Google Colab\n", + " \n", + " View on GitHub\n", + " \n", + " Download notebook\n", + " \n", + " See TF Hub model\n", + "
\n" + ], + "metadata": { + "id": "8yo62ffS5TF5" + } }, { "cell_type": "markdown", - "metadata": { - "id": "zrCwCCxhiAL7" - }, "source": [ "Welcome to the **Intermediate Colab** for **TensorFlow Decision Forests** (**TF-DF**).\n", "In this colab, you will learn about some more advanced capabilities of **TF-DF**, including how to deal with natural language features.\n", @@ -76,66 +73,65 @@ "1. Train a Random Forest that consumes text features using a [TensorFlow Hub](https://www.tensorflow.org/hub) module. In this setting (transfer learning), the module is already pre-trained on a large text corpus.\n", "\n", "1. Train a Gradient Boosted Decision Trees (GBDT) and a Neural Network together. The GBDT will consume the output of the Neural Network." - ] + ], + "metadata": { + "id": "zrCwCCxhiAL7" + } }, { "cell_type": "markdown", - "metadata": { - "id": "Rzskapxq7gdo" - }, "source": [ "## Setup" - ] + ], + "metadata": { + "id": "Rzskapxq7gdo" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "mZiInVYfffAb" - }, - "outputs": [], "source": [ "# Install TensorFlow Dececision Forests\n", "!pip install tensorflow_decision_forests" - ] + ], + "outputs": [], + "metadata": { + "id": "mZiInVYfffAb" + } }, { "cell_type": "markdown", - "metadata": { - "id": "2EFndCFdoJM5" - }, "source": [ "Install [Wurlitzer](https://pypi.org/project/wurlitzer/). It can be used to show\n", "the detailed training logs. This is only needed in colabs." - ] + ], + "metadata": { + "id": "2EFndCFdoJM5" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "L06XWRdSoLj5" - }, - "outputs": [], "source": [ "!pip install wurlitzer" - ] + ], + "outputs": [], + "metadata": { + "id": "L06XWRdSoLj5" + } }, { "cell_type": "markdown", - "metadata": { - "id": "i7PlfbnxYcPf" - }, "source": [ "Import the necessary libraries." - ] + ], + "metadata": { + "id": "i7PlfbnxYcPf" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "RsCV2oAS7gC_" - }, - "outputs": [], "source": [ "import tensorflow_decision_forests as tfdf\n", "\n", @@ -152,25 +148,24 @@ "\n", "from IPython.core.magic import register_line_magic\n", "from IPython.display import Javascript" - ] + ], + "outputs": [], + "metadata": { + "id": "RsCV2oAS7gC_" + } }, { "cell_type": "markdown", - "metadata": { - "id": "w2fsI0y5x5i5" - }, "source": [ "The hidden code cell limits the output height in colab." - ] + ], + "metadata": { + "id": "w2fsI0y5x5i5" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "cellView": "form", - "id": "jZXB4o6Tlu0i" - }, - "outputs": [], "source": [ "#@title\n", "\n", @@ -182,13 +177,15 @@ " display(\n", " Javascript(\"google.colab.output.setIframeHeight(0, true, {maxHeight: \" +\n", " str(size) + \"})\"))" - ] + ], + "outputs": [], + "metadata": { + "cellView": "form", + "id": "jZXB4o6Tlu0i" + } }, { "cell_type": "markdown", - "metadata": { - "id": "M_D4Ft4o65XT" - }, "source": [ "## Use raw text as features\n", "\n", @@ -199,28 +196,27 @@ "In this example, you'll will train a Random Forest on the [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/index.html) (SST) dataset. The objective of this dataset is to classify sentences as carrying a *positive* or *negative* sentiment. You'll will use the binary classification version of the dataset curated in [TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/glue#gluesst2).\n", "\n", "**Note:** Categorical-set features can be expensive to train. In this colab, we will train a small Random Forest with 20 trees." - ] + ], + "metadata": { + "id": "M_D4Ft4o65XT" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "SgEiFy23j14S" - }, - "outputs": [], "source": [ "# Install the nighly TensorFlow Datasets package\n", "# TODO: Remove when the release package is fixed.\n", "!pip install tfds-nightly -U --quiet" - ] + ], + "outputs": [], + "metadata": { + "id": "SgEiFy23j14S" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "uVN-j0E4Q1T3" - }, - "outputs": [], "source": [ "# Load the dataset\n", "import tensorflow_datasets as tfds\n", @@ -229,33 +225,33 @@ "# Display the first 3 examples of the test fold.\n", "for example in all_ds[\"test\"].take(3):\n", " print({attr_name: attr_tensor.numpy() for attr_name, attr_tensor in example.items()})" - ] + ], + "outputs": [], + "metadata": { + "id": "uVN-j0E4Q1T3" + } }, { "cell_type": "markdown", - "metadata": { - "id": "UHiQUWE2XDYN" - }, "source": [ "The dataset is modified as follows:\n", "\n", "1. The raw labels are integers in `{-1, 1}`, but the learning algorithm expects positive integer labels e.g. `{0, 1}`. Therefore, the labels are transformed as follows: `new_labels = (original_labels + 1) / 2`.\n", "1. A batch-size of 64 is applied to make reading the dataset more efficient.\n", - "1. The `sentence` attribute needs to be tokenized, i.e. `\"hello world\" -\u003e [\"hello\", \"world\"]`.\n", + "1. The `sentence` attribute needs to be tokenized, i.e. `\"hello world\" -> [\"hello\", \"world\"]`.\n", "\n", "\n", "**Note:** This example doesn't use the `test` split of the dataset as it does not have labels. If `test` split had labels, you could concatenate the `validation` fold into the `train` one (e.g. `all_ds[\"train\"].concatenate(all_ds[\"validation\"])`).\n", "\n", "**Details:** Some decision forest learning algorithms do not need a validation dataset (e.g. Random Forests) while others do (e.g. Gradient Boosted Trees in some cases). Since each learning algorithm under TF-DF can use validation data differently, TF-DF handles train/validation splits internally. As a result, when you have a training and validation sets, they can always be concatenated as input to the learning algorithm." - ] + ], + "metadata": { + "id": "UHiQUWE2XDYN" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "yqYDKTKdSPYw" - }, - "outputs": [], "source": [ "def prepare_dataset(example):\n", " label = (example[\"label\"] + 1) // 2\n", @@ -263,24 +259,24 @@ "\n", "train_ds = all_ds[\"train\"].batch(64).map(prepare_dataset)\n", "test_ds = all_ds[\"validation\"].batch(64).map(prepare_dataset)" - ] + ], + "outputs": [], + "metadata": { + "id": "yqYDKTKdSPYw" + } }, { "cell_type": "markdown", - "metadata": { - "id": "YYkIjROI9w43" - }, "source": [ "Finaly, train and evaluate the model as usual. TF-DF automatically detects multi-valued categorical features as categorical-set.\n" - ] + ], + "metadata": { + "id": "YYkIjROI9w43" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "mpxTtYo39wYZ" - }, - "outputs": [], "source": [ "%set_cell_height 300\n", "\n", @@ -293,49 +289,49 @@ "# Train the model.\n", "with sys_pipes():\n", " model_1.fit(x=train_ds)" - ] + ], + "outputs": [], + "metadata": { + "id": "mpxTtYo39wYZ" + } }, { "cell_type": "markdown", - "metadata": { - "id": "D9FMFGzwiHCt" - }, "source": [ "In the previous logs, note that `sentence` is a `CATEGORICAL_SET` feature.\n", "\n", "The model is evaluated as usual:" - ] + ], + "metadata": { + "id": "D9FMFGzwiHCt" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "cpf-wHl094S1" - }, - "outputs": [], "source": [ "evaluation = model_1.evaluate(test_ds)\n", "\n", "print(f\"BinaryCrossentropyloss: {evaluation[0]}\")\n", "print(f\"Accuracy: {evaluation[1]}\")" - ] + ], + "outputs": [], + "metadata": { + "id": "cpf-wHl094S1" + } }, { "cell_type": "markdown", - "metadata": { - "id": "YliBX4GtjncQ" - }, "source": [ "The training logs looks are follow:" - ] + ], + "metadata": { + "id": "YliBX4GtjncQ" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "OnTTtBNmjpo7" - }, - "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", @@ -344,22 +340,23 @@ "plt.xlabel(\"Number of trees\")\n", "plt.ylabel(\"Out-of-bag accuracy\")\n", "pass" - ] + ], + "outputs": [], + "metadata": { + "id": "OnTTtBNmjpo7" + } }, { "cell_type": "markdown", - "metadata": { - "id": "d4qJ0ig3kgic" - }, "source": [ "More trees would probably be beneficial (I am sure of it because I tried :p)." - ] + ], + "metadata": { + "id": "d4qJ0ig3kgic" + } }, { "cell_type": "markdown", - "metadata": { - "id": "Iil_oyOhCNx6" - }, "source": [ "## Use a pretrained text embedding\n", "\n", @@ -377,35 +374,34 @@ "The second option is often preferable: Packaging the embedding in the model makes the model easier to use (and harder to misuse).\n", "\n", "First install TF-Hub:" - ] + ], + "metadata": { + "id": "Iil_oyOhCNx6" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "QfYGXim_DskC" - }, - "outputs": [], "source": [ "!pip install --upgrade tensorflow-hub" - ] + ], + "outputs": [], + "metadata": { + "id": "QfYGXim_DskC" + } }, { "cell_type": "markdown", - "metadata": { - "id": "kNSEhJgjEXww" - }, "source": [ "Unlike before, you don't need to tokenize the text." - ] + ], + "metadata": { + "id": "kNSEhJgjEXww" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "pS5SYqoScbOc" - }, - "outputs": [], "source": [ "def prepare_dataset(example):\n", " label = (example[\"label\"] + 1) // 2\n", @@ -413,15 +409,15 @@ "\n", "train_ds = all_ds[\"train\"].batch(64).map(prepare_dataset)\n", "test_ds = all_ds[\"validation\"].batch(64).map(prepare_dataset)\n" - ] + ], + "outputs": [], + "metadata": { + "id": "pS5SYqoScbOc" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "zHEsd8q_ESpC" - }, - "outputs": [], "source": [ "%set_cell_height 300\n", "\n", @@ -444,48 +440,49 @@ "\n", "with sys_pipes():\n", " model_2.fit(x=train_ds)" - ] + ], + "outputs": [], + "metadata": { + "id": "zHEsd8q_ESpC" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "xPLoDqiFKY18" - }, - "outputs": [], "source": [ "evaluation = model_2.evaluate(test_ds)\n", "\n", "print(f\"BinaryCrossentropyloss: {evaluation[0]}\")\n", "print(f\"Accuracy: {evaluation[1]}\")" - ] + ], + "outputs": [], + "metadata": { + "id": "xPLoDqiFKY18" + } }, { "cell_type": "markdown", - "metadata": { - "id": "WPsD3LyaMLHm" - }, "source": [ "Note that categorical sets represent text differently from a dense embedding, so it may be useful to use both strategies jointly." - ] + ], + "metadata": { + "id": "WPsD3LyaMLHm" + } }, { "cell_type": "markdown", - "metadata": { - "id": "37AGJamzboZQ" - }, "source": [ "## Train a decision tree and neural network together\n", "\n", "The previous example used a pre-trained Neural Network (NN) to \n", "process the text features before passing them to the Random Forest. This example will train both the Neural Network and the Random Forest from scratch.\n" - ] + ], + "metadata": { + "id": "37AGJamzboZQ" + } }, { "cell_type": "markdown", - "metadata": { - "id": "YJIxGwwzMkFl" - }, "source": [ "TF-DF's Decision Forests do not back-propagate gradients ([although this is the subject of ongoing research](https://arxiv.org/abs/2007.14761)). Therefore, the training happens in two stages:\n", "\n", @@ -503,79 +500,78 @@ "*: Training.\n", "```\n", "\n" - ] + ], + "metadata": { + "id": "YJIxGwwzMkFl" + } }, { "cell_type": "markdown", - "metadata": { - "id": "YSIvuAhzbjWO" - }, "source": [ "### Prepare the dataset\n", "\n", "This example uses the [Palmer's Penguins](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) dataset. See the [Beginner colab](beginner_colab.ipynb) for details." - ] + ], + "metadata": { + "id": "YSIvuAhzbjWO" + } }, { "cell_type": "markdown", - "metadata": { - "id": "InUot_K2b3Mz" - }, "source": [ "First, download the raw data:" - ] + ], + "metadata": { + "id": "InUot_K2b3Mz" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "rNyaeCx0b1be" - }, - "outputs": [], "source": [ "!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv" - ] + ], + "outputs": [], + "metadata": { + "id": "rNyaeCx0b1be" + } }, { "cell_type": "markdown", - "metadata": { - "id": "pNPZzQekb9z_" - }, "source": [ "Load a dataset into a Pandas Dataframe." - ] + ], + "metadata": { + "id": "pNPZzQekb9z_" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "9lA3peQ4sa9a" - }, - "outputs": [], "source": [ "dataset_df = pd.read_csv(\"/tmp/penguins.csv\")\n", "\n", "# Display the first 3 examples.\n", "dataset_df.head(3)" - ] + ], + "outputs": [], + "metadata": { + "id": "9lA3peQ4sa9a" + } }, { "cell_type": "markdown", - "metadata": { - "id": "v-_SZpRWcAoX" - }, "source": [ "\n", "Prepare the dataset for training." - ] + ], + "metadata": { + "id": "v-_SZpRWcAoX" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "rtyi8UoqtzhM" - }, - "outputs": [], "source": [ "label = \"species\"\n", "\n", @@ -584,21 +580,21 @@ "for col in dataset_df.columns:\n", " if dataset_df[col].dtype not in [str, object]:\n", " dataset_df[col] = dataset_df[col].fillna(0)" - ] + ], + "outputs": [], + "metadata": { + "id": "rtyi8UoqtzhM" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "GKrW5Yfjso0k" - }, - "outputs": [], "source": [ "# Split the dataset into a training and testing dataset.\n", "\n", "def split_dataset(dataset, test_ratio=0.30):\n", " \"\"\"Splits a panda dataframe in two.\"\"\"\n", - " test_indices = np.random.rand(len(dataset)) \u003c test_ratio\n", + " test_indices = np.random.rand(len(dataset)) < test_ratio\n", " return dataset[~test_indices], dataset[test_indices]\n", "\n", "train_ds_pd, test_ds_pd = split_dataset(dataset_df)\n", @@ -608,51 +604,51 @@ "# Convert the datasets into tensorflow datasets\n", "train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)\n", "test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=label)" - ] + ], + "outputs": [], + "metadata": { + "id": "GKrW5Yfjso0k" + } }, { "cell_type": "markdown", - "metadata": { - "id": "ore7f6tgcOMh" - }, "source": [ "### Build the models\n", "\n", "Next create the neural network model using [Keras' functional style](https://www.tensorflow.org/guide/keras/functional). \n", "\n", "To keep the example simple this model only uses two inputs." - ] + ], + "metadata": { + "id": "ore7f6tgcOMh" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "S1Jfe4YteBqY" - }, - "outputs": [], "source": [ "input_1 = tf.keras.Input(shape=(1,), name=\"bill_length_mm\", dtype=\"float\")\n", "input_2 = tf.keras.Input(shape=(1,), name=\"island\", dtype=\"string\")\n", "\n", "nn_raw_inputs = [input_1, input_2]" - ] + ], + "outputs": [], + "metadata": { + "id": "S1Jfe4YteBqY" + } }, { "cell_type": "markdown", + "source": [ + "Use [`experimental.preprocessing` layers](https://www.tensorflow.org/guide/keras/preprocessing_layers) to convert the raw inputs to inputs apropriate for the neural network. " + ], "metadata": { "id": "ZjlvAUNGeDM8" - }, - "source": [ - "Use [`experimental.preprocessing` layers](https://www.tensorflow.org/guide/keras/preprocessing_layers) to convert the raw inputs to inputs apropriate for the neural netrwork. " - ] + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "9Q09Nkp6ei21" - }, - "outputs": [], "source": [ "# Normalization.\n", "Normalization = tf.keras.layers.experimental.preprocessing.Normalization\n", @@ -673,24 +669,24 @@ "normalized_input_2 = input_2_onehot(input_2_indexer(input_2))\n", "\n", "nn_processed_inputs = [normalized_input_1, normalized_input_2]" - ] + ], + "outputs": [], + "metadata": { + "id": "9Q09Nkp6ei21" + } }, { "cell_type": "markdown", - "metadata": { - "id": "ZCoQljyhelau" - }, "source": [ "Build the body of the neural network:" - ] + ], + "metadata": { + "id": "ZCoQljyhelau" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "KzocgbYNsH6y" - }, - "outputs": [], "source": [ "y = tf.keras.layers.Concatenate()(nn_processed_inputs)\n", "y = tf.keras.layers.Dense(16, activation=tf.nn.relu6)(y)\n", @@ -701,51 +697,51 @@ "classification_output = tf.keras.layers.Dense(3)(y)\n", "\n", "nn_model = tf.keras.models.Model(nn_raw_inputs, classification_output)" - ] + ], + "outputs": [], + "metadata": { + "id": "KzocgbYNsH6y" + } }, { "cell_type": "markdown", - "metadata": { - "id": "zPbRKf1CfIrj" - }, "source": [ "This `nn_model` directly produces classification logits. \n", "\n", "Next create a decision forest model. This will operate on the high level features that the neural network extracts in the last layer before that classification head." - ] + ], + "metadata": { + "id": "zPbRKf1CfIrj" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "7fnpGNyTuXvH" - }, - "outputs": [], "source": [ "# To reduce the risk of mistakes, group both the decision forest and the\n", "# neural network in a single keras model.\n", "nn_without_head = tf.keras.models.Model(inputs=nn_model.inputs, outputs=last_layer)\n", "df_and_nn_model = tfdf.keras.RandomForestModel(preprocessing=nn_without_head)" - ] + ], + "outputs": [], + "metadata": { + "id": "7fnpGNyTuXvH" + } }, { "cell_type": "markdown", - "metadata": { - "id": "trq07lvMudlz" - }, "source": [ "### Train and evaluate the models\n", "\n", "The model will be trained in two stages. First train the neural network with its own classification head:" - ] + ], + "metadata": { + "id": "trq07lvMudlz" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "h4OyUWKiupuF" - }, - "outputs": [], "source": [ "%set_cell_height 300\n", "\n", @@ -756,71 +752,75 @@ "\n", "nn_model.fit(x=train_ds, validation_data=test_ds, epochs=10)\n", "nn_model.summary()" - ] + ], + "outputs": [], + "metadata": { + "id": "h4OyUWKiupuF" + } }, { "cell_type": "markdown", - "metadata": { - "id": "N2mgMZOpgMQp" - }, "source": [ "The neural network layers are shared between the two models. So now that the neural network is trained the decision forest model will be fit to the trained output of the neural network layers:" - ] + ], + "metadata": { + "id": "N2mgMZOpgMQp" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "JAc9niXqud7V" - }, - "outputs": [], "source": [ "%set_cell_height 300\n", "\n", "df_and_nn_model.compile(metrics=[\"accuracy\"])\n", "with sys_pipes():\n", " df_and_nn_model.fit(x=train_ds)" - ] + ], + "outputs": [], + "metadata": { + "id": "JAc9niXqud7V" + } }, { "cell_type": "markdown", - "metadata": { - "id": "HF8Ru2HSv1a5" - }, "source": [ "Now evaluate the composed model:" - ] + ], + "metadata": { + "id": "HF8Ru2HSv1a5" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "EPMlcObzuw89" - }, - "outputs": [], "source": [ "print(\"Evaluation:\", df_and_nn_model.evaluate(test_ds))" - ] + ], + "outputs": [], + "metadata": { + "id": "EPMlcObzuw89" + } }, { "cell_type": "markdown", - "metadata": { - "id": "awiHEznlv5sI" - }, "source": [ "Compare it to the Neural Network alone:" - ] + ], + "metadata": { + "id": "awiHEznlv5sI" + } }, { "cell_type": "code", "execution_count": null, - "metadata": { - "id": "--ompWYTvxM-" - }, - "outputs": [], "source": [ "print(\"Evaluation :\", nn_model.evaluate(test_ds))" - ] + ], + "outputs": [], + "metadata": { + "id": "--ompWYTvxM-" + } } ], "metadata": { @@ -836,5 +836,5 @@ } }, "nbformat": 4, - "nbformat_minor": 0 -} + "nbformat_minor": 2 +} \ No newline at end of file