diff --git a/documentation/tutorials/kaggle_beginner_example_classification.ipynb b/documentation/tutorials/kaggle_beginner_example_classification.ipynb new file mode 100644 index 00000000..bd796f04 --- /dev/null +++ b/documentation/tutorials/kaggle_beginner_example_classification.ipynb @@ -0,0 +1,1804 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Copyright 2022 The TensorFlow Authors." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MDBzBKC_pnXl" + }, + "source": [ + "# Structured Data Classification using TFDF\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " View on TensorFlow.org\n", + " \n", + " Run in Google Colab\n", + " \n", + " View on GitHub\n", + " \n", + " Download notebook\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MA9_xqWRpqZU" + }, + "source": [ + "## Introduction\n", + "\n", + "[TensorFlow Decision Forests](https://www.tensorflow.org/decision_forests)\n", + "is a collection of state-of-the-art algorithms of Decision Forest models\n", + "that are compatible with [Keras APIs](https://www.tensorflow.org/api_docs/python/tf/keras)\n", + ".\n", + "The models include [Random Forests](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel),\n", + "[Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel),\n", + "and [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel),\n", + "and can be used for regression, classification, and ranking tasks.\n", + "For an introduction to [TFDF](https://www.tensorflow.org/decision_forests) without Kaggle, please refer to this [tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab).\n", + "Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start when working with tabular data, and will often outperform neural networks.\n", + "\n", + "In this example we will use TensorFlow to train each of these on a dataset you load from a CSV file. This is a common pattern in practice. Roughly, your code will look as follows:\n", + "\n", + "```\n", + "import tensorflow_decision_forests as tfdf\n", + "import pandas as pd\n", + " \n", + "dataset = pd.read_csv(\"project/dataset.csv\")\n", + "tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(dataset, label=\"my_label\", task=tfdf.keras.Task.CLASSIFICATION)\n", + "\n", + "model = tfdf.keras.RandomForestModel()\n", + "model.fit(tf_dataset)\n", + " \n", + "print(model.summary())\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mrTx_bPrtd17" + }, + "source": [ + "### Setup" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dl6_Mdy7sUC7" + }, + "source": [ + "#### Install TensorFlow Decision Forests" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lSDtXxIDseKq" + }, + "outputs": [], + "source": [ + "!pip install tensorflow_decision_forests --quiet" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zr0eiHcyvG1m" + }, + "source": [ + "#### Import the library" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IA1LNshiumEA" + }, + "outputs": [], + "source": [ + "# Scientific computing # \n", + "import numpy as np # Numpy Documentation - https://numpy.org/doc/stable/ \n", + "\n", + "# - Data processing - #\n", + "import pandas as pd # Pandas Documentation - https://pandas.pydata.org/docs/\n", + "\n", + "# ---- Tensorflow ---- #\n", + "import tensorflow as tf\n", + "import tensorflow_decision_forests as tfdf" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CjdtV-KWvcWA" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TensorFlow v2.9.1\n", + "TensorFlow Decision Forests v0.2.7\n" + ] + } + ], + "source": [ + "print(\"TensorFlow v\" + tf.__version__)\n", + "print(\"TensorFlow Decision Forests v\" + tfdf.__version__)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vzX1j4YewLSr" + }, + "source": [ + "### Download the Titanic dataset\n", + "The [Titanic dataset](https://www.kaggle.com/competitions/titanic/overview/description) is an example of a binary classification problem in supervised learning. We are classifying the outcome of the passengers as either one of two classes, survived or did not survive the Titanic." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_vmqf8o60D37" + }, + "source": [ + "To run this notebook, you need to have a Kaggle account.\n", + "\n", + "If you do not have an account, you can create one here: [Kaggle Register](https://www.kaggle.com/account/login?phase=startRegisterTab&returnUrl=%2F) \n", + "\n", + "In order to get a token to use in the following cell, check out the [Authentication Section](https://www.kaggle.com/docs/api#authentication) of Kaggle API documentation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "ekAcuqTFvt3p" + }, + "outputs": [], + "source": [ + "#@title Enter your Kaggle token in order to fetch the dataset\n", + "\n", + "username = '' #@param {type:\"string\"}\n", + "key = '' #@param {type: \"string\"}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "JKyVN-lC0HOC" + }, + "outputs": [], + "source": [ + "#@title Configure Kaggle\n", + "try:\n", + " from google.colab import files, drive\n", + "\n", + " # Install and Configure Kaggle\n", + " import json\n", + "\n", + " token = {\n", + " \"username\":username,\n", + " \"key\":key\n", + " }\n", + "\n", + " # Installing kaggle\n", + " !pip install kaggle &> /dev/null\n", + "\n", + " # Creating .kaggle if necessary\n", + " !if [ -d .kaggle ]; then echo \".kaggle exists\"; else echo \".kaggle does not exist ... Creating it\"; mkdir .kaggle; if [ -d .kaggle ]; then echo \"Successfully created\"; else echo \"Error creating .kaggle\"; fi; fi\n", + "\n", + " with open('/content/.kaggle/kaggle.json', 'w') as file:\n", + " json.dump(token, file)\n", + "\n", + " # Creating .kaggle if necessary\n", + " !if [ -d ~/.kaggle ]; then echo \" ~/.kaggle exists\"; else echo \" ~/.kaggle does not exist ... Creating it\"; mkdir ~/.kaggle; if [ -d ~/.kaggle ]; then echo \"Successfully created\"; else echo \"Error creating ~/.kaggle\"; fi; fi\n", + " !cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json\n", + "\n", + " # kaggle configuration\n", + " !kaggle config set -n path -v{/content}\n", + "\n", + " # Changing mode\n", + " !chmod 600 /root/.kaggle/kaggle.json\n", + "except Exception:\n", + " pass" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "J501lMS40lUR" + }, + "outputs": [], + "source": [ + "#@title Download Dataset\n", + "import os\n", + "\n", + "DOWNLOAD_LOCATION = \"/root/Downloads/\"\n", + "\n", + "if os.path.exists(DOWNLOAD_LOCATION):\n", + " if os.path.isdir(DOWNLOAD_LOCATION):\n", + " print(\"{} exists and is a directory\".format(DOWNLOAD_LOCATION))\n", + " else:\n", + " print(\"{} exists but is not a directory!!!\".format(DOWNLOAD_LOCATION))\n", + "else:\n", + " print(\"{} does not exist ... Creating it\".format(DOWNLOAD_LOCATION))\n", + " os.makedirs(DOWNLOAD_LOCATION)\n", + "\n", + "# Downloading\n", + "!kaggle competitions download -c titanic -p {DOWNLOAD_LOCATION}\n", + "\n", + "# Extracting archives\n", + "!cd {DOWNLOAD_LOCATION}; unzip -qq \\*.zip; rm -f *.zip" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PCFSJUjl2fuT" + }, + "source": [ + "## Load the dataset\n", + "Note: Pandas is practical as you don't have to type in name of the input features to load them. For larger datasets (>1M examples), using the [TensorFlow Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) to read the files may be better suited." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "18QhsN2L16wH" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Full train dataset shape is (891, 12)\n" + ] + } + ], + "source": [ + "train_file_path = os.path.join(DOWNLOAD_LOCATION, \"train.csv\")\n", + "train_full_data = pd.read_csv(train_file_path)\n", + "print(\"Full train dataset shape is {}\".format(train_full_data.shape))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J8KqgoL95mhw" + }, + "source": [ + "The data is composed of 12 columns and 891 entries. We can see all 12 dimensions of our dataset by printing out the first 3 entries using the following code: \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "v4rywCtW2pfK" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " PassengerId Survived Pclass \\\n", + "0 1 0 3 \n", + "1 2 1 1 \n", + "2 3 1 3 \n", + "\n", + " Name Sex Age SibSp \\\n", + "0 Braund, Mr. Owen Harris male 22.0 1 \n", + "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", + "2 Heikkinen, Miss. Laina female 26.0 0 \n", + "\n", + " Parch Ticket Fare Cabin Embarked \n", + "0 0 A/5 21171 7.2500 NaN S \n", + "1 0 PC 17599 71.2833 C85 C \n", + "2 0 STON/O2. 3101282 7.9250 NaN S " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_full_data.head(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IYqyU_6MOrH8" + }, + "source": [ + "* 8 feature columns named `Pclass, Sex, Age, SibSp, Parch, Fare, Cabin, Embarked`.\n", + "* Label column named `Survived`.\n", + "* We will drop the following unnecessary columns : `PassengerId`, `Name` and `Ticket`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "shj-eSteOqPE" + }, + "outputs": [], + "source": [ + "train_full_data = train_full_data.drop(['PassengerId', 'Name', 'Ticket'], axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SYvo-ty6QiHN" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SurvivedPclassSexAgeSibSpParchFareCabinEmbarked
003male22.0107.2500NaNS
111female38.01071.2833C85C
213female26.0007.9250NaNS
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked\n", + "0 0 3 male 22.0 1 0 7.2500 NaN S\n", + "1 1 1 female 38.0 1 0 71.2833 C85 C\n", + "2 1 3 female 26.0 0 0 7.9250 NaN S" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_full_data.head(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qs070SbkMJix" + }, + "source": [ + "Refer to [Kaggle](https://www.kaggle.com/competitions/titanic/data) for a comprehensive guide to the data." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cwdbYZeTJP89" + }, + "source": [ + "## Exploratory Data Analysis (EDA)\n", + "Data scientists use exploratory analysis techniques to analyze and visualize large datasets. This process helps them identify the main characteristics of their data sets and develop effective strategies to get the answers they need. It can also help them spot anomalies and test hypotheses.\n", + "\n", + "For this dataset, there are some amazing notebooks already available on Kaggle. One of them is [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banerjee." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2EpAa_q55Ke8" + }, + "source": [ + "## Prepare the dataset\n", + "This dataset contains a mix of numeric, categorical and missing features. TF-DF supports all these feature types natively, and no preprocessing is required. This is one advantage of tree-based models; making them a great entry point to TensorFlow and ML." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uQMX8Md3ISq0" + }, + "source": [ + "Convert the values stored in the `Survived` column to a list of values, where the list does not allow for duplicates. `Survived` has one of two values, 0 or 1." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YmrDp4SL7hTw" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Label classes: [0, 1]\n" + ] + } + ], + "source": [ + "label=\"Survived\"\n", + "classes = train_full_data[label].unique().tolist()\n", + "print(f\"Label classes: {classes}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0NGJhK0R58Oa" + }, + "source": [ + "Split the dataset into training and testing:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CW3ofmmI5xIr" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "611 examples in training, 280 examples in validation.\n" + ] + } + ], + "source": [ + "def split_dataset(dataset, test_ratio=0.30):\n", + " test_indices = np.random.rand(len(dataset)) < test_ratio\n", + " return dataset[~test_indices], dataset[test_indices]\n", + "\n", + "train_ds_pd, val_ds_pd = split_dataset(train_full_data)\n", + "print(\"{} examples in training, {} examples in validation.\".format(\n", + " len(train_ds_pd), len(val_ds_pd)))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "I0ZrYmer6tMp" + }, + "source": [ + "There's one more step required before you can train your model. You need to convert from Pandas format (`pd.DataFrame`) into TensorFlow format (`tf.data.Dataset`). A single line helper function that will do this for you: \n", + "\n", + "```\n", + "tfdf.keras.pd_dataframe_to_tf_dataset(your_df, label='your_label', task=tfdf.keras.Task.CLASSIFICATION)\n", + "```\n", + "\n", + "This is a high [performance](https://www.tensorflow.org/guide/data_performance) data loading library which is helpful when training neural networks with accelerators like [GPUs](https://cloud.google.com/gpu) and [TPUs](https://cloud.google.com/tpu). It is not necessary for tree-based models until you begin to do distributed training.\n", + "\n", + "Note that tf.data is a bit tricky to use, and has a learning curve. There are guides on [tensorflow.org/guide](https://www.tensorflow.org/guide) to help." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DyAHpZ0R6B5R" + }, + "outputs": [], + "source": [ + "train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n", + " train_ds_pd, \n", + " label = label, \n", + " task = tfdf.keras.Task.CLASSIFICATION)\n", + "\n", + "val_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n", + " val_ds_pd, \n", + " label = label, \n", + " task = tfdf.keras.Task.CLASSIFICATION)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3m46QYDz8IB4" + }, + "source": [ + "## Create and train a Random Forest model " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "11yxinBK78qU" + }, + "outputs": [], + "source": [ + "model = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.CLASSIFICATION)\n", + "model.compile(metrics=[\"accuracy\"]) # Optional, you can use this to include a list of eval metrics" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_tQfGooA8OI2" + }, + "outputs": [], + "source": [ + "model.fit(x=train_ds)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YoqPROtT9A33" + }, + "source": [ + "## Visualize your model\n", + "One benefit of tree-based models is that you can easily visualize them. The default number of trees used in the Random Forest is 300. You can select a tree to display below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Cwv7-NXc8WUq" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "
\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0, max_depth=3)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RtGEzEGU9FsI" + }, + "source": [ + "## Evaluate the model on OOB data and the validation dataset\n", + "\n", + "Let's plot accuracy on OOB evaluation dataset as a function of the number of trees in the forest. One of the nice features about this particular hyperparameter is that larger values are usually better, and come with little risk aside from slowing down training." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4nOZy6lX9CwJ" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "logs = model.make_inspector().training_logs()\n", + "plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])\n", + "plt.xlabel(\"Number of trees\")\n", + "plt.ylabel(\"Accuracy (out-of-bag)\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KWlw6i0U9UcE" + }, + "source": [ + "You can also see some general stats on the OOB dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_nEjaF9Y9NjF" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Evaluation(num_examples=611, accuracy=0.806873977086743, loss=0.7393123309627944, rmse=None, ndcg=None, aucs=None, auuc=None, qini=None)" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "inspector = model.make_inspector()\n", + "inspector.evaluation()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W8OwVx569bbU" + }, + "source": [ + "Now, let's run an evaluation using the test data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KyH_XC1d9X9x" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1/1 [==============================] - 1s 958ms/step - loss: 0.0000e+00 - accuracy: 0.8679\n", + "loss: 0.0000\n", + "accuracy: 0.8679\n" + ] + } + ], + "source": [ + "evaluation = model.evaluate(x=val_ds,return_dict=True)\n", + "\n", + "for name, value in evaluation.items():\n", + " print(f\"{name}: {value:.4f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TK0l4Qgxbwcq" + }, + "source": [ + "## Test Set Prediction\n", + "Now we will do prediction on `test.csv`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BbC05cBKcTQ5" + }, + "outputs": [], + "source": [ + "test_file_path = os.path.join(DOWNLOAD_LOCATION, \"test.csv\")\n", + "test_data = pd.read_csv(test_file_path)\n", + "ids = test_data.pop('PassengerId')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OUDeYu2zcYrk" + }, + "outputs": [], + "source": [ + "test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(\n", + " test_data, \n", + " task = tfdf.keras.Task.CLASSIFICATION)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4l13fn1seaHj" + }, + "source": [ + "Since the prediction can be either 0 (Not survived) or 1 (Survived), let's convert the predited float value to binary value" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8tY1evHAcoTH" + }, + "outputs": [], + "source": [ + "preds = model.predict(test_ds)\n", + "preds = preds >= 0.5\n", + "preds = preds.astype('int')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Jxtj1lp6csVQ" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvived
08920
18930
28940
38950
48960
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " PassengerId Survived\n", + "0 892 0\n", + "1 893 0\n", + "2 894 0\n", + "3 895 0\n", + "4 896 0" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output = pd.DataFrame({'PassengerId': ids,\n", + " 'Survived': preds.squeeze()})\n", + "\n", + "output.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yZOhWNRcpM-G" + }, + "source": [ + "You can download the predicted output as a CSV file and do submission on the [Competition page](https://www.kaggle.com/competitions/titanic/submit) on Kaggle." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "j8FquAuufmtS" + }, + "outputs": [], + "source": [ + "output_filename = \"test_prediction_output.csv\"\n", + "output.to_csv(output_filename, index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2ary3LNoffRA" + }, + "outputs": [ + { + "data": { + "application/javascript": [ + "\n", + " async function download(id, filename, size) {\n", + " if (!google.colab.kernel.accessAllowed) {\n", + " return;\n", + " }\n", + " const div = document.createElement('div');\n", + " const label = document.createElement('label');\n", + " label.textContent = `Downloading \"${filename}\": `;\n", + " div.appendChild(label);\n", + " const progress = document.createElement('progress');\n", + " progress.max = size;\n", + " div.appendChild(progress);\n", + " document.body.appendChild(div);\n", + "\n", + " const buffers = [];\n", + " let downloaded = 0;\n", + "\n", + " const channel = await google.colab.kernel.comms.open(id);\n", + " // Send a message to notify the kernel that we're ready.\n", + " channel.send({})\n", + "\n", + " for await (const message of channel.messages) {\n", + " // Send a message to notify the kernel that we're ready.\n", + " channel.send({})\n", + " if (message.buffers) {\n", + " for (const buffer of message.buffers) {\n", + " buffers.push(buffer);\n", + " downloaded += buffer.byteLength;\n", + " progress.value = downloaded;\n", + " }\n", + " }\n", + " }\n", + " const blob = new Blob(buffers, {type: 'application/binary'});\n", + " const a = document.createElement('a');\n", + " a.href = window.URL.createObjectURL(blob);\n", + " a.download = filename;\n", + " div.appendChild(a);\n", + " a.click();\n", + " div.remove();\n", + " }\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/javascript": [ + "download(\"download_1a8bee3f-c06f-4c3e-8d5a-94a8e2980331\", \"test_prediction_output.csv\", 2839)" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from google.colab import files\n", + "files.download('test_prediction_output.csv')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Lh3gxL9OHKbD" + }, + "source": [ + "# References\n", + "* Dive deep into \n", + " * [Random Forests](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel)\n", + " * [Gradient Boosted Trees](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/GradientBoostedTreesModel)\n", + " * [CART](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/CartModel)\n", + " * [Keras API](https://www.tensorflow.org/api_docs/python/tf/keras)\n", + " * [TensorFlow Decision Forests (TF-DF)](https://www.tensorflow.org/decision_forests).\n", + "* [EDA is fun](https://www.kaggle.com/code/prashant111/eda-is-fun#EDA-is-fun) by Prashant Banerjee.\n", + "* TensorFlow Decision Forests tutorials which are a set of 3 very interesting tutorials.\n", + " * [Beginner Tutorial](https://www.tensorflow.org/decision_forests/tutorials/beginner_colab)\n", + " * [Intermediate Tutorial](https://www.tensorflow.org/decision_forests/tutorials/intermediate_colab)\n", + " * [Advanced Tutorial](https://www.tensorflow.org/decision_forests/tutorials/advanced_colab)\n", + "* The [TensorFlow Forum](https://discuss.tensorflow.org/) where one can get in touch with the TensorFlow community. Check it out if you haven't yet." + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "kaggle_beginner_example_classification.ipynb", + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}