From 462243db4b15008c26b15a61e6a81127cb5f2520 Mon Sep 17 00:00:00 2001 From: Razan Alkhamisi <124253161+razanAlkhamisi@users.noreply.github.com> Date: Sun, 23 Mar 2025 12:50:37 +0300 Subject: [PATCH] Add files via upload --- lab-SVM.ipynb | 1438 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1438 insertions(+) create mode 100644 lab-SVM.ipynb diff --git a/lab-SVM.ipynb b/lab-SVM.ipynb new file mode 100644 index 0000000..409e1d6 --- /dev/null +++ b/lab-SVM.ipynb @@ -0,0 +1,1438 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# SVM (Support Vector Machines)\n", + "\n", + "Estimated time needed: **15-30** minutes\n", + "\n", + "## Objectives\n", + "\n", + "After completing this lab you will be able to:\n", + "\n", + "* Use scikit-learn to Support Vector Machine to classify\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this notebook, you will use SVM (Support Vector Machines) to build and train a model using human cell records, and classify cells to whether the samples are benign or malignant.\n", + "\n", + "SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

Table of contents

\n", + "\n", + "
\n", + "
    \n", + "
  1. Load the Cancer data
  2. \n", + "
  3. Modeling
  4. \n", + "
  5. Evaluation
  6. \n", + "
  7. Practice
  8. \n", + "
\n", + "
\n", + "
\n", + "
\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting scikit-learn==0.23.1\n", + " Downloading scikit-learn-0.23.1.tar.gz (7.2 MB)\n", + " ---------------------------------------- 0.0/7.2 MB ? eta -:--:--\n", + " ---------------------------------------- 0.0/7.2 MB ? eta -:--:--\n", + " ---- ----------------------------------- 0.8/7.2 MB 3.4 MB/s eta 0:00:02\n", + " ---------- ----------------------------- 1.8/7.2 MB 4.6 MB/s eta 0:00:02\n", + " ----------------- ---------------------- 3.1/7.2 MB 5.3 MB/s eta 0:00:01\n", + " -------------------------- ------------- 4.7/7.2 MB 5.5 MB/s eta 0:00:01\n", + " --------------------------------- ------ 6.0/7.2 MB 5.8 MB/s eta 0:00:01\n", + " ---------------------------------------- 7.2/7.2 MB 5.7 MB/s eta 0:00:00\n", + " Installing build dependencies: started\n", + " Installing build dependencies: finished with status 'error'\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " error: subprocess-exited-with-error\n", + " \n", + " × pip subprocess to install build dependencies did not run successfully.\n", + " │ exit code: 1\n", + " ╰─> [128 lines of output]\n", + " Ignoring numpy: markers 'python_version == \"3.6\" and platform_system != \"AIX\" and platform_python_implementation == \"CPython\"' don't match your environment\n", + " Ignoring numpy: markers 'python_version == \"3.6\" and platform_system != \"AIX\" and platform_python_implementation != \"CPython\"' don't match your environment\n", + " Ignoring numpy: markers 'python_version == \"3.7\" and platform_system != \"AIX\"' don't match your environment\n", + " Ignoring numpy: markers 'python_version == \"3.6\" and platform_system == \"AIX\"' don't match your environment\n", + " Ignoring numpy: markers 'python_version == \"3.7\" and platform_system == \"AIX\"' don't match your environment\n", + " Ignoring numpy: markers 'python_version >= \"3.8\" and platform_system == \"AIX\"' don't match your environment\n", + " Collecting setuptools\n", + " Downloading setuptools-77.0.3-py3-none-any.whl.metadata (6.6 kB)\n", + " Collecting wheel\n", + " Downloading wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)\n", + " Collecting Cython>=0.28.5\n", + " Downloading Cython-3.0.12-cp310-cp310-win_amd64.whl.metadata (3.6 kB)\n", + " Collecting numpy==1.17.3\n", + " Downloading numpy-1.17.3.zip (6.4 MB)\n", + " ---------------------------------------- 0.0/6.4 MB ? eta -:--:--\n", + " ---------------------------------------- 0.0/6.4 MB ? eta -:--:--\n", + " --- ------------------------------------ 0.5/6.4 MB 1.5 MB/s eta 0:00:04\n", + " ------------- -------------------------- 2.1/6.4 MB 4.5 MB/s eta 0:00:01\n", + " ----------------------------- ---------- 4.7/6.4 MB 7.3 MB/s eta 0:00:01\n", + " ---------------------------------------- 6.4/6.4 MB 7.5 MB/s eta 0:00:00\n", + " Preparing metadata (setup.py): started\n", + " Preparing metadata (setup.py): finished with status 'done'\n", + " Collecting scipy>=0.19.1\n", + " Downloading scipy-1.15.2-cp310-cp310-win_amd64.whl.metadata (60 kB)\n", + " INFO: pip is looking at multiple versions of scipy to determine which version is compatible with other requirements. This could take a while.\n", + " Downloading scipy-1.15.1-cp310-cp310-win_amd64.whl.metadata (60 kB)\n", + " Downloading scipy-1.15.0-cp310-cp310-win_amd64.whl.metadata (60 kB)\n", + " Downloading scipy-1.14.1-cp310-cp310-win_amd64.whl.metadata (60 kB)\n", + " Downloading scipy-1.14.0-cp310-cp310-win_amd64.whl.metadata (60 kB)\n", + " Downloading scipy-1.13.1-cp310-cp310-win_amd64.whl.metadata (60 kB)\n", + " Downloading scipy-1.13.0-cp310-cp310-win_amd64.whl.metadata (60 kB)\n", + " Downloading scipy-1.12.0-cp310-cp310-win_amd64.whl.metadata (60 kB)\n", + " INFO: pip is still looking at multiple versions of scipy to determine which version is compatible with other requirements. This could take a while.\n", + " Downloading scipy-1.11.4-cp310-cp310-win_amd64.whl.metadata (60 kB)\n", + " Downloading scipy-1.11.3-cp310-cp310-win_amd64.whl.metadata (60 kB)\n", + " Downloading scipy-1.11.2-cp310-cp310-win_amd64.whl.metadata (59 kB)\n", + " Downloading scipy-1.11.1-cp310-cp310-win_amd64.whl.metadata (59 kB)\n", + " Downloading scipy-1.10.1-cp310-cp310-win_amd64.whl.metadata (58 kB)\n", + " INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. See https://pip.pypa.io/warnings/backtracking for guidance. If you want to abort this run, press Ctrl + C.\n", + " Downloading scipy-1.10.0-cp310-cp310-win_amd64.whl.metadata (58 kB)\n", + " Downloading scipy-1.9.3-cp310-cp310-win_amd64.whl.metadata (58 kB)\n", + " Downloading scipy-1.9.2-cp310-cp310-win_amd64.whl.metadata (58 kB)\n", + " Downloading scipy-1.9.1-cp310-cp310-win_amd64.whl.metadata (2.2 kB)\n", + " Downloading scipy-1.9.0-cp310-cp310-win_amd64.whl.metadata (2.2 kB)\n", + " Downloading scipy-1.8.1-cp310-cp310-win_amd64.whl.metadata (2.2 kB)\n", + " Downloading setuptools-77.0.3-py3-none-any.whl (1.3 MB)\n", + " ---------------------------------------- 0.0/1.3 MB ? eta -:--:--\n", + " ---------------------------------------- 1.3/1.3 MB 9.1 MB/s eta 0:00:00\n", + " Downloading wheel-0.45.1-py3-none-any.whl (72 kB)\n", + " Downloading Cython-3.0.12-cp310-cp310-win_amd64.whl (2.8 MB)\n", + " ---------------------------------------- 0.0/2.8 MB ? eta -:--:--\n", + " ---------------------- ----------------- 1.6/2.8 MB 7.6 MB/s eta 0:00:01\n", + " ------------------------------ --------- 2.1/2.8 MB 4.7 MB/s eta 0:00:01\n", + " ---------------------------------------- 2.8/2.8 MB 5.0 MB/s eta 0:00:00\n", + " Downloading scipy-1.8.1-cp310-cp310-win_amd64.whl (36.9 MB)\n", + " ---------------------------------------- 0.0/36.9 MB ? eta -:--:--\n", + " - -------------------------------------- 1.3/36.9 MB 6.7 MB/s eta 0:00:06\n", + " --- ------------------------------------ 2.9/36.9 MB 6.5 MB/s eta 0:00:06\n", + " ----- ---------------------------------- 5.0/36.9 MB 7.7 MB/s eta 0:00:05\n", + " ------- -------------------------------- 7.1/36.9 MB 8.4 MB/s eta 0:00:04\n", + " ---------- ----------------------------- 9.7/36.9 MB 8.9 MB/s eta 0:00:04\n", + " ------------ --------------------------- 11.5/36.9 MB 8.9 MB/s eta 0:00:03\n", + " --------------- ------------------------ 13.9/36.9 MB 9.3 MB/s eta 0:00:03\n", + " ----------------- ---------------------- 15.7/36.9 MB 9.2 MB/s eta 0:00:03\n", + " ------------------- -------------------- 18.4/36.9 MB 9.5 MB/s eta 0:00:02\n", + " ---------------------- ----------------- 20.7/36.9 MB 9.6 MB/s eta 0:00:02\n", + " ----------------------- ---------------- 22.0/36.9 MB 9.7 MB/s eta 0:00:02\n", + " ------------------------ --------------- 23.1/36.9 MB 9.4 MB/s eta 0:00:02\n", + " ------------------------- -------------- 23.9/36.9 MB 8.5 MB/s eta 0:00:02\n", + " --------------------------- ------------ 25.2/36.9 MB 8.5 MB/s eta 0:00:02\n", + " --------------------------- ------------ 25.2/36.9 MB 8.5 MB/s eta 0:00:02\n", + " ---------------------------- ----------- 26.2/36.9 MB 7.6 MB/s eta 0:00:02\n", + " ------------------------------ --------- 28.0/36.9 MB 7.6 MB/s eta 0:00:02\n", + " -------------------------------- ------- 29.9/36.9 MB 7.7 MB/s eta 0:00:01\n", + " ---------------------------------- ----- 32.0/36.9 MB 7.8 MB/s eta 0:00:01\n", + " ------------------------------------ --- 34.1/36.9 MB 7.9 MB/s eta 0:00:01\n", + " ------------------------------------- -- 34.6/36.9 MB 7.9 MB/s eta 0:00:01\n", + " --------------------------------------- 36.7/36.9 MB 7.9 MB/s eta 0:00:01\n", + " ---------------------------------------- 36.9/36.9 MB 7.7 MB/s eta 0:00:00\n", + " Building wheels for collected packages: numpy\n", + " Building wheel for numpy (setup.py): started\n", + " Building wheel for numpy (setup.py): finished with status 'error'\n", + " error: subprocess-exited-with-error\n", + " \n", + " × python setup.py bdist_wheel did not run successfully.\n", + " │ exit code: 1\n", + " ╰─> [15 lines of output]\n", + " Running from numpy source directory.\n", + " Traceback (most recent call last):\n", + " File \"\", line 2, in \n", + " File \"\", line 34, in \n", + " File \"C:\\Users\\razan\\AppData\\Local\\Temp\\pip-install-l0pp2l7k\\numpy_9dcc13a86398436b9ea86496a9c5ecfc\\setup.py\", line 443, in \n", + " setup_package()\n", + " File \"C:\\Users\\razan\\AppData\\Local\\Temp\\pip-install-l0pp2l7k\\numpy_9dcc13a86398436b9ea86496a9c5ecfc\\setup.py\", line 422, in setup_package\n", + " from numpy.distutils.core import setup\n", + " File \"C:\\Users\\razan\\AppData\\Local\\Temp\\pip-install-l0pp2l7k\\numpy_9dcc13a86398436b9ea86496a9c5ecfc\\numpy\\distutils\\core.py\", line 26, in \n", + " from numpy.distutils.command import config, config_compiler, \\\n", + " File \"C:\\Users\\razan\\AppData\\Local\\Temp\\pip-install-l0pp2l7k\\numpy_9dcc13a86398436b9ea86496a9c5ecfc\\numpy\\distutils\\command\\config.py\", line 20, in \n", + " from numpy.distutils.mingw32ccompiler import generate_manifest\n", + " File \"C:\\Users\\razan\\AppData\\Local\\Temp\\pip-install-l0pp2l7k\\numpy_9dcc13a86398436b9ea86496a9c5ecfc\\numpy\\distutils\\mingw32ccompiler.py\", line 34, in \n", + " from distutils.msvccompiler import get_build_version as get_build_msvc_version\n", + " ModuleNotFoundError: No module named 'distutils.msvccompiler'\n", + " [end of output]\n", + " \n", + " note: This error originates from a subprocess, and is likely not a problem with pip.\n", + " ERROR: Failed building wheel for numpy\n", + " Running setup.py clean for numpy\n", + " error: subprocess-exited-with-error\n", + " \n", + " × python setup.py clean did not run successfully.\n", + " │ exit code: 1\n", + " ╰─> [10 lines of output]\n", + " Running from numpy source directory.\n", + " \n", + " `setup.py clean` is not supported, use one of the following instead:\n", + " \n", + " - `git clean -xdf` (cleans all files)\n", + " - `git clean -Xdf` (cleans all versioned files, doesn't touch\n", + " files that aren't checked into the git repo)\n", + " \n", + " Add `--force` to your command to use it anyway if you must (unsupported).\n", + " \n", + " [end of output]\n", + " \n", + " note: This error originates from a subprocess, and is likely not a problem with pip.\n", + " ERROR: Failed cleaning build dir for numpy\n", + " Failed to build numpy\n", + " ERROR: Failed to build installable wheels for some pyproject.toml based projects (numpy)\n", + " [end of output]\n", + " \n", + " note: This error originates from a subprocess, and is likely not a problem with pip.\n", + "error: subprocess-exited-with-error\n", + "\n", + "× pip subprocess to install build dependencies did not run successfully.\n", + "│ exit code: 1\n", + "╰─> See above for output.\n", + "\n", + "note: This error originates from a subprocess, and is likely not a problem with pip.\n" + ] + } + ], + "source": [ + "!pip install scikit-learn==0.23.1" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "ename": "ModuleNotFoundError", + "evalue": "No module named 'piplite'", + "output_type": "error", + "traceback": [ + "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[1;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[1;32mIn[6], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mpiplite\u001b[39;00m\n\u001b[0;32m 2\u001b[0m \u001b[38;5;28;01mawait\u001b[39;00m piplite\u001b[38;5;241m.\u001b[39minstall([\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mpandas\u001b[39m\u001b[38;5;124m'\u001b[39m])\n\u001b[0;32m 3\u001b[0m \u001b[38;5;28;01mawait\u001b[39;00m piplite\u001b[38;5;241m.\u001b[39minstall([\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mmatplotlib\u001b[39m\u001b[38;5;124m'\u001b[39m])\n", + "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'piplite'" + ] + } + ], + "source": [ + "import piplite\n", + "await piplite.install(['pandas'])\n", + "await piplite.install(['matplotlib'])\n", + "await piplite.install(['numpy'])\n", + "await piplite.install(['scikit-learn'])\n", + "await piplite.install(['scipy'])\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import pylab as pl\n", + "import numpy as np\n", + "import scipy.optimize as opt\n", + "from sklearn import preprocessing\n", + "from sklearn.model_selection import train_test_split\n", + "%matplotlib inline \n", + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "def download(url,filename):\n", + " respons=requests.get(url,stream=True)\n", + " if respons.status_code==200:\n", + " with open(filename,'wb') as f:\n", + " for x in respons.iter_content(chunk_size=8192):\n", + " if x:\n", + " f.write(x)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "button": false, + "new_sheet": false, + "run_control": { + "read_only": false + } + }, + "source": [ + "

Load the Cancer data

\n", + "The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007)[http://mlearn.ics.uci.edu/MLRepository.html]. The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:\n", + "\n", + "| Field name | Description |\n", + "| ----------- | --------------------------- |\n", + "| ID | Clump thickness |\n", + "| Clump | Clump thickness |\n", + "| UnifSize | Uniformity of cell size |\n", + "| UnifShape | Uniformity of cell shape |\n", + "| MargAdh | Marginal adhesion |\n", + "| SingEpiSize | Single epithelial cell size |\n", + "| BareNuc | Bare nuclei |\n", + "| BlandChrom | Bland chromatin |\n", + "| NormNucl | Normal nucleoli |\n", + "| Mit | Mitoses |\n", + "| Class | Benign or malignant |\n", + "\n", + "
\n", + "
\n", + "\n", + "For the purposes of this example, we're using a dataset that has a relatively small number of predictors in each record. To download the data, we will use `!wget` to download it from IBM Object Storage." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "button": false, + "new_sheet": false, + "run_control": { + "read_only": false + } + }, + "outputs": [], + "source": [ + "#Click here and press Shift+Enter\n", + "path=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/cell_samples.csv\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "button": false, + "new_sheet": false, + "run_control": { + "read_only": false + } + }, + "source": [ + "## Load Data From CSV File\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "download(path, \"cell_samples.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "button": false, + "new_sheet": false, + "run_control": { + "read_only": false + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IDClumpUnifSizeUnifShapeMargAdhSingEpiSizeBareNucBlandChromNormNuclMitClass
010000255111213112
1100294554457103212
210154253111223112
310162776881343712
410170234113213112
\n", + "
" + ], + "text/plain": [ + " ID Clump UnifSize UnifShape MargAdh SingEpiSize BareNuc \\\n", + "0 1000025 5 1 1 1 2 1 \n", + "1 1002945 5 4 4 5 7 10 \n", + "2 1015425 3 1 1 1 2 2 \n", + "3 1016277 6 8 8 1 3 4 \n", + "4 1017023 4 1 1 3 2 1 \n", + "\n", + " BlandChrom NormNucl Mit Class \n", + "0 3 1 1 2 \n", + "1 3 2 1 2 \n", + "2 3 1 1 2 \n", + "3 3 7 1 2 \n", + "4 3 1 1 2 " + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cell_df = pd.read_csv(\"cell_samples.csv\")\n", + "cell_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The ID field contains the patient identifiers. The characteristics of the cell samples from each patient are contained in fields Clump to Mit. The values are graded from 1 to 10, with 1 being the closest to benign.\n", + "\n", + "The Class field contains the diagnosis, as confirmed by separate medical procedures, as to whether the samples are benign (value = 2) or malignant (value = 4).\n", + "\n", + "Let's look at the distribution of the classes based on Clump thickness and Uniformity of cell size:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "ax=cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');\n", + "cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign',ax=ax);\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data pre-processing and selection\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's first look at columns data types:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ID int64\n", + "Clump int64\n", + "UnifSize int64\n", + "UnifShape int64\n", + "MargAdh int64\n", + "SingEpiSize int64\n", + "BareNuc object\n", + "BlandChrom int64\n", + "NormNucl int64\n", + "Mit int64\n", + "Class int64\n", + "dtype: object" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cell_df.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It looks like the **BareNuc** column includes some values that are not numerical. We can drop those rows:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ID int64\n", + "Clump int64\n", + "UnifSize int64\n", + "UnifShape int64\n", + "MargAdh int64\n", + "SingEpiSize int64\n", + "BareNuc int32\n", + "BlandChrom int64\n", + "NormNucl int64\n", + "Mit int64\n", + "Class int64\n", + "dtype: object" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]#If a value cannot be converted, it replaces it with NaN (errors='coerce').\n", + "\n", + "cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')\n", + "cell_df.dtypes" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 5, 1, 1, 1, 2, 1, 3, 1, 1],\n", + " [ 5, 4, 4, 5, 7, 10, 3, 2, 1],\n", + " [ 3, 1, 1, 1, 2, 2, 3, 1, 1],\n", + " [ 6, 8, 8, 1, 3, 4, 3, 7, 1],\n", + " [ 4, 1, 1, 3, 2, 1, 3, 1, 1]], dtype=int64)" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]\n", + "X = np.asarray(feature_df)\n", + "X[0:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We want the model to predict the value of Class (that is, benign (=2) or malignant (=4)). As this field can have one of only two possible values, we need to change its measurement level to reflect this.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([2, 2, 2, 2, 2])" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cell_df['Class'] = cell_df['Class'].astype('int')\n", + "y = np.asarray(cell_df['Class'])\n", + "y [0:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train/Test dataset\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We split our dataset into train and test set:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train set: (546, 9) (546,)\n", + "Test set: (137, 9) (137,)\n" + ] + } + ], + "source": [ + "X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\n", + "print ('Train set:', X_train.shape, y_train.shape)\n", + "print ('Test set:', X_test.shape, y_test.shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

Modeling (SVM with Scikit-learn)

\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:\n", + "\n", + "```\n", + "1.Linear\n", + "2.Polynomial\n", + "3.Radial basis function (RBF)\n", + "4.Sigmoid\n", + "```\n", + "\n", + "Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset. We usually choose different functions in turn and compare the results. Let's just use the default, RBF (Radial Basis Function) for this lab.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
SVC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "SVC()" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.svm import SVC\n", + "clf = SVC(kernel='rbf')\n", + "clf.fit(X_train, y_train) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After being fitted, the model can then be used to predict new values:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([2, 4, 2, 4, 2])" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "yhat = clf.predict(X_test)\n", + "yhat [0:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

Evaluation

\n" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.metrics import classification_report, confusion_matrix\n", + "import itertools" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_confusion_matrix(cm, classes,\n", + " normalize=False,\n", + " title='Confusion matrix',\n", + " cmap=plt.cm.Blues):\n", + " \"\"\"\n", + " This function prints and plots the confusion matrix.\n", + " Normalization can be applied by setting `normalize=True`.\n", + " \"\"\"\n", + " if normalize:\n", + " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", + " print(\"Normalized confusion matrix\")\n", + " else:\n", + " print('Confusion matrix, without normalization')\n", + "\n", + " print(cm)\n", + "\n", + " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", + " plt.title(title)\n", + " plt.colorbar()\n", + " tick_marks = np.arange(len(classes))\n", + " plt.xticks(tick_marks, classes, rotation=45)\n", + " plt.yticks(tick_marks, classes)\n", + "\n", + " fmt = '.2f' if normalize else 'd'\n", + " thresh = cm.max() / 2.\n", + " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", + " plt.text(j, i, format(cm[i, j], fmt),\n", + " horizontalalignment=\"center\",\n", + " color=\"white\" if cm[i, j] > thresh else \"black\")\n", + "\n", + " plt.tight_layout()\n", + " plt.ylabel('True label')\n", + " plt.xlabel('Predicted label')" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " precision recall f1-score support\n", + "\n", + " 2 1.00 0.94 0.97 90\n", + " 4 0.90 1.00 0.95 47\n", + "\n", + " accuracy 0.96 137\n", + " macro avg 0.95 0.97 0.96 137\n", + "weighted avg 0.97 0.96 0.96 137\n", + "\n", + "Confusion matrix, without normalization\n", + "[[85 5]\n", + " [ 0 47]]\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Compute confusion matrix\n", + "cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])\n", + "np.set_printoptions(precision=2)\n", + "\n", + "print (classification_report(y_test, yhat))\n", + "\n", + "# Plot non-normalized confusion matrix\n", + "plt.figure()\n", + "plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False, title='Confusion matrix')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also easily use the **f1\\_score** from sklearn library:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.9639038982104676" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.metrics import f1_score\n", + "f1_score(y_test, yhat, average='weighted') " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's try the jaccard index for accuracy:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.9444444444444444" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.metrics import jaccard_score\n", + "jaccard_score(y_test, yhat,pos_label=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

Practice

\n", + "Can you rebuild the model, but this time with a __linear__ kernel? You can use __kernel='linear'__ option, when you define the svm. How the accuracy changes with the new kernel function?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "F1-score: 0.9639\n" + ] + } + ], + "source": [ + "# write your code here\n", + "from sklearn.svm import SVC\n", + "\n", + "model=SVC(kernel='linear')\n", + "\n", + "model.fit(X_train, y_train) \n", + "\n", + "pre=model.predict(X_test)\n", + "print(\"F1-score: %.4f\" % f1_score(y_test, pre, average='weighted'))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
Click here for the solution\n", + "\n", + "```python\n", + "clf2 = svm.SVC(kernel='linear')\n", + "clf2.fit(X_train, y_train) \n", + "yhat2 = clf2.predict(X_test)\n", + "print(\"Avg F1-score: %.4f\" % f1_score(y_test, yhat2, average='weighted'))\n", + "print(\"Jaccard score: %.4f\" % jaccard_score(y_test, yhat2,pos_label=2))\n", + "\n", + "```\n", + "\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Thank you for completing this lab!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SDA", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}