diff --git a/Minor Project/ProjectSubmission.md b/Minor Project/ProjectSubmission.md index e661c66..ca2386e 100644 --- a/Minor Project/ProjectSubmission.md +++ b/Minor Project/ProjectSubmission.md @@ -1,9 +1,9 @@ -Roll Number : < Roll no. allotted for this internship eg - 23470 > +Roll Number : 23489 -Student Name : < Your good name > +Student Name : Yashika Rajora -Project Title : < Problem statement allotted to you > +Project Title : Flowcast:Credit Card Approval Fraud Detection -Google Colab Link : < View only link of your Google Colab Notebook > +Google Colab Link : https://colab.research.google.com/drive/1b7R2oEDCqZRISLzhRYMv21H4X7VUC2ou?usp=sharing -Summary(Optional) : < Brief summary of your project > \ No newline at end of file +Summary(Optional) : < Brief summary of your project > diff --git a/customer_lifetime_value_prediction.ipynb b/customer_lifetime_value_prediction.ipynb new file mode 100644 index 0000000..16dfbfe --- /dev/null +++ b/customer_lifetime_value_prediction.ipynb @@ -0,0 +1,13063 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "F9WuH6oiRJFE" + }, + "outputs": [], + "source": [ + "# pip install kaggle" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "q6oQ1qeSO7j1" + }, + "outputs": [], + "source": [ + "# !pip install opendatasets" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "69wlOOtXRY7X" + }, + "outputs": [], + "source": [ + "# import opendatasets as od\n", + "# od.download(\"https://www.kaggle.com/datasets/sergeymedvedev/customer_segmentation?select=customer_segmentation.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "t5DjR5ywT1W1" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "# Try reading the file with Latin-1 encoding\n", + "data = pd.read_csv(\"/content/customer_segmentation/customer_segmentation.csv\", encoding='latin-1')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RulfNo3ET4P-", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 293 + }, + "outputId": "2d6c99e0-b57e-4fcf-c876-03bc38c7f506" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " InvoiceNo StockCode Description Quantity \\\n", + "0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 \n", + "1 536365 71053 WHITE METAL LANTERN 6 \n", + "2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 \n", + "3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 \n", + "4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 \n", + "\n", + " InvoiceDate UnitPrice CustomerID Country \n", + "0 12/1/2010 8:26 2.55 17850.0 United Kingdom \n", + "1 12/1/2010 8:26 3.39 17850.0 United Kingdom \n", + "2 12/1/2010 8:26 2.75 17850.0 United Kingdom \n", + "3 12/1/2010 8:26 3.39 17850.0 United Kingdom \n", + "4 12/1/2010 8:26 3.39 17850.0 United Kingdom " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER612/1/2010 8:262.5517850.0United Kingdom
153636571053WHITE METAL LANTERN612/1/2010 8:263.3917850.0United Kingdom
253636584406BCREAM CUPID HEARTS COAT HANGER812/1/2010 8:262.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE612/1/2010 8:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.612/1/2010 8:263.3917850.0United Kingdom
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "data" + } + }, + "metadata": {}, + "execution_count": 48 + } + ], + "source": [ + "data.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YCamZwYdO1L7" + }, + "source": [ + "

Customer Lifetime prediction value

" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8IfuOcu7O1L9" + }, + "source": [ + "We invest in customers (acquisition costs, offline ads, promotions, discounts & etc.) to generate revenue and be profitable. Naturally, these actions make some customers super valuable in terms of lifetime value but there are always some customers who pull down the profitability. We need to identify these behavior patterns, segment customers and act accordingly.\n", + "Calculating Lifetime Value is the easy part. First we need to select a time window. It can be anything like 3, 6, 12, 24 months. By the equation below, we can have Lifetime Value for each customer in that specific time window:\n", + "\n", + "**Lifetime Value: Total Gross Revenue - Total Cost**\n", + "\n", + "This equation now gives us the historical lifetime value. If we see some customers having very high negative lifetime value historically, it could be too late to take an action.\n", + "\n", + "We are going to build a simple machine learning model that predicts our customers lifetime value.\n", + "\n", + "

Lifetime Value Prediction

\n", + "\n", + "* Define an appropriate time frame for Customer Lifetime Value calculation\n", + "* Identify the features we are going to use to predict future and create them\n", + "* Calculate lifetime value (LTV) for training the machine learning model\n", + "* Build and run the machine learning model\n", + "* Check if the model is useful\n", + "\n", + "**1. How to decide the timeframe**\n", + "\n", + "Deciding the time frame really depends on your industry, business model, strategy and more. For some industries, 1 year is a very long period while for the others it is very short. In our example, we will go ahead with 6 months.\n", + "\n", + "**2. Identifying the features for prediction**\n", + "\n", + "RFM scores for each customer ID (which we calculated in the previous article) are the perfect candidates for feature set. To implement it correctly, we need to split our dataset. We will take 3 months of data, calculate RFM and use it for predicting next 6 months. So we need to create two dataframes first and append RFM scores to them.\n", + "\n", + "After the first two steps, it is easy to calculate CLTV and train and test the model.\n", + "\n", + "- 1. Identifying the features \n", + "- 2. Importing necessary libraries and packages and reading files\n", + " - 2.1 Feature Engineering\n", + "- 3. Recency\n", + " - 3.1 Assigning a recency score \n", + " - 3.2 Ordering clusters\n", + "- 4. Frequency\n", + " - 4.1 Frequency clusters\n", + "- 5. Revenue\n", + " - 5.1 Revenue clusters\n", + "- 6. Overall score based on RFM Clustering \n", + "- 7. Customer Lifetime Value \n", + " - 7.1 Feature engineering\n", + "- 8. Machine Learning Model for Customer Lifetime Value Prediction \n", + "- 9. Final Clusters for Customer Lifetime Value \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IuAe94sWO1L-" + }, + "source": [ + "

1. Identifying the features

" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "G4iLix3sO1L_" + }, + "source": [ + "

2. Importing relevant packages and libraries

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "T2LbA3IeUia8", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "0265df55-fac3-4986-a94b-104ab1eb60a1" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Error: File not found at ../input/customer_segmentation/customer_segmentation.csv\n", + "Please make sure the file exists in the correct location.\n", + "If you are using a platform like Kaggle, ensure the dataset is added to your notebook.\n" + ] + } + ], + "source": [ + "import os\n", + "import pandas as pd\n", + "\n", + "# Check if the file exists\n", + "file_path = '../input/customer_segmentation/customer_segmentation.csv'\n", + "if os.path.exists(file_path):\n", + " # Read data if the file exists\n", + " tx_data = pd.read_csv(file_path, encoding='cp1252')\n", + " print(\"Data loaded successfully!\")\n", + "else:\n", + " print(f\"Error: File not found at {file_path}\")\n", + " print(\"Please make sure the file exists in the correct location.\")\n", + " print(\"If you are using a platform like Kaggle, ensure the dataset is added to your notebook.\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "aXbTXM12O1L_" + }, + "outputs": [], + "source": [ + "\n", + "#import libraries\n", + "from __future__ import division\n", + "\n", + "from datetime import datetime, timedelta,date\n", + "import pandas as pd\n", + "%matplotlib inline\n", + "from sklearn.metrics import classification_report,confusion_matrix\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import seaborn as sns\n", + "from sklearn.cluster import KMeans\n", + "\n", + "\n", + "import plotly as py\n", + "import plotly.offline as pyoff\n", + "import plotly.graph_objs as go\n", + "\n", + "import xgboost as xgb\n", + "from sklearn.model_selection import KFold, cross_val_score, train_test_split\n", + "\n", + "import xgboost as xgb\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8wzIlmpbW71J", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 293 + }, + "outputId": "3c29b099-f7d7-41f4-f165-ad5d2cec5947" + }, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/html": [ + " \n", + " " + ] + }, + "metadata": {} + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " InvoiceNo StockCode Description Quantity \\\n", + "0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 \n", + "1 536365 71053 WHITE METAL LANTERN 6 \n", + "2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 \n", + "3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 \n", + "4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 \n", + "\n", + " InvoiceDate UnitPrice CustomerID Country \n", + "0 12/1/2010 8:26 2.55 17850.0 United Kingdom \n", + "1 12/1/2010 8:26 3.39 17850.0 United Kingdom \n", + "2 12/1/2010 8:26 2.75 17850.0 United Kingdom \n", + "3 12/1/2010 8:26 3.39 17850.0 United Kingdom \n", + "4 12/1/2010 8:26 3.39 17850.0 United Kingdom " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER612/1/2010 8:262.5517850.0United Kingdom
153636571053WHITE METAL LANTERN612/1/2010 8:263.3917850.0United Kingdom
253636584406BCREAM CUPID HEARTS COAT HANGER812/1/2010 8:262.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE612/1/2010 8:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.612/1/2010 8:263.3917850.0United Kingdom
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_data" + } + }, + "metadata": {}, + "execution_count": 51 + } + ], + "source": [ + "import plotly.offline as pyoff\n", + "import pandas as pd\n", + "\n", + "# Initialize Plotly in notebook mode\n", + "pyoff.init_notebook_mode(connected=True)\n", + "\n", + "# Load data from CSV, specify encoding as 'cp1252'\n", + "tx_data = pd.read_csv('/content/customer_segmentation/customer_segmentation.csv', encoding='cp1252')\n", + "\n", + "# Display the first few rows of the DataFrame\n", + "tx_data.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sO2n6T_2O1MB" + }, + "source": [ + "We have all the crucial information we need:\n", + "Customer ID\n", + "Unit Price\n", + "Quantity\n", + "Invoice Date\n", + "Revenue = Active Customer Count * Order Count * Average Revenue per Order\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_0fdKfH-O1MC" + }, + "source": [ + "

2.1 Feature Engineering

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "huU-F1FBO1MC" + }, + "outputs": [], + "source": [ + "#converting the type of Invoice Date Field from string to datetime.\n", + "tx_data['InvoiceDate'] = pd.to_datetime(tx_data['InvoiceDate'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WFxG5zk0O1MD" + }, + "outputs": [], + "source": [ + "#creating YearMonth field for the ease of reporting and visualization\n", + "tx_data['InvoiceYearMonth'] = tx_data['InvoiceDate'].map(lambda date: 100*date.year + date.month)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "j2-PrRLxO1MD", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + }, + "outputId": "80ee2cd2-eec6-4dd4-de8c-cb303a9d191b" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Quantity InvoiceDate UnitPrice \\\n", + "count 541909.000000 541909 541909.000000 \n", + "mean 9.552250 2011-07-04 13:34:57.156386048 4.611114 \n", + "min -80995.000000 2010-12-01 08:26:00 -11062.060000 \n", + "25% 1.000000 2011-03-28 11:34:00 1.250000 \n", + "50% 3.000000 2011-07-19 17:17:00 2.080000 \n", + "75% 10.000000 2011-10-19 11:27:00 4.130000 \n", + "max 80995.000000 2011-12-09 12:50:00 38970.000000 \n", + "std 218.081158 NaN 96.759853 \n", + "\n", + " CustomerID InvoiceYearMonth \n", + "count 406829.000000 541909.000000 \n", + "mean 15287.690570 201099.713989 \n", + "min 12346.000000 201012.000000 \n", + "25% 13953.000000 201103.000000 \n", + "50% 15152.000000 201107.000000 \n", + "75% 16791.000000 201110.000000 \n", + "max 18287.000000 201112.000000 \n", + "std 1713.600303 25.788703 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
QuantityInvoiceDateUnitPriceCustomerIDInvoiceYearMonth
count541909.000000541909541909.000000406829.000000541909.000000
mean9.5522502011-07-04 13:34:57.1563860484.61111415287.690570201099.713989
min-80995.0000002010-12-01 08:26:00-11062.06000012346.000000201012.000000
25%1.0000002011-03-28 11:34:001.25000013953.000000201103.000000
50%3.0000002011-07-19 17:17:002.08000015152.000000201107.000000
75%10.0000002011-10-19 11:27:004.13000016791.000000201110.000000
max80995.0000002011-12-09 12:50:0038970.00000018287.000000201112.000000
std218.081158NaN96.7598531713.60030325.788703
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"tx_data\",\n \"rows\": 8,\n \"fields\": [\n {\n \"column\": \"Quantity\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 196412.4226608867,\n \"min\": -80995.0,\n \"max\": 541909.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 9.55224954743324,\n 10.0,\n 541909.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"InvoiceDate\",\n \"properties\": {\n \"dtype\": \"date\",\n \"min\": \"1970-01-01 00:00:00.000541909\",\n \"max\": \"2011-12-09 12:50:00\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"541909\",\n \"2011-07-04 13:34:57.156386048\",\n \"2011-10-19 11:27:00\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"UnitPrice\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 190752.07570771928,\n \"min\": -11062.06,\n \"max\": 541909.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 4.611113626088513,\n 4.13,\n 541909.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 139204.1680069419,\n \"min\": 1713.600303321598,\n \"max\": 406829.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 15287.690570239585,\n 16791.0,\n 406829.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"InvoiceYearMonth\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 148392.75129663633,\n \"min\": 25.788702574753856,\n \"max\": 541909.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 201099.71398888004,\n 201110.0,\n 541909.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 54 + } + ], + "source": [ + "tx_data.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pVkRPygEO1ME", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "outputId": "a4fa7729-9654-4a65-b3fa-78cff035be03" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Country\n", + "United Kingdom 495478\n", + "Germany 9495\n", + "France 8557\n", + "EIRE 8196\n", + "Spain 2533\n", + "Netherlands 2371\n", + "Belgium 2069\n", + "Switzerland 2002\n", + "Portugal 1519\n", + "Australia 1259\n", + "Norway 1086\n", + "Italy 803\n", + "Channel Islands 758\n", + "Finland 695\n", + "Cyprus 622\n", + "Sweden 462\n", + "Unspecified 446\n", + "Austria 401\n", + "Denmark 389\n", + "Japan 358\n", + "Poland 341\n", + "Israel 297\n", + "USA 291\n", + "Hong Kong 288\n", + "Singapore 229\n", + "Iceland 182\n", + "Canada 151\n", + "Greece 146\n", + "Malta 127\n", + "United Arab Emirates 68\n", + "European Community 61\n", + "RSA 58\n", + "Lebanon 45\n", + "Lithuania 35\n", + "Brazil 32\n", + "Czech Republic 30\n", + "Bahrain 19\n", + "Saudi Arabia 10\n", + "Name: count, dtype: int64" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
count
Country
United Kingdom495478
Germany9495
France8557
EIRE8196
Spain2533
Netherlands2371
Belgium2069
Switzerland2002
Portugal1519
Australia1259
Norway1086
Italy803
Channel Islands758
Finland695
Cyprus622
Sweden462
Unspecified446
Austria401
Denmark389
Japan358
Poland341
Israel297
USA291
Hong Kong288
Singapore229
Iceland182
Canada151
Greece146
Malta127
United Arab Emirates68
European Community61
RSA58
Lebanon45
Lithuania35
Brazil32
Czech Republic30
Bahrain19
Saudi Arabia10
\n", + "

" + ] + }, + "metadata": {}, + "execution_count": 55 + } + ], + "source": [ + "tx_data['Country'].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hs--bgmPO1ME" + }, + "source": [ + "Starting from this part, we will be focusing on UK data only (which has the most records). We can get the monthly active customers by counting unique CustomerIDs. The same analysis can be carried out for customers of other countries as well." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BTwLXOvnO1ME" + }, + "outputs": [], + "source": [ + "#we will be using only UK data\n", + "tx_uk = tx_data.query(\"Country=='United Kingdom'\").reset_index(drop=True)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oLuM42fjO1ME" + }, + "source": [ + "**Segmentation Techniques**\n", + "\n", + "You can do many different segmentations according to what you are trying to achieve. If you want to increase retention rate, you can do a segmentation based on churn probability and take actions. But there are very common and useful segmentation methods as well. Now we are going to implement one of them to our business: RFM.\n", + "RFM stands for Recency - Frequency - Monetary Value. Theoretically we will have segments like below:\n", + "\n", + "* Low Value: Customers who are less active than others, not very frequent buyer/visitor and generates very low - zero - maybe negative revenue.\n", + "* Mid Value: In the middle of everything. Often using our platform (but not as much as our High Values), fairly frequent and generates moderate revenue.\n", + "* High Value: The group we don’t want to lose. High Revenue, Frequency and low Inactivity.\n", + "\n", + "As the methodology, we need to calculate Recency, Frequency and Monetary Value (we will call it Revenue from now on) and apply unsupervised machine learning to identify different groups (clusters) for each. Let’s jump into coding and see how to do RFM Clustering.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3-mxLNAsO1ME" + }, + "source": [ + "

3. Recency

\n", + "\n", + "To calculate recency, we need to find out most recent purchase date of each customer and see how many days they are inactive for. After having no. of inactive days for each customer, we will apply K-means* clustering to assign customers a recency score.\n", + "\n", + "Lets go ahead and calculate that." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "En8CHNYJO1ME", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "c0d02f79-a82c-4b17-8ee9-78898b07bde6" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID\n", + "0 17850.0\n", + "1 13047.0\n", + "2 12583.0\n", + "3 13748.0\n", + "4 15100.0" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerID
017850.0
113047.0
212583.0
313748.0
415100.0
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_user", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 4373,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1722.390705427691,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 4372,\n \"samples\": [\n 14633.0,\n 13050.0,\n 17975.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 57 + } + ], + "source": [ + "#create a generic user dataframe to keep CustomerID and new segmentation scores\n", + "tx_user = pd.DataFrame(tx_data['CustomerID'].unique())\n", + "tx_user.columns = ['CustomerID']\n", + "tx_user.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AriZIC_qO1MF", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + }, + "outputId": "6e1d50e1-a8e5-450e-cc3b-83c017dedeee" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " InvoiceNo StockCode Description Quantity \\\n", + "0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 \n", + "1 536365 71053 WHITE METAL LANTERN 6 \n", + "2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 \n", + "3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 \n", + "4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 \n", + "\n", + " InvoiceDate UnitPrice CustomerID Country InvoiceYearMonth \n", + "0 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 201012 \n", + "1 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 201012 \n", + "2 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 201012 \n", + "3 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 201012 \n", + "4 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 201012 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountryInvoiceYearMonth
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom201012
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850.0United Kingdom201012
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850.0United Kingdom201012
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850.0United Kingdom201012
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850.0United Kingdom201012
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_uk" + } + }, + "metadata": {}, + "execution_count": 58 + } + ], + "source": [ + "tx_uk.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7fguh8DxO1MF" + }, + "source": [ + "Since we are calculating recency, we need to know when last the person bought something. Let us calculate the last date of transaction for a person." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sLG_dwdsO1MF", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "cdc273df-a8cb-4520-d883-f384ffa070e2" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID MaxPurchaseDate\n", + "0 12346.0 2011-01-18 10:17:00\n", + "1 12747.0 2011-12-07 14:34:00\n", + "2 12748.0 2011-12-09 12:20:00\n", + "3 12749.0 2011-12-06 09:56:00\n", + "4 12820.0 2011-12-06 15:12:00" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDMaxPurchaseDate
012346.02011-01-18 10:17:00
112747.02011-12-07 14:34:00
212748.02011-12-09 12:20:00
312749.02011-12-06 09:56:00
412820.02011-12-06 15:12:00
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_max_purchase", + "summary": "{\n \"name\": \"tx_max_purchase\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 17160.0,\n 15758.0,\n 15349.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MaxPurchaseDate\",\n \"properties\": {\n \"dtype\": \"date\",\n \"min\": \"2010-12-01 09:53:00\",\n \"max\": \"2011-12-09 12:49:00\",\n \"num_unique_values\": 3836,\n \"samples\": [\n \"2011-08-12 16:54:00\",\n \"2011-03-04 14:13:00\",\n \"2010-12-02 17:03:00\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 59 + } + ], + "source": [ + "#get the max purchase date for each customer and create a dataframe with it\n", + "tx_max_purchase = tx_uk.groupby('CustomerID').InvoiceDate.max().reset_index()\n", + "tx_max_purchase.columns = ['CustomerID','MaxPurchaseDate']\n", + "tx_max_purchase.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gH4XygopO1MF", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "af2ddd48-3839-4ad5-cb4d-7bf076b80074" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID MaxPurchaseDate Recency\n", + "0 12346.0 2011-01-18 10:17:00 325\n", + "1 12747.0 2011-12-07 14:34:00 1\n", + "2 12748.0 2011-12-09 12:20:00 0\n", + "3 12749.0 2011-12-06 09:56:00 3\n", + "4 12820.0 2011-12-06 15:12:00 2" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDMaxPurchaseDateRecency
012346.02011-01-18 10:17:00325
112747.02011-12-07 14:34:001
212748.02011-12-09 12:20:000
312749.02011-12-06 09:56:003
412820.02011-12-06 15:12:002
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_max_purchase", + "summary": "{\n \"name\": \"tx_max_purchase\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 17160.0,\n 15758.0,\n 15349.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MaxPurchaseDate\",\n \"properties\": {\n \"dtype\": \"date\",\n \"min\": \"2010-12-01 09:53:00\",\n \"max\": \"2011-12-09 12:49:00\",\n \"num_unique_values\": 3836,\n \"samples\": [\n \"2011-08-12 16:54:00\",\n \"2011-03-04 14:13:00\",\n \"2010-12-02 17:03:00\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 283,\n 129,\n 171\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 60 + } + ], + "source": [ + "# Compare the last transaction of the dataset with last transaction dates of the individual customer IDs.\n", + "tx_max_purchase['Recency'] = (tx_max_purchase['MaxPurchaseDate'].max() - tx_max_purchase['MaxPurchaseDate']).dt.days\n", + "tx_max_purchase.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lqhsgB-MO1MF", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "d25477ef-4dc8-4a2c-8153-420cd963c106" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Recency\n", + "0 17850.0 301\n", + "1 13047.0 31\n", + "2 13748.0 95\n", + "3 15100.0 329\n", + "4 15291.0 25" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRecency
017850.0301
113047.031
213748.095
315100.0329
415291.025
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_user", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 16189.0,\n 13740.0,\n 16023.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 76,\n 252,\n 248\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 61 + } + ], + "source": [ + "#merge this dataframe to our new user dataframe\n", + "tx_user = pd.merge(tx_user, tx_max_purchase[['CustomerID','Recency']], on='CustomerID')\n", + "tx_user.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gzr0uKd-O1MF" + }, + "source": [ + "\\\\

3.1 Assigning a recency score

\n", + "\n", + "We are going to apply K-means clustering to assign a recency score. But we should tell how many clusters we need to K-means algorithm. To find it out, we will apply Elbow Method. Elbow Method simply tells the optimal cluster number for optimal inertia. Code snippet and Inertia graph are as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PS3PXH7OO1MF", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "outputId": "0ea21c49-f623-4610-96db-ed8bfa04d151" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/png": "\n" + }, + "metadata": {} + } + ], + "source": [ + "from sklearn.cluster import KMeans\n", + "\n", + "sse={} # error\n", + "tx_recency = tx_user[['Recency']]\n", + "for k in range(1, 10):\n", + " kmeans = KMeans(n_clusters=k, max_iter=1000).fit(tx_recency)\n", + " tx_recency[\"clusters\"] = kmeans.labels_ #cluster names corresponding to recency values\n", + " sse[k] = kmeans.inertia_ #sse corresponding to clusters\n", + "plt.figure()\n", + "plt.plot(list(sse.keys()), list(sse.values()))\n", + "plt.xlabel(\"Number of cluster\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aLVKWpNJO1MG" + }, + "source": [ + "Here it looks like 3 is the optimal one. Based on business requirements, we can go ahead with less or more clusters. We will be selecting 4 for this example" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AnxjkIHPO1MG", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "52e7b9ce-44e7-4d3d-fd7c-af94a2454667" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n" + ] + } + ], + "source": [ + "#build 4 clusters for recency and add it to dataframe\n", + "kmeans = KMeans(n_clusters=4)\n", + "tx_user['RecencyCluster'] = kmeans.fit_predict(tx_user[['Recency']])\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jSZvJQNyO1MG", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "9fc305ac-c7f8-4641-b2fd-f6de412da836" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Recency RecencyCluster\n", + "0 17850.0 301 1\n", + "1 13047.0 31 2\n", + "2 13748.0 95 0\n", + "3 15100.0 329 1\n", + "4 15291.0 25 2" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRecencyRecencyCluster
017850.03011
113047.0312
213748.0950
315100.03291
415291.0252
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_user", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 16189.0,\n 13740.0,\n 16023.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 76,\n 252,\n 248\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"int32\",\n \"num_unique_values\": 4,\n \"samples\": [\n 2,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 64 + } + ], + "source": [ + "tx_user.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yAgz_qk5O1MG", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "9a82ed31-8374-48d7-e362-a82be4f451da" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " count mean std min 25% 50% 75% \\\n", + "RecencyCluster \n", + "0 954.0 77.679245 22.850898 48.0 59.00 72.5 93.00 \n", + "1 478.0 304.393305 41.183489 245.0 266.25 300.0 336.00 \n", + "2 1950.0 17.488205 13.237058 0.0 6.00 16.0 28.00 \n", + "3 568.0 184.625000 31.753602 132.0 156.75 184.0 211.25 \n", + "\n", + " max \n", + "RecencyCluster \n", + "0 131.0 \n", + "1 373.0 \n", + "2 47.0 \n", + "3 244.0 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countmeanstdmin25%50%75%max
RecencyCluster
0954.077.67924522.85089848.059.0072.593.00131.0
1478.0304.39330541.183489245.0266.25300.0336.00373.0
21950.017.48820513.2370580.06.0016.028.0047.0
3568.0184.62500031.753602132.0156.75184.0211.25244.0
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"int32\",\n \"num_unique_values\": 4,\n \"samples\": [\n 1,\n 3,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 674.06700952749,\n \"min\": 478.0,\n \"max\": 1950.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 478.0,\n 568.0,\n 954.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"mean\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 126.17887851902452,\n \"min\": 17.488205128205127,\n \"max\": 304.39330543933056,\n \"num_unique_values\": 4,\n \"samples\": [\n 304.39330543933056,\n 184.625,\n 77.67924528301887\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"std\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 11.974125219307917,\n \"min\": 13.237058477856872,\n \"max\": 41.183489256909944,\n \"num_unique_values\": 4,\n \"samples\": [\n 41.183489256909944,\n 31.753601776012104,\n 22.850897908212414\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"min\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 107.38831407560136,\n \"min\": 0.0,\n \"max\": 245.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 245.0,\n 132.0,\n 48.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"25%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 114.65982295468626,\n \"min\": 6.0,\n \"max\": 266.25,\n \"num_unique_values\": 4,\n \"samples\": [\n 266.25,\n 156.75,\n 59.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"50%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 125.73674549099267,\n \"min\": 16.0,\n \"max\": 300.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 300.0,\n 184.0,\n 72.5\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"75%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 135.78910962101,\n \"min\": 28.0,\n \"max\": 336.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 336.0,\n 211.25,\n 93.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"max\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 141.4552343794083,\n \"min\": 47.0,\n \"max\": 373.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 373.0,\n 244.0,\n 131.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 65 + } + ], + "source": [ + "tx_user.groupby('RecencyCluster')['Recency'].describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HUFV41OKO1MG" + }, + "source": [ + "

3.2 Ordering clusters

\n", + "\n", + "We have a cluster corresponding to each customerID. But each cluster is randomly assigned. Cluster 2 is not better than cluster 1 for e.g. and so on. We want to give clusters according to most recent transactions.\n", + "\n", + "We will first find the mean of recency value corresponding to each cluster. Then we will sort these values. Let's say cluster 3 has the most recent transactions mean value. From the above table we see that cluster 1(mean recency 304) > cluster 2 > cluster 3 > cluster 0. That means that cluster 1 is most inactive and cluster 0 is most recent. We will give indices to these clusters as 0,1,2,3. So cluster 1 becomes cluster 0, cluster 2 becomes cluster 1, cluster 3 becomes cluster 2 and so on. Now we will drop the original cluster numbers and replace them with 0,1,2,3. Code is below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8_1qP5jaO1MG" + }, + "outputs": [], + "source": [ + "#function for ordering cluster numbers\n", + "def order_cluster(cluster_field_name, target_field_name,df,ascending):\n", + " new_cluster_field_name = 'new_' + cluster_field_name\n", + " df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()\n", + " df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)\n", + " df_new['index'] = df_new.index\n", + " df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)\n", + " df_final = df_final.drop([cluster_field_name],axis=1)\n", + " df_final = df_final.rename(columns={\"index\":cluster_field_name})\n", + " return df_final\n", + "\n", + "tx_user = order_cluster('RecencyCluster', 'Recency',tx_user,False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xR2X2kg-O1MG", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "f8df2fa7-501b-4b31-d997-bb192f6fe3ee" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Recency RecencyCluster\n", + "0 17850.0 301 0\n", + "1 15100.0 329 0\n", + "2 18074.0 373 0\n", + "3 16250.0 260 0\n", + "4 13747.0 373 0" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRecencyRecencyCluster
017850.03010
115100.03290
218074.03730
316250.02600
413747.03730
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_user", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 13067.0,\n 17947.0,\n 16968.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 169,\n 0,\n 161\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 3,\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 67 + } + ], + "source": [ + "tx_user.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TkmwGorBO1MH", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "e1eebde4-61b6-4100-bdaa-c06c34aaa5d6" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " count mean std min 25% 50% 75% \\\n", + "RecencyCluster \n", + "0 478.0 304.393305 41.183489 245.0 266.25 300.0 336.00 \n", + "1 568.0 184.625000 31.753602 132.0 156.75 184.0 211.25 \n", + "2 954.0 77.679245 22.850898 48.0 59.00 72.5 93.00 \n", + "3 1950.0 17.488205 13.237058 0.0 6.00 16.0 28.00 \n", + "\n", + " max \n", + "RecencyCluster \n", + "0 373.0 \n", + "1 244.0 \n", + "2 131.0 \n", + "3 47.0 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countmeanstdmin25%50%75%max
RecencyCluster
0478.0304.39330541.183489245.0266.25300.0336.00373.0
1568.0184.62500031.753602132.0156.75184.0211.25244.0
2954.077.67924522.85089848.059.0072.593.00131.0
31950.017.48820513.2370580.06.0016.028.0047.0
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 1,\n 3,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 674.06700952749,\n \"min\": 478.0,\n \"max\": 1950.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 568.0,\n 1950.0,\n 478.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"mean\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 126.17887851902452,\n \"min\": 17.488205128205127,\n \"max\": 304.39330543933056,\n \"num_unique_values\": 4,\n \"samples\": [\n 184.625,\n 17.488205128205127,\n 304.39330543933056\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"std\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 11.974125219307917,\n \"min\": 13.237058477856872,\n \"max\": 41.183489256909944,\n \"num_unique_values\": 4,\n \"samples\": [\n 31.753601776012104,\n 13.237058477856872,\n 41.183489256909944\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"min\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 107.38831407560136,\n \"min\": 0.0,\n \"max\": 245.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 132.0,\n 0.0,\n 245.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"25%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 114.65982295468626,\n \"min\": 6.0,\n \"max\": 266.25,\n \"num_unique_values\": 4,\n \"samples\": [\n 156.75,\n 6.0,\n 266.25\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"50%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 125.73674549099267,\n \"min\": 16.0,\n \"max\": 300.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 184.0,\n 16.0,\n 300.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"75%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 135.78910962101,\n \"min\": 28.0,\n \"max\": 336.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 211.25,\n 28.0,\n 336.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"max\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 141.4552343794083,\n \"min\": 47.0,\n \"max\": 373.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 244.0,\n 47.0,\n 373.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 68 + } + ], + "source": [ + "tx_user.groupby('RecencyCluster')['Recency'].describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tQAvdnJmO1MH" + }, + "source": [ + "Great! cluster 1 earlier is now cluster0, cluster 2 earlier is now cluster 1 and so on. The clusters are arranged according to inactiviuty. Cluster 0 now is most inactive, cluster 3 is most active." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k9DV2TEzO1MH" + }, + "source": [ + "

4. Frequency

\n", + "\n", + "To create frequency clusters, we need to find total number orders for each customer. First calculate this and see how frequency look like in our customer database" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rRQH30JpO1MH" + }, + "outputs": [], + "source": [ + "#get order counts for each user and create a dataframe with it\n", + "tx_frequency = tx_uk.groupby('CustomerID').InvoiceDate.count().reset_index()\n", + "tx_frequency.columns = ['CustomerID','Frequency']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RR6vbiWkO1MH", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "0da3050e-c1fc-480b-ab17-f99cec0eb5cc" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Frequency\n", + "0 12346.0 2\n", + "1 12747.0 103\n", + "2 12748.0 4642\n", + "3 12749.0 231\n", + "4 12820.0 59" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDFrequency
012346.02
112747.0103
212748.04642
312749.0231
412820.059
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_frequency", + "summary": "{\n \"name\": \"tx_frequency\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 17160.0,\n 15758.0,\n 15349.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Frequency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 220,\n \"min\": 1,\n \"max\": 7983,\n \"num_unique_values\": 455,\n \"samples\": [\n 415,\n 154,\n 452\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 70 + } + ], + "source": [ + "tx_frequency.head() #how many orders does a customer have" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_RNAanFYO1MH", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "94dfcc91-7337-4e34-8c04-f899942b671e" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Recency RecencyCluster Frequency\n", + "0 17850.0 301 0 312\n", + "1 15100.0 329 0 6\n", + "2 18074.0 373 0 13\n", + "3 16250.0 260 0 24\n", + "4 13747.0 373 0 1" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRecencyRecencyClusterFrequency
017850.03010312
115100.032906
218074.0373013
316250.0260024
413747.037301
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_user", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 13067.0,\n 17947.0,\n 16968.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 169,\n 0,\n 161\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 3,\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Frequency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 220,\n \"min\": 1,\n \"max\": 7983,\n \"num_unique_values\": 455,\n \"samples\": [\n 253,\n 41,\n 604\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 71 + } + ], + "source": [ + "#add this data to our main dataframe\n", + "tx_user = pd.merge(tx_user, tx_frequency, on='CustomerID')\n", + "\n", + "tx_user.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h3hMNQCRO1MH" + }, + "source": [ + "

4.1 Frequency clusters

" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CX9yWweXO1MI" + }, + "source": [ + "Determine the right number of clusters for K-Means by elbow method" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UpKtowM3O1MI", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "outputId": "bd57aaaf-c32d-4925-9610-d73b32ed813d" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/png": "\n" + }, + "metadata": {} + } + ], + "source": [ + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.cluster import KMeans\n", + "\n", + "# Assuming tx_user is your DataFrame\n", + "# Create a copy of the 'Frequency' column\n", + "tx_recency = tx_user[['Frequency']].copy()\n", + "\n", + "sse = {} # Dictionary to store SSE for each k value\n", + "\n", + "# Loop over different values of k\n", + "for k in range(1, 10):\n", + " kmeans = KMeans(n_clusters=k, max_iter=1000).fit(tx_recency)\n", + " # Assign cluster labels to the DataFrame\n", + " tx_recency.loc[:, \"clusters\"] = kmeans.labels_\n", + " # Store the SSE value for the current number of clusters\n", + " sse[k] = kmeans.inertia_\n", + "\n", + "# Plotting the results\n", + "plt.figure()\n", + "plt.plot(list(sse.keys()), list(sse.values()))\n", + "plt.xlabel(\"Number of clusters\")\n", + "plt.ylabel(\"SSE\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nQ2GXbYkO1MI" + }, + "source": [ + "By Elbow method, clusters number should be 4 as after 4, the graph goes down." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "N4NR3Wp-O1MI", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 295 + }, + "outputId": "03eb594c-dfea-4671-e69e-2e5b44cf1190" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " count mean std min 25% 50% \\\n", + "FrequencyCluster \n", + "0 3496.0 49.525744 44.954212 1.0 15.0 33.0 \n", + "1 429.0 331.221445 133.856510 191.0 228.0 287.0 \n", + "2 22.0 1313.136364 505.934524 872.0 988.5 1140.0 \n", + "3 3.0 5917.666667 1805.062418 4642.0 4885.0 5128.0 \n", + "\n", + " 75% max \n", + "FrequencyCluster \n", + "0 73.0 190.0 \n", + "1 399.0 803.0 \n", + "2 1452.0 2782.0 \n", + "3 6555.5 7983.0 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countmeanstdmin25%50%75%max
FrequencyCluster
03496.049.52574444.9542121.015.033.073.0190.0
1429.0331.221445133.856510191.0228.0287.0399.0803.0
222.01313.136364505.934524872.0988.51140.01452.02782.0
33.05917.6666671805.0624184642.04885.05128.06555.57983.0
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"FrequencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 1,\n 3,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1683.8373832806224,\n \"min\": 3.0,\n \"max\": 3496.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 429.0,\n 3.0,\n 3496.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"mean\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2730.7716948007696,\n \"min\": 49.525743707093824,\n \"max\": 5917.666666666667,\n \"num_unique_values\": 4,\n \"samples\": [\n 331.2214452214452,\n 5917.666666666667,\n 49.525743707093824\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"std\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 813.3004715715977,\n \"min\": 44.95421190788836,\n \"max\": 1805.0624181266787,\n \"num_unique_values\": 4,\n \"samples\": [\n 133.85651023921278,\n 1805.0624181266787,\n 44.95421190788836\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"min\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2176.0377600890415,\n \"min\": 1.0,\n \"max\": 4642.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 191.0,\n 4642.0,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"25%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2275.9374030275962,\n \"min\": 15.0,\n \"max\": 4885.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 228.0,\n 4885.0,\n 15.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"50%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2368.4739109111306,\n \"min\": 33.0,\n \"max\": 5128.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 287.0,\n 5128.0,\n 33.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"75%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3015.0696060234936,\n \"min\": 73.0,\n \"max\": 6555.5,\n \"num_unique_values\": 4,\n \"samples\": [\n 399.0,\n 6555.5,\n 73.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"max\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3539.5894771380854,\n \"min\": 190.0,\n \"max\": 7983.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 803.0,\n 7983.0,\n 190.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 73 + } + ], + "source": [ + "# Applying k-Means\n", + "kmeans=KMeans(n_clusters=4)\n", + "tx_user['FrequencyCluster']=kmeans.fit_predict(tx_user[['Frequency']])\n", + "\n", + "#order the frequency cluster\n", + "tx_user = order_cluster('FrequencyCluster', 'Frequency', tx_user, True )\n", + "tx_user.groupby('FrequencyCluster')['Frequency'].describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9oJXU2QlO1MJ" + }, + "source": [ + "Clsuter with max frequency is cluster 3, least frequency cluster is cluster 0." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ilAR6e1zO1MJ" + }, + "source": [ + "

5. Revenue

\n", + "\n", + "Let’s see how our customer database looks like when we cluster them based on revenue. We will calculate revenue for each customer, plot a histogram and apply the same clustering method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sKITGPzgO1MJ" + }, + "outputs": [], + "source": [ + "#calculate revenue for each customer\n", + "tx_uk['Revenue'] = tx_uk['UnitPrice'] * tx_uk['Quantity']\n", + "tx_revenue = tx_uk.groupby('CustomerID').Revenue.sum().reset_index()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "R7d63RXGO1MJ", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "74920f4e-a00e-4473-dbc2-89cb5e5b48f5" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Revenue\n", + "0 12346.0 0.00\n", + "1 12747.0 4196.01\n", + "2 12748.0 29072.10\n", + "3 12749.0 3868.20\n", + "4 12820.0 942.34" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRevenue
012346.00.00
112747.04196.01
212748.029072.10
312749.03868.20
412820.0942.34
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_revenue", + "summary": "{\n \"name\": \"tx_revenue\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 17160.0,\n 15758.0,\n 15349.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6548.608224207446,\n \"min\": -4287.63,\n \"max\": 256438.49,\n \"num_unique_values\": 3878,\n \"samples\": [\n 1979.3,\n 264.65,\n 8727.61\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 75 + } + ], + "source": [ + "tx_revenue.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "McYXLZ78O1MJ", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "c62fe818-7cd6-4310-d998-67c1c08a4640" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Recency RecencyCluster Frequency FrequencyCluster Revenue\n", + "0 17850.0 301 0 312 1 5288.63\n", + "1 15808.0 305 0 210 1 3724.77\n", + "2 13047.0 31 3 196 1 3079.10\n", + "3 14688.0 7 3 359 1 5107.38\n", + "4 16029.0 38 3 274 1 50992.61" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRecencyRecencyClusterFrequencyFrequencyClusterRevenue
017850.0301031215288.63
115808.0305021013724.77
213047.031319613079.10
314688.07335915107.38
416029.0383274150992.61
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_user", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 14951.0,\n 14345.0,\n 12981.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 174,\n 353,\n 161\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 3,\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Frequency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 220,\n \"min\": 1,\n \"max\": 7983,\n \"num_unique_values\": 455,\n \"samples\": [\n 15,\n 320,\n 146\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"FrequencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 0,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6548.608224207446,\n \"min\": -4287.63,\n \"max\": 256438.49,\n \"num_unique_values\": 3878,\n \"samples\": [\n -1592.49,\n 532.94,\n 110.46\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 76 + } + ], + "source": [ + "#merge it with our main dataframe\n", + "tx_user = pd.merge(tx_user, tx_revenue, on='CustomerID')\n", + "tx_user.head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "n31rm5_IO1MJ" + }, + "source": [ + "We have some customers with negative revenue as well. Let’s continue and apply k-means clustering:\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "53T5G22lO1MJ" + }, + "source": [ + "**Elbow method to find out the optimum number of clusters for K-Means**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xTLXwIJZO1MJ", + "scrolled": true, + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "outputId": "0d15de27-c7e3-4415-d207-87edb731dbb8" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + ":7: SettingWithCopyWarning:\n", + "\n", + "\n", + "A value is trying to be set on a copy of a slice from a DataFrame.\n", + "Try using .loc[row_indexer,col_indexer] = value instead\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + ":7: SettingWithCopyWarning:\n", + "\n", + "\n", + "A value is trying to be set on a copy of a slice from a DataFrame.\n", + "Try using .loc[row_indexer,col_indexer] = value instead\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + ":7: SettingWithCopyWarning:\n", + "\n", + "\n", + "A value is trying to be set on a copy of a slice from a DataFrame.\n", + "Try using .loc[row_indexer,col_indexer] = value instead\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + ":7: SettingWithCopyWarning:\n", + "\n", + "\n", + "A value is trying to be set on a copy of a slice from a DataFrame.\n", + "Try using .loc[row_indexer,col_indexer] = value instead\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + ":7: SettingWithCopyWarning:\n", + "\n", + "\n", + "A value is trying to be set on a copy of a slice from a DataFrame.\n", + "Try using .loc[row_indexer,col_indexer] = value instead\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + ":7: SettingWithCopyWarning:\n", + "\n", + "\n", + "A value is trying to be set on a copy of a slice from a DataFrame.\n", + "Try using .loc[row_indexer,col_indexer] = value instead\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + ":7: SettingWithCopyWarning:\n", + "\n", + "\n", + "A value is trying to be set on a copy of a slice from a DataFrame.\n", + "Try using .loc[row_indexer,col_indexer] = value instead\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + ":7: SettingWithCopyWarning:\n", + "\n", + "\n", + "A value is trying to be set on a copy of a slice from a DataFrame.\n", + "Try using .loc[row_indexer,col_indexer] = value instead\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + ":7: SettingWithCopyWarning:\n", + "\n", + "\n", + "A value is trying to be set on a copy of a slice from a DataFrame.\n", + "Try using .loc[row_indexer,col_indexer] = value instead\n", + "\n", + "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", + "\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/png": "\n" + }, + "metadata": {} + } + ], + "source": [ + "from sklearn.cluster import KMeans\n", + "\n", + "sse={} # error\n", + "tx_recency = tx_user[['Revenue']]\n", + "for k in range(1, 10):\n", + " kmeans = KMeans(n_clusters=k, max_iter=1000).fit(tx_recency)\n", + " tx_recency[\"clusters\"] = kmeans.labels_ #cluster names corresponding to recency values\n", + " sse[k] = kmeans.inertia_ #sse corresponding to clusters\n", + "plt.figure()\n", + "plt.plot(list(sse.keys()), list(sse.values()))\n", + "plt.xlabel(\"Number of cluster\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GXgm-ZDkO1MJ" + }, + "source": [ + "From elbow's method, we find that clusters can be 3 or 4. Lets take 4 as the number of clusters" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K5eV2YpfO1MK" + }, + "source": [ + "

5.1. Revenue clusters

" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UOhe7qFR7ho3" + }, + "source": [ + "Let’s see how our customer database looks like when we cluster them based on revenue. We will calculate revenue\n", + "for each customer, plot a histogram and apply the same clustering method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PxaBHYUZ7lYr" + }, + "outputs": [], + "source": [ + "#calculate revenue for each customer\n", + "tx_uk['Revenue'] = tx_uk['UnitPrice'] * tx_uk['Quantity']\n", + "tx_revenue = tx_uk.groupby('CustomerID').Revenue.sum().reset_index()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fUoJewNL7qU0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "362054b0-6e8b-40fc-dad4-d4972ba6291b" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Revenue\n", + "0 12346.0 0.00\n", + "1 12747.0 4196.01\n", + "2 12748.0 29072.10\n", + "3 12749.0 3868.20\n", + "4 12820.0 942.34" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRevenue
012346.00.00
112747.04196.01
212748.029072.10
312749.03868.20
412820.0942.34
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_revenue", + "summary": "{\n \"name\": \"tx_revenue\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 17160.0,\n 15758.0,\n 15349.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6548.608224207446,\n \"min\": -4287.63,\n \"max\": 256438.49,\n \"num_unique_values\": 3878,\n \"samples\": [\n 1979.3,\n 264.65,\n 8727.61\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 79 + } + ], + "source": [ + "tx_revenue.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EHlcZWt77vhH", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "95cd9806-e20f-4900-8056-19f1f3909a66" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Recency RecencyCluster Frequency FrequencyCluster Revenue\n", + "0 17850.0 301 0 312 1 5288.63\n", + "1 15808.0 305 0 210 1 3724.77\n", + "2 13047.0 31 3 196 1 3079.10\n", + "3 14688.0 7 3 359 1 5107.38\n", + "4 16029.0 38 3 274 1 50992.61" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRecencyRecencyClusterFrequencyFrequencyClusterRevenue
017850.0301031215288.63
115808.0305021013724.77
213047.031319613079.10
314688.07335915107.38
416029.0383274150992.61
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_user", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 14951.0,\n 14345.0,\n 12981.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 174,\n 353,\n 161\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 3,\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Frequency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 220,\n \"min\": 1,\n \"max\": 7983,\n \"num_unique_values\": 455,\n \"samples\": [\n 15,\n 320,\n 146\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"FrequencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 0,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6548.608224207446,\n \"min\": -4287.63,\n \"max\": 256438.49,\n \"num_unique_values\": 3878,\n \"samples\": [\n -1592.49,\n 532.94,\n 110.46\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 80 + } + ], + "source": [ + "#merge it with our main dataframe\n", + "tx_user = pd.merge(tx_user, tx_revenue, on='CustomerID')\n", + "tx_user = tx_user.drop(columns=['Revenue_y'])\n", + "tx_user=tx_user.rename(columns={'Revenue_x':'Revenue'})\n", + "tx_user.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3-0UCAqU8ADu" + }, + "source": [ + "We have some customers with negative revenue as well. Let’s continue and apply k-means clustering:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "j61HnJJ08FsZ" + }, + "outputs": [], + "source": [ + "#Elbow method to find out the optimum number of clusters for K-Means" + ] + }, + { + "cell_type": "code", + "source": [ + "tx_user" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 423 + }, + "id": "LhWFt4de1Kbz", + "outputId": "5fd75a8e-cb29-46c9-8697-bfa7cc353f57" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Recency RecencyCluster Frequency FrequencyCluster \\\n", + "0 17850.0 301 0 312 1 \n", + "1 15808.0 305 0 210 1 \n", + "2 13047.0 31 3 196 1 \n", + "3 14688.0 7 3 359 1 \n", + "4 16029.0 38 3 274 1 \n", + "... ... ... ... ... ... \n", + "3945 14056.0 0 3 1128 2 \n", + "3946 14456.0 4 3 977 2 \n", + "3947 12748.0 0 3 4642 3 \n", + "3948 17841.0 1 3 7983 3 \n", + "3949 14096.0 3 3 5128 3 \n", + "\n", + " Revenue \n", + "0 5288.63 \n", + "1 3724.77 \n", + "2 3079.10 \n", + "3 5107.38 \n", + "4 50992.61 \n", + "... ... \n", + "3945 8124.40 \n", + "3946 3047.63 \n", + "3947 29072.10 \n", + "3948 40340.78 \n", + "3949 57120.91 \n", + "\n", + "[3950 rows x 6 columns]" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRecencyRecencyClusterFrequencyFrequencyClusterRevenue
017850.0301031215288.63
115808.0305021013724.77
213047.031319613079.10
314688.07335915107.38
416029.0383274150992.61
.....................
394514056.003112828124.40
394614456.04397723047.63
394712748.0034642329072.10
394817841.0137983340340.78
394914096.0335128357120.91
\n", + "

3950 rows × 6 columns

\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_user", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 14951.0,\n 14345.0,\n 12981.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 174,\n 353,\n 161\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 3,\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Frequency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 220,\n \"min\": 1,\n \"max\": 7983,\n \"num_unique_values\": 455,\n \"samples\": [\n 15,\n 320,\n 146\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"FrequencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 0,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6548.608224207446,\n \"min\": -4287.63,\n \"max\": 256438.49,\n \"num_unique_values\": 3878,\n \"samples\": [\n -1592.49,\n 532.94,\n 110.46\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 82 + } + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JjKhsRjbJSeg", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "outputId": "d5fe75f2-b71a-4b1c-8c7a-9d55109f4239" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n", + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/png": "\n" + }, + "metadata": {} + } + ], + "source": [ + "# Assuming 'tx_data' is your original DataFrame containing all transactional data\n", + "# Ensure it contains 'CustomerID', 'UnitPrice', 'Quantity', and any other relevant columns\n", + "\n", + "# Step 1: Calculate 'Revenue' for each transaction\n", + "tx_data['Revenue'] = tx_data['UnitPrice'] * tx_data['Quantity']\n", + "\n", + "# Step 2: Group by 'CustomerID' to sum up the revenue for each customer\n", + "tx_user2 = tx_data.groupby('CustomerID')['Revenue'].sum().reset_index()\n", + "\n", + "# Step 3: Prepare the tx_recency DataFrame with the 'Revenue' column\n", + "tx_recency = tx_user2[['Revenue']]\n", + "\n", + "# Step 4: Run KMeans clustering and calculate SSE for different cluster numbers\n", + "sse = {} # Store the sum of squared errors\n", + "for k in range(1, 10):\n", + " kmeans = KMeans(n_clusters=k, max_iter=1000).fit(tx_recency)\n", + " tx_recency[\"clusters\"] = kmeans.labels_ # Assign cluster labels\n", + " sse[k] = kmeans.inertia_ # Store SSE for each k\n", + "\n", + "# Step 5: Plotting the Elbow method graph\n", + "plt.figure()\n", + "plt.plot(list(sse.keys()), list(sse.values()))\n", + "plt.xlabel(\"Number of clusters\")\n", + "plt.ylabel(\"SSE\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yA1j35uVO1MK" + }, + "source": [ + "

6. Overall Score based on RFM Clsutering

\n", + "\n", + "We have scores (cluster numbers) for recency, frequency & revenue. Let’s create an overall score out of them\n" + ] + }, + { + "cell_type": "code", + "source": [ + "#apply clustering\n", + "kmeans = KMeans(n_clusters=4)\n", + "tx_user['RevenueCluster'] = kmeans.fit_predict(tx_user[['Revenue']])\n", + "\n", + "#order the cluster numbers\n", + "tx_user = order_cluster('RevenueCluster', 'Revenue',tx_user,True)\n", + "\n", + "#show details of the dataframe\n", + "tx_user.groupby('RevenueCluster')['Revenue'].describe()" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 295 + }, + "id": "neRDRJuQ2fZY", + "outputId": "2eea423d-149a-475f-aa69-f816638fa5ee" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " count mean std min 25% \\\n", + "RevenueCluster \n", + "0 3687.0 907.254414 921.910820 -4287.63 263.115 \n", + "1 234.0 7760.699530 3637.173671 4330.67 5161.485 \n", + "2 27.0 43070.445185 15939.249588 25748.35 28865.490 \n", + "3 2.0 221960.330000 48759.481478 187482.17 204721.250 \n", + "\n", + " 50% 75% max \n", + "RevenueCluster \n", + "0 572.56 1258.220 4314.72 \n", + "1 6549.38 9142.305 21535.90 \n", + "2 36351.42 53489.790 88125.38 \n", + "3 221960.33 239199.410 256438.49 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countmeanstdmin25%50%75%max
RevenueCluster
03687.0907.254414921.910820-4287.63263.115572.561258.2204314.72
1234.07760.6995303637.1736714330.675161.4856549.389142.30521535.90
227.043070.44518515939.24958825748.3528865.49036351.4253489.79088125.38
32.0221960.33000048759.481478187482.17204721.250221960.33239199.410256438.49
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"RevenueCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 1,\n 3,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1802.6677453152593,\n \"min\": 2.0,\n \"max\": 3687.0,\n \"num_unique_values\": 4,\n \"samples\": [\n 234.0,\n 2.0,\n 3687.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"mean\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 104010.82368070885,\n \"min\": 907.2544138866288,\n \"max\": 221960.33000000002,\n \"num_unique_values\": 4,\n \"samples\": [\n 7760.69952991453,\n 221960.33000000002,\n 907.2544138866288\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"std\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 21958.023661503135,\n \"min\": 921.9108197203295,\n \"max\": 48759.481477669535,\n \"num_unique_values\": 4,\n \"samples\": [\n 3637.173671171341,\n 48759.481477669535,\n 921.9108197203295\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"min\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 90329.53968761419,\n \"min\": -4287.63,\n \"max\": 187482.17,\n \"num_unique_values\": 4,\n \"samples\": [\n 4330.67,\n 187482.17,\n -4287.63\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"25%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 97449.32308515052,\n \"min\": 263.115,\n \"max\": 204721.25,\n \"num_unique_values\": 4,\n \"samples\": [\n 5161.485,\n 204721.25,\n 263.115\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"50%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 104908.33313947511,\n \"min\": 572.56,\n \"max\": 221960.33000000002,\n \"num_unique_values\": 4,\n \"samples\": [\n 6549.38,\n 221960.33000000002,\n 572.56\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"75%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 111350.54989645646,\n \"min\": 1258.22,\n \"max\": 239199.41,\n \"num_unique_values\": 4,\n \"samples\": [\n 9142.305,\n 239199.41,\n 1258.22\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"max\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 115047.04827536555,\n \"min\": 4314.72,\n \"max\": 256438.49,\n \"num_unique_values\": 4,\n \"samples\": [\n 21535.9,\n 256438.49,\n 4314.72\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 84 + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5BiRQiYpO1MK" + }, + "source": [ + "Score 8 is our best customer, score 0 is our worst customer." + ] + }, + { + "cell_type": "code", + "source": [ + "tx_user" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 423 + }, + "id": "40YdmjGNyuN_", + "outputId": "e2021ff5-0e20-4c0b-8da2-b34c0f27b13a" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Recency RecencyCluster Frequency FrequencyCluster \\\n", + "0 17850.0 301 0 312 1 \n", + "1 14688.0 7 3 359 1 \n", + "2 13767.0 1 3 399 1 \n", + "3 15513.0 30 3 314 1 \n", + "4 14849.0 21 3 392 1 \n", + "... ... ... ... ... ... \n", + "3945 12748.0 0 3 4642 3 \n", + "3946 17841.0 1 3 7983 3 \n", + "3947 14096.0 3 3 5128 3 \n", + "3948 17450.0 7 3 351 1 \n", + "3949 18102.0 0 3 433 1 \n", + "\n", + " Revenue RevenueCluster \n", + "0 5288.63 1 \n", + "1 5107.38 1 \n", + "2 16945.71 1 \n", + "3 14520.08 1 \n", + "4 7904.28 1 \n", + "... ... ... \n", + "3945 29072.10 2 \n", + "3946 40340.78 2 \n", + "3947 57120.91 2 \n", + "3948 187482.17 3 \n", + "3949 256438.49 3 \n", + "\n", + "[3950 rows x 7 columns]" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRecencyRecencyClusterFrequencyFrequencyClusterRevenueRevenueCluster
017850.0301031215288.631
114688.07335915107.381
213767.013399116945.711
315513.0303314114520.081
414849.021339217904.281
........................
394512748.0034642329072.102
394617841.0137983340340.782
394714096.0335128357120.912
394817450.0733511187482.173
394918102.0034331256438.493
\n", + "

3950 rows × 7 columns

\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_user", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 17172.0,\n 13607.0,\n 13379.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 189,\n 307,\n 161\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 3,\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Frequency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 220,\n \"min\": 1,\n \"max\": 7983,\n \"num_unique_values\": 455,\n \"samples\": [\n 718,\n 500,\n 99\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"FrequencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 0,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6548.608224207447,\n \"min\": -4287.63,\n \"max\": 256438.49,\n \"num_unique_values\": 3878,\n \"samples\": [\n 102.83,\n 190.53,\n 314.69\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RevenueCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 0,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 85 + } + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NyWdB45wH8Cx" + }, + "outputs": [], + "source": [ + "# import pandas as pd\n", + "# from sklearn.cluster import KMeans\n", + "# from sklearn.preprocessing import StandardScaler\n", + "\n", + "# # Function to create clusters based on the provided number of clusters and column name\n", + "# def create_clusters(data, num_clusters, col_name):\n", + "# # Ensure the column exists in the DataFrame\n", + "# if col_name not in data.columns:\n", + "# raise ValueError(f\"Column '{col_name}' not found in the DataFrame.\")\n", + "\n", + "# # Standardize the data before clustering\n", + "# scaler = StandardScaler()\n", + "# scaled_data = scaler.fit_transform(data[[col_name]])\n", + "\n", + "# # Fit KMeans on the standardized data\n", + "# kmeans = KMeans(n_clusters=num_clusters, max_iter=1000, random_state=42)\n", + "# kmeans.fit(scaled_data)\n", + "\n", + "# # Add the cluster labels to the original DataFrame\n", + "# data[col_name + 'Cluster'] = kmeans.labels_\n", + "# return data\n", + "\n", + "# # Example DataFrame (replace with your own data loading step)\n", + "# # tx_user = pd.read_csv('path_to_your_user_data.csv')\n", + "\n", + "# # Define the number of clusters\n", + "# num_clusters = 4\n", + "\n", + "# # Check if required columns exist\n", + "# required_columns = ['Recency', 'Frequency', 'Revenue']\n", + "# missing_columns = [col for col in required_columns if col not in tx_user.columns]\n", + "# if missing_columns:\n", + "# raise ValueError(f\"Missing columns in the DataFrame: {missing_columns}\")\n", + "\n", + "# # Assuming tx_user has 'Recency', 'Frequency', and 'Revenue' columns\n", + "# tx_user = create_clusters(tx_user, num_clusters, 'Recency')\n", + "# tx_user = create_clusters(tx_user, num_clusters, 'Frequency')\n", + "# tx_user = create_clusters(tx_user, num_clusters, 'Revenue')\n", + "\n", + "# # Calculate the OverallScore\n", + "# tx_user['OverallScore'] = tx_user['RecencyCluster'] + tx_user['FrequencyCluster'] + tx_user['RevenueCluster']\n", + "\n", + "# # Group by OverallScore and calculate the mean of Recency, Frequency, and Revenue\n", + "# result = tx_user.groupby('OverallScore')[['Recency', 'Frequency', 'Revenue']].mean()\n", + "\n", + "# # Display the result\n", + "# print(result)\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "source": [ + "# Calculate overall score\n", + "tx_user['OverallScore'] = tx_user['RecencyCluster'] + tx_user['FrequencyCluster'] + tx_user['RevenueCluster']\n", + "\n", + "# Use mean() to see details for the selected columns\n", + "tx_user.groupby('OverallScore')[['Recency', 'Frequency', 'Revenue']].mean()\n" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 363 + }, + "id": "Ah7vxM_Szgrv", + "outputId": "8c84fa7f-3645-43a0-eac9-33e632c78c49" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Recency Frequency Revenue\n", + "OverallScore \n", + "0 304.584388 21.995781 303.339705\n", + "1 185.362989 32.596085 498.087546\n", + "2 78.991304 46.963043 868.082991\n", + "3 20.689610 68.419590 1091.416414\n", + "4 14.892617 271.755034 3607.097114\n", + "5 9.662162 373.290541 9136.946014\n", + "6 7.740741 876.037037 22777.914815\n", + "7 1.857143 1272.714286 103954.025714\n", + "8 1.333333 5917.666667 42177.930000" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
RecencyFrequencyRevenue
OverallScore
0304.58438821.995781303.339705
1185.36298932.596085498.087546
278.99130446.963043868.082991
320.68961068.4195901091.416414
414.892617271.7550343607.097114
59.662162373.2905419136.946014
67.740741876.03703722777.914815
71.8571431272.714286103954.025714
81.3333335917.66666742177.930000
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 9,\n \"fields\": [\n {\n \"column\": \"OverallScore\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2,\n \"min\": 0,\n \"max\": 8,\n \"num_unique_values\": 9,\n \"samples\": [\n 7,\n 1,\n 5\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 106.51315302965865,\n \"min\": 1.3333333333333333,\n \"max\": 304.584388185654,\n \"num_unique_values\": 9,\n \"samples\": [\n 1.8571428571428572,\n 185.3629893238434,\n 9.662162162162161\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Frequency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1899.447716676171,\n \"min\": 21.9957805907173,\n \"max\": 5917.666666666667,\n \"num_unique_values\": 9,\n \"samples\": [\n 1272.7142857142858,\n 32.596085409252666,\n 373.2905405405405\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 34322.49841955528,\n \"min\": 303.3397046413502,\n \"max\": 103954.0257142857,\n \"num_unique_values\": 9,\n \"samples\": [\n 103954.0257142857,\n 498.0875462633452,\n 9136.946013513514\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 87 + } + ] + }, + { + "cell_type": "code", + "source": [ + "tx_user['Segment'] = 'Low-Value'\n", + "tx_user.loc[tx_user['OverallScore']>2,'Segment'] = 'Mid-Value'\n", + "tx_user.loc[tx_user['OverallScore']>4,'Segment'] = 'High-Value'" + ], + "metadata": { + "id": "jL7SNKmF4qyX" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XMNspW2NO1MK" + }, + "outputs": [], + "source": [ + "\n", + "# else:\n", + "# print(\"Warning: 'OverallScore' column not found. Cannot group and calculate means.\")\n", + "\n", + "# # --- (Code from ipython-input-164-5ca3f4963d0c) ---\n", + "# tx_user['Segment'] = 'Low-Value'import pandas as pd\n", + "# from sklearn.cluster import KMeans\n", + "\n", + "# # Assuming tx_user is already defined and populated with Recency, Frequency, and Revenue columns\n", + "# # For example:\n", + "# # tx_user = pd.read_csv('path_to_your_user_data.csv')\n", + "# # If you don't have Recency, Frequency calculated, you'll need to add that logic here\n", + "\n", + "# # Step 1: Create clusters for Recency, Frequency, and Revenue\n", + "# def create_clusters(data, num_clusters, col_name):\n", + "# kmeans = KMeans(n_clusters=num_clusters, max_iter=1000)\n", + "# # Check if column exists before fitting\n", + "# if col_name in data.columns:\n", + "# kmeans.fit(data[[col_name]])\n", + "# # Ensure the new cluster column is assigned back to the DataFrame\n", + "# data[col_name + 'Cluster'] = kmeans.labels_\n", + "# else:\n", + "# print(f\"Warning: Column '{col_name}' not found in DataFrame. Skipping clustering for this column.\")\n", + "# return data\n", + "\n", + "# # Define the number of clusters you want\n", + "# num_clusters = 4\n", + "\n", + "# # Create clusters for Recency, Frequency, and Revenue\n", + "# tx_user = create_clusters(tx_user, num_clusters, 'Recency')\n", + "# tx_user = create_clusters(tx_user, num_clusters, 'Frequency')\n", + "# tx_user = create_clusters(tx_user, num_clusters, 'Revenue')\n", + "\n", + "# # Step 2: Calculate OverallScore\n", + "# # Check if all cluster columns exist before calculating OverallScore\n", + "# if 'RecencyCluster' in tx_user.columns and 'FrequencyCluster' in tx_user.columns and 'RevenueCluster' in tx_user.columns:\n", + "# tx_user['OverallScore'] = tx_user['RecencyCluster'] + tx_user['FrequencyCluster'] + tx_user['RevenueCluster']\n", + "# else:\n", + "# print(\"Warning: One or more cluster columns are missing. Cannot calculate OverallScore.\")\n", + "\n", + "# # Step 3: Group by OverallScore and calculate mean of Recency, Frequency, and Revenue\n", + "# # Check if OverallScore column exists before grouping\n", + "# if 'OverallScore' in tx_user.columns:\n", + "# result = tx_user.groupby('OverallScore')[['Recency', 'Frequency', 'Revenue']].mean()\n", + "# # Display the result\n", + "# print(result)\n", + "# tx_user.loc[tx_user['OverallScore']>2,'Segment'] = 'Mid-Value'\n", + "# tx_user.loc[tx_user['OverallScore']>4,'Segment'] = 'High-Value'\n", + "# Use code with caution" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SLhYVJZtO1MK" + }, + "source": [ + "

7. Customer Lifetime Value

" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5BNtUPKDO1ML" + }, + "source": [ + "Since our feature set is ready, let’s calculate 6 months LTV for each customer which we are going to use for training our model.\n", + "\n", + "**Lifetime Value: Total Gross Revenue - Total Cost**\n", + "\n", + "There is no cost specified in the dataset. That’s why Revenue becomes our LTV directly.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "iVr0rk7NO1ML", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + }, + "outputId": "f1bd50af-4929-4512-8ce6-6f1e7e1732f9" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " InvoiceNo StockCode Description Quantity \\\n", + "0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 \n", + "1 536365 71053 WHITE METAL LANTERN 6 \n", + "2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 \n", + "3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 \n", + "4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 \n", + "\n", + " InvoiceDate UnitPrice CustomerID Country \\\n", + "0 2010-12-01 08:26:00 2.55 17850.0 United Kingdom \n", + "1 2010-12-01 08:26:00 3.39 17850.0 United Kingdom \n", + "2 2010-12-01 08:26:00 2.75 17850.0 United Kingdom \n", + "3 2010-12-01 08:26:00 3.39 17850.0 United Kingdom \n", + "4 2010-12-01 08:26:00 3.39 17850.0 United Kingdom \n", + "\n", + " InvoiceYearMonth Revenue \n", + "0 201012 15.30 \n", + "1 201012 20.34 \n", + "2 201012 22.00 \n", + "3 201012 20.34 \n", + "4 201012 20.34 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountryInvoiceYearMonthRevenue
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom20101215.30
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850.0United Kingdom20101220.34
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850.0United Kingdom20101222.00
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850.0United Kingdom20101220.34
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850.0United Kingdom20101220.34
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_uk" + } + }, + "metadata": {}, + "execution_count": 90 + } + ], + "source": [ + "tx_uk.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2cm4zAl-O1ML", + "scrolled": true, + "colab": { + "base_uri": "https://localhost:8080/", + "height": 303 + }, + "outputId": "fa9e2726-4781-46d7-bbde-483057d75dc3" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "count 495478\n", + "mean 2011-07-04 05:01:41.098131456\n", + "min 2010-12-01 08:26:00\n", + "25% 2011-03-27 12:06:00\n", + "50% 2011-07-19 11:47:00\n", + "75% 2011-10-20 10:41:00\n", + "max 2011-12-09 12:49:00\n", + "Name: InvoiceDate, dtype: object" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
InvoiceDate
count495478
mean2011-07-04 05:01:41.098131456
min2010-12-01 08:26:00
25%2011-03-27 12:06:00
50%2011-07-19 11:47:00
75%2011-10-20 10:41:00
max2011-12-09 12:49:00
\n", + "

" + ] + }, + "metadata": {}, + "execution_count": 91 + } + ], + "source": [ + "tx_uk['InvoiceDate'].describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aVs_AWrVO1ML" + }, + "source": [ + "We see that customers are active from 1 December 2010. Let us consider customers from March onwards (so that they are not new customers). We shall divide them into 2 subgroups. One will be where timeframe of analysing is 3 months, another will be timeframe of 6 months." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Wiz_GDbYYxMP" + }, + "outputs": [], + "source": [ + "from datetime import datetime\n", + "\n", + "tx_3m = tx_uk[(tx_uk.InvoiceDate < datetime(2011,6,1)) & (tx_uk.InvoiceDate >= datetime(2011,3,1))].reset_index(drop=True) #3 months time\n", + "tx_6m = tx_uk[(tx_uk.InvoiceDate >= datetime(2011,6,1)) & (tx_uk.InvoiceDate < datetime(2011,12,1))].reset_index(drop=True) # 6 months time" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7SkELym7O1ML" + }, + "outputs": [], + "source": [ + "#calculate revenue and create a new dataframe for it\n", + "tx_6m['Revenue'] = tx_6m['UnitPrice'] * tx_6m['Quantity']\n", + "tx_user_6m = tx_6m.groupby('CustomerID')['Revenue'].sum().reset_index()\n", + "tx_user_6m.columns = ['CustomerID','m6_Revenue']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mYAEizUTO1ML", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "outputId": "0ab10fca-b0ef-4f50-a12b-ab56275824c0" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID m6_Revenue\n", + "0 12747.0 1666.11\n", + "1 12748.0 18679.01\n", + "2 12749.0 2323.04\n", + "3 12820.0 561.53\n", + "4 12822.0 918.98" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDm6_Revenue
012747.01666.11
112748.018679.01
212749.02323.04
312820.0561.53
412822.0918.98
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_user_6m", + "summary": "{\n \"name\": \"tx_user_6m\",\n \"rows\": 3167,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1572.9401142163365,\n \"min\": 12747.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3167,\n \"samples\": [\n 13805.0,\n 15130.0,\n 13115.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"m6_Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4782.390775124077,\n \"min\": -4287.63,\n \"max\": 180469.05,\n \"num_unique_values\": 3117,\n \"samples\": [\n 2687.56,\n 1638.68,\n 465.96\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 94 + } + ], + "source": [ + "tx_user_6m.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sfzhm8egO1ML", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 472 + }, + "outputId": "44db0832-c93d-44fb-93c4-9a9d267e2602" + }, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/png": "\n" + }, + "metadata": {} + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "# Plot the histogram\n", + "plt.hist(tx_user_6m['m6_Revenue'], bins=30, edgecolor='black')\n", + "\n", + "# Set the title and labels\n", + "plt.title('6m Revenue')\n", + "plt.xlabel('Revenue')\n", + "plt.ylabel('Count')\n", + "\n", + "# Display the plot\n", + "plt.show()\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RbWoD9C9O1ML" + }, + "source": [ + "Histogram clearly shows we have customers with negative LTV. We have some outliers too. Filtering out the outliers makes sense to have a proper machine learning model." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TuIbH_htO1MM" + }, + "source": [ + "Ok, next step. We will merge our 3 months and tx_uk and also merge 6 months dataframe and tx_uk to see correlations between LTV and the feature set we have." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "s3ywyb_JO1MM", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 313 + }, + "outputId": "2506c89e-802b-41bd-8927-fd39d6ff3584" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Recency RecencyCluster Frequency FrequencyCluster Revenue \\\n", + "0 17850.0 301 0 312 1 5288.63 \n", + "1 14688.0 7 3 359 1 5107.38 \n", + "2 13767.0 1 3 399 1 16945.71 \n", + "3 15513.0 30 3 314 1 14520.08 \n", + "4 14849.0 21 3 392 1 7904.28 \n", + "\n", + " RevenueCluster OverallScore Segment \n", + "0 1 2 Low-Value \n", + "1 1 5 High-Value \n", + "2 1 5 High-Value \n", + "3 1 5 High-Value \n", + "4 1 5 High-Value " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRecencyRecencyClusterFrequencyFrequencyClusterRevenueRevenueClusterOverallScoreSegment
017850.0301031215288.6312Low-Value
114688.07335915107.3815High-Value
213767.013399116945.7115High-Value
315513.0303314114520.0815High-Value
414849.021339217904.2815High-Value
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_user", + "summary": "{\n \"name\": \"tx_user\",\n \"rows\": 3950,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1576.8483250815016,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3950,\n \"samples\": [\n 17172.0,\n 13607.0,\n 13379.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 189,\n 307,\n 161\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 3,\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Frequency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 220,\n \"min\": 1,\n \"max\": 7983,\n \"num_unique_values\": 455,\n \"samples\": [\n 718,\n 500,\n 99\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"FrequencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 0,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6548.608224207447,\n \"min\": -4287.63,\n \"max\": 256438.49,\n \"num_unique_values\": 3878,\n \"samples\": [\n 102.83,\n 190.53,\n 314.69\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RevenueCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 0,\n 3,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"OverallScore\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 8,\n \"num_unique_values\": 9,\n \"samples\": [\n 7,\n 5,\n 6\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Segment\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Low-Value\",\n \"High-Value\",\n \"Mid-Value\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 96 + } + ], + "source": [ + "tx_user.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "npq34CtaO1MM", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 538 + }, + "outputId": "a10d8f0f-d3a1-4ccf-d957-b81bd8f932db" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " InvoiceNo StockCode Description Quantity \\\n", + "0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 \n", + "1 536365 71053 WHITE METAL LANTERN 6 \n", + "2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 \n", + "3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 \n", + "4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 \n", + "\n", + " InvoiceDate UnitPrice CustomerID Country \\\n", + "0 2010-12-01 08:26:00 2.55 17850.0 United Kingdom \n", + "1 2010-12-01 08:26:00 3.39 17850.0 United Kingdom \n", + "2 2010-12-01 08:26:00 2.75 17850.0 United Kingdom \n", + "3 2010-12-01 08:26:00 3.39 17850.0 United Kingdom \n", + "4 2010-12-01 08:26:00 3.39 17850.0 United Kingdom \n", + "\n", + " InvoiceYearMonth Revenue \n", + "0 201012 15.30 \n", + "1 201012 20.34 \n", + "2 201012 22.00 \n", + "3 201012 20.34 \n", + "4 201012 20.34 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountryInvoiceYearMonthRevenue
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom20101215.30
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850.0United Kingdom20101220.34
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850.0United Kingdom20101222.00
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850.0United Kingdom20101220.34
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850.0United Kingdom20101220.34
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_uk" + } + }, + "metadata": {}, + "execution_count": 97 + } + ], + "source": [ + "tx_uk.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rq3lDDlTO1MM" + }, + "outputs": [], + "source": [ + "tx_merge = pd.merge(tx_user, tx_user_6m, on='CustomerID', how='left') #Only people who are in the timeline of tx_user_6m" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zLvMF5H0O1MM" + }, + "outputs": [], + "source": [ + "tx_merge = tx_merge.fillna(0)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true, + "id": "lLMnrL9OO1MM", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 557 + }, + "outputId": "34051a3e-fd21-4a10-fdaf-a46d81245ca0" + }, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/png": "\n" + }, + "metadata": {} + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "# Filter the data\n", + "tx_graph = tx_merge.query(\"m6_Revenue < 50000\")\n", + "\n", + "# Create the figure and axis objects\n", + "plt.figure(figsize=(10, 6))\n", + "\n", + "# Plot the scatter plots for each segment\n", + "plt.scatter(\n", + " tx_graph.query(\"Segment == 'Low-Value'\")['OverallScore'],\n", + " tx_graph.query(\"Segment == 'Low-Value'\")['m6_Revenue'],\n", + " color='blue',\n", + " s=7**2, # Marker size (matplotlib sizes markers by area, hence s=7^2)\n", + " alpha=0.8,\n", + " label='Low'\n", + ")\n", + "\n", + "plt.scatter(\n", + " tx_graph.query(\"Segment == 'Mid-Value'\")['OverallScore'],\n", + " tx_graph.query(\"Segment == 'Mid-Value'\")['m6_Revenue'],\n", + " color='green',\n", + " s=9**2, # Marker size\n", + " alpha=0.5,\n", + " label='Mid'\n", + ")\n", + "\n", + "plt.scatter(\n", + " tx_graph.query(\"Segment == 'High-Value'\")['OverallScore'],\n", + " tx_graph.query(\"Segment == 'High-Value'\")['m6_Revenue'],\n", + " color='red',\n", + " s=11**2, # Marker size\n", + " alpha=0.9,\n", + " label='High'\n", + ")\n", + "\n", + "# Set the title and labels\n", + "plt.title('LTV')\n", + "plt.xlabel('RFM Score')\n", + "plt.ylabel('6m LTV')\n", + "\n", + "# Add a legend\n", + "plt.legend()\n", + "\n", + "# Show the plot\n", + "plt.show()\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RCMmivKsO1MM" + }, + "source": [ + "We can visualise correlation between overall RFM score and revenue. Positive correlation is quite visible here. High RFM score means high LTV.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UPJ__rwiO1MM" + }, + "source": [ + "Before building the machine learning model, we need to identify what is the type of this machine learning problem. LTV itself is a regression problem. A machine learning model can predict the $ value of the LTV. But here, we want LTV segments. Because it makes it more actionable and easy to communicate with other people. By applying K-means clustering, we can identify our existing LTV groups and build segments on top of it.\n", + "\n", + "Considering business part of this analysis, we need to treat customers differently based on their predicted LTV. For this example, we will apply clustering and have 3 segments (number of segments really depends on your business dynamics and goals):\n", + "* Low LTV\n", + "* Mid LTV\n", + "* High LTV\n", + "\n", + "We are going to apply K-means clustering to decide segments and observe their characteristics\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "keUkjb1LO1MM" + }, + "outputs": [], + "source": [ + "#remove outliers\n", + "tx_merge = tx_merge[tx_merge['m6_Revenue']\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRecencyRecencyClusterFrequencyFrequencyClusterRevenueRevenueClusterOverallScoreSegmentm6_Revenue
017850.0301031215288.6312Low-Value0.00
114688.07335915107.3815High-Value1702.06
414849.021339217904.2815High-Value5498.07
613468.01330615656.7515High-Value1813.09
717690.029325814748.4515High-Value2616.15
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + " \n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_merge", + "summary": "{\n \"name\": \"tx_merge\",\n \"rows\": 3910,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1575.3246255165336,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3910,\n \"samples\": [\n 16209.0,\n 16172.0,\n 17500.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 189,\n 307,\n 161\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 3,\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Frequency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 131,\n \"min\": 1,\n \"max\": 2782,\n \"num_unique_values\": 437,\n \"samples\": [\n 163,\n 299,\n 340\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"FrequencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 2,\n \"num_unique_values\": 3,\n \"samples\": [\n 1,\n 0,\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1896.155427288516,\n \"min\": -4287.63,\n \"max\": 21535.9,\n \"num_unique_values\": 3838,\n \"samples\": [\n 2584.4,\n 2126.93,\n 741.26\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RevenueCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 0,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"OverallScore\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 6,\n \"num_unique_values\": 7,\n \"samples\": [\n 2,\n 5\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Segment\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Low-Value\",\n \"High-Value\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"m6_Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1057.0800280963301,\n \"min\": -4287.63,\n \"max\": 8432.68,\n \"num_unique_values\": 3077,\n \"samples\": [\n 4016.29,\n 173.76\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 102 + } + ], + "source": [ + "tx_merge.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XWUBYq8rO1MN", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 382 + }, + "outputId": "0416683c-4c66-4eac-da71-2ecfb68f4d1f" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:\n", + "\n", + "The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", + "\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Recency RecencyCluster Frequency FrequencyCluster Revenue \\\n", + "0 17850.0 301 0 312 1 5288.63 \n", + "1 14688.0 7 3 359 1 5107.38 \n", + "4 14849.0 21 3 392 1 7904.28 \n", + "6 13468.0 1 3 306 1 5656.75 \n", + "7 17690.0 29 3 258 1 4748.45 \n", + "\n", + " RevenueCluster OverallScore Segment m6_Revenue LTVCluster \n", + "0 1 2 Low-Value 0.00 0 \n", + "1 1 5 High-Value 1702.06 2 \n", + "4 1 5 High-Value 5498.07 1 \n", + "6 1 5 High-Value 1813.09 2 \n", + "7 1 5 High-Value 2616.15 2 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRecencyRecencyClusterFrequencyFrequencyClusterRevenueRevenueClusterOverallScoreSegmentm6_RevenueLTVCluster
017850.0301031215288.6312Low-Value0.000
114688.07335915107.3815High-Value1702.062
414849.021339217904.2815High-Value5498.071
613468.01330615656.7515High-Value1813.092
717690.029325814748.4515High-Value2616.152
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_merge", + "summary": "{\n \"name\": \"tx_merge\",\n \"rows\": 3910,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1575.3246255165336,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3910,\n \"samples\": [\n 16209.0,\n 16172.0,\n 17500.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 189,\n 307,\n 161\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 3,\n 1,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Frequency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 131,\n \"min\": 1,\n \"max\": 2782,\n \"num_unique_values\": 437,\n \"samples\": [\n 163,\n 299,\n 340\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"FrequencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 2,\n \"num_unique_values\": 3,\n \"samples\": [\n 1,\n 0,\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1896.155427288516,\n \"min\": -4287.63,\n \"max\": 21535.9,\n \"num_unique_values\": 3838,\n \"samples\": [\n 2584.4,\n 2126.93,\n 741.26\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RevenueCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 0,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"OverallScore\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 6,\n \"num_unique_values\": 7,\n \"samples\": [\n 2,\n 5\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Segment\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Low-Value\",\n \"High-Value\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"m6_Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1057.0800280963301,\n \"min\": -4287.63,\n \"max\": 8432.68,\n \"num_unique_values\": 3077,\n \"samples\": [\n 4016.29,\n 173.76\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"LTVCluster\",\n \"properties\": {\n \"dtype\": \"int32\",\n \"num_unique_values\": 3,\n \"samples\": [\n 0,\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 103 + } + ], + "source": [ + "#creating 3 clusters\n", + "kmeans = KMeans(n_clusters=3)\n", + "tx_merge['LTVCluster'] = kmeans.fit_predict(tx_merge[['m6_Revenue']])\n", + "\n", + "tx_merge.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XPs3Nq6MO1MN", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 174 + }, + "outputId": "45b881df-f289-43dd-87ab-26b6f6a4a259" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " count mean std min 25% 50% \\\n", + "LTVCluster \n", + "0 2960.0 277.296717 282.043305 -4287.63 0.0000 229.110 \n", + "1 794.0 1609.586914 549.435635 941.95 1148.8350 1482.900 \n", + "2 156.0 4645.661795 1345.674897 3129.27 3537.7325 4256.115 \n", + "\n", + " 75% max \n", + "LTVCluster \n", + "0 452.9325 940.83 \n", + "1 1945.7975 3113.70 \n", + "2 5497.9800 8432.68 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countmeanstdmin25%50%75%max
LTVCluster
02960.0277.296717282.043305-4287.630.0000229.110452.9325940.83
1794.01609.586914549.435635941.951148.83501482.9001945.79753113.70
2156.04645.6617951345.6748973129.273537.73254256.1155497.98008432.68
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"tx_cluster\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"LTVCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 2,\n \"num_unique_values\": 3,\n \"samples\": [\n 0,\n 1,\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1469.7514529107746,\n \"min\": 156.0,\n \"max\": 2960.0,\n \"num_unique_values\": 3,\n \"samples\": [\n 2960.0,\n 794.0,\n 156.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"mean\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2238.874765777621,\n \"min\": 277.2967172297297,\n \"max\": 4645.661794871795,\n \"num_unique_values\": 3,\n \"samples\": [\n 277.2967172297297,\n 1609.5869143576826,\n 4645.661794871795\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"std\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 553.2943440760093,\n \"min\": 282.0433045759854,\n \"max\": 1345.6748974689422,\n \"num_unique_values\": 3,\n \"samples\": [\n 282.0433045759854,\n 549.4356347614342,\n 1345.6748974689422\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"min\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3811.0208905401364,\n \"min\": -4287.63,\n \"max\": 3129.27,\n \"num_unique_values\": 3,\n \"samples\": [\n -4287.63,\n 941.95,\n 3129.27\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"25%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1804.7254807074078,\n \"min\": 0.0,\n \"max\": 3537.7325,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.0,\n 1148.835,\n 3537.7325\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"50%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2060.7231768786896,\n \"min\": 229.11,\n \"max\": 4256.115,\n \"num_unique_values\": 3,\n \"samples\": [\n 229.11,\n 1482.9,\n 4256.115\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"75%\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2591.6259599843074,\n \"min\": 452.9325,\n \"max\": 5497.98,\n \"num_unique_values\": 3,\n \"samples\": [\n 452.9325,\n 1945.7975,\n 5497.98\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"max\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3854.450429909994,\n \"min\": 940.83,\n \"max\": 8432.68,\n \"num_unique_values\": 3,\n \"samples\": [\n 940.83,\n 3113.7,\n 8432.68\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 104 + } + ], + "source": [ + "#order cluster number based on LTV\n", + "tx_merge = order_cluster('LTVCluster', 'm6_Revenue',tx_merge,True)\n", + "\n", + "#creatinga new cluster dataframe\n", + "tx_cluster = tx_merge.copy()\n", + "\n", + "#see details of the clusters\n", + "tx_cluster.groupby('LTVCluster')['m6_Revenue'].describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pPG-wmbUO1MN" + }, + "source": [ + "We have finished LTV clustering and here are the characteristics of each clusters as shown above.\n", + "\n", + "Cluster 2 is the best with average 8.2k LTV whereas 0 is the worst with 396." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l8g5GKAvO1MN" + }, + "source": [ + "There are few more step before training the machine learning model:\n", + "* Feature engineering.\n", + "* Convert categorical columns to numerical columns.\n", + "* We will check the correlation of features against our label, LTV clusters.\n", + "* We will split our feature set and label (LTV) as X and y. We use X to predict y.\n", + "* Will create Training and Test dataset. Training set will be used for building the machine learning model. We will apply our model to Test set to see its real performance.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wRCSfnIQO1MN", + "scrolled": true, + "colab": { + "base_uri": "https://localhost:8080/", + "height": 313 + }, + "outputId": "27a54dc2-28e2-4e47-83c5-92a91930b528" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Recency RecencyCluster Frequency FrequencyCluster Revenue \\\n", + "0 17850.0 301 0 312 1 5288.63 \n", + "1 13093.0 266 0 170 0 7741.47 \n", + "2 15032.0 255 0 55 0 4464.10 \n", + "3 16000.0 2 3 9 0 12393.70 \n", + "4 15749.0 234 1 15 0 21535.90 \n", + "\n", + " RevenueCluster OverallScore Segment m6_Revenue LTVCluster \n", + "0 1 2 Low-Value 0.0 0 \n", + "1 1 1 Low-Value 0.0 0 \n", + "2 1 1 Low-Value 0.0 0 \n", + "3 1 4 Mid-Value 0.0 0 \n", + "4 1 2 Low-Value 0.0 0 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRecencyRecencyClusterFrequencyFrequencyClusterRevenueRevenueClusterOverallScoreSegmentm6_RevenueLTVCluster
017850.0301031215288.6312Low-Value0.00
113093.0266017007741.4711Low-Value0.00
215032.025505504464.1011Low-Value0.00
316000.0239012393.7014Mid-Value0.00
415749.0234115021535.9012Low-Value0.00
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_cluster", + "summary": "{\n \"name\": \"tx_cluster\",\n \"rows\": 3910,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1575.3246255165336,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3910,\n \"samples\": [\n 15925.0,\n 16967.0,\n 14652.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 174,\n 319,\n 203\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 3,\n 2,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Frequency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 131,\n \"min\": 1,\n \"max\": 2782,\n \"num_unique_values\": 437,\n \"samples\": [\n 380,\n 27,\n 247\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"FrequencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 2,\n \"num_unique_values\": 3,\n \"samples\": [\n 1,\n 0,\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1896.155427288516,\n \"min\": -4287.63,\n \"max\": 21535.9,\n \"num_unique_values\": 3838,\n \"samples\": [\n 176.6,\n 338.8,\n 1430.94\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RevenueCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 0,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"OverallScore\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 6,\n \"num_unique_values\": 7,\n \"samples\": [\n 2,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Segment\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Low-Value\",\n \"Mid-Value\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"m6_Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1057.0800280963301,\n \"min\": -4287.63,\n \"max\": 8432.68,\n \"num_unique_values\": 3077,\n \"samples\": [\n 283.84,\n 124.14999999999999\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"LTVCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 2,\n \"num_unique_values\": 3,\n \"samples\": [\n 0,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 105 + } + ], + "source": [ + "tx_cluster.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cwIcGIBYO1MN" + }, + "source": [ + "

7.1 Feature Engineering

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "k_dd1HfjO1MN", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 243 + }, + "outputId": "0b9102cf-2ad1-49c4-d60d-e80d639b9cd8" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " CustomerID Recency RecencyCluster Frequency FrequencyCluster Revenue \\\n", + "0 17850.0 301 0 312 1 5288.63 \n", + "1 13093.0 266 0 170 0 7741.47 \n", + "2 15032.0 255 0 55 0 4464.10 \n", + "3 16000.0 2 3 9 0 12393.70 \n", + "4 15749.0 234 1 15 0 21535.90 \n", + "\n", + " RevenueCluster OverallScore m6_Revenue LTVCluster Segment_High-Value \\\n", + "0 1 2 0.0 0 False \n", + "1 1 1 0.0 0 False \n", + "2 1 1 0.0 0 False \n", + "3 1 4 0.0 0 False \n", + "4 1 2 0.0 0 False \n", + "\n", + " Segment_Low-Value Segment_Mid-Value \n", + "0 True False \n", + "1 True False \n", + "2 True False \n", + "3 False True \n", + "4 True False " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CustomerIDRecencyRecencyClusterFrequencyFrequencyClusterRevenueRevenueClusterOverallScorem6_RevenueLTVClusterSegment_High-ValueSegment_Low-ValueSegment_Mid-Value
017850.0301031215288.63120.00FalseTrueFalse
113093.0266017007741.47110.00FalseTrueFalse
215032.025505504464.10110.00FalseTrueFalse
316000.0239012393.70140.00FalseFalseTrue
415749.0234115021535.90120.00FalseTrueFalse
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "tx_class", + "summary": "{\n \"name\": \"tx_class\",\n \"rows\": 3910,\n \"fields\": [\n {\n \"column\": \"CustomerID\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1575.3246255165336,\n \"min\": 12346.0,\n \"max\": 18287.0,\n \"num_unique_values\": 3910,\n \"samples\": [\n 15925.0,\n 16967.0,\n 14652.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Recency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 100,\n \"min\": 0,\n \"max\": 373,\n \"num_unique_values\": 348,\n \"samples\": [\n 174,\n 319,\n 203\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RecencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 3,\n \"num_unique_values\": 4,\n \"samples\": [\n 3,\n 2,\n 0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Frequency\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 131,\n \"min\": 1,\n \"max\": 2782,\n \"num_unique_values\": 437,\n \"samples\": [\n 380,\n 27,\n 247\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"FrequencyCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 2,\n \"num_unique_values\": 3,\n \"samples\": [\n 1,\n 0,\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1896.155427288516,\n \"min\": -4287.63,\n \"max\": 21535.9,\n \"num_unique_values\": 3838,\n \"samples\": [\n 176.6,\n 338.8,\n 1430.94\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"RevenueCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 0,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"OverallScore\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 0,\n \"max\": 6,\n \"num_unique_values\": 7,\n \"samples\": [\n 2,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"m6_Revenue\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1057.0800280963301,\n \"min\": -4287.63,\n \"max\": 8432.68,\n \"num_unique_values\": 3077,\n \"samples\": [\n 283.84,\n 124.14999999999999\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"LTVCluster\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 2,\n \"num_unique_values\": 3,\n \"samples\": [\n 0,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Segment_High-Value\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n true,\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Segment_Low-Value\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n false,\n true\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Segment_Mid-Value\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n true,\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 106 + } + ], + "source": [ + "#convert categorical columns to numerical\n", + "tx_class = pd.get_dummies(tx_cluster) #There is only one categorical variable segment\n", + "tx_class.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YuVfI9ASO1MN", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 491 + }, + "outputId": "6ab184d3-330c-4b48-f1a8-8f9479a8bb74" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "LTVCluster 1.000000\n", + "m6_Revenue 0.877872\n", + "Revenue 0.775550\n", + "RevenueCluster 0.605556\n", + "Frequency 0.569399\n", + "OverallScore 0.542147\n", + "FrequencyCluster 0.515290\n", + "Segment_High-Value 0.496939\n", + "RecencyCluster 0.358319\n", + "Segment_Mid-Value 0.189268\n", + "CustomerID -0.030020\n", + "Recency -0.350378\n", + "Segment_Low-Value -0.378387\n", + "Name: LTVCluster, dtype: float64" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
LTVCluster
LTVCluster1.000000
m6_Revenue0.877872
Revenue0.775550
RevenueCluster0.605556
Frequency0.569399
OverallScore0.542147
FrequencyCluster0.515290
Segment_High-Value0.496939
RecencyCluster0.358319
Segment_Mid-Value0.189268
CustomerID-0.030020
Recency-0.350378
Segment_Low-Value-0.378387
\n", + "

" + ] + }, + "metadata": {}, + "execution_count": 107 + } + ], + "source": [ + "#calculate and show correlations\n", + "corr_matrix = tx_class.corr()\n", + "corr_matrix['LTVCluster'].sort_values(ascending=False)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6mQIIrm3O1MN" + }, + "outputs": [], + "source": [ + "#create X and y, X will be feature set and y is the label - LTV\n", + "X = tx_class.drop(['LTVCluster','m6_Revenue'],axis=1)\n", + "y = tx_class['LTVCluster']\n", + "\n", + "#split training and test sets\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=56)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wGeL5dMlO1MN" + }, + "source": [ + "We see that Revenue, Frequency and RFM scores will be helpful for our machine learning models from the correlation with LTVCluster.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_I_atw61O1MO" + }, + "source": [ + "

8. Machine Learning Model for Customer Lifetime Value Prediction

" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z4ttDEDmO1MO" + }, + "source": [ + "Since our LTV Clusters are 3 types, high LTV, mid LTV and low LTV; we will perform multi class classification." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8IwktUt1O1MO", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "1507fc23-d1ca-49b9-c775-b5524055fd3b" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Accuracy of XGB classifier on training set: 0.95\n", + "Accuracy of XGB classifier on test set: 0.92\n" + ] + } + ], + "source": [ + "#XGBoost Multiclassification Model\n", + "ltv_xgb_model = xgb.XGBClassifier(max_depth=5, learning_rate=0.1,n_jobs=-1).fit(X_train, y_train)\n", + "\n", + "print('Accuracy of XGB classifier on training set: {:.2f}'\n", + " .format(ltv_xgb_model.score(X_train, y_train)))\n", + "print('Accuracy of XGB classifier on test set: {:.2f}'\n", + " .format(ltv_xgb_model.score(X_test[X_train.columns], y_test)))\n", + "\n", + "y_pred = ltv_xgb_model.predict(X_test)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QyNtxmQpO1MO" + }, + "source": [ + "Accuracy looks good on training and test set. Let's check the precision, recall, fscore too" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gPB-DmRUO1MO", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "c2b91906-d279-46e9-bd8c-24f8ea09195a" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + " precision recall f1-score support\n", + "\n", + " 0 0.95 0.94 0.95 145\n", + " 1 0.80 0.84 0.82 43\n", + " 2 1.00 0.88 0.93 8\n", + "\n", + " accuracy 0.92 196\n", + " macro avg 0.92 0.89 0.90 196\n", + "weighted avg 0.92 0.92 0.92 196\n", + "\n" + ] + } + ], + "source": [ + "print(classification_report(y_test, y_pred))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eowRMitRO1MO" + }, + "source": [ + "

9. Final Clusters for Customer Lifetime Value

\n", + "\n", + "- **Cluster 0**: Good precision, recall, f1-score and support\n", + "- **Cluster 1**: Needs better precision, recall and f1-score\n", + "- **Cluster 2**: Bad precision, F1-Score needs improvement" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NmU-1l4WO1MO" + }, + "source": [ + "If model tells us this customer belongs to cluster 0, 93 out of 100 will be correct (precision). And the model successfully identifies 95% of actual cluster 0 customers (recall).\n", + "\n", + "We really need to improve the model for other clusters. For example, we barely detect 67% of Mid LTV customers.\n", + "\n", + "**Possible actions to improve performance**\n", + "\n", + "- Adding more features and improve feature engineering\n", + "- Try different models other than XGBoost\n", + "- Apply hyper parameter tuning to current model\n", + "- Add more data to the model if possible\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "include_colab_link": true + }, + "kaggle": { + "accelerator": "none", + "dataSources": [ + { + "datasetId": 302641, + "sourceId": 618141, + "sourceType": "datasetVersion" + } + ], + "dockerImageVersionId": 29867, + "isGpuEnabled": false, + "isInternetEnabled": true, + "language": "python", + "sourceType": "notebook" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file