Initial example

jer2ig · jer2ig · commit de75b4bafc55 · 2025-11-12T16:06:16.000-08:00
diff --git a/doc/examples/index.rst b/doc/examples/index.rst
@@ -33,6 +33,7 @@ General Examples
     py_double_ml_plm_irm_hetfx.ipynb
     py_double_ml_meets_flaml.ipynb
     py_double_ml_rdflex.ipynb
+    py_double_ml_lplr.ipynb
 
 
 Effect Heterogeneity
diff --git a/doc/examples/py_double_ml_lplr.ipynb b/doc/examples/py_double_ml_lplr.ipynb
@@ -0,0 +1,205 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "# Python: Log-Odds Effects for Logistic PLR models\n",
+    "\n",
+    "In this simple example, we illustrate how the [DoubleML](https://docs.doubleml.org/stable/index.html) package can be used to estimate the changes in log-odds due to treatment in a logistic partíal linear regression [DoubleMLLPLR](https://docs.doubleml.org/stable/guide/models.html#logistic-partial-linear-regression-lplr) model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2025-11-12T23:42:30.920222Z",
+     "start_time": "2025-11-12T23:42:30.915753Z"
+    }
+   },
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import doubleml as dml\n",
+    "\n",
+    "from doubleml.plm.datasets import make_lplr_LZZ2020"
+   ],
+   "outputs": [],
+   "execution_count": 3
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data\n",
+    "\n",
+    "We define a data generating process to create synthetic data to compare the estimates to the true effect. The data generating process is adapted and extended from [Liu et al. (2020)](https://academic.oup.com/ectj/article-abstract/24/3/559/6296639).\n",
+    "\n",
+    "The documentation of the data generating process can be found [here](https://docs.doubleml.org/stable/api/datasets.html).\n",
+    "\n",
+    "The data generation process supports both binary and continuous treatments. In this example we consider a continuous treatment effect. Both the treatment assignment (if binary) and the outcome variable balancing can be can be adjusted."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2025-11-13T00:05:27.845205Z",
+     "start_time": "2025-11-13T00:05:27.835022Z"
+    }
+   },
+   "source": [
+    "np.random.seed(42)\n",
+    "data = make_lplr_LZZ2020(n_obs=1000, dim_x=20, alpha=0.5, treatment=\"continuous\")\n",
+    "print(data)"
+   ],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "================== DoubleMLData Object ==================\n",
+      "\n",
+      "------------------ Data summary      ------------------\n",
+      "Outcome variable: y\n",
+      "Treatment variable(s): ['d']\n",
+      "Covariates: ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20']\n",
+      "Instrument variable(s): None\n",
+      "No. Observations: 1000\n",
+      "\n",
+      "------------------ DataFrame info    ------------------\n",
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 1000 entries, 0 to 999\n",
+      "Columns: 23 entries, X1 to p\n",
+      "dtypes: float64(23)\n",
+      "memory usage: 179.8 KB\n",
+      "\n"
+     ]
+    }
+   ],
+   "execution_count": 32
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "## Model\n",
+    "\n",
+    "The logistic partial linear regression (LPLR) model is specified as follows:\n",
+    "\n",
+    "$$\\mathbb{E} [Y | D, X] = \\mathbb{P} (Y=1 | D, X) = \\text{expit} \\{\\beta_0 D + r_0 (X) \\}$$\n",
+    "\n",
+    "where $Y$ is the binary outcome variable and $D$ is the policy variable of interest.\n",
+    "The high-dimensional vector $X = (X_1, \\ldots, X_p)$ consists of other confounding covariates.\n",
+    "$\\text{expit}$ is the logistic link function\n",
+    "\n",
+    "$$\\text{expit} ( X ) = \\frac{1}{1 + e^{-x}}$$\n",
+    "\n",
+    "The log-odds of the treated versus the untreated is modelled as a partial linear model. The estimated coefficient $\\beta_0$ can be interpreted as the change in log-odds due to a one unit increase in the treatment variable $D$, holding all other covariates constant."
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "Next, define the learners for the nuisance functions and fit the [LPLR Model](https://docs.doubleml.org/stable/guide/models.html#logistic-partial-linear-regression-lplr).\n",
+    "The correct type of learner (regressor or classifier) must be used for each nuisance function.\n",
+    "\n",
+    "- ml_M is a model of the outcome. Here, since the outcome is binary, we use a classifier.\n",
+    "- ml_t is a model of the log-odds. This must always be a regressor.\n",
+    "- ml_m is a model of the treatment. Here, since the treatment is continuous, we use a regressor. In the case of a binary treatment, a classifier must be used."
+   ]
+  },
+  {
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2025-11-13T00:05:47.340376Z",
+     "start_time": "2025-11-13T00:05:31.657594Z"
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "# First stage estimation\n",
+    "from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor\n",
+    "randomForest_reg = RandomForestRegressor()\n",
+    "randomForest_class = RandomForestClassifier()\n",
+    "\n",
+    "np.random.seed(4242)\n",
+    "\n",
+    "dml_lplr = dml.DoubleMLLPLR(data,\n",
+    "                          ml_M=randomForest_class,\n",
+    "                          ml_t=randomForest_reg,\n",
+    "                          ml_m=randomForest_reg,\n",
+    "                          n_folds=5)\n",
+    "print(\"Training LPLR Model\")\n",
+    "dml_lplr.fit()\n",
+    "\n",
+    "print(dml_lplr.summary)"
+   ],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Training LPLR Model\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/julius/Projects/DoubleMLLogit/.venv/lib/python3.13/site-packages/sklearn/utils/deprecation.py:132: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.\n",
+      "  warnings.warn(\n",
+      "/Users/julius/Projects/DoubleMLLogit/.venv/lib/python3.13/site-packages/sklearn/utils/deprecation.py:132: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "      coef   std err         t     P>|t|     2.5 %    97.5 %\n",
+      "d  0.35212  0.100429  3.506179  0.000455  0.155284  0.548957\n"
+     ]
+    }
+   ],
+   "execution_count": 33
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": ""
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.10.6 64-bit",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "ac5e9af40c2048901fb5e070f7bbe2ca12417b0669992742e66f016e0e17b88e"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}