Update Tutorial (#100)

autogluon · Nov 11, 2024 · 9c1576c · 9c1576c
1 parent b0e469a
commit 9c1576c
Showing 1 changed file with 186 additions and 28 deletions.
diff --git a/docs/tutorials/autogluon-assistant-quick-start.ipynb b/docs/tutorials/autogluon-assistant-quick-start.ipynb
@@ -1,9 +1,8 @@
 {
  "cells": [
   {
-   "attachments": {},
    "cell_type": "markdown",
-   "id": "998885f294556807",
+   "id": "1e174bb9",
    "metadata": {},
    "source": [
     "# AutoGluon Assistant - Quick Start\n",
@@ -18,7 +17,8 @@
     "We will cover:\n",
     "- Setting up AutoGluon Assistant\n",
     "- Preparing Your Data\n",
-    "- Using AutoGluon Assistant\n",
+    "- Using AutoGluon Assistant (via Command Line Interface)\n",
+    "- Using AutoGluon Assistant (through Python Programming)\n",
     "\n",
     "By the end of this tutorial, you'll be able to run your data with our highly accurate ML solutions using just natural language instructions. Let's get started with the installation!"
    ]
@@ -39,7 +39,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install git+https://github.com/autogluon/autogluon-assistant.git#egg=autogluon-assistant[dev]"
+    "!pip install autogluon-assistant"
    ]
   },
   {
@@ -129,22 +129,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import requests, os\n",
-    "\n",
-    "# Create directory and download example files\n",
-    "os.makedirs(\"./toy_data\", exist_ok=True)\n",
-    "for f in [\"train.csv\", \"test.csv\", \"descriptions.txt\"]:\n",
-    "    open(f\"toy_data/{f}\", \"wb\").write(\n",
-    "        requests.get(f\"https://raw.githubusercontent.com/autogluon/autogluon-assistant/main/toy_data/{f}\").content\n",
-    "    )"
+    "%%bash\n",
+    "wget https://automl-mm-bench.s3.us-east-1.amazonaws.com/aga/data/aga_sample_data.zip\n",
+    "unzip aga_sample_data.zip"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "c810125a2b8aa286",
    "metadata": {},
    "source": [
-    "That's it! We now have:\n",
+    "That's it! We now have (under `./toy_data`):\n",
     "\n",
     "- `train.csv`: Training data with labeled examples\n",
     "- `test.csv`: Test data for making predictions\n",
@@ -181,7 +176,7 @@
    "id": "ec8a61ef4291bc39",
    "metadata": {},
    "source": [
-    "## Using AutoGluon Assistant\n",
+    "## Using AutoGluon Assistant (via Command Line Interface)\n",
     "\n",
     "Now that we have our data ready, let's use AutoGluon Assistant to build our ML model. The simplest way to use AutoGluon Assistant is through the command line - no coding required! After installing the package, you can run it directly from your terminal:"
    ]
@@ -193,36 +188,181 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "#TODO: remove the requirement of config files\n",
-    "!autogluon-assistant ./toy_data"
+    "%%bash\n",
+    "autogluon-assistant ./toy_data \\\n",
+    "    --presets medium_quality    # (Optional) Choose prediction quality level:\n",
+    "                                # Options: medium_quality, high_quality, best_quality (default)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "b25c839d",
+   "id": "8adda2cc",
    "metadata": {},
    "source": [
+    "```\n",
+    "INFO:root:Starting AutoGluon-Assistant\n",
+    "INFO:root:Presets: medium_quality\n",
+    "INFO:root:Loading default config from: /media/deephome/autogluon-assistant/src/autogluon_assistant/configs/medium_quality.yaml\n",
+    "INFO:root:Successfully loaded config\n",
+    "🤖  Welcome to AutoGluon-Assistant \n",
+    "Will use task config:\n",
+    "{\n",
+    "    'infer_eval_metric': True,\n",
+    "    'detect_and_drop_id_column': False,\n",
+    "    'task_preprocessors_timeout': 3600,\n",
+    "    'save_artifacts': {'enabled': False, 'append_timestamp': True, 'path': './aga-artifacts'},\n",
+    "    'feature_transformers': None,\n",
+    "    'autogluon': {'predictor_init_kwargs': {}, 'predictor_fit_kwargs': {'presets': 'medium_quality', 'time_limit': 600}},\n",
+    "    'llm': {\n",
+    "        'provider': 'bedrock',\n",
+    "        'api_key_location': 'BEDROCK_API_KEY',\n",
+    "        'model': 'anthropic.claude-3-5-sonnet-20241022-v2:0',\n",
+    "        'max_tokens': 512,\n",
+    "        'proxy_url': None,\n",
+    "        'temperature': 0,\n",
+    "        'verbose': True\n",
+    "    }\n",
+    "}\n",
+    "Task path: /media/deephome/testdir/toy_data\n",
+    "Task loaded!\n",
+    "TabularPredictionTask(name=toy_data, description=, 3 datasets)\n",
+    "INFO:botocore.credentials:Found credentials in environment variables.\n",
+    "INFO:autogluon_assistant.llm.llm:AGA is using model anthropic.claude-3-5-sonnet-20241022-v2:0 from Bedrock to assist you with the task.\n",
+    "INFO:autogluon_assistant.assistant:Task understanding starts...\n",
+    "INFO:autogluon_assistant.task_inference.task_inference:description: data_description_file: You are solving this data science tasks of binary classification: \\nThe dataset presented here (the spaceship dataset) comprises a lot of features, including both numerical and categorical features. Some of the features are missing, with nan value. We have splitted the dataset into three parts of train, valid and test. Your task is to predict the Transported item, which is a binary label with True and False. The evaluation metric is the classification accuracy.\\n\n",
+    "INFO:autogluon_assistant.task_inference.task_inference:train_data: /media/deephome/testdir/toy_data/train.csv\n",
+    "Loaded data from: /media/deephome/testdir/toy_data/train.csv | Columns = 16 / 16 | Rows = 1000 -> 1000\n",
+    "INFO:autogluon_assistant.task_inference.task_inference:test_data: /media/deephome/testdir/toy_data/test.csv\n",
+    "Loaded data from: /media/deephome/testdir/toy_data/test.csv | Columns = 16 / 16 | Rows = 1000 -> 1000\n",
+    "INFO:autogluon_assistant.task_inference.task_inference:WARNING: Failed to identify the sample_submission_data of the task, it is set to None.\n",
+    "INFO:autogluon_assistant.task_inference.task_inference:label_column: Transported\n",
+    "INFO:autogluon_assistant.task_inference.task_inference:problem_type: binary\n",
+    "INFO:autogluon_assistant.task_inference.task_inference:eval_metric: accuracy\n",
+    "INFO:autogluon_assistant.assistant:Total number of prompt tokens: 1582\n",
+    "INFO:autogluon_assistant.assistant:Total number of completion tokens: 155\n",
+    "INFO:autogluon_assistant.assistant:Task understanding complete!\n",
+    "INFO:autogluon_assistant.assistant:Automatic feature generation is disabled. \n",
+    "Model training starts...\n",
+    "INFO:autogluon_assistant.predictor:Fitting AutoGluon TabularPredictor\n",
+    "INFO:autogluon_assistant.predictor:predictor_init_kwargs: {'learner_kwargs': {'ignored_columns': []}, 'label': 'Transported', 'problem_type': 'binary', 'eval_metric': 'accuracy'}\n",
+    "INFO:autogluon_assistant.predictor:predictor_fit_kwargs: {'presets': 'medium_quality', 'time_limit': 600}\n",
+    "No path specified. Models will be saved in: \"AutogluonModels/ag-20241111_055131\"\n",
+    "Verbosity: 2 (Standard Logging)\n",
+    "=================== System Info ===================\n",
+    "AutoGluon Version:  1.1.1\n",
+    "Python Version:     3.10.14\n",
+    "Operating System:   Linux\n",
+    "Platform Machine:   x86_64\n",
+    "Platform Version:   #54~20.04.1-Ubuntu SMP Fri Oct 6 22:04:33 UTC 2023\n",
+    "CPU Count:          96\n",
+    "Memory Avail:       1030.28 GB / 1121.80 GB (91.8%)\n",
+    "Disk Space Avail:   64.75 GB / 860.63 GB (7.5%)\n",
+    "===================================================\n",
+    "Presets specified: ['medium_quality']\n",
+    "Beginning AutoGluon training ... Time limit = 600s\n",
+    "AutoGluon will save models to \"AutogluonModels/ag-20241111_055131\"\n",
+    "Train Data Rows:    1000\n",
+    "Train Data Columns: 15\n",
+    "Label Column:       Transported\n",
+    "Problem Type:       binary\n",
+    "Preprocessing data ...\n",
+    "Selected class <--> label mapping:  class 1 = True, class 0 = False\n",
+    "Using Feature Generators to preprocess the data ...\n",
+    "Fitting AutoMLPipelineFeatureGenerator...\n",
+    "        Available Memory:                    1055013.00 MB\n",
+    "        Train Data (Original)  Memory Usage: 0.48 MB (0.0% of available memory)\n",
+    "        Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n",
+    "        Stage 1 Generators:\n",
+    "                Fitting AsTypeFeatureGenerator...\n",
+    "        Stage 2 Generators:\n",
+    "                Fitting FillNaFeatureGenerator...\n",
+    "        Stage 3 Generators:\n",
+    "                Fitting IdentityFeatureGenerator...\n",
+    "                Fitting CategoryFeatureGenerator...\n",
+    "                        Fitting CategoryMemoryMinimizeFeatureGenerator...\n",
+    "        Stage 4 Generators:\n",
+    "                Fitting DropUniqueFeatureGenerator...\n",
+    "        Stage 5 Generators:\n",
+    "                Fitting DropDuplicatesFeatureGenerator...\n",
+    "        Unused Original Features (Count: 1): ['PassengerId']\n",
+    "                These features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.\n",
+    "                Features can also be unused if they carry very little information, such as being categorical but having almost entirely unique values or being duplicates of other features.\n",
+    "                These features do not need to be present at inference time.\n",
+    "                ('object', []) : 1 | ['PassengerId']\n",
+    "        Types of features in original data (raw dtype, special dtypes):\n",
+    "                ('float', [])  : 7 | ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', ...]\n",
+    "                ('object', []) : 7 | ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Name', ...]\n",
+    "        Types of features in processed data (raw dtype, special dtypes):\n",
+    "                ('category', []) : 7 | ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Name', ...]\n",
+    "                ('float', [])    : 7 | ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', ...]\n",
+    "        0.1s = Fit runtime\n",
+    "        14 features in original data used to generate 14 features in processed data.\n",
+    "        Train Data (Processed) Memory Usage: 0.06 MB (0.0% of available memory)\n",
+    "Data preprocessing and feature engineering runtime = 0.1s ...\n",
+    "AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n",
+    "        To change this, specify the eval_metric parameter of Predictor()\n",
+    "Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 800, Val Rows: 200\n",
+    "User-specified model hyperparameters to be fit:\n",
+    "{\n",
+    "        'NN_TORCH': {},\n",
+    "        'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],\n",
+    "        'CAT': {},\n",
+    "        'XGB': {},\n",
+    "        'FASTAI': {},\n",
+    "        'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],\n",
+    "        'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],\n",
+    "        'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],\n",
+    "}\n",
+    "Fitting 13 L1 models ...\n",
+    "Fitting model: KNeighborsUnif ... Training model for up to 599.9s of the 599.9s of remaining time.\n",
+    "        0.805    = Validation score   (accuracy)\n",
+    "        0.04s    = Training   runtime\n",
+    "        0.04s    = Validation runtime\n",
+    "Fitting model: KNeighborsDist ... Training model for up to 599.82s of the 599.82s of remaining time.\n",
+    "        0.79     = Validation score   (accuracy)\n",
+    "        0.03s    = Training   runtime\n",
+    "        0.03s    = Validation runtime\n",
+    "Fitting model: LightGBMXT ... Training model for up to 599.75s of the 599.75s of remaining time.\n",
+    "        0.83     = Validation score   (accuracy)\n",
+    "        0.87s    = Training   runtime\n",
+    "        0.01s    = Validation runtime\n",
+    "\n",
+    "......\n",
+    "\n",
+    "Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 581.72s of remaining time.\n",
+    "        Ensemble Weights: {'LightGBMLarge': 0.4, 'NeuralNetTorch': 0.25, 'NeuralNetFastAI': 0.2, 'CatBoost': 0.15}\n",
+    "        0.855    = Validation score   (accuracy)\n",
+    "        0.12s    = Training   runtime\n",
+    "        0.0s     = Validation runtime\n",
+    "AutoGluon training complete, total runtime = 18.41s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 4025.3 rows/s (200 batch size)\n",
+    "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"AutogluonModels/ag-20241111_055131\")\n",
+    "Model training complete!\n",
+    "Prediction starts...\n",
+    "Prediction complete! Outputs written to aga-output-20241111_055149.csv\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1ff4c018",
+   "metadata": {},
+   "source": [
+    "## Using AutoGluon Assistant (through Python Programming)\n",
+    "\n",
     "Let's also look at how to use AutoGluon Assistant programmatically in Python:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "id": "362ff589bb29d77d",
-   "metadata": {
-    "tags": [
-     "hide-output"
-    ]
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "from autogluon_assistant import AutogluonAssistant\n",
-    "\n",
-    "# Initialize the assistant\n",
-    "assistant = AutogluonAssistant()\n",
+    "from autogluon_assistant import run_assistant\n",
     "\n",
     "# Run the assistant\n",
-    "output_file = assistant.predict(data_dir=\"./toy_data\")"
+    "output_file = run_assistant(task_path=\"./toy_data\", presets=\"medium_quality\")"
    ]
   },
   {
@@ -240,11 +380,29 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "import pandas as pd\n",
+    "\n",
     "predictions = pd.read_csv(output_file)\n",
     "print(\"\\nFirst few predictions:\")\n",
     "print(predictions.head())"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "52a7d48a",
+   "metadata": {},
+   "source": [
+    "```\n",
+    "First few predictions:\n",
+    "   Transported\n",
+    "0         True\n",
+    "1        False\n",
+    "2         True\n",
+    "3         True\n",
+    "4         True\n",
+    "```"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "79eb2f75ce0e5eed",
@@ -254,7 +412,7 @@
    "source": [
     "## Conclusion\n",
     "\n",
-    "In this quickstart tutorial, we saw how AutoGluon Assistant simplifies the entire ML pipeline by allowing users to solve machine learning problems with minimal efforts. With just a data directory, AutoGluon Assistant handles the entire process from data understanding to prediction generation. Check out the other tutorials to learn more about customizing the configuration (WIP), using different LLM providers, and handling various types of ML tasks.\n",
+    "In this quickstart tutorial, we saw how AutoGluon Assistant simplifies the entire ML pipeline by allowing users to solve machine learning problems with minimal efforts. With just a data directory, AutoGluon Assistant handles the entire process from data understanding to prediction generation. Check out the other tutorials (WIP) to learn more about customizing the configuration, using different LLM providers, and handling various types of ML tasks.\n",
     "\n",
     "Want to dive deeper? Explore our GitHub repository for more advanced features and examples."
    ]