Skip to content

Latest commit

 

History

History
162 lines (110 loc) · 5.26 KB

README.md

File metadata and controls

162 lines (110 loc) · 5.26 KB

EV Toolbox Regression

Overview

The EV Toolbox Regression project provides a streamlined solution for performing regression analysis on datasets stored in Box. The workflow involves downloading datasets, combining them, running regression analysis, and uploading results back to Box, ensuring an end-to-end solution for data handling and analysis.

This tool is built using Python, leveraging libraries such as pandas and scikit-learn for data manipulation and machine learning, and the Box SDK for seamless integration with Box.


Features

  • Box Integration: Securely connect to Box to download and upload files.
  • Dataset Combination: Automatically combine datasets using a specified key column.
  • Regression Analysis: Perform linear regression and output:
    • Regression coefficients in a single-row CSV format.
    • R² score, Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) in the terminal.
  • File Overwriting: Automatically replaces existing files in Box when uploading new results.

Requirements

  • Python: Version 3.8 or later.
  • Libraries:
    • pandas
    • scikit-learn
    • boxsdk
    • dotenv
    • requests
  • Box Account: Required for file storage and authentication.

Setup Instructions

1. Clone the Repository

git clone https://github.com/your-repo/ev-toolbox-regression.git
cd ev-toolbox-regression

2. Install Dependencies

Create a virtual environment and install the required Python packages:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

3. Set Up .env File

Create a .env file in the project root with the following content:

# Box Credentials
BOX_CLIENT_ID=your_client_id
BOX_CLIENT_SECRET=your_client_secret
BOX_ACCESS_TOKEN=your_access_token
BOX_REFRESH_TOKEN=your_refresh_token
BOX_REDIRECT_URI=http://localhost

# File IDs
BOX_FILE_IDS=your_file_id_1,your_file_id_2

# Output Folder ID
BOX_OUTPUT_FOLDER_ID=your_output_folder_id

# Regression Parameters
TARGET_COLUMN=Y_house_price_of_unit_area
FEATURE_COLUMNS=X1_transaction_date,X2_house_age,X3_distance_to_MRT,X4_number_of_convenience_stores,X5_latitude,X6_longitude

Usage

1. Run the Workflow

Run the main.py script to execute the end-to-end process:

python main.py

The workflow performs the following steps: 1. Authenticate with Box.

2.	Download datasets specified in BOX_FILE_IDS.

3.	Combine the datasets using the No column as the key.

4.	Perform regression analysis:

    • Outputs regression coefficients as a CSV (**'reg_coefficients.csv'**).

    • Displays model statistics (R², MSE, RMSE) in the terminal.

5.	Upload the results back to the Box folder specified in **'BOX_OUTPUT_FOLDER_ID'**.

2. Reauthorize the Application

If tokens expire, the script will prompt you to reauthorize the application:

1.	Follow the provided authorization URL.

2.  Log in your Box account, click "Grant Access".

3.	Paste the received authorization code (at the end of link of the redicted page, after "**'code='**") into the terminal.

File Descriptions

  • main.py: Orchestrates the entire workflow, including authentication, file handling, dataset combination, regression, and result upload.
  • data_utils.py: Contains functions for combining datasets and handling file downloads/uploads with Box.
  • regression_utils.py: Handles regression analysis, calculates statistics, and saves coefficients to a CSV file.
  • token_manager.py: Manages Box API authentication, token refresh, and reauthorization.
  • .env: Configuration file for Box credentials, file IDs, output folder ID, and regression parameters.

Example Output

Terminal Output

Authenticated User: John Doe (ID: 12345678901)
Datasets combined successfully.
Regression Statistics:
R² Score: 0.5824
Mean Squared Error: 77.1317
Root Mean Squared Error: 8.7825
Regression coefficients saved to data/reg_coefficients.csv.
File uploaded successfully: reg_coefficients.csv

reg_coefficients.csv

X1_transaction_date X2_house_age X3_distance_to_MRT X4_number_of_convenience_stores X5_latitude X6_longitude Intercept
5.146227462979936 -0.269695448 -0.004487461 1.133276905 225.472976 -12.423601 -14437.101

Troubleshooting

  • ValueError: One or more specified columns are missing in the combined dataset.

    • Ensure all specified columns in the .env file exist in the combined dataset.
    • Verify the input datasets contain the required columns and are properly combined.
  • BoxAPIException: item_name_in_use

    • This error occurs when attempting to upload a file that already exists in the Box folder.
    • The script is configured to overwrite files in Box. If the issue persists, verify that the upload_file_to_box function is correctly replacing files.
  • Authentication Error: Refresh token has expired

    • Reauthorize the application by following the URL provided in the terminal and entering the new authorization code.

License

This project is licensed under the MIT License.