The EV Toolbox Regression project provides a streamlined solution for performing regression analysis on datasets stored in Box. The workflow involves downloading datasets, combining them, running regression analysis, and uploading results back to Box, ensuring an end-to-end solution for data handling and analysis.
This tool is built using Python, leveraging libraries such as pandas
and scikit-learn
for data manipulation and machine learning, and the Box SDK
for seamless integration with Box.
- Box Integration: Securely connect to Box to download and upload files.
- Dataset Combination: Automatically combine datasets using a specified key column.
- Regression Analysis: Perform linear regression and output:
- Regression coefficients in a single-row CSV format.
- R² score, Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) in the terminal.
- File Overwriting: Automatically replaces existing files in Box when uploading new results.
- Python: Version 3.8 or later.
- Libraries:
pandas
scikit-learn
boxsdk
dotenv
requests
- Box Account: Required for file storage and authentication.
git clone https://github.com/your-repo/ev-toolbox-regression.git
cd ev-toolbox-regression
Create a virtual environment and install the required Python packages:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
Create a .env file in the project root with the following content:
# Box Credentials
BOX_CLIENT_ID=your_client_id
BOX_CLIENT_SECRET=your_client_secret
BOX_ACCESS_TOKEN=your_access_token
BOX_REFRESH_TOKEN=your_refresh_token
BOX_REDIRECT_URI=http://localhost
# File IDs
BOX_FILE_IDS=your_file_id_1,your_file_id_2
# Output Folder ID
BOX_OUTPUT_FOLDER_ID=your_output_folder_id
# Regression Parameters
TARGET_COLUMN=Y_house_price_of_unit_area
FEATURE_COLUMNS=X1_transaction_date,X2_house_age,X3_distance_to_MRT,X4_number_of_convenience_stores,X5_latitude,X6_longitude
Run the main.py script to execute the end-to-end process:
python main.py
The workflow performs the following steps: 1. Authenticate with Box.
2. Download datasets specified in BOX_FILE_IDS.
3. Combine the datasets using the No column as the key.
4. Perform regression analysis:
• Outputs regression coefficients as a CSV (**'reg_coefficients.csv'**).
• Displays model statistics (R², MSE, RMSE) in the terminal.
5. Upload the results back to the Box folder specified in **'BOX_OUTPUT_FOLDER_ID'**.
If tokens expire, the script will prompt you to reauthorize the application:
1. Follow the provided authorization URL.
2. Log in your Box account, click "Grant Access".
3. Paste the received authorization code (at the end of link of the redicted page, after "**'code='**") into the terminal.
main.py
: Orchestrates the entire workflow, including authentication, file handling, dataset combination, regression, and result upload.data_utils.py
: Contains functions for combining datasets and handling file downloads/uploads with Box.regression_utils.py
: Handles regression analysis, calculates statistics, and saves coefficients to a CSV file.token_manager.py
: Manages Box API authentication, token refresh, and reauthorization..env
: Configuration file for Box credentials, file IDs, output folder ID, and regression parameters.
Authenticated User: John Doe (ID: 12345678901)
Datasets combined successfully.
Regression Statistics:
R² Score: 0.5824
Mean Squared Error: 77.1317
Root Mean Squared Error: 8.7825
Regression coefficients saved to data/reg_coefficients.csv.
File uploaded successfully: reg_coefficients.csv
X1_transaction_date | X2_house_age | X3_distance_to_MRT | X4_number_of_convenience_stores | X5_latitude | X6_longitude | Intercept |
---|---|---|---|---|---|---|
5.146227462979936 | -0.269695448 | -0.004487461 | 1.133276905 | 225.472976 | -12.423601 | -14437.101 |
-
ValueError: One or more specified columns are missing in the combined dataset.
- Ensure all specified columns in the
.env
file exist in the combined dataset. - Verify the input datasets contain the required columns and are properly combined.
- Ensure all specified columns in the
-
BoxAPIException: item_name_in_use
- This error occurs when attempting to upload a file that already exists in the Box folder.
- The script is configured to overwrite files in Box. If the issue persists, verify that the
upload_file_to_box
function is correctly replacing files.
-
Authentication Error: Refresh token has expired
- Reauthorize the application by following the URL provided in the terminal and entering the new authorization code.
This project is licensed under the MIT License.