Skip to content

Commit 59bc4a0

Browse files
committed
Merge branch 'feature/boston_housing' of https://github.com/davidjurado/mlcube_examples into feature/boston_housing
2 parents 933bfea + cecbc54 commit 59bc4a0

File tree

11 files changed

+492
-0
lines changed

11 files changed

+492
-0
lines changed

boston_housing/.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
project/raw_dataset.txt
2+
project/processed_dataset.csv
3+
mlcube/workspace/data
4+
mlcube/run
5+
mlcube/tasks

boston_housing/README.md

Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
# Packing an existing project into MLCUbe
2+
3+
In this tutorial we're going to use the [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). We'll take an existing implementation, create the needed files to pack it into MLCube and execute all tasks.
4+
5+
## Original project code
6+
7+
At fist we have only 4 files, one for package dependencies and 3 scripts for each task: download data, preprocess data and train.
8+
9+
```bash
10+
├── project
11+
├── 01_download_dataset.py
12+
├── 02_preprocess_dataset.py
13+
├── 03_train.py
14+
└── requirements.txt
15+
```
16+
17+
The most important thing that we need to remember about these scripts are the input parameters:
18+
19+
* 01_download_dataset.py
20+
21+
**--data_dir** : Dataset download path, inside this folder path a new file called raw_dataset.txt will be created.
22+
23+
* 02_preprocess_dataset.py
24+
25+
**--data_dir** : Folder path containing raw dataset file, when finished a new file called processed_dataset.csv will be created.
26+
27+
* 03_train.py
28+
29+
**--dataset_file_path** : Processed dataset file path. Note: this is the full path to the csv file.
30+
**--n_estimators** : Number of boosting stages to perform. In this case we're using a gradient boosting regressor.
31+
32+
## MLCube scructure
33+
34+
We'll need a couple of files for MLCube, first we'll need to create a folder called **mlcube** in the same path from as project folder. We'll need to create the following structure (for this tutorial the files are already in place)
35+
36+
```bash
37+
├── mlcube
38+
│   ├── mlcube.yaml
39+
│   └── workspace
40+
│   └── parameters.yaml
41+
└── project
42+
├── 01_download_dataset.py
43+
├── 02_preprocess_dataset.py
44+
├── 03_train.py
45+
└── requirements.txt
46+
```
47+
48+
In the following steps we'll describe each file.
49+
50+
## Define tasks execution scripts
51+
52+
In general, we'll have a script for each task, and there are different ways to describe their execution from a main hanlder file, in this tutorial we'll use a function from the Python subprocess modeule:
53+
54+
* subprocess.Popen()
55+
56+
When we don't have input parameters for a Python script (or maybe just one) we can describe the execution of that script from Python code as follows:
57+
58+
```Python
59+
import subprocess
60+
# Set the full command as variable
61+
command = "python my_task.py --single_parameter input"
62+
# Split the command, this will give us the list:
63+
# ['python', 'my_task.py', '--single_parameter', 'input']
64+
splitted_command = command.split()
65+
# Execute the command as a new process
66+
process = subprocess.Popen(splitted_command, cwd=".")
67+
# Wait for the process to finish
68+
process.wait()
69+
```
70+
71+
### MLCube File: mlcube/workspace/parameters.yaml
72+
73+
When we have a script with multiple input parameters, it will be hard to store the full command to execute it in a single variable, in this case we can create a shell script describing all the arguments and even add some extra fucntionalities, this will useful since we can define the input parameters as environment variables.
74+
75+
We can use the **mlcube/workspace/parameters.yaml** file to describe all the input parameters we'll use (this file is already provided, please take a look and study its content), the idea is to describe all the parameters in this file and then use this single file as an input for the task. Then we can read the content of the parameters file in Python and set all the parameters as environment variables. Finally with the environment variables setted we can excute a shell script with our implementation.
76+
77+
The way we execute all these steps in Python is described below.
78+
79+
```Python
80+
import os
81+
import yaml
82+
# Read the file and store the parameters in a variable
83+
with open(parameters_file, 'r') as stream:
84+
parameters = yaml.safe_load(stream)
85+
# Get the system's enviroment
86+
env = os.environ.copy()
87+
# We can add a single new enviroment as follows
88+
env.update({
89+
'NEW_ENV_VARIABLE': "my_new_env_variable",
90+
})
91+
# Add all the parameters we got from the parameters file
92+
env.update(parameters)
93+
# Execute the shell script with the updated enviroment
94+
process = subprocess.Popen("./run_and_time.sh", cwd=".", env=env)
95+
# Wait for the process to finish
96+
process.wait()
97+
```
98+
99+
### Shell script
100+
101+
In this tutorial we already have a shell script containing the steps to run the train task, the file is: **project/run_and_time.sh**, please take a look and study its content.
102+
103+
### MLCube Command
104+
105+
We are targeting pull-type installation, so MLCube images should be available on docker hub. If not, try this:
106+
107+
```bash
108+
mlcube run ... -Pdocker.build_strategy=auto
109+
```
110+
111+
Parameters defined in mlcube.yaml can be overridden using: param=input, example:
112+
113+
```bash
114+
mlcube run --task=download_data data_dir=absolute_path_to_custom_dir
115+
```
116+
117+
Also, users can override the workspace directory by using:
118+
119+
```bash
120+
mlcube run --task=download_data --workspace=absolute_path_to_custom_dir
121+
```
122+
123+
Note: Sometimes, overriding the workspace path could fail for some task, this is because the input parameter parameters_file should be specified, to solve this use:
124+
125+
```bash
126+
mlcube run --task=train --workspace=absolute_path_to_custom_dir parameters_file=$(pwd)/workspace/parameters.yaml
127+
```
128+
129+
### MLCube Python entrypoint file
130+
131+
At this point we know how to execute the tasks sripts from Python code, now we can create a file that contains the definition on how to run each task.
132+
133+
This file will be located in **project/mlcube.py**, this is the main file that will serve as the entrypoint to run all tasks.
134+
135+
This file is already provided, please take a look and study its content.
136+
137+
## Dockerize the project
138+
139+
We'll create a Dockerfile with the needed steps to run the project, at the end we'll need to define the execution of the **mlcube.py** file as the entrypoint. This file will be located in **project/Dockerfile**.
140+
141+
This file is already provided, please take a look and study its content.
142+
143+
When creating the docker image, we'll need to run the docker build command inside the project folder, the command that we'll use is:
144+
145+
`docker build . -t mlcommons/boston_housing:0.0.1 -f Dockerfile`
146+
147+
Keep in mind the tag that we just described.
148+
149+
At this point our solution folder structure should look like this:
150+
151+
```bash
152+
├── mlcube
153+
│   ├── mlcube.yaml
154+
│   └── workspace
155+
│   └── parameters.yaml
156+
└── project
157+
├── 01_download_dataset.py
158+
├── 02_preprocess_dataset.py
159+
├── 03_train.py
160+
├── Dockerfile
161+
├── mlcube.py
162+
├── requirements.txt
163+
└── run_and_time.sh
164+
```
165+
166+
### Define MLCube files
167+
168+
Inside the mlcube folder we'll need to define the following files.
169+
170+
### mlcube/platforms/docker.yaml
171+
172+
This file contains the description of the platform that we'll use to run MLCube, in this case is Docker. In the container definition we'll have the following subfields:
173+
174+
* command: Main command to run, in this case is docker
175+
* run_args: In this field we'll define all the arguments to run the docker conatiner, e.g. --rm, --gpus, etc.
176+
* image: Image to use, in this case we'll need to use the same image tag from the docker build command.
177+
178+
This file is already provided, please take a look and study its content.
179+
180+
### MLCube task definition file
181+
182+
The file located in **mlcube/mlcube.yaml** contains the definition of all the tasks and their parameters.
183+
184+
This file is already provided, please take a look and study its content.
185+
186+
With this file we have finished the packing of the project into MLCube! Now we can setup the project and run all the tasks.
187+
188+
### Project setup
189+
190+
```bash
191+
# Create Python environment
192+
virtualenv -p python3 ./env && source ./env/bin/activate
193+
194+
# Install MLCube and MLCube docker runner from GitHub repository
195+
# (normally, users will just run `pip install mlcube mlcube_docker`)
196+
git clone https://github.com/mlcommons/mlcube && cd mlcube/mlcube
197+
python setup.py bdist_wheel && pip install --force-reinstall ./dist/mlcube-* && cd ..
198+
cd ./runners/mlcube_docker && python setup.py bdist_wheel && pip install --force-reinstall --no-deps ./dist/mlcube_docker-* && cd ../../..
199+
200+
# Fetch the boston housing example from GitHub
201+
git clone https://github.com/mlcommons/mlcube_examples && cd ./mlcube_examples
202+
git fetch origin pull/27/head:feature/boston_housing && git checkout feature/boston_housing
203+
cd ./boston_housing/mlcube
204+
```
205+
206+
### Dataset
207+
208+
The [Boston Housing Dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) will be downloaded and processed. Sizes of the dataset in each step:
209+
210+
| Dataset Step | MLCube Task | Format | Size |
211+
|--------------------------------|-------------------|------------|---------|
212+
| Downlaod (Compressed dataset) | download_data | txt file | ~52 KB |
213+
| Preprocess (Processed dataset) | preprocess_data | csv file | ~40 KB |
214+
| Total | (After all tasks) | All | ~92 KB |
215+
216+
### Tasks execution
217+
218+
```bash
219+
# Download Boston housing dataset. Default path = /workspace/data
220+
# To override it, use data_dir=DATA_DIR
221+
mlcube run --task download_data
222+
223+
# Preprocess Boston housing dataset, this will convert raw .txt data to .csv format
224+
# It will use the DATA_DIR path defined in the previous step
225+
mlcube run --task preprocess_data
226+
227+
# Run training.
228+
# Parameters to override: dataset_file_path=DATASET_FILE_PATH parameters_file=PATH_TO_TRAINING_PARAMS
229+
mlcube run --task train
230+
```

boston_housing/mlcube/mlcube.yaml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
name: MLCommons Boston Housing
2+
description: MLCommons Boston Housing example
3+
authors:
4+
- {name: "MLCommons Best Practices Working Group"}
5+
6+
platform:
7+
accelerator_count: 0
8+
9+
docker:
10+
# Image name.
11+
image: mlcommons/boston_housing:0.0.1
12+
# Docker build context relative to $MLCUBE_ROOT. Default is `build`.
13+
build_context: "../project"
14+
# Docker file name within docker build context, default is `Dockerfile`.
15+
build_file: "Dockerfile"
16+
17+
tasks:
18+
download_data:
19+
# Download boston housing dataset
20+
parameters:
21+
# Directory where dataset will be saved.
22+
outputs: {data_dir: data/}
23+
preprocess_data:
24+
# Preprocess dataset
25+
parameters:
26+
# Same directory location where dataset was downloaded
27+
inputs: {data_dir: data/}
28+
train:
29+
# Train gradient boosting regressor model
30+
parameters:
31+
# Processed dataset file
32+
inputs: {dataset_file_path: data/processed_dataset.csv, parameters_file: parameters.yaml}
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
N_ESTIMATORS: "500"
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
"""Download the raw Boston Housing Dataset"""
2+
import os
3+
import argparse
4+
import requests
5+
6+
DATASET_URL = "http://lib.stat.cmu.edu/datasets/boston"
7+
8+
9+
def download_dataset(data_dir):
10+
"""Download dataset and store it in a given path.
11+
Args:
12+
data_dir (str): Dataset download path."""
13+
14+
request = requests.get(DATASET_URL)
15+
file_name = "raw_dataset.txt"
16+
file_path = os.path.join(data_dir, file_name)
17+
with open(file_path,'wb') as f:
18+
f.write(request.content)
19+
print(f"\nRaw dataset saved at: {file_path}")
20+
21+
22+
def main():
23+
24+
parser = argparse.ArgumentParser(description='Download dataset')
25+
parser.add_argument('--data_dir', required=True,
26+
help='Dataset download path')
27+
args = parser.parse_args()
28+
29+
data_dir = args.data_dir
30+
download_dataset(data_dir)
31+
32+
33+
if __name__ == '__main__':
34+
main()
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
"""Preprocess the dataset and save in CSV format"""
2+
import os
3+
import argparse
4+
import pandas as pd
5+
6+
def process_data(data_dir):
7+
"""Process raw dataset and save it in CSV format.
8+
Args:
9+
data_dir (str): Folder path containing dataset."""
10+
11+
col_names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "PRICE"]
12+
raw_file = os.path.join(data_dir, "raw_dataset.txt")
13+
print(f"\nProcessing raw file: {raw_file}")
14+
15+
df = pd.read_csv(raw_file, skiprows=22, header=None, delim_whitespace=True)
16+
df_even=df[df.index%2==0].reset_index(drop=True)
17+
df_odd=df[df.index%2==1].iloc[: , :3].reset_index(drop=True)
18+
df_odd.columns = [11,12,13]
19+
dataset = df_even.join(df_odd)
20+
dataset.columns = col_names
21+
22+
output_file = os.path.join(data_dir, "processed_dataset.csv")
23+
dataset.to_csv(output_file, index=False)
24+
print(f"Processed dataset saved at: {output_file}")
25+
26+
27+
def main():
28+
29+
parser = argparse.ArgumentParser(description='Preprocess dataset')
30+
parser.add_argument('--data_dir', required=True,
31+
help='Folder containing dataset file')
32+
args = parser.parse_args()
33+
34+
data_dir = args.data_dir
35+
process_data(data_dir)
36+
37+
38+
if __name__ == '__main__':
39+
main()

boston_housing/project/03_train.py

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
"""Train gradient boosting regressor on Boston housing dataset"""
2+
import os
3+
import argparse
4+
import pandas as pd
5+
from sklearn.model_selection import train_test_split
6+
from sklearn.metrics import mean_squared_error
7+
from sklearn.ensemble import GradientBoostingRegressor
8+
9+
10+
def train(dataset_file_path, n_estimators):
11+
df = pd.read_csv(dataset_file_path)
12+
13+
data = df.drop(['PRICE'], axis=1)
14+
target = df[['PRICE']]
15+
X_train, X_test, Y_train, Y_test = train_test_split(data, target, test_size = 0.25)
16+
17+
clf = GradientBoostingRegressor(n_estimators=n_estimators, verbose = 1)
18+
clf.fit(X_train, Y_train.values.ravel())
19+
20+
train_predicted = clf.predict(X_train)
21+
train_expected = Y_train
22+
train_rmse = mean_squared_error(train_predicted, train_expected, squared=False)
23+
24+
test_predicted = clf.predict(X_test)
25+
test_expected = Y_test
26+
test_rmse = mean_squared_error(test_predicted, test_expected, squared=False)
27+
28+
print(f"\nTRAIN RMSE:\t{train_rmse}")
29+
print(f"TEST RMSE:\t{test_rmse}")
30+
31+
def main():
32+
33+
parser = argparse.ArgumentParser(description='Train model')
34+
parser.add_argument('--dataset_file_path', required=True,
35+
help='Processed dataset file path')
36+
parser.add_argument('--n_estimators', type=int, default=100,
37+
help='number of boosting stages to perform')
38+
args = parser.parse_args()
39+
40+
dataset_file_path = args.dataset_file_path
41+
n_estimators = args.n_estimators
42+
train(dataset_file_path, n_estimators)
43+
44+
45+
if __name__ == '__main__':
46+
main()

0 commit comments

Comments
 (0)