diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..00967fc --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +infer_container.sif +train_container.sif \ No newline at end of file diff --git a/Dockerfile.infer b/Dockerfile.infer index 9c9b020..9836ea3 100644 --- a/Dockerfile.infer +++ b/Dockerfile.infer @@ -1,7 +1,29 @@ +# Dockerfile for the inference server +# This Dockerfile is used to create a Docker image for the inference server +# It is based on the Python 3.9 slim image +FROM python:3.9-slim + +# Set working directory +# This is the directory where the inference server will be executed +# and where the model will be loaded WORKDIR /app + +# Copy the list of Python packages required for the serving script COPY requirements.txt . -COPY server.py . -FROM python:3.9-slim -CMD ["python", "server.py"] + +# Install the Python dependencies inside the container RUN pip install --no-cache-dir -r requirements.txt + +# Copy the model files to the working directory +# moved after dependencies +COPY server.py . + +# Expose the port that the inference server will listen on +# This is the port that the server will use to communicate with clients +# The default port for the inference server is 8080 EXPOSE 8080 + +# Run the inference server +CMD ["python", "server.py"] + + diff --git a/Dockerfile.train b/Dockerfile.train index 06ffaf9..00bbbad 100644 --- a/Dockerfile.train +++ b/Dockerfile.train @@ -1,12 +1,42 @@ -FROM +ARG OWNER=jupyter +ARG BASE_CONTAINER=$OWNER/scipy-notebook:python-3.11.5 +FROM $BASE_CONTAINER # TODO: Set a working directory +# This is the directory where the training script will be executed +# and where the model will be saved +WORKDIR /home/jovyan/app + +# Switch to root to create model output dir +USER root +RUN mkdir -p /home/jovyan/app/models +# Give ownership to jovyan so it can write to it +RUN chown -R jovyan /home/jovyan/app + +# Switch back to jovyan user +USER jovyan # TODO: Copy the requirements.txt file to the working directory +# This file contains the list of Python packages required for the training script +# The requirements.txt file should be in the same directory as the Dockerfile +COPY requirements.txt . -# TODO: Install the Python dependencies +# TODO: Install the Python dependencies inside the container +# Use --no-cache-dir to avoid caching the packages +# Useful for keeping the image size smaller +RUN pip install --no-cache-dir -r requirements.txt # TODO: Copy the training script (train.py) to the working directory +# This script contains the code to train the model +# and save it to a specified location +# The script should be in the same directory as the Dockerfile +# If the script is in a different directory, adjust the path accordingly +# For example, if the script is in a subdirectory called 'src', use: +# COPY src/train.py . +# If the script is in a parent directory, use: +# COPY ../train.py . +COPY train.py . # TODO: Run the training script that generates the model -CMD [...] +# This command will be executed when the container starts +CMD ["python", "train.py"] diff --git a/README.md b/README.md index 8098afb..390b090 100644 --- a/README.md +++ b/README.md @@ -5,16 +5,16 @@ In this project, you will train, run and serve a machine learning model using Do ## Deliverables -- [ ] Clone this repository to your personal github account -- [ ] Containerize training the machine learning model -- [ ] Containerize serving of the machine learning model -- [ ] Train and run the machine learning model using Docker -- [ ] Run the Docker container serving the machine learning model -- [ ] Store the Docker images on your personal account on Docker Hub -- [ ] Provide the resulting Dockerfiles in GitHub -- [ ] Build an Apptainer image on a HPC of your choice -- [ ] Provide the logs of the slurm job in GitHub -- [ ] Document the steps in a text document in GitHub +- [x] Clone this repository to your personal github account +- [x] Containerize training the machine learning model +- [x] Containerize serving of the machine learning model +- [x] Train and run the machine learning model using Docker +- [x] Run the Docker container serving the machine learning model +- [x] Store the Docker images on your personal account on Docker Hub +- [x] Provide the resulting Dockerfiles in GitHub +- [x] Build an Apptainer image on a HPC of your choice +- [x] Provide the logs of the slurm job in GitHub +- [x] Document the steps in a text document in GitHub ## Proposed steps - containerize and run training the machine learning model diff --git a/Steps.md b/Steps.md new file mode 100644 index 0000000..668839f --- /dev/null +++ b/Steps.md @@ -0,0 +1,119 @@ +## Containerize and run training the machine learning model + +0. I didn't pull an image with the dependencies needed because I already had one from the training session with the librairies that are going to be use (Repository: jupyter/scipy-notebook; TAG: python-3.11.5 ) + +2. I then modified the recipe Dockerfile.train with the proposed advices on the dockerfile + +3. Proceed to create a first version, but this was missing some info so I created a second one + +4. Built the train model using +`docker build . --tag train:version2 -f Dockerfile.train` + +5. Ran the train model using +`docker run --rm --volume "$PWD"/app/models:/app/models train:version2` + +6. Once completed this resulted in: iris_model.pkl + +## Then onto the containerizing and serving the ml model + +7. Started by pulling the proposed image using +`docker pull python:3.9-slim` + +8. Modified the dockerfile.infer + +9. Built the image using +`docker build . --tag infer:version1 -f Dockerfile.infer` + +10. Ran and mounted the serving recipe +`docker run --rm -- p 8080:8080 -v "$PWD"/app/models:/app/models infer:version1` + +11. Tested it on another terminal using +`curl http://localhost:8080/` which returned "Welcome to Docker Lab" + +## Storing images on Dockerhub + +12. I then logged in to dockerhub via vs code +`docker login` + +13. Then properly retagged my images for docker hub +```bash +docker tag train:version2 agrondin1/train:version2 +docker tag infer:version1 agrondin1/infer:version1 +``` + +14. Finally proceeded to push images to docker Hub +```bash +docker push agrondin1/train:version2 +docker push agrondin1/infer:version1 +``` + +## Building an Apptainer image on the HPC + +15. I first connected to https://login.hpc.ugent.be + +16. Proceeded to start a shell session with 1 node 4 cores for 4 hours using the donphan cluster + +17. I then enter my scratch directory: +`cd scratch/gent/491/vsc49179` + +18. Then went one to build the apptainer images: `nano build_apptainer_images.sh` +```bash +`#!/bin/bash` +`#SBATCH --job-name=apptainer_build_all` +`#SBATCH --output=apptainer_build.log` +`#SBATCH --time=01:00:00` +`#SBATCH --ntasks=1` + +`# Apptainer is available system-wide so no need to load module` + +`# Build training image` +`apptainer build train_container.sif docker://agrondin1/train:version2` + +`# Build inference image` +`apptainer build infer_container.sif docker://agrondin1/infer:version1` +``` + +19. Submitted the job to slurm +`sbatch build_apptainer_images.sh` + +This did manage to create the infer container but not the train so I relaunch it with only the build training image. + +20. I then switch to vs code server and modified the script to create the containers like so + +```bash +#!/bin/bash +#SBATCH --job-name=job_submission +#SBATCH --output=apptainer_build.log +#SBATCH --partition=donphan +#SBATCH --mem=8G +#SBATCH --time=00:30:00 +# Apptainer is available system-wide so no need to load module +``` + +```bash +# Build training image +apptainer build --fakeroot train_container.sif docker://agrondin1/train:version2 +``` + +```bash +# Build inference image +apptainer build --fakeroot infer_container.sif docker://agrondin1/infer:version1 +``` + +```bash +mv train_container.sif $VSC_SCRATCH/. +mv infer_container.sif $VSC_SCRATCH/. +``` + + +And then I finally manage to create both image plus the log of the slurm job. Unfortunately the .log is not complete because I creates the *sif files but had forgotten to use #SBATCH --output=apptainer_build.log and I had to do it again but now it just states that the file have been created. + +21. I copied the files to my user home directory: +cp apptainer_build.log infer_container.sif train_container.sif /user/gent/491/vsc49179 + +22. Then finally donwloaded the files and finally pushed everything except the *sif because they exceeded the limit supported by github: +```bash +remote: error: File infer_container.sif is 120.25 MB; this exceeds GitHub's file size limit of 100.00 MB +remote: error: File train_container.sif is 1246.62 MB; this exceeds GitHub's file size limit of 100.00 MB +``` + diff --git a/app/models/iris_model.pkl b/app/models/iris_model.pkl new file mode 100644 index 0000000..86ebede Binary files /dev/null and b/app/models/iris_model.pkl differ diff --git a/apptainer_build.log b/apptainer_build.log new file mode 100644 index 0000000..3bd35c2 --- /dev/null +++ b/apptainer_build.log @@ -0,0 +1,4 @@ +FATAL: While checking build target: build target 'train_container.sif' already exists. Use --force if you want to overwrite it +FATAL: While checking build target: build target 'infer_container.sif' already exists. Use --force if you want to overwrite it +mv: 'train_container.sif' and '/scratch/gent/491/vsc49179/./train_container.sif' are the same file +mv: 'infer_container.sif' and '/scratch/gent/491/vsc49179/./infer_container.sif' are the same file