vib-tcp · AdrienG9 · Apr 20, 2025 · Apr 21, 2025 · Apr 21, 2025 · Apr 21, 2025
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+infer_container.sif
+train_container.sif
diff --git a/Dockerfile.infer b/Dockerfile.infer
@@ -1,7 +1,29 @@
+# Dockerfile for the inference server
+# This Dockerfile is used to create a Docker image for the inference server
+# It is based on the Python 3.9 slim image
+FROM python:3.9-slim
+
+# Set working directory
+# This is the directory where the inference server will be executed
+# and where the model will be loaded
 WORKDIR /app
+
+# Copy the list of Python packages required for the serving script
 COPY requirements.txt .
-COPY server.py .
-FROM python:3.9-slim
-CMD ["python", "server.py"]
+
+# Install the Python dependencies inside the container
 RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy the model files to the working directory
+# moved after dependencies
+COPY server.py .
+
+# Expose the port that the inference server will listen on
+# This is the port that the server will use to communicate with clients
+# The default port for the inference server is 8080
 EXPOSE 8080
+
+# Run the inference server
+CMD ["python", "server.py"]
+
+
diff --git a/Dockerfile.train b/Dockerfile.train
@@ -1,12 +1,42 @@
-FROM <base imagae>
+ARG OWNER=jupyter
+ARG BASE_CONTAINER=$OWNER/scipy-notebook:python-3.11.5
+FROM $BASE_CONTAINER
 
 # TODO: Set a working directory
+# This is the directory where the training script will be executed
+# and where the model will be saved
+WORKDIR /home/jovyan/app
+
+# Switch to root to create model output dir
+USER root
+RUN mkdir -p /home/jovyan/app/models
+# Give ownership to jovyan so it can write to it
+RUN chown -R jovyan /home/jovyan/app
+
+# Switch back to jovyan user
+USER jovyan
 
 # TODO: Copy the requirements.txt file to the working directory
+# This file contains the list of Python packages required for the training script
+# The requirements.txt file should be in the same directory as the Dockerfile
+COPY requirements.txt .
 
-# TODO: Install the Python dependencies
+# TODO: Install the Python dependencies inside the container
+# Use --no-cache-dir to avoid caching the packages
+# Useful for keeping the image size smaller
+RUN pip install --no-cache-dir -r requirements.txt
 
 # TODO: Copy the training script (train.py) to the working directory
+# This script contains the code to train the model
+# and save it to a specified location
+# The script should be in the same directory as the Dockerfile
+# If the script is in a different directory, adjust the path accordingly
+# For example, if the script is in a subdirectory called 'src', use:
+# COPY src/train.py .
+# If the script is in a parent directory, use:
+# COPY ../train.py .
+COPY train.py .
 
 # TODO: Run the training script that generates the model
-CMD [...]
+# This command will be executed when the container starts
+CMD ["python", "train.py"]
diff --git a/README.md b/README.md
@@ -5,16 +5,16 @@ In this project, you will train, run and serve a machine learning model using Do
 
 ## Deliverables
 
-- [ ] Clone this repository to your personal github account
-- [ ] Containerize training the machine learning model
-- [ ] Containerize serving of the machine learning model
-- [ ] Train and run the machine learning model using Docker
-- [ ] Run the Docker container serving the machine learning model
-- [ ] Store the Docker images on your personal account on Docker Hub
-- [ ] Provide the resulting Dockerfiles in GitHub
-- [ ] Build an Apptainer image on a HPC of your choice
-- [ ] Provide the logs of the slurm job in GitHub
-- [ ] Document the steps in a text document in GitHub
+- [x] Clone this repository to your personal github account
+- [x] Containerize training the machine learning model
+- [x] Containerize serving of the machine learning model
+- [x] Train and run the machine learning model using Docker
+- [x] Run the Docker container serving the machine learning model
+- [x] Store the Docker images on your personal account on Docker Hub
+- [x] Provide the resulting Dockerfiles in GitHub
+- [x] Build an Apptainer image on a HPC of your choice
+- [x] Provide the logs of the slurm job in GitHub
+- [x] Document the steps in a text document in GitHub
 
 ## Proposed steps - containerize and run training the machine learning model
 

diff --git a/Steps.md b/Steps.md
@@ -0,0 +1,119 @@
+## Containerize and run training the machine learning model
+
+0. I didn't pull an image with the dependencies needed because I already had one from the training session with the librairies that are going to be use (Repository: jupyter/scipy-notebook;  TAG: python-3.11.5 )
+
+2. I then modified the recipe Dockerfile.train with the proposed advices on the dockerfile 
+
+3. Proceed to create a first version, but this was missing some info so I created a second one
+
+4. Built the train model using 
+`docker build . --tag train:version2 -f Dockerfile.train`
+
+5. Ran the train model using 
+`docker run --rm   --volume "$PWD"/app/models:/app/models   train:version2`
+
+6. Once completed this resulted in: iris_model.pkl
+
+## Then onto the containerizing and serving the ml model
+
+7. Started by pulling the proposed image using 
+`docker pull python:3.9-slim`
+
+8. Modified the dockerfile.infer 
+
+9. Built the image using
+`docker build . --tag infer:version1 -f Dockerfile.infer`
+
+10. Ran and mounted the serving recipe
+`docker run --rm -- p 8080:8080 -v "$PWD"/app/models:/app/models infer:version1`
+
+11. Tested it on another terminal using
+`curl http://localhost:8080/` which returned "Welcome to Docker Lab"
+
+## Storing images on Dockerhub 
+
+12. I then logged in to dockerhub via vs code
+`docker login` 
+
+13. Then properly retagged my images for docker hub
+```bash
+docker tag train:version2 agrondin1/train:version2
+docker tag infer:version1 agrondin1/infer:version1
+```
+
+14. Finally proceeded to push images to docker Hub
+```bash
+docker push agrondin1/train:version2
+docker push agrondin1/infer:version1
+```
+
+## Building an Apptainer image on the HPC
+
+15. I first connected to https://login.hpc.ugent.be
+
+16. Proceeded to start a shell session with 1 node 4 cores for 4 hours using the donphan cluster
+
+17. I then enter my scratch directory:
+`cd scratch/gent/491/vsc49179`
+
+18. Then went one to build the apptainer images: `nano build_apptainer_images.sh`
+```bash
+`#!/bin/bash`
+`#SBATCH --job-name=apptainer_build_all`
+`#SBATCH --output=apptainer_build.log`
+`#SBATCH --time=01:00:00`
+`#SBATCH --ntasks=1`
+
+`# Apptainer is available system-wide so no need to load module`
+
+`# Build training image`
+`apptainer build train_container.sif docker://agrondin1/train:version2`
+
+`# Build inference image`
+`apptainer build infer_container.sif docker://agrondin1/infer:version1`
+```
+
+19. Submitted the job to slurm
+`sbatch build_apptainer_images.sh`
+
+This did manage to create the infer container but not the train so I relaunch it with only the build training image. 
+
+20. I then switch to vs code server and modified the script to create the containers like so
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=job_submission
+#SBATCH --output=apptainer_build.log
+#SBATCH --partition=donphan
+#SBATCH --mem=8G
+#SBATCH --time=00:30:00
+# Apptainer is available system-wide so no need to load module
+```
+
+```bash
+# Build training image
+apptainer build --fakeroot train_container.sif docker://agrondin1/train:version2
+```
+
+```bash
+# Build inference image
+apptainer build --fakeroot infer_container.sif docker://agrondin1/infer:version1
+```
+
+```bash
+mv train_container.sif $VSC_SCRATCH/.
+mv infer_container.sif $VSC_SCRATCH/.
+```
+
+
+And then I finally manage to create both image plus the log of the slurm job. Unfortunately the .log is not complete because I creates the *sif files but had forgotten to use #SBATCH --output=apptainer_build.log and I had to do it again but now it just states that the file have been created.
+
+21. I copied the files to my user home directory:
+cp apptainer_build.log infer_container.sif train_container.sif /user/gent/491/vsc49179
+
+22. Then finally donwloaded the files and finally pushed everything except the *sif because they exceeded the limit supported by github:
+```bash
+remote: error: File infer_container.sif is 120.25 MB; this exceeds GitHub's file size limit of 100.00 MB
+remote: error: File train_container.sif is 1246.62 MB; this exceeds GitHub's file size limit of 100.00 MB
+```
+
diff --git a/app/models/iris_model.pkl b/app/models/iris_model.pkl
diff --git a/apptainer_build.log b/apptainer_build.log
@@ -0,0 +1,4 @@
+FATAL:   While checking build target: build target 'train_container.sif' already exists. Use --force if you want to overwrite it
+FATAL:   While checking build target: build target 'infer_container.sif' already exists. Use --force if you want to overwrite it
+mv: 'train_container.sif' and '/scratch/gent/491/vsc49179/./train_container.sif' are the same file
+mv: 'infer_container.sif' and '/scratch/gent/491/vsc49179/./infer_container.sif' are the same file