Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
infer_container.sif
train_container.sif
28 changes: 25 additions & 3 deletions Dockerfile.infer
Original file line number Diff line number Diff line change
@@ -1,7 +1,29 @@
# Dockerfile for the inference server
# This Dockerfile is used to create a Docker image for the inference server
# It is based on the Python 3.9 slim image
FROM python:3.9-slim

# Set working directory
# This is the directory where the inference server will be executed
# and where the model will be loaded
WORKDIR /app

# Copy the list of Python packages required for the serving script
COPY requirements.txt .
COPY server.py .
FROM python:3.9-slim
CMD ["python", "server.py"]

# Install the Python dependencies inside the container
RUN pip install --no-cache-dir -r requirements.txt

# Copy the model files to the working directory
# moved after dependencies
COPY server.py .

# Expose the port that the inference server will listen on
# This is the port that the server will use to communicate with clients
# The default port for the inference server is 8080
EXPOSE 8080

# Run the inference server
CMD ["python", "server.py"]


36 changes: 33 additions & 3 deletions Dockerfile.train
Original file line number Diff line number Diff line change
@@ -1,12 +1,42 @@
FROM <base imagae>
ARG OWNER=jupyter
ARG BASE_CONTAINER=$OWNER/scipy-notebook:python-3.11.5
FROM $BASE_CONTAINER

# TODO: Set a working directory
# This is the directory where the training script will be executed
# and where the model will be saved
WORKDIR /home/jovyan/app

# Switch to root to create model output dir
USER root
RUN mkdir -p /home/jovyan/app/models
# Give ownership to jovyan so it can write to it
RUN chown -R jovyan /home/jovyan/app

# Switch back to jovyan user
USER jovyan

# TODO: Copy the requirements.txt file to the working directory
# This file contains the list of Python packages required for the training script
# The requirements.txt file should be in the same directory as the Dockerfile
COPY requirements.txt .

# TODO: Install the Python dependencies
# TODO: Install the Python dependencies inside the container
# Use --no-cache-dir to avoid caching the packages
# Useful for keeping the image size smaller
RUN pip install --no-cache-dir -r requirements.txt

# TODO: Copy the training script (train.py) to the working directory
# This script contains the code to train the model
# and save it to a specified location
# The script should be in the same directory as the Dockerfile
# If the script is in a different directory, adjust the path accordingly
# For example, if the script is in a subdirectory called 'src', use:
# COPY src/train.py .
# If the script is in a parent directory, use:
# COPY ../train.py .
COPY train.py .

# TODO: Run the training script that generates the model
CMD [...]
# This command will be executed when the container starts
CMD ["python", "train.py"]
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,16 @@ In this project, you will train, run and serve a machine learning model using Do

## Deliverables

- [ ] Clone this repository to your personal github account
- [ ] Containerize training the machine learning model
- [ ] Containerize serving of the machine learning model
- [ ] Train and run the machine learning model using Docker
- [ ] Run the Docker container serving the machine learning model
- [ ] Store the Docker images on your personal account on Docker Hub
- [ ] Provide the resulting Dockerfiles in GitHub
- [ ] Build an Apptainer image on a HPC of your choice
- [ ] Provide the logs of the slurm job in GitHub
- [ ] Document the steps in a text document in GitHub
- [x] Clone this repository to your personal github account
- [x] Containerize training the machine learning model
- [x] Containerize serving of the machine learning model
- [x] Train and run the machine learning model using Docker
- [x] Run the Docker container serving the machine learning model
- [x] Store the Docker images on your personal account on Docker Hub
- [x] Provide the resulting Dockerfiles in GitHub
- [x] Build an Apptainer image on a HPC of your choice
- [x] Provide the logs of the slurm job in GitHub
- [x] Document the steps in a text document in GitHub

## Proposed steps - containerize and run training the machine learning model

Expand Down
119 changes: 119 additions & 0 deletions Steps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
## Containerize and run training the machine learning model

0. I didn't pull an image with the dependencies needed because I already had one from the training session with the librairies that are going to be use (Repository: jupyter/scipy-notebook; TAG: python-3.11.5 )

2. I then modified the recipe Dockerfile.train with the proposed advices on the dockerfile

3. Proceed to create a first version, but this was missing some info so I created a second one

4. Built the train model using
`docker build . --tag train:version2 -f Dockerfile.train`

5. Ran the train model using
`docker run --rm --volume "$PWD"/app/models:/app/models train:version2`

6. Once completed this resulted in: iris_model.pkl

## Then onto the containerizing and serving the ml model

7. Started by pulling the proposed image using
`docker pull python:3.9-slim`

8. Modified the dockerfile.infer

9. Built the image using
`docker build . --tag infer:version1 -f Dockerfile.infer`

10. Ran and mounted the serving recipe
`docker run --rm -- p 8080:8080 -v "$PWD"/app/models:/app/models infer:version1`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is typo in this line - the declaration of the port does work like specified.


11. Tested it on another terminal using
`curl http://localhost:8080/` which returned "Welcome to Docker Lab"

## Storing images on Dockerhub

12. I then logged in to dockerhub via vs code
`docker login`

13. Then properly retagged my images for docker hub
```bash
docker tag train:version2 agrondin1/train:version2
docker tag infer:version1 agrondin1/infer:version1
```

14. Finally proceeded to push images to docker Hub
```bash
docker push agrondin1/train:version2
docker push agrondin1/infer:version1
```

## Building an Apptainer image on the HPC

15. I first connected to https://login.hpc.ugent.be

16. Proceeded to start a shell session with 1 node 4 cores for 4 hours using the donphan cluster

17. I then enter my scratch directory:
`cd scratch/gent/491/vsc49179`

18. Then went one to build the apptainer images: `nano build_apptainer_images.sh`
```bash
`#!/bin/bash`
`#SBATCH --job-name=apptainer_build_all`
`#SBATCH --output=apptainer_build.log`
`#SBATCH --time=01:00:00`
`#SBATCH --ntasks=1`

`# Apptainer is available system-wide so no need to load module`

`# Build training image`
`apptainer build train_container.sif docker://agrondin1/train:version2`

`# Build inference image`
`apptainer build infer_container.sif docker://agrondin1/infer:version1`
```

19. Submitted the job to slurm
`sbatch build_apptainer_images.sh`

This did manage to create the infer container but not the train so I relaunch it with only the build training image.

20. I then switch to vs code server and modified the script to create the containers like so

```bash
#!/bin/bash
#SBATCH --job-name=job_submission
#SBATCH --output=apptainer_build.log
#SBATCH --partition=donphan
#SBATCH --mem=8G
#SBATCH --time=00:30:00
# Apptainer is available system-wide so no need to load module
```

```bash
# Build training image
apptainer build --fakeroot train_container.sif docker://agrondin1/train:version2
```

```bash
# Build inference image
apptainer build --fakeroot infer_container.sif docker://agrondin1/infer:version1
```

```bash
mv train_container.sif $VSC_SCRATCH/.
mv infer_container.sif $VSC_SCRATCH/.
```


And then I finally manage to create both image plus the log of the slurm job. Unfortunately the .log is not complete because I creates the *sif files but had forgotten to use #SBATCH --output=apptainer_build.log and I had to do it again but now it just states that the file have been created.

21. I copied the files to my user home directory:
cp apptainer_build.log infer_container.sif train_container.sif /user/gent/491/vsc49179

22. Then finally donwloaded the files and finally pushed everything except the *sif because they exceeded the limit supported by github:
```bash
remote: error: File infer_container.sif is 120.25 MB; this exceeds GitHub's file size limit of 100.00 MB
remote: error: File train_container.sif is 1246.62 MB; this exceeds GitHub's file size limit of 100.00 MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your efforts @AdrienG9 - no need to provide the sif files! ;-)

```

Binary file added app/models/iris_model.pkl
Binary file not shown.
4 changes: 4 additions & 0 deletions apptainer_build.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
FATAL: While checking build target: build target 'train_container.sif' already exists. Use --force if you want to overwrite it
FATAL: While checking build target: build target 'infer_container.sif' already exists. Use --force if you want to overwrite it
mv: 'train_container.sif' and '/scratch/gent/491/vsc49179/./train_container.sif' are the same file
mv: 'infer_container.sif' and '/scratch/gent/491/vsc49179/./infer_container.sif' are the same file