Skip to content

Commit 912551e

Browse files
committed
Initial commit
0 parents  commit 912551e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+1160
-0
lines changed

.dvc/.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
/config.local
2+
/tmp
3+
/cache

.dvc/config

Whitespace-only changes.

.dvcignore

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Add patterns of files dvc should ignore, which could improve
2+
# the performance. Learn more at
3+
# https://dvc.org/doc/user-guide/dvcignore

.github/workflows/.gitkeep

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
.venv/
2+
dvc_plots

README.md

+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# workshop-uncool-mlops
2+
3+
![Overview](./docs/imgs/overview.png)
4+
5+
![Issue Labeler](./docs/imgs/issue-labeler.jpg)
6+
7+
- :star: -> https://github.com/iterative/dvc
8+
- :star: -> https://github.com/iterative/dvclive
9+
- :star: -> https://github.com/iterative/cml
10+
- :star: -> https://github.com/huggingface/transformerss
11+
12+
# Before we start
13+
14+
- Fork this repo https://github.com/iterative/workshop-uncool-mlops
15+
- Clone **your fork**.
16+
17+
- Intall:
18+
19+
```console
20+
$ python -m venv .venv
21+
$ source .venv/bin/activate
22+
$ python -m pip install --upgrade pip
23+
$ pip install wheel
24+
$ pip install -r requirements.txt
25+
```
26+
27+
- Create a [GitHub personal access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token)
28+
29+
```console
30+
$ export GITHUB_TOKEN={YOUR_GITHUB_TOKEN}
31+
```
32+
33+
# Current status
34+
35+
1. [Local Reproducibility](./docs/1-local-reproducibility.md)
36+
37+
# Workshop
38+
39+
2. [Shared Reproducibility](./docs/2-shared-reproducibility.md)
40+
41+
3. [Online Reproducibility](./docs/3-online-reproducibility.md)
42+
43+
4. [Deployment](./docs/4-deployment.md)
44+
45+
5. [Automation](./docs/5-automation.md)

docs/1-local-reproducibility.md

+54
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Local reproducibility
2+
3+
We have a [DVC Pipeline](https://dvc.org/doc/user-guide/project-structure/pipelines-files) defined in [dvc.yaml file](../dvc.yaml).
4+
5+
The pipeline is composed of stages using Python scripts, defined in [src](../src/):
6+
7+
```mermaid
8+
flowchart TD
9+
node2[eval]
10+
node3[get-data]
11+
node4[split-data]
12+
node5[train]
13+
node3-->node4
14+
node4-->node2
15+
node4-->node5
16+
node5-->node2
17+
```
18+
19+
We use [DVC Params](https://dvc.org/doc/command-reference/params), defined in [params.yaml](../params.yaml), to configure the pipeline.
20+
21+
The pipeline enables local `reproducibility` and can be run with `dvc repro` / `dvc exp run`:
22+
23+
```console
24+
$ export GITHUB_TOKEN={YOUR_GITHUB_TOKEN}
25+
$ export LOGURU_LEVEL=INFO
26+
$ dvc exp run -S train.epochs=8
27+
```
28+
29+
30+
The pipeline generates [DVC Metrics](https://dvc.org/doc/command-reference/metrics) and [DVC Plots](https://dvc.org/doc/command-reference/plots) to evaluate model performance, which can be found in [outs](../outs)
31+
32+
```console
33+
$ dvc exp diff
34+
```
35+
36+
```console
37+
$ dvc plots diff --open
38+
```
39+
40+
Because the metrics and plots files are small enough to be tracked by `git`, after we run the pipeline we can share the results with others:
41+
42+
```
43+
git add `dvc.lock` outs
44+
git push
45+
```
46+
47+
You can connect the repo with https://studio.iterative.ai/ in order to have a better visualization for the metrics, parameters and plots associated to each commit:
48+
49+
https://studio.iterative.ai/user/daavoo/views/workshop-uncool-mlops-5fgmd70rkt
50+
51+
52+
However, the rest of the outputs are gitignored because they are too big to be tracked by `git`.
53+
54+
![Bigger Boat](./imgs/bigger-boat.jpg)

docs/2-shared-reproducibility.md

+50
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Shared Reproducibility
2+
3+
DVC remotes provide a location to store arbitrarily large files and directories.
4+
5+
![DVC Remote](./imgs/dvc-remote.png)
6+
7+
First, you need to create a new folder on our [Google Drive](https://drive.google.com), navigate to the new folder and copy the last part of the URL.
8+
9+
![Google Drive](./imgs/gdrive.png)
10+
11+
You can now add a DVC remote to our project:
12+
13+
```bash
14+
dvc remote add --default myremote gdrive://{COPY PASTED GDRIVE URL}
15+
```
16+
17+
---
18+
19+
> More info: https://dvc.org/doc/command-reference/remote/add#description
20+
21+
---
22+
23+
The results of the pipeline can now be shared with others by using [dvc push](https://dvc.org/doc/command-reference/push) and [dvc pull](https://dvc.org/doc/command-reference/pull).
24+
25+
```console
26+
dvc push -j 4
27+
```
28+
29+
You will be prompted for Google Drive credentials the first time you run `dvc push/pull`.
30+
31+
32+
```bash
33+
# Researcher A
34+
# Updates hparam
35+
dvc repro
36+
git add . git commit -m "Updated hparam"
37+
git push && dvc push
38+
```
39+
40+
```bash
41+
# Researcher B
42+
git pull && dvc pull
43+
# Receives changes
44+
```
45+
46+
## Other Remotes
47+
48+
This commands works the same regardless of the remote type. See all the available remotes:
49+
50+
https://dvc.org/doc/command-reference/remote/add#supported-storage-types

docs/3-online-reproducibility.md

+140
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Online Reproducibility
2+
3+
## Add new secret PERSONAL_GITHUB_TOKEN.
4+
5+
- Create a personal access token
6+
7+
https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
8+
9+
- Create a new secret and name it `PERSONAL_GITHUB_TOKEN`
10+
11+
## Grant GitHub access to DVC Remote
12+
13+
You need to grant GitHub access to the DVC Remote.
14+
15+
Get the credentials.
16+
17+
```bash
18+
cat ".dvc/tmp/gdrive-user-credentials.json"
19+
```
20+
21+
And create a new GitHub secret called `GDRIVE_CREDENTIALS_DATA` to store them.
22+
23+
With this, GitHub runners will be able to pull and push all the changes generated by the pipeline.
24+
25+
## Pull Request workflow
26+
27+
You can create a new *GitHub actions workflow* that runs when a new Pull Request is created.
28+
29+
This workflow will use `DVC` to reproduce the pipeline and update the large artifacts tracked by DVC.
30+
31+
In addition it will use `CML` to post a **report** with the `DVC` metrics, params, and plots ([cml send-comment](https://cml.dev/doc/ref/send-comment)). It will also update the artifacts tracked by Git ([cml pr](https://cml.dev/doc/ref/pr))
32+
33+
![Report Metrics](./imgs/report-metrics.png)
34+
35+
![Report Plots](./imgs/report-plots.png)
36+
37+
<details>
38+
<summary>Create and fill `.github/workflows/on_pr.yml`</summary>
39+
40+
```yaml
41+
name: DVC & CML Workflow
42+
43+
on:
44+
pull_request:
45+
46+
# Allows you to run this workflow manually from the Actions tab
47+
workflow_dispatch:
48+
49+
jobs:
50+
build:
51+
runs-on: ubuntu-latest
52+
container: docker://ghcr.io/iterative/cml:latest
53+
54+
steps:
55+
- uses: actions/checkout@v2
56+
with:
57+
fetch-depth: 0
58+
59+
- name: Setup
60+
run: |
61+
pip install -r requirements.txt
62+
63+
- name: Run DVC pipeline
64+
env:
65+
GITHUB_TOKEN: ${{ secrets.PERSONAL_GITHUB_TOKEN }}
66+
GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
67+
run: |
68+
dvc repro --pull
69+
70+
- name: Share changes
71+
env:
72+
GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
73+
run: |
74+
dvc push
75+
76+
- name: Create a P.R. with CML
77+
env:
78+
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
79+
run: |
80+
cml pr --auto-merge "dvc.lock" "outs/*.json" "outs/eval" "outs/train_metrics"
81+
82+
- name: CML Report
83+
env:
84+
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
85+
run: |
86+
echo "## Metrics & Params" >> report.md
87+
88+
dvc exp diff main --show-md >> report.md
89+
cml send-comment --pr --update report.md
90+
91+
echo "## Plots" >> report.md
92+
93+
echo "### Eval Loss" >> report.md
94+
dvc plots diff \
95+
--target outs/train_metrics/scalars/eval_loss.tsv --show-vega main > vega.json
96+
vl2png vega.json -s 1.5 | cml-publish --md >> report.md
97+
98+
echo "### Eval Accuracy" >> report.md
99+
dvc plots diff \
100+
--target outs/train_metrics/scalars/eval_accuracy.tsv --show-vega main > vega.json
101+
vl2png vega.json -s 1.5 | cml-publish --md >> report.md
102+
103+
echo "### Confusion Matrix" >> report.md
104+
dvc plots diff \
105+
--target outs/eval/plots/confusion_matrix.json --show-vega main > vega.json
106+
vl2png vega.json -s 1.5 | cml-publish --md >> report.md
107+
108+
cml send-comment --pr --update report.md
109+
```
110+
</details>
111+
112+
## Reproduce Online
113+
114+
And now you can reproduce the pipeline from the web:
115+
116+
### From GitHub UI
117+
118+
- Edit `params.yaml` from the GitHub Interface.
119+
120+
- Change `train.epochs`.
121+
122+
- Select `Create a new branch for this commit and start a pull request`
123+
124+
### From Studio
125+
126+
- Go to https://studio.iterative.ai (It's free)
127+
- Connect your GitHub account.
128+
- Add a new view.
129+
130+
> More info: https://dvc.org/doc/studio
131+
132+
- Click on `Run new experiment` button.
133+
134+
## More compute
135+
136+
In the above workflow we are using the default GitHub runners to train our model.
137+
138+
While this is enough for our use case (small dataset, small model), your project would often require more compute resources.
139+
140+
[CML Self-Hosted Runners](https://cml.dev/doc/self-hosted-runners) allows you to allocate cloud instances (or on-premise machines) and use them in your GitHub actions workflow.

0 commit comments

Comments
 (0)