poldpold
diff --git a/‎.dvc/.gitignore
+3 b/‎.dvc/.gitignore
+3
diff --git a/‎.dvc/config b/‎.dvc/config
diff --git a/‎.dvcignore
+3 b/‎.dvcignore
+3
diff --git a/‎.github/workflows/.gitkeep
+1 b/‎.github/workflows/.gitkeep
+1
diff --git a/‎.gitignore
+2 b/‎.gitignore
+2
diff --git a/‎README.md
+45 b/‎README.md
+45
diff --git a/‎docs/1-local-reproducibility.md
+54 b/‎docs/1-local-reproducibility.md
+54
diff --git a/‎docs/2-shared-reproducibility.md
+50 b/‎docs/2-shared-reproducibility.md
+50
diff --git a/‎docs/3-online-reproducibility.md
+140 b/‎docs/3-online-reproducibility.md
+140
@@ -0,0 +1,3 @@
+/config.local
+/tmp
+/cache
@@ -0,0 +1,3 @@
+# Add patterns of files dvc should ignore, which could improve
+# the performance. Learn more at
+# https://dvc.org/doc/user-guide/dvcignore
@@ -0,0 +1 @@
+
@@ -0,0 +1,2 @@
+.venv/
+dvc_plots
@@ -0,0 +1,45 @@
+# workshop-uncool-mlops
+
+![Overview](./docs/imgs/overview.png)
+
+![Issue Labeler](./docs/imgs/issue-labeler.jpg)
+
+- :star: -> https://github.com/iterative/dvc
+- :star: -> https://github.com/iterative/dvclive
+- :star: -> https://github.com/iterative/cml
+- :star: -> https://github.com/huggingface/transformerss
+
+# Before we start
+
+- Fork this repo https://github.com/iterative/workshop-uncool-mlops
+- Clone **your fork**.
+
+- Intall:
+
+```console
+$ python -m venv .venv
+$ source .venv/bin/activate
+$ python -m pip install --upgrade pip
+$ pip install wheel
+$ pip install -r requirements.txt
+```
+
+- Create a [GitHub personal access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token)
+
+```console
+$ export GITHUB_TOKEN={YOUR_GITHUB_TOKEN}
+```
+
+# Current status
+
+1. [Local Reproducibility](./docs/1-local-reproducibility.md)
+
+# Workshop
+
+2. [Shared Reproducibility](./docs/2-shared-reproducibility.md)
+
+3. [Online Reproducibility](./docs/3-online-reproducibility.md)
+
+4. [Deployment](./docs/4-deployment.md)
+
+5. [Automation](./docs/5-automation.md)
@@ -0,0 +1,54 @@
+# Local reproducibility
+
+We have a [DVC Pipeline](https://dvc.org/doc/user-guide/project-structure/pipelines-files) defined in [dvc.yaml file](../dvc.yaml).
+
+The pipeline is composed of stages using Python scripts, defined in [src](../src/):
+
+```mermaid
+flowchart TD
+        node2[eval]
+        node3[get-data]
+        node4[split-data]
+        node5[train]
+        node3-->node4
+        node4-->node2
+        node4-->node5
+        node5-->node2
+```
+
+We use [DVC Params](https://dvc.org/doc/command-reference/params), defined in [params.yaml](../params.yaml), to configure the pipeline.
+
+The pipeline enables local `reproducibility` and can be run with `dvc repro` / `dvc exp run`:
+
+```console
+$ export GITHUB_TOKEN={YOUR_GITHUB_TOKEN}
+$ export LOGURU_LEVEL=INFO
+$ dvc exp run -S train.epochs=8
+```
+
+
+The pipeline generates [DVC Metrics](https://dvc.org/doc/command-reference/metrics) and [DVC Plots](https://dvc.org/doc/command-reference/plots) to evaluate model performance, which can be found in [outs](../outs)
+
+```console
+$ dvc exp diff
+```
+
+```console
+$ dvc plots diff --open
+```
+
+Because the metrics and plots files are small enough to be tracked by `git`, after we run the pipeline we can share the results with others:
+
+```
+git add `dvc.lock` outs
+git push
+```
+
+You can connect the repo with https://studio.iterative.ai/ in order to have a better visualization for the metrics, parameters and plots associated to each commit:
+
+https://studio.iterative.ai/user/daavoo/views/workshop-uncool-mlops-5fgmd70rkt
+
+
+However, the rest of the outputs are gitignored because they are too big to be tracked by `git`.
+
+![Bigger Boat](./imgs/bigger-boat.jpg)
@@ -0,0 +1,50 @@
+# Shared Reproducibility
+
+DVC remotes provide a location to store arbitrarily large files and directories.
+
+![DVC Remote](./imgs/dvc-remote.png)
+
+First, you need to create a new folder on our [Google Drive](https://drive.google.com), navigate to the new folder and copy the last part of the URL.
+
+![Google Drive](./imgs/gdrive.png)
+
+You can now add a DVC remote to our project:
+
+```bash
+dvc remote add --default myremote gdrive://{COPY PASTED GDRIVE URL}
+```
+
+---
+
+> More info: https://dvc.org/doc/command-reference/remote/add#description
+
+---
+
+The results of the pipeline can now be shared with others by using [dvc push](https://dvc.org/doc/command-reference/push) and [dvc pull](https://dvc.org/doc/command-reference/pull).
+
+```console
+dvc push -j 4
+```
+
+You will be prompted for Google Drive credentials the first time you run `dvc push/pull`.
+
+
+```bash
+# Researcher A
+# Updates hparam
+dvc repro
+git add . git commit -m "Updated hparam"
+git push && dvc push
+```
+
+```bash
+# Researcher B
+git pull && dvc pull
+# Receives changes
+```
+
+## Other Remotes
+
+This commands works the same regardless of the remote type. See all the available remotes:
+
+https://dvc.org/doc/command-reference/remote/add#supported-storage-types
@@ -0,0 +1,140 @@
+# Online Reproducibility
+
+## Add new secret PERSONAL_GITHUB_TOKEN.
+
+- Create a personal access token
+
+https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
+
+- Create a new secret and name it `PERSONAL_GITHUB_TOKEN`
+
+## Grant GitHub access to DVC Remote
+
+You need to grant GitHub access to the DVC Remote.
+
+Get the credentials.
+
+```bash
+cat ".dvc/tmp/gdrive-user-credentials.json"
+```
+
+And create a new GitHub secret called `GDRIVE_CREDENTIALS_DATA` to store them.
+
+With this, GitHub runners will be able to pull and push all the changes generated by the pipeline.
+
+## Pull Request workflow
+
+You can create a new *GitHub actions workflow* that runs when a new Pull Request is created.
+
+This workflow will use `DVC` to reproduce the pipeline and update the large artifacts tracked by DVC.
+
+In addition it will use `CML` to post a **report** with the `DVC` metrics, params, and plots ([cml send-comment](https://cml.dev/doc/ref/send-comment)). It will also update the artifacts tracked by Git ([cml pr](https://cml.dev/doc/ref/pr))
+
+![Report Metrics](./imgs/report-metrics.png)
+
+![Report Plots](./imgs/report-plots.png)
+
+<details>
+<summary>Create and fill `.github/workflows/on_pr.yml`</summary>
+
+```yaml
+name: DVC & CML Workflow
+
+on:
+  pull_request:
+
+  # Allows you to run this workflow manually from the Actions tab
+  workflow_dispatch:
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    container: docker://ghcr.io/iterative/cml:latest
+
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+
+      - name: Setup
+        run: |
+          pip install -r requirements.txt
+
+      - name: Run DVC pipeline
+        env:
+          GITHUB_TOKEN: ${{ secrets.PERSONAL_GITHUB_TOKEN }}
+          GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
+        run: |
+          dvc repro --pull
+
+      - name: Share changes
+        env:
+          GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
+        run: |
+          dvc push
+
+      - name: Create a P.R. with CML 
+        env:
+          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          cml pr --auto-merge "dvc.lock" "outs/*.json" "outs/eval"  "outs/train_metrics"
+
+      - name: CML Report
+        env:
+          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          echo "## Metrics & Params" >> report.md
+
+          dvc exp diff main --show-md >> report.md
+          cml send-comment --pr --update report.md
+                  
+          echo "## Plots" >> report.md
+
+          echo "### Eval Loss" >> report.md
+          dvc plots diff \
+            --target outs/train_metrics/scalars/eval_loss.tsv --show-vega main > vega.json
+          vl2png vega.json -s 1.5 | cml-publish --md  >> report.md
+
+          echo "### Eval Accuracy" >> report.md
+          dvc plots diff \
+            --target outs/train_metrics/scalars/eval_accuracy.tsv --show-vega main > vega.json
+          vl2png vega.json -s 1.5 | cml-publish --md  >> report.md
+
+          echo "### Confusion Matrix" >> report.md
+          dvc plots diff \
+            --target outs/eval/plots/confusion_matrix.json --show-vega main > vega.json
+          vl2png vega.json -s 1.5 | cml-publish --md  >> report.md
+
+          cml send-comment --pr --update report.md
+```
+</details>
+
+## Reproduce Online
+
+And now you can reproduce the pipeline from the web:
+
+### From GitHub UI
+
+- Edit `params.yaml` from the GitHub Interface.
+
+- Change `train.epochs`.
+
+- Select `Create a new branch for this commit and start a pull request`
+
+### From Studio
+
+- Go to https://studio.iterative.ai (It's free)
+- Connect your GitHub account.
+- Add a new view.
+
+> More info: https://dvc.org/doc/studio
+
+- Click on `Run new experiment` button.
+
+## More compute
+
+In the above workflow we are using the default GitHub runners to train our model.
+
+While this is enough for our use case (small dataset, small model), your project would often require more compute resources.
+
+[CML Self-Hosted Runners](https://cml.dev/doc/self-hosted-runners) allows you to allocate cloud instances (or on-premise machines) and use them in your GitHub actions workflow.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Add patterns of files dvc should ignore, which could improve`
	`2`	`+# the performance. Learn more at`
	`3`	`+# https://dvc.org/doc/user-guide/dvcignore`