Skip to content

Commit b959893

Browse files
committed
Split dockerfile into multi-stage build to separate pre-req builds, runner, and full layers
1 parent 3e52b94 commit b959893

File tree

10 files changed

+323
-54
lines changed

10 files changed

+323
-54
lines changed

.dockerignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@ lightning_logs
77
wandb
88
**/test_data/**/**/*.tif
99
**/project_data
10+
one_off_projects

.github/workflows/build_test.yaml

Lines changed: 90 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -55,45 +55,108 @@ jobs:
5555
sudo rm -rf /opt/ghc
5656
sudo rm -rf /usr/local/share/boost
5757
58-
# Add deploy key and clone olmoearth_pretrain repository.
59-
- name: Start ssh-agent and add olmoearth_pretrain deploy key
58+
- name: Set up Docker Buildx
59+
uses: docker/setup-buildx-action@v3
60+
61+
# ============================================================================
62+
# SSH Multi-Key Setup for Private Repository Access
63+
# ============================================================================
64+
# The Docker build needs to clone from multiple private GitHub repositories:
65+
# - allenai/rslearn
66+
# - allenai/olmoearth_pretrain
67+
# - allenai/olmoearth_run
68+
#
69+
# PROBLEM: GitHub deploy keys are scoped to a single repository. We need
70+
# separate deploy keys for each repo. However, when SSH connects to github.com
71+
# with multiple keys loaded in the agent, it tries them in order. If the first
72+
# key authenticates successfully but lacks access to a specific repository,
73+
# GitHub returns "Repository not found" and SSH stops trying other keys.
74+
#
75+
# SOLUTION: We use SSH host aliases to route each repository to its specific key.
76+
# webfactory/ssh-agent sets up the SSH agent on the runner, but Docker builds
77+
# run in an isolated container that doesn't have access to the runner's SSH
78+
# config or git URL rewrites. So we must recreate this setup inside the container.
79+
80+
- name: Setup SSH agent with deploy keys
6081
uses: webfactory/[email protected]
6182
with:
62-
ssh-private-key: ${{ secrets.DEPLOY_KEY_FOR_HELIOS_CLONE }}
63-
- name: Clone olmoearth_pretrain and update requirements-extra.txt.
83+
ssh-private-key: |
84+
${{ secrets.DEPLOY_KEY_FOR_OLMOEARTH_PRETRAIN_CLONE }}
85+
${{ secrets.DEPLOY_KEY_FOR_OLMOEARTH_RUN_CLONE }}
86+
87+
- name: Debug SSH keys
6488
run: |
65-
mkdir docker_build
66-
git clone [email protected]:allenai/olmoearth_pretrain.git docker_build/olmoearth_pretrain
67-
git -C docker_build/olmoearth_pretrain checkout 0cd82b4784bc2f246a23f6da98a9bab27761c1ba
68-
echo "olmoearth_pretrain @ /opt/rslearn_projects/docker_build/olmoearth_pretrain/" >> requirements-extra.txt
89+
echo "=== SSH Agent Keys ==="
90+
ssh-add -L
91+
echo ""
92+
echo "=== SSH Config (if any) ==="
93+
cat ~/.ssh/config || echo "No SSH config found"
94+
echo ""
95+
echo "=== Git Config URL Rewrites ==="
96+
git config --global --get-regexp 'url\..*\.insteadof' || echo "No git URL rewrites found"
6997
70-
# Same thing for olmoearth_run repository.
71-
- name: Start ssh-agent and add olmoearth_run deploy key
72-
uses: webfactory/[email protected]
73-
with:
74-
ssh-private-key: ${{ secrets.DEPLOY_KEY_FOR_OLMOEARTH_RUN_CLONE }}
75-
- name: Clone olmoearth_run and update requirements-extra.txt.
98+
# Prepare all SSH configuration files for Docker build
99+
# The SSH agent runs on the runner but we need actual key files and config
100+
# inside the Docker container. We pre-build everything here in the workflow
101+
# to keep the Dockerfile simple and free of CI-specific logic.
102+
- name: Prepare SSH keys for Docker build
76103
run: |
77-
git clone [email protected]:allenai/olmoearth_run.git docker_build/olmoearth_run
78-
echo "olmoearth_run @ /opt/rslearn_projects/docker_build/olmoearth_run/" >> requirements-extra.txt
104+
mkdir -p .docker-ssh
105+
106+
# Write SSH private keys
107+
echo "${{ secrets.DEPLOY_KEY_FOR_OLMOEARTH_PRETRAIN_CLONE }}" > .docker-ssh/olmoearth_pretrain_key
108+
echo "${{ secrets.DEPLOY_KEY_FOR_OLMOEARTH_RUN_CLONE }}" > .docker-ssh/olmoearth_run_key
109+
chmod 600 .docker-ssh/*_key
79110
80-
- name: Build and push Docker image
111+
# Create SSH config with host aliases
112+
cat > .docker-ssh/config << 'EOF'
113+
Host github-olmoearth-pretrain
114+
HostName github.com
115+
IdentityFile /root/.ssh/olmoearth_pretrain_key
116+
IdentitiesOnly yes
117+
118+
Host github-olmoearth-run
119+
HostName github.com
120+
IdentityFile /root/.ssh/olmoearth_run_key
121+
IdentitiesOnly yes
122+
EOF
123+
124+
# Create modified requirements files with host aliases
125+
sed 's|git@github\.com/allenai/olmoearth_pretrain|git@github-olmoearth-pretrain/allenai/olmoearth_pretrain|g' \
126+
requirements-olmoearth_pretrain.txt > .docker-ssh/requirements-olmoearth_pretrain.txt
127+
sed 's|git@github\.com/allenai/olmoearth_run|git@github-olmoearth-run/allenai/olmoearth_run|g' \
128+
requirements-olmoearth_run.txt > .docker-ssh/requirements-olmoearth_run.txt
129+
130+
## REMOVE BEFORE MERGE! ####
131+
# - name: Clone olmoearth repositories and update requirements-extra.txt
132+
# run: |
133+
# mkdir docker_build
134+
# git clone [email protected]:allenai/olmoearth_pretrain.git docker_build/olmoearth_pretrain
135+
# git -C docker_build/olmoearth_pretrain checkout 0cd82b4784bc2f246a23f6da98a9bab27761c1ba
136+
# echo "olmoearth_pretrain @ /opt/rslearn_projects/docker_build/olmoearth_pretrain/" >> requirements-extra.txt
137+
# git clone [email protected]:allenai/olmoearth_run.git docker_build/olmoearth_run
138+
# echo "olmoearth_run @ /opt/rslearn_projects/docker_build/olmoearth_run/" >> requirements-extra.txt
139+
############################
140+
141+
- name: Build and push Docker images with Bake
81142
id: build-push
82-
uses: docker/build-push-action@v6
83-
with:
84-
context: .
85-
push: true
86-
tags: ${{ steps.meta.outputs.tags }}
87-
labels: ${{ steps.meta.outputs.labels }}
88-
build-args: |
89-
GIT_USERNAME=${{ secrets.GIT_USERNAME }}
90-
GIT_TOKEN=${{ secrets.GIT_TOKEN }}
143+
run: |
144+
docker buildx bake \
145+
--allow=ssh \
146+
--file ./docker-bake.hcl \
147+
--file ${{ steps.meta.outputs.bake-file }} \
148+
--set "*.cache-from=type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache" \
149+
--set "*.cache-to=type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:buildcache,mode=max" \
150+
--metadata-file /tmp/bake-metadata.json \
151+
--push
91152
92153
- name: Store Image Names
93154
# We need the docker image name downstream in test & deploy. This saves the full docker image names to outputs
94155
id: image-names
95156
run: |-
96-
GHCR_IMAGE="${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build-push.outputs.digest }}"
157+
# Extract the digest for the 'full' target from bake metadata
158+
FULL_DIGEST=$(cat /tmp/bake-metadata.json | jq -r '.full."containerimage.digest"')
159+
GHCR_IMAGE="${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${FULL_DIGEST}"
97160
GHCR_IMAGE=`echo ${GHCR_IMAGE} | tr '[:upper:]' '[:lower:]'` # docker requires that all image names be lowercase
98161
echo "ghcr.io Docker image name is ${GHCR_IMAGE}"
99162
echo "ghcr_image_name=\"${GHCR_IMAGE}\"" >> $GITHUB_OUTPUT

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ olmoearth_run_data/nandi/finetune/annotation*.geojson
1919
lightning_logs/
2020
/tmp*/
2121
docker_build
22+
.docker-ssh/
2223

2324
# misc from one off projects
2425
one_off_projects/2025_07_joint_finetune/data

Dockerfile

Lines changed: 115 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,127 @@
1-
FROM pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime@sha256:7db0e1bf4b1ac274ea09cf6358ab516f8a5c7d3d0e02311bed445f7e236a5d80
1+
###
2+
### docker build --ssh default=/path/to/private.key -f Dockerfile .
23

3-
RUN apt update
4-
RUN apt install -y libpq-dev ffmpeg libsm6 libxext6 git wget
4+
ARG BASE=ubuntu:22.04
5+
ARG BASE_PYTORCH=pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime@sha256:7db0e1bf4b1ac274ea09cf6358ab516f8a5c7d3d0e02311bed445f7e236a5d80
6+
ARG PLATFORM=linux/amd64
57

6-
# Install Go (used for Satlas smooth_point_labels_viterbi.go).
7-
RUN wget https://go.dev/dl/go1.22.12.linux-amd64.tar.gz
8-
RUN rm -rf /usr/local/go && tar -C /usr/local -xzf go1.22.12.linux-amd64.tar.gz
9-
ENV PATH="${PATH}:/usr/local/go/bin"
8+
FROM --platform=${PLATFORM} ${BASE} AS tippecanoe
109

11-
# Install tippecanoe (used by forest loss driver).
12-
RUN apt install -y build-essential libsqlite3-dev zlib1g-dev
13-
RUN git clone https://github.com/mapbox/tippecanoe /opt/tippecanoe
14-
WORKDIR /opt/tippecanoe
10+
RUN apt-get update && apt-get install -y --no-install-recommends build-essential ca-certificates curl libsqlite3-dev zlib1g-dev
11+
12+
RUN mkdir -p /tmp/tippecanoe && curl -L https://github.com/mapbox/tippecanoe/archive/refs/tags/1.36.0.tar.gz | tar -xz --strip 1 -C /tmp/tippecanoe
13+
WORKDIR /tmp/tippecanoe
1514
RUN make -j
16-
RUN make install
15+
RUN PREFIX=/opt/tippecanoe make install
16+
17+
# To use this:
18+
# COPY --from=tippecanoe /opt/tippecanoe /opt/tippecanoe
19+
# ENV PATH="/opt/tippecanoe/bin:${PATH}"
20+
21+
FROM --platform=${PLATFORM} ${BASE} AS golang
22+
23+
## Build Satlas smooth_point_labels_viterbi.go (Requires golang)
24+
RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates curl
25+
RUN curl -L https://go.dev/dl/go1.22.12.linux-amd64.tar.gz | tar -xz -C /usr/local
26+
ENV PATH="/usr/local/go/bin:${PATH}"
27+
28+
COPY rslp/satlas/scripts /tmp/smooth_point_labels_viterbi/
29+
WORKDIR /tmp/smooth_point_labels_viterbi/
30+
RUN go build smooth_point_labels_viterbi.go
31+
32+
# To use this:
33+
# COPY --from=golang /tmp/smooth_point_labels_viterbi/smooth_point_labels_viterbi /usr/local/bin/smooth_point_labels_viterbi
34+
35+
FROM --platform=${PLATFORM} pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime@sha256:7db0e1bf4b1ac274ea09cf6358ab516f8a5c7d3d0e02311bed445f7e236a5d80 AS base
36+
37+
RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates git openssh-client
38+
39+
# Pin GitHub's host keys so SSH won't prompt
40+
RUN mkdir -p -m 700 /root/.ssh && \
41+
ssh-keyscan -t rsa,ecdsa,ed25519 github.com >> /root/.ssh/known_hosts
42+
43+
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
44+
45+
COPY . /opt/rslearn_projects/
46+
47+
# ============================================================================
48+
# SSH Multi-Key Configuration for GitHub Actions
49+
# ============================================================================
50+
# This section handles SSH authentication for multiple private GitHub repositories
51+
# when building in CI/CD environments that use separate deploy keys per repository.
52+
#
53+
# CONTEXT:
54+
# - GitHub deploy keys are scoped to a single repository for security
55+
# - We need to clone from multiple repos: rslearn, olmoearth_pretrain, olmoearth_run
56+
# - Each repository has its own deploy key stored as a GitHub Actions secret
57+
#
58+
# THE PROBLEM:
59+
# When SSH tries to connect to github.com with multiple keys in the agent:
60+
# 1. SSH tries the first key
61+
# 2. If it authenticates successfully, SSH stops trying other keys
62+
# 3. If that key lacks access to a specific repo, GitHub returns "Repository not found"
63+
# 4. The connection fails even though other keys in the agent might have access
64+
#
65+
# This happens because SSH authenticates at the HOST level (github.com), not the
66+
# repository level. Once any key authenticates with github.com, SSH considers the
67+
# connection successful and doesn't try other keys.
68+
#
69+
# THE SOLUTION:
70+
# We use SSH host aliases to make SSH think each repository is on a different host:
71+
# - git@github-olmoearth-pretrain:allenai/olmoearth_pretrain.git
72+
# - git@github-olmoearth-run:allenai/olmoearth_run.git
73+
#
74+
# Each alias points to github.com but specifies which SSH key to use via IdentityFile.
75+
# With IdentitiesOnly=yes, SSH only tries the specified key for that alias.
76+
#
77+
# IMPLEMENTATION:
78+
# The GitHub Actions workflow writes deploy keys to .docker-ssh/ and sets
79+
# USE_SSH_KEYS_FROM_BUILD=true. This code then:
80+
# 1. Copies the key files to ~/.ssh/
81+
# 2. Creates SSH config with host aliases mapped to specific keys
82+
# 3. Rewrites requirements*.txt files to use the host aliases
83+
#
84+
# LOCAL DEVELOPMENT:
85+
# Local developers typically have a single SSH key with access to all repositories,
86+
# so this complexity is unnecessary. When USE_SSH_KEYS_FROM_BUILD=false (default),
87+
# this entire setup is skipped and standard SSH agent forwarding is used.
88+
89+
ARG USE_SSH_KEYS_FROM_BUILD=false
90+
RUN if [ "$USE_SSH_KEYS_FROM_BUILD" = "true" ] && [ -d /opt/rslearn_projects/.docker-ssh ]; then \
91+
echo "Setting up SSH keys from build context..." && \
92+
cp /opt/rslearn_projects/.docker-ssh/*_key /root/.ssh/ && \
93+
cp /opt/rslearn_projects/.docker-ssh/config /root/.ssh/config && \
94+
chmod 600 /root/.ssh/*_key /root/.ssh/config && \
95+
cp /opt/rslearn_projects/.docker-ssh/requirements-olmoearth_pretrain.txt /opt/rslearn_projects/requirements-olmoearth_pretrain.txt && \
96+
cp /opt/rslearn_projects/.docker-ssh/requirements-olmoearth_run.txt /opt/rslearn_projects/requirements-olmoearth_run.txt && \
97+
echo "SSH multi-key setup complete."; \
98+
else \
99+
echo "Using default SSH configuration (single key or SSH agent)."; \
100+
fi
101+
102+
RUN --mount=type=ssh \
103+
--mount=type=cache,target=/root/.cache/uv \
104+
uv pip install --system /opt/rslearn_projects[olmoearth_run,olmoearth_pretrain]
17105

106+
FROM base AS full
107+
108+
COPY --from=tippecanoe /opt/tippecanoe /opt/tippecanoe
109+
COPY --from=golang /tmp/smooth_point_labels_viterbi/smooth_point_labels_viterbi /usr/local/bin/smooth_point_labels_viterbi
18110
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
19111

20-
# Install rslearn.
21-
# We use git clone and then git checkout instead of git clone -b so that the user could
22-
# specify a commit name or branch instead of only accepting a branch.
112+
ENV PATH="/opt/tippecanoe/bin:${PATH}"
113+
114+
## Install rslearn.
115+
## We use git clone and then git checkout instead of git clone -b so that the user could
116+
## specify a commit name or branch instead of only accepting a branch.
23117
ARG RSLEARN_BRANCH=master
24-
RUN git clone https://github.com/allenai/rslearn.git /opt/rslearn
118+
RUN --mount=type=ssh git clone git@github.com:allenai/rslearn.git /opt/rslearn
25119
WORKDIR /opt/rslearn
26120
RUN git checkout $RSLEARN_BRANCH
27-
RUN uv pip install --system /opt/rslearn[extra]
121+
RUN --mount=type=cache,target=/root/.cache/uv \
122+
uv pip install --system /opt/rslearn[extra]
28123

29-
# Install rslearn_projects.
30124
COPY . /opt/rslearn_projects/
31-
RUN uv pip install --system /opt/rslearn_projects[dev,extra]
32-
33-
# Build Satlas smooth_point_labels_viterbi.go program.
34-
WORKDIR /opt/rslearn_projects/rslp/satlas/scripts
35-
RUN go build smooth_point_labels_viterbi.go
36-
37-
WORKDIR /opt/rslearn_projects
125+
RUN --mount=type=ssh \
126+
--mount=type=cache,target=/root/.cache/uv \
127+
uv pip install --system /opt/rslearn_projects[dev,extra]

README.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,43 @@ environment variable:
3333
echo "RSLP_PREFIX=project_data/" > .env
3434

3535

36+
Docker Build
37+
------------
38+
39+
The project includes a multi-stage Docker build for creating containerized environments.
40+
41+
### Prerequisites
42+
43+
- Docker with BuildKit support
44+
- SSH access to the following private repositories:
45+
- `github.com:allenai/rslearn`
46+
- `github.com:allenai/olmoearth_pretrain`
47+
- `github.com:allenai/olmoearth_run`
48+
- Your SSH key should be loaded in your SSH agent
49+
50+
### Building Images
51+
52+
To build all targets:
53+
54+
docker buildx bake
55+
56+
To build a specific target:
57+
58+
docker buildx bake base # Base image with core dependencies
59+
docker buildx bake full # Full image with all tools and rslearn
60+
61+
### Build Targets
62+
63+
- **base**: Core PyTorch environment with rslearn_projects and olmoearth dependencies
64+
- **full**: Includes rslearn, tippecanoe, and additional development tools
65+
66+
### Notes for Local Development
67+
68+
Local developers with a single SSH key that has access to all repositories can build directly
69+
without additional configuration. The multi-key SSH setup is only required in CI/CD environments
70+
where separate deploy keys are used for each repository.
71+
72+
3673
Applications
3774
------------
3875

SECRETS.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
This is a listing of all Github Actions secrets used in this repository and what they are for.
2+
They are mostly managed by @favyen2 and @joshhvulcan.
3+
4+
| Secret Name | Description |
5+
|----------------------------------------------|-------------------------------------------------------------------|
6+
| AWS_ACCESS_KEY_ID | AWS Access Key for accessing requestor pays buckets. |
7+
| AWS_SECRET_ACCESS_KEY | AWS Access Key for accessing requestor pays buckets. |
8+
| BEAKER_ADDR | Address of the beaker API |
9+
| BEAKER_BUDGET | Which beaker budget to use when submitting jobs. |
10+
| BEAKER_CLUSTER_INFERENCE | Which beaker cluster to use. |
11+
| BEAKER_TOKEN | Beaker token |
12+
| BEAKER_TOKEN_2 | Beaker token |
13+
| BEAKER_USERNAME | Beaker username. (Favyen?) |
14+
| BEAKER_WORKSPACE | Which beaker workspace to use. |
15+
| DEPLOY_KEY_FOR_ESRUN_CLONE | Deploy key to use when cloning earthsystem_run repository. |
16+
| DEPLOY_KEY_FOR_HELIOS_CLONE | Deploy keyt o use when cloning helios repo. |
17+
| DEPLOY_KEY_FOR_OLMOEARTH_PRETRAIN_CLONE | Deploy key to use when closing olmoearth_pretrain repo. |
18+
| DEPLOY_KEY_FOR_OLMOEARTH_RUN_CLONE | Deploy key to use when cloning omoearth_run repo. |
19+
| DOCKER_BUILD_PAT_JOSH | Personal access token from Josh for docker builds (still needed?) |
20+
| FOREST_LOSS_DRIVER_INFERENCE_SERVICE_ACCOUNT | GCP Service Account to use for Forest Loss Driver |
21+
| GCP_PROJECT_ID | GCP Project ID for forest loss |
22+
| GCP_USER | |
23+
| GCP_VM_DEPLOYER_CREDENTIALS | GCP Service accoutn for deploying vms (?) |
24+
| GHCR_PAT_PULL_DOCKER_IMAGE | GHCR Peronsal access token for reading?/writing?) iamges |
25+
| GIT_TOKEN | Is this still used? |
26+
| GIT_USERNAME | Is this still used? |
27+
| GOOGLE_CREDENTIALS | Is this still used? |
28+
| PLANET_API_KEY_NICFI | Planet API key |
29+
| RSLP_PREFIX | RSLP Prefix (?) |

0 commit comments

Comments
 (0)