S3 Migration from Linode to AWS

# S3 Migration from Linode to AWS

https://redhat.atlassian.net/browse/COCKPIT-1833

## Rationale

Moving any serious amount of test workload to ARR or TF requires all qcow downloads to be fast and reliable, i.e. in close proximity to the EC2 instances. As we can't have any truly local cache with either solution, let's move our primary image store and logs from Linode to ARR. That will also "deal with" all the excess traffic due to scrapers, who by now have learned to get past our Anubis proxy.

## Current image/log stores

The Cockpit CI infrastructure currently uses three S3 buckets on Linode:
- **cockpit-images.us-east-1** - Public qcow2 test images (US region)
- **cockpit-images.eu-central-1** - Private/internal RHEL qcow2 images (EU region)
- **cockpit-logs.us-east-1** - CI test logs, artifacts, and Prometheus metrics

These buckets are being migrated to AWS S3 as a prerequisite for moving the CI infrastructure itself from the company internal PSI OpenStack cluster to AWS EC2. The S3 migration happens first, then the compute migration will follow.

### Anubis Proxy Decision

**Initial approach: Public-read buckets without Anubis**

For the initial migration, we'll make public AWS S3 buckets public-read (matching the original Linode setup before scrapers became a problem). This keeps the migration simple and maintains existing access patterns.

The current Anubis deployment has become ineffective anyway, so it needs revisiting/tuning regardless of the migration. We can re-add/fix Anubis proxy later if scraper problems actually emerge.

**AWS traffic cost model (future state after CI moves to AWS EC2):**
- ✅ **FREE**: EC2↔S3 in same region (infrastructure traffic - ~90% of volume)
- ✅ **FREE**: First 100GB/month internet egress (developers)
- 💰 **$0.09/GB**: Developer downloads beyond 100GB/month (negligible volume)

Sources:
- [AWS S3 Pricing 2026: All Tiers, Egress & Hidden Costs](https://leanopstech.com/blog/aws-s3-pricing-2026/)
- [AWS Data Transfer Pricing 2026: All Egress Fees](https://leanopstech.com/blog/aws-data-transfer-pricing-2026/)
- [Amazon S3 Pricing Explained](https://go-cloud.io/amazon-s3-pricing/)

**Current state**: CI runs on PSI OpenStack cluster, so all S3 traffic is internet egress until CI migration completes.

---

## Migration Plan

### 🔨 Phase 1: Preparation & Discovery

**Timeline**: Day 1-2 of migration week

**Tasks:**
1. **Review current Linode lifecycle policies**
   - [x] Check existing log retention settings via `s3-lifecycle` script to replicate on AWS:
     - Logs bucket: 90-day expiration policy
     - Image buckets: No lifecycle policy (obsolete images are pruned explicitly by tooling)

2. **Design AWS bucket structure**
   - [x] Region: **us-east-1** (target region for future EC2 infrastructure)
   - [x] Bucket names (must be globally unique on AWS):
     - Images: `cockpit-ci-images` (single bucket for all images)
     - Logs: `cockpit-ci-logs`
   - [x] Storage class: **S3 Standard for both**. Intelligent-Tiering not worth the monitoring fees given many small files per test run and 90-day lifecycle.

3. **Plan access control strategy**
   - [x] **Images bucket**: Per-file ACL approach (like before bots commit 04fbfc3483)
     - Non-RHEL images: Upload with `public-read` ACL
     - RHEL images (filename starts with `rhel`): Upload with `authenticated-read` or private ACL
     - Bucket itself can be private, access controlled at object level
   - [x] **Logs bucket**: public-read for GET, authenticated PUT/DELETE
   - [x] Plan IAM roles/credentials for CI infrastructure:
     - Both buckets: `s3:PutObject`, `s3:DeleteObject`, `s3:GetObject`, `s3:ListBucket`
     - Both buckets: `s3:PutObjectAcl` (to set per-file ACLs during upload)

4. Set up/Verify team access
   - [ ] @thrix: create web UI accounts for Jelle and Lis, or better yet, [front-door-ci rover group](https://rover.redhat.com/groups/group/front-door-ci)
   - [x] @allisonkarlitskaya confirms that she can log into ARR EC2 as poweruser
   - [ ] @jelly confirms that he can log into ARR EC2 as poweruser

### ✅ Phase 2: AWS Infrastructure Setup

**Timeline**: Day 2-3 of migration week

**Tasks:**
1. **Create S3 buckets**:
   - [x] Create/update playbook (`cockpituous/ansible/aws/setup-s3-buckets.yml`) to create buckets in us-east-1
     - Images bucket (single bucket, per-file ACLs)
     - Logs bucket (bucket policy for public read)
     - Ensure they all have correct Service*/AppCode tags
   - [x] AWS URLs format: `https://<bucket>.s3.us-east-1.amazonaws.com/`
   - [x] Add bucket names to `ansible/aws/aws_defaults.yml`
   - [x] Document S3 buckets in `ansible/aws/README.md`

2. **Configure bucket policies and CORS via Ansible**
   - [x] Images bucket: Allow per-object ACLs (`ObjectWriter` ownership)
   - [x] Images bucket: Block Public Access settings (allow per-file ACLs, block bucket policies)
   - [x] Logs bucket: Bucket policy for public-read GET, authenticated PUT/DELETE
   - [x] Logs bucket: `BucketOwnerEnforced` ownership (ACLs disabled, simpler)
   - [x] Set up lifecycle policies (90-day expiration on logs bucket)
   - Note: CORS not needed - tested without it

3. **Create IAM access keys and deploy credentials**
   - [x] @thrix: create [arr-cockpit-bootstrap](https://us-east-1.console.aws.amazon.com/iam/home#/users/details/arr-cockpit-bootstrap?section=permissions) IAM user with S3 admin privileges for `cockpit*` resources (for deployment)
   - [x] @thrix: modify the [arr-cockpit](https://us-east-1.console.aws.amazon.com/iam/home#/users/details/arr-cockpit?section=permissions) account to allow S3 `cockpit*` bucket (PutObject, DeleteObject, GetObject, PutObjectAcl, DeleteObject) and EC2 operations (for runtime)
   - [x] Store the `arr-cockpit` S3 access key in ci-secrets.git `s3-keys/`
   - [x] Store the `arr-cockpit-bootstrap` S3 access key in bitwarden
   - [x] Deploy the new secret to GitHub/OpenShift/PSI (existing docs/scripts)

4. **Run Ansible playbooks to provision infrastructure**
   - [x] Execute bucket creation playbooks (`ansible-playbook aws/setup-s3-buckets.yml`)
   - [x] Verify buckets created successfully (via `aws s3api` commands)
   - [x] Test bucket policies with sample uploads/downloads
   - [x] Verified per-file ACLs on images bucket (public-read and private both work)
   - [x] Verified bucket policy on logs bucket (public read works without ACLs)
   - [x] Verified 90-day lifecycle policy on logs bucket

### ✅ Phase 3: Logs Migration & Testing

**Timeline**: Day 3 of migration week

Logs are simpler - no per-file ACLs, no developer IAM credentials needed, switch directly to AWS.

First, set up local job-runner S3 sink testing, with the current Linode S3. Create `/tmp/job-runner.toml` with

```yaml
[logs]
driver = 's3'

[logs.s3]
# bots lib/stores.py LOG_STORE
url = 'https://cockpit-logs.us-east-1.linodeobjects.com/'
key = [{file="/run/user/1000/ci-secrets/s3-keys/cockpit-logs.us-east-1.linodeobjects.com"}]
proxy_url = 'https://logs-cockpit.apps.ocp.cloud.ci.centos.org/'
acl = 'authenticated-read'
```

and then run a test:
```
./job-runner --debug --config-file /tmp/job-runner.toml run cockpit-project/bots --sha ae69819bc6bec3d32c038e40179f3953d6010c2f --context test-s3 --title "Local job-runner test with S3 logging"
```

→ [job log](https://logs-cockpit.apps.ocp.cloud.ci.centos.org/cockpit-project/bots/test-s3/ae69819bc6be/log.html)

**Code changes:**

**bots repository:** (https://github.com/cockpit-project/bots/pull/9013)
- [x] `bots/.github/workflows/issue-comment.yml` - Move logs bucket URL to AWS
- [x] `bots/lib/stores.py` - Switch LOG_STORE to AWS
- [x] Drop `bots/s3-lifecycle`, done with Ansible now

**Testing:**
- [x] Adjust /tmp/job-runner.toml to the below contents, run the same bots test with the new AWS S3

```yaml
[logs]
driver = 's3'

[logs.s3]
# bots lib/stores.py LOG_STORE
url = 'https://cockpit-ci-logs.s3.us-east-1.amazonaws.com/'
key = [{file="/run/user/1000/ci-secrets/s3-keys/s3.us-east-1.amazonaws.com"}]
acl = ''  # BucketOwnerEnforced - bucket policy handles public access, ACLs disabled
```

→ [job log](https://cockpit-ci-logs.s3.us-east-1.amazonaws.com/cockpit-project/bots/test-s3/ae69819bc6be/log.html)

- [x] Test `prometheus-stats` logs upload to AWS

```sh
./prometheus-stats --db /tmp/test-results.db --s3 https://cockpit-ci-logs.s3.us-east-1.amazonaws.com/prometheus --verbose
```

→ [successful](https://cockpit-ci-logs.s3.us-east-1.amazonaws.com/prometheus)

**cockpituous repository:** (https://github.com/cockpit-project/cockpituous/pull/698)
- [x] `cockpituous/ansible/roles/tasks-systemd/tasks/main.yml` - Configure log uploads to AWS
- [x] `cockpituous/metrics/README.md` - Document AWS S3 setup
- [x] `cockpituous/metrics/metrics.yaml` - Update Prometheus scrape target to AWS bucket

**ci-secrets**:
- [x] add `AWS_KEY_LOGS` to point to AWS store, and deploy secret to bots image-build env


### Phase 4: Images Migration & Testing

**Timeline**: Day 4 of migration week

**Rationale**: Images are more complex - per-file ACLs, dual-write, need to revert scraper workaround.

**Code changes:**

**bots repository:**
- [ ] `bots/lib/stores.py` - Add AWS images bucket URL to PUBLIC_STORES (keep Linode URLs)
- [ ] `bots/image-upload` - Revert commit 04fbfc3483 to restore per-file ACL logic, then implement dual-write to both Linode and AWS buckets
- [ ] `bots/image-download` - Add AWS bucket URL support (fallback from AWS to Linode)
- [ ] `bots/README.md` - Document both Linode and AWS S3 buckets during transition
- [ ] `bots/job-runner.toml` - Add AWS S3 URL examples
- [ ] `bots/s3-lifecycle` - Add AWS URL examples in comments

**Implementation approach:**
- Images uploaded to BOTH Linode and AWS (dual-write)
- Weekly image refresh will naturally populate AWS
- ACL logic: `public = not os.path.basename(source).startswith('rhel')`
- Keep embedded Linode fallback key in lib/s3.py for backward compatibility

**RHEL image proxy setup (cockpituous repository):**
- [ ] Best option: Configure images S3 proxy to allow unrestricted access to Red Hat VPN VPC
- [ ] If that doesn't work with the per-file ACLs: Create a third "private-images" S3 bucket with global policy, and allow RH VPN VPC access
- [ ] If that doesn't work either: Implement a proxy as AWS Lambda function
- [ ] Worst option (ongoing maintenance/cost): persistent EC2 instance in RH VPC that proxies accesss to private images
    - [ ] Create `ansible/aws/launch-images-proxy.yml` playbook (similar to launch-webhook.yml)
    - EC2 instance in Red Hat internal VPC (like launch-tasks.yml subnet)
    - Install nginx to proxy RHEL images from cockpit-ci-images bucket
    - Use arr-cockpit credentials to authenticate to S3
    - Only accessible from Red Hat internal network
    - [ ] Create nginx configuration role for proxying S3 with authentication
    - [ ] Deploy proxy instance to AWS
    - [ ] Update `bots/lib/stores.py` to add proxy URL for RHEL images
    - [ ] Document proxy URL in `ansible/aws/README.md`

**Testing:**
- [ ] Trigger image refresh for `cirros` image (small and fast)
- [ ] Verify upload succeeds to both Linode and AWS buckets
- [ ] Verify `cirros` has public-read ACL on AWS (non-RHEL image)
- [ ] Test download from AWS bucket using `image-download`
- [ ] Verify checksums match between Linode and AWS
- [ ] Test RHEL image upload to verify private ACL (stays private on AWS)
- [ ] Test RHEL image download through nginx proxy from Red Hat network
- [ ] Verify proxy authenticates to S3 correctly for RHEL images
- [ ] Verify direct S3 access to RHEL images fails (private ACL working)
- [ ] Test that Linode fallback key still works for old images
- [ ] Monitor upload/download speeds to AWS
- [ ] Check for any timeout or reliability issues
- [ ] Begin monitoring AWS costs

### Phase 5: Production Cutover & Dual-Write Period

**Timeline**: Day 5 of migration week (cutover), then ~2 months dual-write

**Approach**: No historical data migration - dual-write new content only

**Cutover actions (Day 5):**
1. [ ] Merge code changes from Phases 3 and 4
2. [ ] Restart/deploy services to pick up new configuration
3. [ ] Monitor closely for issues

**Dual-write strategy:**
1. [ ] **New test logs** → AWS only
   - Old Linode logs remain accessible (read-only)
   - Will be cleaned up in Phase 6 (~2 months)

2. [ ] **New/refreshed images** → BOTH Linode AND AWS
   - Weekly image refreshes populate both buckets
   - After ~1 month, all active images will be on AWS
   - Public images: dual-write to both buckets
   - RHEL images: dual-write to both private buckets

3. [ ] **Downloads during dual-write:**
   - Code tries AWS first, falls back to Linode for old images
   - Linode fallback key still embedded for old images

**Why this works:**
- Images rebuild weekly, so all active images on AWS within 1-2 weeks
- No Linode egress costs (no bulk migration)
- No data validation complexity
- Gradual, low-risk transition

**Dual-write period**: ~2 months to ensure safe cleanup

### Phase 6: Cutover & Cleanup (~2 months after dual-write starts)

**Tasks:**
1. **Verify AWS has all active content**
   - [ ] Ensure all current images available on AWS
   - [ ] Verify new logs going to AWS successfully
   - [ ] Check that old Linode logs are no longer referenced

2. **Remove Linode from production configs**
   - [ ] Remove Linode URLs from `bots/lib/stores.py` (PUBLIC_STORES, LOG_STORE)
   - [ ] Remove Linode bucket configurations from `bots/.github/workflows/issue-comment.yml`
   - [ ] Remove Linode configurations from `cockpituous/ansible/roles/tasks-systemd/tasks/main.yml`
   - [ ] Update `cockpituous/metrics/metrics.yaml` to only scrape from AWS
   - [ ] Remove Linode S3 credentials from GitHub Secrets, ci-secrets repo (deployed to infrastructure)
   - [ ] Update `bots/README.md` to drop the S3 token setup/talking to Lis, check cockpituous as well
   - [ ] Remove embedded Linode fallback key from `bots/lib/s3.py`
   - [ ] remove `S3_KEY_LOGS` from ci-secrets.git and bots image-refresh env

3. **Monitor for issues**
   - [ ] Watch for failed downloads (missing images that weren't dual-written)
   - [ ] Monitor AWS costs vs. expected
   - [ ] Check for authentication failures

4. **Keep Anubis proxy available** (no changes)
   - [ ] Leave logs-proxy deployment in place but not actively used
   - [ ] Can enable/fix later if scraper issues emerge
   - [ ] Update documentation to note it's available but not in use

5. **Clean up Linode**
   - [ ] Clean up old test logs (>2 months old)
   - [ ] Remove old images that weren't refreshed
   - [ ] Cancel Linode object storage subscription
   - [ ] Archive final bucket inventory for records if needed

---

## Critical Files Reference

**bots repository:**
- `bots/lib/stores.py` - Bucket URL configuration
- `bots/lib/s3.py` - S3 authentication and signing
- `bots/lib/aio/s3.py` - Async S3 for logging
- `bots/.github/workflows/issue-comment.yml` - CI secrets
- `bots/image-upload` - Image upload logic with per-file ACL (commit 04fbfc3483)
- `bots/image-download` - Image download logic
- `bots/prometheus-stats` - Metrics upload

**cockpituous repository:**
- `cockpituous/ansible/aws/setup-s3-buckets.yml` - S3 bucket provisioning playbook
- `cockpituous/ansible/aws/launch-images-proxy.yml` - RHEL images nginx proxy deployment (Phase 4)
- `cockpituous/ansible/aws/aws_defaults.yml` - Bucket names and AWS defaults
- `cockpituous/ansible/aws/README.md` - AWS infrastructure documentation
- `cockpituous/ansible/roles/tasks-systemd/tasks/main.yml` - Ansible CI config
- `cockpituous/ansible/roles/images-proxy/` - nginx proxy role for RHEL images (Phase 4)
- `cockpituous/logs-proxy/s3-proxy.py` - S3 proxy implementation
- `cockpituous/logs-proxy/proxy.yaml` - Anubis deployment
- `cockpituous/metrics/metrics.yaml` - Prometheus config

---

## Decisions Made

1. **AWS Region**: us-east-1 (target region for future EC2 infrastructure)
2. **Bucket consolidation**: Single region, no separate EU bucket (at least initially -- not sure if we can have that with our ARR account)
3. **Access control**:
   - Images bucket: Per-file ACLs (non-RHEL images get `public-read`, RHEL images stay `private`)
   - Logs bucket: Bucket policy for public read (simpler than per-file ACLs since all logs have same access pattern)
4. **IAM credentials strategy**:
   - `arr-cockpit-bootstrap`: S3 admin for `cockpit*` resources (deployment/infrastructure)
   - `arr-cockpit`: S3 read/write/ACL for `cockpit*` buckets + EC2 operations (CI runtime)
   - Single shared credentials for both buckets (same access key in both s3-keys files)
   - No per-developer IAM users - too complex to manage
5. **RHEL image access**:
   - **Decision**: nginx proxy in Red Hat internal VPC (Phase 4)
   - Proxy authenticates to S3 using arr-cockpit credentials
   - Serves RHEL images to Red Hat internal network only
   - Similar to launch-webhook.yml but in internal VPC subnet (like launch-tasks.yml)
   - No per-developer IAM credentials needed
6. **Scraper protection**: Monitor after migration, add/fix Anubis later only if needed
7. **Public fallback key**: Keep embedded Linode key until Phase 6 cleanup (for old images). No AWS fallback key needed initially. Can reinstate later if Anubis is reinstated.
8. **CORS**: Not needed - logs work without it
9. **Migration timeline**:
   - **Phases 1-5: Next week** (blocking AWS compute migration)
   - **Phase 6 cleanup: ~2 months later**
10. **Data migration**: **NONE** - Images rebuild weekly anyway, new images will naturally appear on AWS


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 Migration from Linode to AWS #9003

S3 Migration from Linode to AWS

Rationale

Current image/log stores

Anubis Proxy Decision

Migration Plan

🔨 Phase 1: Preparation & Discovery

✅ Phase 2: AWS Infrastructure Setup

✅ Phase 3: Logs Migration & Testing

Phase 4: Images Migration & Testing

Phase 5: Production Cutover & Dual-Write Period

Phase 6: Cutover & Cleanup (~2 months after dual-write starts)

Critical Files Reference

Decisions Made

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

S3 Migration from Linode to AWS #9003

Description

S3 Migration from Linode to AWS

Rationale

Current image/log stores

Anubis Proxy Decision

Migration Plan

🔨 Phase 1: Preparation & Discovery

✅ Phase 2: AWS Infrastructure Setup

✅ Phase 3: Logs Migration & Testing

Phase 4: Images Migration & Testing

Phase 5: Production Cutover & Dual-Write Period

Phase 6: Cutover & Cleanup (~2 months after dual-write starts)

Critical Files Reference

Decisions Made

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions