Skip to content

S3 Migration from Linode to AWS #9003

@martinpitt

Description

@martinpitt

S3 Migration from Linode to AWS

https://redhat.atlassian.net/browse/COCKPIT-1833

Rationale

Moving any serious amount of test workload to ARR or TF requires all qcow downloads to be fast and reliable, i.e. in close proximity to the EC2 instances. As we can't have any truly local cache with either solution, let's move our primary image store and logs from Linode to ARR. That will also "deal with" all the excess traffic due to scrapers, who by now have learned to get past our Anubis proxy.

Current image/log stores

The Cockpit CI infrastructure currently uses three S3 buckets on Linode:

  • cockpit-images.us-east-1 - Public qcow2 test images (US region)
  • cockpit-images.eu-central-1 - Private/internal RHEL qcow2 images (EU region)
  • cockpit-logs.us-east-1 - CI test logs, artifacts, and Prometheus metrics

These buckets are being migrated to AWS S3 as a prerequisite for moving the CI infrastructure itself from the company internal PSI OpenStack cluster to AWS EC2. The S3 migration happens first, then the compute migration will follow.

Anubis Proxy Decision

Initial approach: Public-read buckets without Anubis

For the initial migration, we'll make public AWS S3 buckets public-read (matching the original Linode setup before scrapers became a problem). This keeps the migration simple and maintains existing access patterns.

The current Anubis deployment has become ineffective anyway, so it needs revisiting/tuning regardless of the migration. We can re-add/fix Anubis proxy later if scraper problems actually emerge.

AWS traffic cost model (future state after CI moves to AWS EC2):

  • FREE: EC2↔S3 in same region (infrastructure traffic - ~90% of volume)
  • FREE: First 100GB/month internet egress (developers)
  • 💰 $0.09/GB: Developer downloads beyond 100GB/month (negligible volume)

Sources:

Current state: CI runs on PSI OpenStack cluster, so all S3 traffic is internet egress until CI migration completes.


Migration Plan

🔨 Phase 1: Preparation & Discovery

Timeline: Day 1-2 of migration week

Tasks:

  1. Review current Linode lifecycle policies

    • Check existing log retention settings via s3-lifecycle script to replicate on AWS:
      • Logs bucket: 90-day expiration policy
      • Image buckets: No lifecycle policy (obsolete images are pruned explicitly by tooling)
  2. Design AWS bucket structure

    • Region: us-east-1 (target region for future EC2 infrastructure)
    • Bucket names (must be globally unique on AWS):
      • Images: cockpit-ci-images (single bucket for all images)
      • Logs: cockpit-ci-logs
    • Storage class: S3 Standard for both. Intelligent-Tiering not worth the monitoring fees given many small files per test run and 90-day lifecycle.
  3. Plan access control strategy

    • Images bucket: Per-file ACL approach (like before bots commit 04fbfc3)
      • Non-RHEL images: Upload with public-read ACL
      • RHEL images (filename starts with rhel): Upload with authenticated-read or private ACL
      • Bucket itself can be private, access controlled at object level
    • Logs bucket: public-read for GET, authenticated PUT/DELETE
    • Plan IAM roles/credentials for CI infrastructure:
      • Both buckets: s3:PutObject, s3:DeleteObject, s3:GetObject, s3:ListBucket
      • Both buckets: s3:PutObjectAcl (to set per-file ACLs during upload)
  4. Set up/Verify team access

✅ Phase 2: AWS Infrastructure Setup

Timeline: Day 2-3 of migration week

Tasks:

  1. Create S3 buckets:

    • Create/update playbook (cockpituous/ansible/aws/setup-s3-buckets.yml) to create buckets in us-east-1
      • Images bucket (single bucket, per-file ACLs)
      • Logs bucket (bucket policy for public read)
      • Ensure they all have correct Service*/AppCode tags
    • AWS URLs format: https://<bucket>.s3.us-east-1.amazonaws.com/
    • Add bucket names to ansible/aws/aws_defaults.yml
    • Document S3 buckets in ansible/aws/README.md
  2. Configure bucket policies and CORS via Ansible

    • Images bucket: Allow per-object ACLs (ObjectWriter ownership)
    • Images bucket: Block Public Access settings (allow per-file ACLs, block bucket policies)
    • Logs bucket: Bucket policy for public-read GET, authenticated PUT/DELETE
    • Logs bucket: BucketOwnerEnforced ownership (ACLs disabled, simpler)
    • Set up lifecycle policies (90-day expiration on logs bucket)
    • Note: CORS not needed - tested without it
  3. Create IAM access keys and deploy credentials

    • @thrix: create arr-cockpit-bootstrap IAM user with S3 admin privileges for cockpit* resources (for deployment)
    • @thrix: modify the arr-cockpit account to allow S3 cockpit* bucket (PutObject, DeleteObject, GetObject, PutObjectAcl, DeleteObject) and EC2 operations (for runtime)
    • Store the arr-cockpit S3 access key in ci-secrets.git s3-keys/
    • Store the arr-cockpit-bootstrap S3 access key in bitwarden
    • Deploy the new secret to GitHub/OpenShift/PSI (existing docs/scripts)
  4. Run Ansible playbooks to provision infrastructure

    • Execute bucket creation playbooks (ansible-playbook aws/setup-s3-buckets.yml)
    • Verify buckets created successfully (via aws s3api commands)
    • Test bucket policies with sample uploads/downloads
    • Verified per-file ACLs on images bucket (public-read and private both work)
    • Verified bucket policy on logs bucket (public read works without ACLs)
    • Verified 90-day lifecycle policy on logs bucket

✅ Phase 3: Logs Migration & Testing

Timeline: Day 3 of migration week

Logs are simpler - no per-file ACLs, no developer IAM credentials needed, switch directly to AWS.

First, set up local job-runner S3 sink testing, with the current Linode S3. Create /tmp/job-runner.toml with

[logs]
driver = 's3'

[logs.s3]
# bots lib/stores.py LOG_STORE
url = 'https://cockpit-logs.us-east-1.linodeobjects.com/'
key = [{file="/run/user/1000/ci-secrets/s3-keys/cockpit-logs.us-east-1.linodeobjects.com"}]
proxy_url = 'https://logs-cockpit.apps.ocp.cloud.ci.centos.org/'
acl = 'authenticated-read'

and then run a test:

./job-runner --debug --config-file /tmp/job-runner.toml run cockpit-project/bots --sha ae69819bc6bec3d32c038e40179f3953d6010c2f --context test-s3 --title "Local job-runner test with S3 logging"

job log

Code changes:

bots repository: (#9013)

  • bots/.github/workflows/issue-comment.yml - Move logs bucket URL to AWS
  • bots/lib/stores.py - Switch LOG_STORE to AWS
  • Drop bots/s3-lifecycle, done with Ansible now

Testing:

  • Adjust /tmp/job-runner.toml to the below contents, run the same bots test with the new AWS S3
[logs]
driver = 's3'

[logs.s3]
# bots lib/stores.py LOG_STORE
url = 'https://cockpit-ci-logs.s3.us-east-1.amazonaws.com/'
key = [{file="/run/user/1000/ci-secrets/s3-keys/s3.us-east-1.amazonaws.com"}]
acl = ''  # BucketOwnerEnforced - bucket policy handles public access, ACLs disabled

job log

  • Test prometheus-stats logs upload to AWS
./prometheus-stats --db /tmp/test-results.db --s3 https://cockpit-ci-logs.s3.us-east-1.amazonaws.com/prometheus --verbose

successful

cockpituous repository: (cockpit-project/cockpituous#698)

  • cockpituous/ansible/roles/tasks-systemd/tasks/main.yml - Configure log uploads to AWS
  • cockpituous/metrics/README.md - Document AWS S3 setup
  • cockpituous/metrics/metrics.yaml - Update Prometheus scrape target to AWS bucket

ci-secrets:

  • add AWS_KEY_LOGS to point to AWS store, and deploy secret to bots image-build env

Phase 4: Images Migration & Testing

Timeline: Day 4 of migration week

Rationale: Images are more complex - per-file ACLs, dual-write, need to revert scraper workaround.

Code changes:

bots repository:

  • bots/lib/stores.py - Add AWS images bucket URL to PUBLIC_STORES (keep Linode URLs)
  • bots/image-upload - Revert commit 04fbfc3 to restore per-file ACL logic, then implement dual-write to both Linode and AWS buckets
  • bots/image-download - Add AWS bucket URL support (fallback from AWS to Linode)
  • bots/README.md - Document both Linode and AWS S3 buckets during transition
  • bots/job-runner.toml - Add AWS S3 URL examples
  • bots/s3-lifecycle - Add AWS URL examples in comments

Implementation approach:

  • Images uploaded to BOTH Linode and AWS (dual-write)
  • Weekly image refresh will naturally populate AWS
  • ACL logic: public = not os.path.basename(source).startswith('rhel')
  • Keep embedded Linode fallback key in lib/s3.py for backward compatibility

RHEL image proxy setup (cockpituous repository):

  • Best option: Configure images S3 proxy to allow unrestricted access to Red Hat VPN VPC
  • If that doesn't work with the per-file ACLs: Create a third "private-images" S3 bucket with global policy, and allow RH VPN VPC access
  • If that doesn't work either: Implement a proxy as AWS Lambda function
  • Worst option (ongoing maintenance/cost): persistent EC2 instance in RH VPC that proxies accesss to private images
    • Create ansible/aws/launch-images-proxy.yml playbook (similar to launch-webhook.yml)
    • EC2 instance in Red Hat internal VPC (like launch-tasks.yml subnet)
    • Install nginx to proxy RHEL images from cockpit-ci-images bucket
    • Use arr-cockpit credentials to authenticate to S3
    • Only accessible from Red Hat internal network
    • Create nginx configuration role for proxying S3 with authentication
    • Deploy proxy instance to AWS
    • Update bots/lib/stores.py to add proxy URL for RHEL images
    • Document proxy URL in ansible/aws/README.md

Testing:

  • Trigger image refresh for cirros image (small and fast)
  • Verify upload succeeds to both Linode and AWS buckets
  • Verify cirros has public-read ACL on AWS (non-RHEL image)
  • Test download from AWS bucket using image-download
  • Verify checksums match between Linode and AWS
  • Test RHEL image upload to verify private ACL (stays private on AWS)
  • Test RHEL image download through nginx proxy from Red Hat network
  • Verify proxy authenticates to S3 correctly for RHEL images
  • Verify direct S3 access to RHEL images fails (private ACL working)
  • Test that Linode fallback key still works for old images
  • Monitor upload/download speeds to AWS
  • Check for any timeout or reliability issues
  • Begin monitoring AWS costs

Phase 5: Production Cutover & Dual-Write Period

Timeline: Day 5 of migration week (cutover), then ~2 months dual-write

Approach: No historical data migration - dual-write new content only

Cutover actions (Day 5):

  1. Merge code changes from Phases 3 and 4
  2. Restart/deploy services to pick up new configuration
  3. Monitor closely for issues

Dual-write strategy:

  1. New test logs → AWS only

    • Old Linode logs remain accessible (read-only)
    • Will be cleaned up in Phase 6 (~2 months)
  2. New/refreshed images → BOTH Linode AND AWS

    • Weekly image refreshes populate both buckets
    • After ~1 month, all active images will be on AWS
    • Public images: dual-write to both buckets
    • RHEL images: dual-write to both private buckets
  3. Downloads during dual-write:

    • Code tries AWS first, falls back to Linode for old images
    • Linode fallback key still embedded for old images

Why this works:

  • Images rebuild weekly, so all active images on AWS within 1-2 weeks
  • No Linode egress costs (no bulk migration)
  • No data validation complexity
  • Gradual, low-risk transition

Dual-write period: ~2 months to ensure safe cleanup

Phase 6: Cutover & Cleanup (~2 months after dual-write starts)

Tasks:

  1. Verify AWS has all active content

    • Ensure all current images available on AWS
    • Verify new logs going to AWS successfully
    • Check that old Linode logs are no longer referenced
  2. Remove Linode from production configs

    • Remove Linode URLs from bots/lib/stores.py (PUBLIC_STORES, LOG_STORE)
    • Remove Linode bucket configurations from bots/.github/workflows/issue-comment.yml
    • Remove Linode configurations from cockpituous/ansible/roles/tasks-systemd/tasks/main.yml
    • Update cockpituous/metrics/metrics.yaml to only scrape from AWS
    • Remove Linode S3 credentials from GitHub Secrets, ci-secrets repo (deployed to infrastructure)
    • Update bots/README.md to drop the S3 token setup/talking to Lis, check cockpituous as well
    • Remove embedded Linode fallback key from bots/lib/s3.py
    • remove S3_KEY_LOGS from ci-secrets.git and bots image-refresh env
  3. Monitor for issues

    • Watch for failed downloads (missing images that weren't dual-written)
    • Monitor AWS costs vs. expected
    • Check for authentication failures
  4. Keep Anubis proxy available (no changes)

    • Leave logs-proxy deployment in place but not actively used
    • Can enable/fix later if scraper issues emerge
    • Update documentation to note it's available but not in use
  5. Clean up Linode

    • Clean up old test logs (>2 months old)
    • Remove old images that weren't refreshed
    • Cancel Linode object storage subscription
    • Archive final bucket inventory for records if needed

Critical Files Reference

bots repository:

  • bots/lib/stores.py - Bucket URL configuration
  • bots/lib/s3.py - S3 authentication and signing
  • bots/lib/aio/s3.py - Async S3 for logging
  • bots/.github/workflows/issue-comment.yml - CI secrets
  • bots/image-upload - Image upload logic with per-file ACL (commit 04fbfc3)
  • bots/image-download - Image download logic
  • bots/prometheus-stats - Metrics upload

cockpituous repository:

  • cockpituous/ansible/aws/setup-s3-buckets.yml - S3 bucket provisioning playbook
  • cockpituous/ansible/aws/launch-images-proxy.yml - RHEL images nginx proxy deployment (Phase 4)
  • cockpituous/ansible/aws/aws_defaults.yml - Bucket names and AWS defaults
  • cockpituous/ansible/aws/README.md - AWS infrastructure documentation
  • cockpituous/ansible/roles/tasks-systemd/tasks/main.yml - Ansible CI config
  • cockpituous/ansible/roles/images-proxy/ - nginx proxy role for RHEL images (Phase 4)
  • cockpituous/logs-proxy/s3-proxy.py - S3 proxy implementation
  • cockpituous/logs-proxy/proxy.yaml - Anubis deployment
  • cockpituous/metrics/metrics.yaml - Prometheus config

Decisions Made

  1. AWS Region: us-east-1 (target region for future EC2 infrastructure)
  2. Bucket consolidation: Single region, no separate EU bucket (at least initially -- not sure if we can have that with our ARR account)
  3. Access control:
    • Images bucket: Per-file ACLs (non-RHEL images get public-read, RHEL images stay private)
    • Logs bucket: Bucket policy for public read (simpler than per-file ACLs since all logs have same access pattern)
  4. IAM credentials strategy:
    • arr-cockpit-bootstrap: S3 admin for cockpit* resources (deployment/infrastructure)
    • arr-cockpit: S3 read/write/ACL for cockpit* buckets + EC2 operations (CI runtime)
    • Single shared credentials for both buckets (same access key in both s3-keys files)
    • No per-developer IAM users - too complex to manage
  5. RHEL image access:
    • Decision: nginx proxy in Red Hat internal VPC (Phase 4)
    • Proxy authenticates to S3 using arr-cockpit credentials
    • Serves RHEL images to Red Hat internal network only
    • Similar to launch-webhook.yml but in internal VPC subnet (like launch-tasks.yml)
    • No per-developer IAM credentials needed
  6. Scraper protection: Monitor after migration, add/fix Anubis later only if needed
  7. Public fallback key: Keep embedded Linode key until Phase 6 cleanup (for old images). No AWS fallback key needed initially. Can reinstate later if Anubis is reinstated.
  8. CORS: Not needed - logs work without it
  9. Migration timeline:
    • Phases 1-5: Next week (blocking AWS compute migration)
    • Phase 6 cleanup: ~2 months later
  10. Data migration: NONE - Images rebuild weekly anyway, new images will naturally appear on AWS

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions