You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Moving any serious amount of test workload to ARR or TF requires all qcow downloads to be fast and reliable, i.e. in close proximity to the EC2 instances. As we can't have any truly local cache with either solution, let's move our primary image store and logs from Linode to ARR. That will also "deal with" all the excess traffic due to scrapers, who by now have learned to get past our Anubis proxy.
Current image/log stores
The Cockpit CI infrastructure currently uses three S3 buckets on Linode:
cockpit-images.us-east-1 - Public qcow2 test images (US region)
cockpit-logs.us-east-1 - CI test logs, artifacts, and Prometheus metrics
These buckets are being migrated to AWS S3 as a prerequisite for moving the CI infrastructure itself from the company internal PSI OpenStack cluster to AWS EC2. The S3 migration happens first, then the compute migration will follow.
Anubis Proxy Decision
Initial approach: Public-read buckets without Anubis
For the initial migration, we'll make public AWS S3 buckets public-read (matching the original Linode setup before scrapers became a problem). This keeps the migration simple and maintains existing access patterns.
The current Anubis deployment has become ineffective anyway, so it needs revisiting/tuning regardless of the migration. We can re-add/fix Anubis proxy later if scraper problems actually emerge.
AWS traffic cost model (future state after CI moves to AWS EC2):
✅ FREE: EC2↔S3 in same region (infrastructure traffic - ~90% of volume)
✅ FREE: First 100GB/month internet egress (developers)
Single shared credentials for both buckets (same access key in both s3-keys files)
No per-developer IAM users - too complex to manage
RHEL image access:
Decision: nginx proxy in Red Hat internal VPC (Phase 4)
Proxy authenticates to S3 using arr-cockpit credentials
Serves RHEL images to Red Hat internal network only
Similar to launch-webhook.yml but in internal VPC subnet (like launch-tasks.yml)
No per-developer IAM credentials needed
Scraper protection: Monitor after migration, add/fix Anubis later only if needed
Public fallback key: Keep embedded Linode key until Phase 6 cleanup (for old images). No AWS fallback key needed initially. Can reinstate later if Anubis is reinstated.
CORS: Not needed - logs work without it
Migration timeline:
Phases 1-5: Next week (blocking AWS compute migration)
Phase 6 cleanup: ~2 months later
Data migration: NONE - Images rebuild weekly anyway, new images will naturally appear on AWS
S3 Migration from Linode to AWS
https://redhat.atlassian.net/browse/COCKPIT-1833
Rationale
Moving any serious amount of test workload to ARR or TF requires all qcow downloads to be fast and reliable, i.e. in close proximity to the EC2 instances. As we can't have any truly local cache with either solution, let's move our primary image store and logs from Linode to ARR. That will also "deal with" all the excess traffic due to scrapers, who by now have learned to get past our Anubis proxy.
Current image/log stores
The Cockpit CI infrastructure currently uses three S3 buckets on Linode:
These buckets are being migrated to AWS S3 as a prerequisite for moving the CI infrastructure itself from the company internal PSI OpenStack cluster to AWS EC2. The S3 migration happens first, then the compute migration will follow.
Anubis Proxy Decision
Initial approach: Public-read buckets without Anubis
For the initial migration, we'll make public AWS S3 buckets public-read (matching the original Linode setup before scrapers became a problem). This keeps the migration simple and maintains existing access patterns.
The current Anubis deployment has become ineffective anyway, so it needs revisiting/tuning regardless of the migration. We can re-add/fix Anubis proxy later if scraper problems actually emerge.
AWS traffic cost model (future state after CI moves to AWS EC2):
Sources:
Current state: CI runs on PSI OpenStack cluster, so all S3 traffic is internet egress until CI migration completes.
Migration Plan
🔨 Phase 1: Preparation & Discovery
Timeline: Day 1-2 of migration week
Tasks:
Review current Linode lifecycle policies
s3-lifecyclescript to replicate on AWS:Design AWS bucket structure
cockpit-ci-images(single bucket for all images)cockpit-ci-logsPlan access control strategy
public-readACLrhel): Upload withauthenticated-reador private ACLs3:PutObject,s3:DeleteObject,s3:GetObject,s3:ListBuckets3:PutObjectAcl(to set per-file ACLs during upload)Set up/Verify team access
✅ Phase 2: AWS Infrastructure Setup
Timeline: Day 2-3 of migration week
Tasks:
Create S3 buckets:
cockpituous/ansible/aws/setup-s3-buckets.yml) to create buckets in us-east-1https://<bucket>.s3.us-east-1.amazonaws.com/ansible/aws/aws_defaults.ymlansible/aws/README.mdConfigure bucket policies and CORS via Ansible
ObjectWriterownership)BucketOwnerEnforcedownership (ACLs disabled, simpler)Create IAM access keys and deploy credentials
cockpit*resources (for deployment)cockpit*bucket (PutObject, DeleteObject, GetObject, PutObjectAcl, DeleteObject) and EC2 operations (for runtime)arr-cockpitS3 access key in ci-secrets.gits3-keys/arr-cockpit-bootstrapS3 access key in bitwardenRun Ansible playbooks to provision infrastructure
ansible-playbook aws/setup-s3-buckets.yml)aws s3apicommands)✅ Phase 3: Logs Migration & Testing
Timeline: Day 3 of migration week
Logs are simpler - no per-file ACLs, no developer IAM credentials needed, switch directly to AWS.
First, set up local job-runner S3 sink testing, with the current Linode S3. Create
/tmp/job-runner.tomlwithand then run a test:
→ job log
Code changes:
bots repository: (#9013)
bots/.github/workflows/issue-comment.yml- Move logs bucket URL to AWSbots/lib/stores.py- Switch LOG_STORE to AWSbots/s3-lifecycle, done with Ansible nowTesting:
→ job log
prometheus-statslogs upload to AWS→ successful
cockpituous repository: (cockpit-project/cockpituous#698)
cockpituous/ansible/roles/tasks-systemd/tasks/main.yml- Configure log uploads to AWScockpituous/metrics/README.md- Document AWS S3 setupcockpituous/metrics/metrics.yaml- Update Prometheus scrape target to AWS bucketci-secrets:
AWS_KEY_LOGSto point to AWS store, and deploy secret to bots image-build envPhase 4: Images Migration & Testing
Timeline: Day 4 of migration week
Rationale: Images are more complex - per-file ACLs, dual-write, need to revert scraper workaround.
Code changes:
bots repository:
bots/lib/stores.py- Add AWS images bucket URL to PUBLIC_STORES (keep Linode URLs)bots/image-upload- Revert commit 04fbfc3 to restore per-file ACL logic, then implement dual-write to both Linode and AWS bucketsbots/image-download- Add AWS bucket URL support (fallback from AWS to Linode)bots/README.md- Document both Linode and AWS S3 buckets during transitionbots/job-runner.toml- Add AWS S3 URL examplesbots/s3-lifecycle- Add AWS URL examples in commentsImplementation approach:
public = not os.path.basename(source).startswith('rhel')RHEL image proxy setup (cockpituous repository):
ansible/aws/launch-images-proxy.ymlplaybook (similar to launch-webhook.yml)bots/lib/stores.pyto add proxy URL for RHEL imagesansible/aws/README.mdTesting:
cirrosimage (small and fast)cirroshas public-read ACL on AWS (non-RHEL image)image-downloadPhase 5: Production Cutover & Dual-Write Period
Timeline: Day 5 of migration week (cutover), then ~2 months dual-write
Approach: No historical data migration - dual-write new content only
Cutover actions (Day 5):
Dual-write strategy:
New test logs → AWS only
New/refreshed images → BOTH Linode AND AWS
Downloads during dual-write:
Why this works:
Dual-write period: ~2 months to ensure safe cleanup
Phase 6: Cutover & Cleanup (~2 months after dual-write starts)
Tasks:
Verify AWS has all active content
Remove Linode from production configs
bots/lib/stores.py(PUBLIC_STORES, LOG_STORE)bots/.github/workflows/issue-comment.ymlcockpituous/ansible/roles/tasks-systemd/tasks/main.ymlcockpituous/metrics/metrics.yamlto only scrape from AWSbots/README.mdto drop the S3 token setup/talking to Lis, check cockpituous as wellbots/lib/s3.pyS3_KEY_LOGSfrom ci-secrets.git and bots image-refresh envMonitor for issues
Keep Anubis proxy available (no changes)
Clean up Linode
Critical Files Reference
bots repository:
bots/lib/stores.py- Bucket URL configurationbots/lib/s3.py- S3 authentication and signingbots/lib/aio/s3.py- Async S3 for loggingbots/.github/workflows/issue-comment.yml- CI secretsbots/image-upload- Image upload logic with per-file ACL (commit 04fbfc3)bots/image-download- Image download logicbots/prometheus-stats- Metrics uploadcockpituous repository:
cockpituous/ansible/aws/setup-s3-buckets.yml- S3 bucket provisioning playbookcockpituous/ansible/aws/launch-images-proxy.yml- RHEL images nginx proxy deployment (Phase 4)cockpituous/ansible/aws/aws_defaults.yml- Bucket names and AWS defaultscockpituous/ansible/aws/README.md- AWS infrastructure documentationcockpituous/ansible/roles/tasks-systemd/tasks/main.yml- Ansible CI configcockpituous/ansible/roles/images-proxy/- nginx proxy role for RHEL images (Phase 4)cockpituous/logs-proxy/s3-proxy.py- S3 proxy implementationcockpituous/logs-proxy/proxy.yaml- Anubis deploymentcockpituous/metrics/metrics.yaml- Prometheus configDecisions Made
public-read, RHEL images stayprivate)arr-cockpit-bootstrap: S3 admin forcockpit*resources (deployment/infrastructure)arr-cockpit: S3 read/write/ACL forcockpit*buckets + EC2 operations (CI runtime)