feat(relocation) Use a shared bucket for relocations#116108
Conversation
There were more file moves :| Instead of copying all that data around lets put the upload/export in the right place to start with. Now that we have a shared bucket there is no reason to move files around.
With a shared bucket we don't need ot shuffle files around nearly as much. This should address both the slow RPC durations as well as the overall performance of relocations.
Transfer operations don't touch relocation blobs anymore.
The storage backend caches file contents and this cost me an hour of debugging.
Because retries are initialized with the old relocation bucket path, we
have to move the file when processing a retry. The file needs to be
moved because cloudbuild tooling copies all of `runs/{uuid}/in` which
doesn't cover the raw-relocation-data.tar. I thought conditionally
copying the file during processing was simpler than having cloudbuild
configuration always copy files twice.
| # If the currrent UUID is not in the bucket_path, we are processing | ||
| # a retry and need to copy the file. | ||
| if uuid not in raw_relocation_file.bucket_path: | ||
| new_path = relocation_raw_data_path(UUID(uuid)) | ||
| with relocation_storage.open(raw_relocation_file.bucket_path) as f: | ||
| contents = f.read() | ||
| relocation_storage.save(new_path, BytesIO(contents)) | ||
| raw_relocation_file.bucket_path = new_path | ||
| raw_relocation_file.save() |
There was a problem hiding this comment.
I don't love this as we end up copying the file again. It did seem better to do this for retries than to always copy the relocation data file twice (once via bucket_path and once with the directory glob).
| file=file, | ||
| kind=RelocationFile.Kind.RAW_USER_DATA.value, | ||
| bucket_path=relocation_file.bucket_path, | ||
| file=relocation_file.file, |
There was a problem hiding this comment.
Retry copies null bucket path
Medium Severity
Failed-relocation retry reuses relocation_file.bucket_path on the new RelocationFile without checking it is set. Failures created before shared-bucket uploads can have a null path while data still lives on the legacy File record, so the retried run cannot open raw data from storage.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 93dc700. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ba3a35b. Configure here.
| relocation_storage = get_relocation_storage() | ||
| if not relocation_storage.exists(raw_relocation_file.bucket_path): | ||
| retry_task() |
There was a problem hiding this comment.
Bug: Calling retry_task() for a transient "file not in storage" error incorrectly consumes the relocation's retry budget, leading to a permanent failure after a few attempts.
Severity: HIGH
Suggested Fix
The retry_task() call should not be caught by the retry_task_or_fail_relocation context manager. The task should be retried in a way that does not consume the relocation's fatal error budget. This could be achieved by calling retry_task() outside the context manager or by modifying the context manager to ignore RetryTaskError.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/sentry/relocation/tasks/process.py#L515-L517
Potential issue: In the `uploading_complete` task, when a file is not found in storage,
`retry_task()` is called to handle what is considered a transient condition. However,
this call is wrapped by the `retry_task_or_fail_relocation` context manager, which
catches the `RetryTaskError` and decrements a limited retry counter. Once this retry
budget is exhausted, the context manager calls `fail_relocation()`, permanently marking
the entire relocation as FAILED. This incorrectly treats an expected, transient error as
a fatal one, leading to premature and permanent relocation failures.
| relocation_storage = get_relocation_storage() | ||
| blobsize = relocation_storage.size(relocation_raw_data_path(uuid)) | ||
| # TODO(cells) Remove this once RelocationFile.file is optional. | ||
| file = File.objects.create(name="stub", type=RELOCATION_FILE_TYPE, size=blobsize) | ||
|
|
||
| # This write ensures that the entire chain triggered by `uploading_start` remains | ||
| # idempotent, since only one (relocation_uuid, relocation_file_kind) pairing can exist |
There was a problem hiding this comment.
Bug: Catching an IntegrityError leaves the transaction in an aborted state, leading to a TransactionManagementError and an infinite retry loop that repeatedly schedules a Celery task.
Severity: CRITICAL
Suggested Fix
When an IntegrityError is caught, the transaction should be explicitly rolled back to a savepoint created before the failing create() call. This will allow the transaction to be committed successfully, preventing the TransactionManagementError and the subsequent infinite retry loop. The pass statement should be replaced with logic to handle the transaction state correctly.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/sentry/relocation/services/relocation_export/impl.py#L114-L120
Potential issue: When handling an `IntegrityError` from
`RelocationFile.objects.create()` for idempotency, the exception is caught, but the
underlying database transaction is left in an aborted state. The code then proceeds to
schedule an `uploading_complete` task via Celery. When the `atomic_transaction` block
exits, Django raises a `TransactionManagementError` because the transaction was aborted.
The calling function catches this error but fails to delete the `transfer` record,
causing an infinite retry loop where the `uploading_complete` task is repeatedly
scheduled on each attempt.


We've had a number of problems with relocations timing out or consuming too much memory as export files are moved between control + cells. The RPC calls and file operations would be much simpler if there was only a single shared bucket, instead of one bucket per-cell. Futhermore, I've removed all the of the file copying which should improve performance and reliability.
We've already provisioned a new shared bucket and configuration changes to use that bucket are in getsentry/ops#20737
There is still some cleanup work to be done and I've added some TODOs for what I plan to address in the short term.
Refs INFRENG-318