Skip to content

hyku 7.x release candidate ready: Make uploads dir cleanup happen and safe#2978

Open
aprilrieger wants to merge 7 commits intomainfrom
make-uploads-dir-cleanup-happen-and-safe
Open

hyku 7.x release candidate ready: Make uploads dir cleanup happen and safe#2978
aprilrieger wants to merge 7 commits intomainfrom
make-uploads-dir-cleanup-happen-and-safe

Conversation

@aprilrieger
Copy link
Copy Markdown
Collaborator

@aprilrieger aprilrieger commented Mar 19, 2026

Add Hyrax upload staging cleanup jobs and rake task

Summary

Adds cleanup for Hyrax on-disk Carrierwave staging files under each tenant's upload root (HYRAX_UPLOAD_PATH/<tenant> or public/uploads/<tenant>). The cleanup targets the Carrierwave subdirectory (hyrax/uploaded_file/file/) within each tenant's upload root, walking per-upload-ID directories and removing files that have been successfully ingested or are very old.

How it works

Ingestion detection uses the Hyrax::UploadedFile database record: a file is considered ingested if file_set_uri is present on the corresponding record. Files without a DB record or with a nil file_set_uri are treated as not yet ingested (orphaned).

Two-tier age thresholds:

  • Files with a confirmed file_set_uri (ingested) are deleted after DELETE_INGESTED_AFTER_DAYS (default 180).
  • All remaining files — orphaned or missing a DB record — are unconditionally deleted after DELETE_ALL_AFTER_DAYS (default 730), preventing indefinite disk growth.

Scheduling: rake hyku:cleanup_uploads iterates all tenants via Account.find_each and enqueues a CleanupUploadFilesJob per tenant via perform_later (ActiveJob), so deployers can use Sidekiq, GoodJob, or any configured queue adapter. Wire the task from cron / Kubernetes CronJob as needed.

Tenants using S3 uploads (where Site.account.s3_bucket is present) are skipped since they have no local staging tree. Tenants whose upload path does not exist on disk are also skipped.

Note: Operators must schedule the rake task themselves (or call the job directly per tenant); nothing is auto-registered in GoodJob cron or any scheduler by default.

Optional ENV variables

Variable Default Description
DELETE_INGESTED_AFTER_DAYS 180 Days after which ingested staged files are deleted
DELETE_ALL_AFTER_DAYS 730 Days after which any staged file is deleted regardless of ingest status

Job architecture

  • CleanupUploadFilesJob — receives the tenant's uploads_path, builds the Carrierwave directory path (<uploads_path>/hyrax/uploaded_file/file), and enqueues a CleanupSubDirectoryJob for it. Both jobs are declared non_tenant_job so they run outside Apartment tenant switching (the tenant name is passed explicitly).
  • CleanupSubDirectoryJob — walks per-upload-ID subdirectories, checks each file's age and ingest status via Hyrax::UploadedFile.find_by, deletes qualifying files, and cleans up empty directories afterward. Logs progress every 100 deletions.

Changes

  • Add app/jobs/cleanup_upload_files_job.rb — top-level job that locates the Carrierwave staging dir and fans out to CleanupSubDirectoryJob.
  • Add app/jobs/cleanup_sub_directory_job.rb — file-level cleanup job with age + ingest-status checks and empty directory pruning.
  • Add lib/tasks/uploads_cleanup.rakehyku:cleanup_uploads rake task that enqueues cleanup per tenant, skipping S3 and missing-path tenants.
  • Add spec/jobs/cleanup_upload_files_job_spec.rb — specs covering Carrierwave dir existence/absence, parameter forwarding, and default values.
  • Add spec/jobs/cleanup_sub_directory_job_spec.rb — specs covering ingested file deletion, age thresholds, orphan handling, non-file entries, configurable thresholds, missing DB records, and empty directory cleanup.

@aprilrieger aprilrieger added minor-ver for release notes hyku 7 labels Mar 19, 2026
@aprilrieger aprilrieger changed the title Make uploads dir cleanup happen and safe hyku 7.x release candidate ready: Make uploads dir cleanup happen and safe Mar 19, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 19, 2026

Test Results

    3 files  ± 0      3 suites  ±0   15m 31s ⏱️ -31s
2 332 tests +43  2 276 ✅ +43  56 💤 ±0  0 ❌ ±0 
2 359 runs  +43  2 301 ✅ +43  58 💤 ±0  0 ❌ ±0 

Results for commit e152e47. ± Comparison against base commit 45a5bb5.

This pull request removes 42 and adds 85 tests. Note that renamed tests count towards both.
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor Etd permissions is expected not to be able to destroy f34ec3f9-0d3d-4047-b686-79362c84c60c
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor Etd permissions is expected not to be able to edit 600ea6ad-f0e2-4279-b134-59580b44c8f1
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor Etd permissions is expected not to be able to read 7794d39e-ac49-4393-9829-33a6f4a521f5
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor Etd permissions is expected not to be able to update 2ff3af82-6113-486b-869a-6272d6c72c12
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor FileSet permissions is expected not to be able to destroy 1aa9cab2-218d-421b-ac9a-8e1d7fd761fd
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor FileSet permissions is expected not to be able to edit d58f78da-0f9f-4c2f-9510-6092e5f9930e
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor FileSet permissions is expected not to be able to read 33d3ae14-ac56-49a1-a7e6-1f4956eaca54
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor FileSet permissions is expected not to be able to update 0e20bd12-ac9d-4755-8462-dd685bb0aa30
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor GenericWork permissions is expected not to be able to destroy 2baf535f-d8c5-4222-8e39-94a5cdee0d9d
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor GenericWork permissions is expected not to be able to edit a9bcc5d5-2869-48a3-8b25-890f655a1e52
…
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor Etd permissions is expected not to be able to destroy 470f3656-a89e-47f1-bc43-a6040643649f
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor Etd permissions is expected not to be able to edit 024bb91a-3ed4-41dd-ad95-7fea1dcea411
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor Etd permissions is expected not to be able to read bd6c34fe-2624-455d-b461-aeefeeacbbd3
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor Etd permissions is expected not to be able to update 61250fac-86ea-492a-babf-20c72eaca7d7
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor FileSet permissions is expected not to be able to destroy ab5899ae-431f-4156-b03e-6b6a84f7651c
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor FileSet permissions is expected not to be able to edit 9324b65a-ccc2-43b9-9723-d849fcb6cdf9
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor FileSet permissions is expected not to be able to read 31683f55-dc45-4896-b7df-6562084917be
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor FileSet permissions is expected not to be able to update 411b8a1c-55f2-4098-a051-6c25cee38117
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor GenericWork permissions is expected not to be able to destroy b02dcfd7-7448-416e-9ec1-9c655e7cd2d6
spec.abilities.work_ability_spec ‑ Hyrax::Ability::WorkAbility when work depositor GenericWork permissions is expected not to be able to edit 37eb1d91-ef67-41de-afb9-933fbbb86342
…

♻️ This comment has been updated with latest results.

@aprilrieger aprilrieger requested review from maxkadel March 20, 2026 15:32
@gleithner
Copy link
Copy Markdown
Collaborator

Just tested this out and it's not working for my test deployment. I'm not sure what the HEX_TOP_DIR_PATTERN with top_level_directories is trying to do. The comments say that "hyrax" is a protected top level directory, but if I search the uploaded_files they are all in a top level directory called hyrax. For example,
HYRAX_UPLOAD_PATH=/app/samvera/hyrax-webapp/tmp/uploads
example upload_path passed to CleanupUploadFilesJob from the rake: "/app/samvera/hyrax-webapp/tmp/uploads/75759336-1753-4498-aee4-8a0616117caf"
example file: "/app/samvera/hyrax-webapp/tmp/uploads/75759336-1753-4498-aee4-8a0616117caf/hyrax/uploaded_file/file/3/nga_bench_elizabeth_curtis.jpg"

With this example, the File.basename in top_level_directories ends up returning "hyrax", which doesn't match the pattern, and so it's not processed.

@aprilrieger
Copy link
Copy Markdown
Collaborator Author

@gleithner I am curious to know the instance configuration you are using to compare and troubleshoot further.

@aprilrieger aprilrieger changed the title hyku 7.x release candidate ready: Make uploads dir cleanup happen and safe DRAFT: Make uploads dir cleanup happen and safe Mar 20, 2026
@gleithner
Copy link
Copy Markdown
Collaborator

I was assuming this task was meant to clean the Carrierwave(Hyrax::UploadedFileUploader) files that it places in its store_dir(config.upload_path) and cache_dir(config.cache_path)

Based on a default Hyku installation:
config.upload_path defaults to HYRAX_UPLOAD_PATH/ or public/uploads/
config.cache_path sticks with the Hyrax default path which is HYRAX_CACHE_PATH or tmp/cache

I'm unaware of anything else that is saved to the config.upload_path other than the Carrierwave store_dir files which are saved under the directory "hyrax"(This path being set in Hyrax in app/uploaders/hyrax/uploaded_file_uploader.rb). As far as I can tell, no other directories should exist in the config.upload_path in Hyku 6.x+, although I do have old notes in which Rob stated that branding used to default to the upload_path before Hyrax implemented branding.

I'm fairly certain the Carrierwave cache_dir files can be deleted without any issue, but I've been wondering about the store_dir files for a while because I wasn't sure at what point Hyku/Hyrax is done with processing them so that they are safe to delete. I was also confused by the hex named directories that you are targeting and what would generate them, but again as far as I can tell these directories do not exist in a default Hyku install in the upload_path.
And to be specific on Hyku version, all of this is based on looking at Hyku 6.x but I don't see any path changes as of Hyku 7.x that would change things.

I hope that this clarifies my confusion.

@aprilrieger
Copy link
Copy Markdown
Collaborator Author

Thank you for the early review @gleithner!

The issue was that it wasn't targeting the proper path as you stated. We had ported this over and were trying to integrate the pattern from an older Hyku instance, so appreciate the feedback. The current version now targets the Carrierwave staging subdirectory (hyrax/uploaded_file/file/) under each tenant's upload root and uses the Hyrax::UploadedFile record's file_set_uri to determine ingest status. I've updated the PR description to accurately reflect what's on the branch. Let me know if you see anything else that needs adjusting, or if you want to try it out on your end!

maxkadel
maxkadel previously approved these changes Mar 24, 2026
Copy link
Copy Markdown
Collaborator

@maxkadel maxkadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! As long as it works I'm happy.

Non-blocking thought: It might be worth double checking that there aren't directories that are missing the tenant part of the path because of past potential bugs (I think I've seen that before), and that the S3 clients don't have these directories for ephemeral uploads.

Copy link
Copy Markdown
Collaborator

@maxkadel maxkadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks April!


# Avoid ApplicationJob's retry_on(StandardError) burning 5 attempts
# and swallowing the error when the retry block is present.
discard_on ArgumentError do |_job, error|
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about discard_on, I like that!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right?! i thought that was pretty neato myself!

@aprilrieger aprilrieger changed the title DRAFT: Make uploads dir cleanup happen and safe Make uploads dir cleanup happen and safe Mar 26, 2026
@aprilrieger aprilrieger changed the title Make uploads dir cleanup happen and safe hyku 7.x release candidate ready: Make uploads dir cleanup happen and safe Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hyku 7 minor-ver for release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants