hyku 7.x release candidate ready: Make uploads dir cleanup happen and safe#2978
hyku 7.x release candidate ready: Make uploads dir cleanup happen and safe#2978aprilrieger wants to merge 7 commits intomainfrom
Conversation
…buting back the cleaning up of old uploads form the uploads directory if already added uploaded to fcrepo
Test Results 3 files ± 0 3 suites ±0 15m 31s ⏱️ -31s Results for commit e152e47. ± Comparison against base commit 45a5bb5. This pull request removes 42 and adds 85 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
|
Just tested this out and it's not working for my test deployment. I'm not sure what the HEX_TOP_DIR_PATTERN with top_level_directories is trying to do. The comments say that "hyrax" is a protected top level directory, but if I search the uploaded_files they are all in a top level directory called hyrax. For example, With this example, the File.basename in top_level_directories ends up returning "hyrax", which doesn't match the pattern, and so it's not processed. |
|
@gleithner I am curious to know the instance configuration you are using to compare and troubleshoot further. |
|
I was assuming this task was meant to clean the Carrierwave(Hyrax::UploadedFileUploader) files that it places in its store_dir(config.upload_path) and cache_dir(config.cache_path) Based on a default Hyku installation: I'm unaware of anything else that is saved to the config.upload_path other than the Carrierwave store_dir files which are saved under the directory "hyrax"(This path being set in Hyrax in app/uploaders/hyrax/uploaded_file_uploader.rb). As far as I can tell, no other directories should exist in the config.upload_path in Hyku 6.x+, although I do have old notes in which Rob stated that branding used to default to the upload_path before Hyrax implemented branding. I'm fairly certain the Carrierwave cache_dir files can be deleted without any issue, but I've been wondering about the store_dir files for a while because I wasn't sure at what point Hyku/Hyrax is done with processing them so that they are safe to delete. I was also confused by the hex named directories that you are targeting and what would generate them, but again as far as I can tell these directories do not exist in a default Hyku install in the upload_path. I hope that this clarifies my confusion. |
|
Thank you for the early review @gleithner! The issue was that it wasn't targeting the proper path as you stated. We had ported this over and were trying to integrate the pattern from an older Hyku instance, so appreciate the feedback. The current version now targets the Carrierwave staging subdirectory ( |
maxkadel
left a comment
There was a problem hiding this comment.
Looks good! As long as it works I'm happy.
Non-blocking thought: It might be worth double checking that there aren't directories that are missing the tenant part of the path because of past potential bugs (I think I've seen that before), and that the S3 clients don't have these directories for ephemeral uploads.
maxkadel
left a comment
There was a problem hiding this comment.
Looks good, thanks April!
|
|
||
| # Avoid ApplicationJob's retry_on(StandardError) burning 5 attempts | ||
| # and swallowing the error when the retry block is present. | ||
| discard_on ArgumentError do |_job, error| |
There was a problem hiding this comment.
I didn't know about discard_on, I like that!
There was a problem hiding this comment.
Right?! i thought that was pretty neato myself!
Add Hyrax upload staging cleanup jobs and rake task
Summary
Adds cleanup for Hyrax on-disk Carrierwave staging files under each tenant's upload root (
HYRAX_UPLOAD_PATH/<tenant>orpublic/uploads/<tenant>). The cleanup targets the Carrierwave subdirectory (hyrax/uploaded_file/file/) within each tenant's upload root, walking per-upload-ID directories and removing files that have been successfully ingested or are very old.How it works
Ingestion detection uses the
Hyrax::UploadedFiledatabase record: a file is considered ingested iffile_set_uriis present on the corresponding record. Files without a DB record or with anilfile_set_uriare treated as not yet ingested (orphaned).Two-tier age thresholds:
file_set_uri(ingested) are deleted afterDELETE_INGESTED_AFTER_DAYS(default 180).DELETE_ALL_AFTER_DAYS(default 730), preventing indefinite disk growth.Scheduling:
rake hyku:cleanup_uploadsiterates all tenants viaAccount.find_eachand enqueues aCleanupUploadFilesJobper tenant viaperform_later(ActiveJob), so deployers can use Sidekiq, GoodJob, or any configured queue adapter. Wire the task from cron / Kubernetes CronJob as needed.Tenants using S3 uploads (where
Site.account.s3_bucketis present) are skipped since they have no local staging tree. Tenants whose upload path does not exist on disk are also skipped.Note: Operators must schedule the rake task themselves (or call the job directly per tenant); nothing is auto-registered in GoodJob cron or any scheduler by default.
Optional ENV variables
DELETE_INGESTED_AFTER_DAYS180DELETE_ALL_AFTER_DAYS730Job architecture
CleanupUploadFilesJob— receives the tenant'suploads_path, builds the Carrierwave directory path (<uploads_path>/hyrax/uploaded_file/file), and enqueues aCleanupSubDirectoryJobfor it. Both jobs are declarednon_tenant_jobso they run outside Apartment tenant switching (the tenant name is passed explicitly).CleanupSubDirectoryJob— walks per-upload-ID subdirectories, checks each file's age and ingest status viaHyrax::UploadedFile.find_by, deletes qualifying files, and cleans up empty directories afterward. Logs progress every 100 deletions.Changes
app/jobs/cleanup_upload_files_job.rb— top-level job that locates the Carrierwave staging dir and fans out toCleanupSubDirectoryJob.app/jobs/cleanup_sub_directory_job.rb— file-level cleanup job with age + ingest-status checks and empty directory pruning.lib/tasks/uploads_cleanup.rake—hyku:cleanup_uploadsrake task that enqueues cleanup per tenant, skipping S3 and missing-path tenants.spec/jobs/cleanup_upload_files_job_spec.rb— specs covering Carrierwave dir existence/absence, parameter forwarding, and default values.spec/jobs/cleanup_sub_directory_job_spec.rb— specs covering ingested file deletion, age thresholds, orphan handling, non-file entries, configurable thresholds, missing DB records, and empty directory cleanup.