-
Notifications
You must be signed in to change notification settings - Fork 3
The Content Harvester Component
amy wieliczka edited this page Mar 28, 2023
·
1 revision
Content Harvest Component
-
runs in Docker, check README for run instructions
-
content_harvester/by_registry_endpoint.pyruns content harvester for registry endpoints with functionharvest_endpoint(url) -
content_harvester/by_collection.pyruns content harvester for a given collection with functionharvest_collection({"collection_id": 12345, "rikolti_mapper_type": "mapper_name"}) -
content_harvester/by_page.pyruns content harvester for a given page with functionharvest_page_content(collection_id=12345, page_filename="1.jsonl" rikolti_mapper_type="mapper_name")
harvest_page_content:
- creates a
ContentHarvesterwith a persistent s3 client and http client - uses
get_mapped_records(collection_id, page_filename, s3_client)to read a mapped metadata file (either locally, or on s3) and return a list of records - uses
ContentHarvester.harvest(record)to harvest content for each record. - warns about cases where the record has no thumbnail
- adds a
contentkey to record, value is a dictionary with all optional keys'thumbnail', 'media', and 'children' - writes the list of mapped records (either locally, or to s3) to jsonl file
- returns a report of thumbnail source counts by mimetype and thumbnail counts by mimetype (to see how many derivatives were generated), media source counts by mimetype and media counts by mimetype (to see how many derivatives were generated), a count of children encountered while processing, and a count of the total number of records
ContentHarvester.harvest(record):
- finds the media source in the record, downloads the source to the docker container's local filesystem, and if the media source's
nuxeo_type == SampleCustomPicture, generates a jp2 using thederivativesmodule, before optionally uploading to s3 (ifsettings.CONTENT_DESTis not'local') - finds the thumbnail source in the record, downloads the source to the docker container's local filesystem (if it was not already downloaded by the media harvest process), and uses the
derivativesmodule to make a thumbnail, before optionally uploading to s3 (ifsettings.CONTENT_DESTis not'local') - searches for a
childrenfolder in thesettings.METADATA_SRClocation (locally, or on s3) - runs
ContentHarvester.harvest(child_record)recursively for each child record found.
Derivatives Module defines:
make_thumbnail(source_file_path, mimetype)make jp2(source_file_path, mimetype)
Along with several helper functions.