Skip to content

Conversation

VarshaUN
Copy link
Collaborator

As per discussed with mentors , I have added the following ,

  • archiving.py
    Added DownloadStore abstract base class and implementations (LocalFilesystemProvider, S3LikeProvider, SftpProvider) for storing downloads with SHA256-based deduplication and metadata.

  • settings.py
    Initialized download_store based on DOWNLOAD_ARCHIVING_PROVIDER (localstorage, s3, sftp) with configuration validation and error logging.

  • input.py
    Added add_input_from_url and add_input_from_upload to archive URL downloads and uploaded files using download_store, with fallback to project input directory when archiving is disabled. Integrate with InputSource model for metadata storage.

Enhances input handling for pipelines, supporting deduplicated storage and retrieval of inputs across local, S3, and SFTP backends.

Still in progress ,

Signed-off-by : Varsha U N [email protected]

@VarshaUN VarshaUN marked this pull request as draft August 18, 2025 12:25
@AyanSinhaMahapatra
Copy link
Member

@VarshaUN thanks for the PR, you need to address a few overall issues before we can start reviewing the code in more details, see comments below for these:

Couple of issues with general direction of the PR as discussed in #1685 (comment):

We could have a simple base class to get/put files in the archive and a local file system implementation for now, enable with a global settings.

  • we asked for a working implementation of the features with local filesystem only, then only go for S3 (it's fine to create data/abstract classes) as this is much harder to test. Functionality that cannot be tested, is much likely to not be merged.

CORE: have a feature in the base Pipeline class and settings to enable that

  • Your code does not interact with any pipelines at all, and no unittests/overall tests are added, so no way for us to test the functions/features.

The local storage looks like this:

  • We need some tests so show this is being stored correctly.

Presently input archives are downloaded with

def download_missing_inputs(self):
and stored in the /input/ directory for each project as specified in
WORK_DIRECTORIES = ["input", "output", "codebase", "tmp"]
, this needs to move into a central location for each instance of scancode.io and there is a lot of code changes required to do that. This is not addressed yet in the PR.

@VarshaUN VarshaUN changed the title Add download archiving system with LocalFilesystem, S3, and SFTP providers Add download archiving system with LocalFilesystem provider Sep 15, 2025
Signed-off-by: Varsha U N <[email protected]>
This reverts commit cd04f3f1062f3ac8c78af3a7b0ed042633f5b375.
This reverts commit b6d2342873168e53865e8f39185a9602de191b7f.
This reverts commit ca2f49f505bd5c951b5f270d4b218a69848a6de9.
Signed-off-by: Varsha U N <[email protected]>
Signed-off-by: Varsha U N <[email protected]>
Signed-off-by: Varsha U N <[email protected]>
Signed-off-by: Varsha U N <[email protected]>
Signed-off-by: Varsha U N <[email protected]>
Signed-off-by: Varsha U N <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants