Skip to content

[SP-2587] Add directory simhash, modify concatenated names to remove extensions #119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

matiasdaloia
Copy link
Contributor

@matiasdaloia matiasdaloia commented May 22, 2025

Summary by CodeRabbit

  • New Features

    • Improved folder scan filtering by skipping directories based on their extensions as well as names.
    • Folder hashing scan now provides additional information, including directory hash, language extension statistics, and enhanced directory name analysis.
    • Added CycloneDX output format support for folder scan results.
    • Introduced a rank-threshold filter to show higher quality matches based on rank values.
  • Bug Fixes

    • Ensured safer handling of relative paths in folder hashing output.
  • Documentation

    • Updated documentation to reflect removal of obsolete command-line options for folder scans.
    • Enhanced client documentation with new options for result ranking and output format.
  • Chores

    • Updated command-line interface: replaced --threshold with --rank-threshold (default 5) and removed --best-match for folder scans.
    • Updated scanning API and related logic to use new ranking and categorization fields.

@matiasdaloia matiasdaloia self-assigned this May 22, 2025
@matiasdaloia matiasdaloia added the enhancement New feature or request label May 22, 2025
Copy link

coderabbitai bot commented May 22, 2025

Walkthrough

This update modifies folder hashing and scanning features. It updates type hints, filtering logic, and simhash calculation in file and directory processing. The protobuf schema evolves with new and removed fields. CLI arguments for folder scanning are changed, and the scanning logic is refactored to use a unified rank threshold parameter. CycloneDX output support is added. Documentation is updated accordingly.

Changes

File(s) Change Summary
src/scanoss/file_filters.py Added Optional type hint for scan_root, enhanced directory filtering with skipped directory extensions for folder hashing scans.
src/scanoss/scanners/folder_hasher.py Renamed constant to MAXIMUM_FILE_NAME_LENGTH, changed DirectoryFile key type, updated constructor signatures, enhanced simhash logic with unique file/dir names and extension counts, improved debug logging.
src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py Updated protobuf schema: removed best_match and threshold, added rank_threshold, category, query_limit, sim_hash_dir_names, and lang_extensions map in HFHRequest; extended nested messages and added new Version message in HFHResponse.
docs/source/index.rst Removed documentation for --best-match and --threshold arguments in the folder-scan command; added --rank-threshold and cyclonedx format.
src/scanoss/cli.py Removed --best-match, renamed --threshold to --rank-threshold, adjusted defaults, updated argument handling and help text, added cyclonedx format choice.
src/scanoss/scanners/scanner_hfh.py Refactored to use rank_threshold parameter, removed best_match and threshold, updated request dictionary in scanning logic, implemented CycloneDX output support.
src/scanoss/constants.py Added new constant DEFAULT_HFH_RANK_THRESHOLD set to 5.
CLIENT_HELP.md Added documentation for --rank-threshold and --format options in folder-scan command with example usage.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant ScannerHFH
    participant FolderHasher
    participant gRPC_Server

    User->>CLI: Run folder-scan with --rank-threshold
    CLI->>ScannerHFH: Initialize with rank_threshold
    ScannerHFH->>FolderHasher: Prepare folder hash data
    ScannerHFH->>gRPC_Server: Send scan request (rank_threshold, sim_hash_dir_names, lang_extensions)
    gRPC_Server-->>ScannerHFH: Return scan results
    ScannerHFH-->>CLI: Output results
    CLI-->>User: Display scan output
Loading

Possibly related PRs

Poem

In the warren of code, a new path we hash,
With simhash and filters, we make a quick dash.
Old flags are retired, new thresholds in play,
Extensions are counted in a bunny’s own way.
With a hop and a scan, the results are in sight—
🐇✨ Folder hashing just got more right!


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 628eeca and e8f040a.

📒 Files selected for processing (1)
  • src/scanoss/file_filters.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/scanoss/file_filters.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: build
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
src/scanoss/file_filters.py (1)

514-521: 💡 Verification agent

🧩 Analysis chain

Update the doc-string & type handling for scan_root
The parameter was changed to Optional[str], but the doc-string (lines 518-521) and the run-time handling (line 530) still assume a non-None str. Calling relative_to() with a None value will raise, so an early‐return or guard clause is advisable, and the docs should be synchronised.

@@
-    scan_root (str): Root directory to scan or fingerprint
+    scan_root (Optional[str]): Root directory to scan or fingerprint.
+        When None the paths are returned as absolute paths.

🏁 Script executed:

#!/bin/bash
# Locate the file_filters.py file
file=$(fd file_filters.py -t f)
echo "Inspecting: $file"
# Print lines around the doc-string and the runtime handling of scan_root
sed -n '500,580p' "$file"

Length of output: 406


🏁 Script executed:

#!/bin/bash
# Locate the src/scanoss/file_filters.py file specifically
file=$(fd file_filters.py -t f | grep 'src/scanoss/file_filters.py')
echo "Inspecting: $file"
# Print lines around the doc-string and the runtime handling of scan_root
sed -n '500,580p' "$file"

Length of output: 3643


Update doc-string for scan_root in get_filtered_files_from_files
The scan_root parameter is now Optional[str], so the doc-string should reflect the new type and describe the None-case behavior.

• File: src/scanoss/file_filters.py
• Method: get_filtered_files_from_files

Suggested diff:

     Args:
         files (List[str]): List of files to scan or fingerprint
-        scan_root (str): Root directory to scan or fingerprint
+        scan_root (Optional[str]): Root directory to scan or fingerprint.
+            When None, returned paths will be absolute.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ad4793 and 5cca6f3.

📒 Files selected for processing (2)
  • src/scanoss/file_filters.py (3 hunks)
  • src/scanoss/scanners/folder_hasher.py (4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/scanoss/scanners/folder_hasher.py (5)
src/scanoss/utils/simhash.py (4)
  • simhash (125-130)
  • WordFeatureSet (163-169)
  • fingerprint (100-109)
  • vectorize_bytes (84-97)
src/scanoss/scanners/scanner_hfh.py (1)
  • present (129-131)
src/scanoss/results.py (1)
  • present (273-275)
src/scanoss/utils/abstract_presenter.py (1)
  • present (28-55)
src/scanoss/scanners/container_scanner.py (1)
  • present (381-383)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: build
🔇 Additional comments (2)
src/scanoss/scanners/folder_hasher.py (2)

80-82: Constructor signature change breaks backward compatibility
config is now mandatory (FolderHasherConfig, not Optional), yet callers in the tree (e.g. tests, CLI entry points) may still pass None. Consider providing a sensible default:

def __init__(self, scan_dir: str, config: Optional[FolderHasherConfig] = None, ...)
    config = config or FolderHasherConfig()

Without this, existing integrations will raise TypeError.


256-270: Potential misuse of vectorize_bytes input
file_hashes.append(file.key) appends the entire CRC digest; vectorize_bytes will iterate through these 8-byte sequences as separate features, which is fine. If key accidentally becomes List[bytes] (see earlier issue) you will instead pass a list of lists causing the FNV hash to fail.

Fixing the type annotation as suggested earlier avoids this failure path.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
src/scanoss/scanners/folder_hasher.py (1)

38-41: key parameter type should be bytes, not List[bytes].

As noted in the previous review, CRC64.get_hash_buff() returns an 8-byte digest (bytes object), and the code expects a bytes object, not a list. This type annotation is incorrect and will cause runtime errors.

Apply this diff to fix the type annotation:

-    def __init__(self, path: str, key: List[bytes], key_str: str):
+    def __init__(self, path: str, key: bytes, key_str: str):
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b84ae46 and 7278a6d.

📒 Files selected for processing (5)
  • src/scanoss/api/common/v2/scanoss_common_pb2.py (2 hunks)
  • src/scanoss/api/common/v2/scanoss_common_pb2_grpc.py (1 hunks)
  • src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py (2 hunks)
  • src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py (5 hunks)
  • src/scanoss/scanners/folder_hasher.py (6 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
src/scanoss/api/common/v2/scanoss_common_pb2_grpc.py

4-4: warnings imported but unused

Remove unused import: warnings

(F401)


20-20: f-string without any placeholders

Remove extraneous f prefix

(F541)

src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py

4-4: warnings imported but unused

Remove unused import: warnings

(F401)


7-7: Line too long (122 > 120)

(E501)


22-22: f-string without any placeholders

Remove extraneous f prefix

(F541)

src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py

30-30: Line too long (2520 > 120)

(E501)


37-37: Line too long (391 > 120)

(E501)


49-49: Line too long (130 > 120)

(E501)


51-51: Line too long (144 > 120)

(E501)

🪛 Pylint (3.3.7)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py

[error] 47-47: Module 'scanoss.api.scanning.v2.scanoss_scanning_pb2' has no 'HFHRequest' member

(E1101)


[error] 48-48: Module 'scanoss.api.scanning.v2.scanoss_scanning_pb2' has no 'HFHResponse' member

(E1101)


[error] 112-112: Module 'scanoss.api.common.v2.scanoss_common_pb2' has no 'EchoRequest' member

(E1101)


[error] 113-113: Module 'scanoss.api.common.v2.scanoss_common_pb2' has no 'EchoResponse' member

(E1101)


[error] 139-139: Module 'scanoss.api.scanning.v2.scanoss_scanning_pb2' has no 'HFHRequest' member

(E1101)


[error] 140-140: Module 'scanoss.api.scanning.v2.scanoss_scanning_pb2' has no 'HFHResponse' member

(E1101)

src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py

[error] 9-9: No name 'runtime_version' in module 'google.protobuf'

(E0611)

src/scanoss/api/common/v2/scanoss_common_pb2.py

[error] 9-9: No name 'runtime_version' in module 'google.protobuf'

(E0611)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: build
🔇 Additional comments (6)
src/scanoss/api/common/v2/scanoss_common_pb2.py (1)

1-50: Auto-generated protobuf code looks good.

The runtime version validation and code reorganization follow standard protobuf practices. The pylint warning about runtime_version can be safely ignored as it's available in protobuf 5.27.2.

src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py (1)

44-49: Registered method flags are correctly added.

The addition of _registered_method=True to all gRPC method registrations and calls is correct and aligns with gRPC 1.67.0 requirements.

Also applies to: 88-88, 122-122, 149-149

src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py (1)

1-67: Auto-generated protobuf schema updates look good.

The schema enhancements adding sim_hash_dir_names and lang_extensions fields align with the folder hasher implementation. Line length warnings for serialized descriptors can be safely ignored in auto-generated code.

src/scanoss/scanners/folder_hasher.py (3)

197-205: Good error handling for relative path calculation.

The implementation properly handles the case where relative_to() might fail with a ValueError, providing a sensible fallback.


232-294: Enhanced hash calculation with directory and language tracking.

The implementation correctly:

  • Tracks unique file names without extensions
  • Collects unique directory names
  • Counts file extensions by language
  • Computes separate simhashes for files and directories

The logic is well-structured and aligns with the protobuf schema changes.


252-255: Consider validating directory extraction logic.

The logic extracts the last directory component but excludes cases where it equals the full path. Ensure this handles all edge cases correctly, especially for files in the root directory.

#!/bin/bash
# Description: Check how the directory extraction logic handles edge cases

# Create a test script to verify directory extraction behavior
cat > test_dir_extraction.py << 'EOF'
import os

test_cases = [
    "file.txt",  # File in root
    "./file.txt",  # File in current dir
    "dir/file.txt",  # File in subdirectory
    "dir/subdir/file.txt",  # File in nested directory
    "/absolute/path/file.txt",  # Absolute path
]

for path in test_cases:
    current_directory = os.path.dirname(path)
    if current_directory and current_directory != '.':
        last_directory = os.path.basename(current_directory)
        valid = last_directory != current_directory and last_directory not in ['.', '..']
        print(f"Path: {path}")
        print(f"  Current dir: {current_directory}")
        print(f"  Last dir: {last_directory}")
        print(f"  Valid: {valid}")
    else:
        print(f"Path: {path} - Skipped (empty or current dir)")
    print()
EOF

python test_dir_extraction.py
rm test_dir_extraction.py

@matiasdaloia matiasdaloia force-pushed the feature/mdaloia/SP-2587-SCANOSS-PY-Add-directory-hashes-to-folder-hashing-command branch from 162716e to 450ac5e Compare June 30, 2025 11:42
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e137c4 and 450ac5e.

📒 Files selected for processing (7)
  • docs/source/index.rst (0 hunks)
  • src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py (1 hunks)
  • src/scanoss/cli.py (3 hunks)
  • src/scanoss/constants.py (1 hunks)
  • src/scanoss/file_filters.py (3 hunks)
  • src/scanoss/scanners/folder_hasher.py (6 hunks)
  • src/scanoss/scanners/scanner_hfh.py (6 hunks)
💤 Files with no reviewable changes (1)
  • docs/source/index.rst
✅ Files skipped from review due to trivial changes (1)
  • src/scanoss/constants.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/scanoss/file_filters.py
  • src/scanoss/cli.py
  • src/scanoss/scanners/folder_hasher.py
🧰 Additional context used
🪛 Flake8 (7.2.0)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py

[error] 25-25: indentation is not a multiple of 4

(E111)


[error] 26-26: indentation is not a multiple of 4

(E111)


[error] 27-27: indentation is not a multiple of 4

(E111)


[error] 28-28: indentation is not a multiple of 4

(E111)


[error] 29-29: indentation is not a multiple of 4

(E111)


[error] 30-30: indentation is not a multiple of 4

(E111)


[error] 31-31: indentation is not a multiple of 4

(E111)


[error] 32-32: indentation is not a multiple of 4

(E111)


[error] 33-33: indentation is not a multiple of 4

(E111)


[error] 33-33: missing whitespace around operator

(E225)


[error] 34-34: indentation is not a multiple of 4

(E111)


[error] 34-34: missing whitespace around operator

(E225)


[error] 35-35: indentation is not a multiple of 4

(E111)


[error] 35-35: missing whitespace around operator

(E225)


[error] 36-36: indentation is not a multiple of 4

(E111)


[error] 36-36: missing whitespace around operator

(E225)


[error] 37-37: indentation is not a multiple of 4

(E111)


[error] 37-37: missing whitespace around operator

(E225)


[error] 38-38: indentation is not a multiple of 4

(E111)


[error] 38-38: missing whitespace around operator

(E225)


[error] 39-39: indentation is not a multiple of 4

(E111)


[error] 39-39: missing whitespace around operator

(E225)


[error] 40-40: indentation is not a multiple of 4

(E111)


[error] 40-40: missing whitespace around operator

(E225)


[error] 41-41: indentation is not a multiple of 4

(E111)


[error] 41-41: missing whitespace around operator

(E225)


[error] 42-42: indentation is not a multiple of 4

(E111)


[error] 42-42: missing whitespace around operator

(E225)


[error] 43-43: indentation is not a multiple of 4

(E111)


[error] 43-43: missing whitespace around operator

(E225)


[error] 44-44: indentation is not a multiple of 4

(E111)


[error] 44-44: missing whitespace around operator

(E225)


[error] 45-45: indentation is not a multiple of 4

(E111)


[error] 45-45: missing whitespace around operator

(E225)


[error] 46-46: indentation is not a multiple of 4

(E111)


[error] 46-46: missing whitespace around operator

(E225)


[error] 47-47: indentation is not a multiple of 4

(E111)


[error] 47-47: missing whitespace around operator

(E225)


[error] 48-48: indentation is not a multiple of 4

(E111)


[error] 48-48: missing whitespace around operator

(E225)

🪛 Ruff (0.11.9)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py

19-19: Line too long (2500 > 120)

(E501)


26-26: Line too long (391 > 120)

(E501)


30-30: Line too long (130 > 120)

(E501)


32-32: Line too long (144 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: build
🔇 Additional comments (4)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py (1)

19-48: Protobuf-generated code - formatting issues are expected and should not be manually fixed.

This file is automatically generated by the protocol buffer compiler as indicated by the header comment "DO NOT EDIT!". The static analysis warnings about indentation, line length, and missing whitespace around operators are expected in generated code and should not be manually addressed, as any changes would be overwritten when the protobuf definitions are recompiled.

The schema changes reflected in this generated code (replacing best_match/threshold with rank_threshold and adding new fields) align with the API evolution described in the summary.

src/scanoss/scanners/scanner_hfh.py (3)

32-33: LGTM - New imports align with functionality additions.

The imports for DEFAULT_HFH_RANK_THRESHOLD and CycloneDx are properly utilized in the constructor default parameter and the new CycloneDX output formatting method respectively.


57-57: LGTM - Constructor properly updated for new API.

The constructor changes correctly:

  • Add the new rank_threshold parameter with appropriate default from constants
  • Update the docstring to reflect the new parameter
  • Store the parameter as an instance variable

This aligns with the API evolution from separate best_match/threshold parameters to a unified rank_threshold.

Also applies to: 67-67, 95-95


106-106: LGTM - Scan method correctly updated for new API.

The gRPC request dictionary is properly updated to use rank_threshold instead of the removed best_match and threshold parameters, consistent with the protobuf schema changes.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 450ac5e and 44584b6.

📒 Files selected for processing (1)
  • src/scanoss/scanners/scanner_hfh.py (6 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/scanoss/scanners/scanner_hfh.py (3)
src/scanoss/cyclonedx.py (1)
  • CycloneDx (37-302)
src/scanoss/scanossbase.py (1)
  • print_stderr (45-49)
src/scanoss/scanossgrpc.py (1)
  • get_dependencies (248-255)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: build
🔇 Additional comments (5)
src/scanoss/scanners/scanner_hfh.py (5)

32-33: LGTM: Clean import additions.

The new imports are well-organized and support the updated functionality for rank threshold configuration and CycloneDX output formatting.


57-57: LGTM: Improved parameter consolidation.

The change from separate best_match and threshold parameters to a unified rank_threshold with a default constant simplifies the API and improves maintainability.


67-67: LGTM: Updated documentation reflects the parameter change.

The docstring accurately describes the new rank_threshold parameter's purpose.


95-95: LGTM: Instance variable assignment is correct.

The assignment properly stores the rank threshold for use in the scan method.


106-106: LGTM: Request parameter updated consistently.

The gRPC request now correctly uses the unified rank_threshold parameter instead of the previous separate parameters.

Comment on lines 165 to +209
def _format_cyclonedx_output(self) -> str:
raise NotImplementedError('CycloneDX output is not implemented')
if not self.scanner.scan_results:
return ''
try:
if 'results' not in self.scanner.scan_results or not self.scanner.scan_results['results']:
self.base.print_stderr('ERROR: No scan results found')
return ''

first_result = self.scanner.scan_results['results'][0]

best_match_components = [c for c in first_result.get('components', []) if c.get('order') == 1]
if not best_match_components:
self.base.print_stderr('ERROR: No best match component found')
return ''

best_match_component = best_match_components[0]
if not best_match_component.get('versions'):
self.base.print_stderr('ERROR: No versions found for best match component')
return ''

best_match_version = best_match_component['versions'][0]
purl = best_match_component['purl']

get_dependencies_json_request = {
'files': [
{
'file': f'{best_match_component["name"]}:{best_match_version["version"]}',
'purls': [{'purl': purl, 'requirement': best_match_version['version']}],
}
]
}

decorated_scan_results = self.scanner.client.get_dependencies(get_dependencies_json_request)

cdx = CycloneDx(self.base.debug, self.output_file)
scan_results = {}
for f in decorated_scan_results['files']:
scan_results[f['file']] = [f]
if not cdx.produce_from_json(scan_results, self.output_file):
error_msg = 'ERROR: Failed to produce CycloneDX output'
self.base.print_stderr(error_msg)
raise ValueError(error_msg)
except Exception as e:
self.base.print_stderr(f'ERROR: Failed to get license information: {e}')
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Fix inconsistent return type and improve error handling.

The method shows significant improvement over previous versions by adding proper data validation and error handling. However, there are still some issues:

  1. Inconsistent return type: Line 209 returns None while other error cases return empty string
  2. Exception handling could be more granular: The broad try-catch might mask specific errors
  3. Method complexity: The method is quite long and handles multiple responsibilities

Apply this diff to fix the inconsistent return type:

         except Exception as e:
             self.base.print_stderr(f'ERROR: Failed to get license information: {e}')
-            return None
+            return ''

Consider refactoring this method into smaller helper methods for better maintainability:

+    def _get_best_match_component(self, scan_results: dict) -> dict:
+        """Extract and validate the best match component from scan results."""
+        if 'results' not in scan_results or not scan_results['results']:
+            raise ValueError('No scan results found')
+        
+        first_result = scan_results['results'][0]
+        best_match_components = [c for c in first_result.get('components', []) if c.get('order') == 1]
+        if not best_match_components:
+            raise ValueError('No best match component found')
+        
+        best_match_component = best_match_components[0]
+        if not best_match_component.get('versions'):
+            raise ValueError('No versions found for best match component')
+        
+        return best_match_component
+
+    def _create_dependencies_request(self, component: dict, version: dict, purl: str) -> dict:
+        """Create the dependencies request JSON."""
+        return {
+            'files': [
+                {
+                    'file': f'{component["name"]}:{version["version"]}',
+                    'purls': [{'purl': purl, 'requirement': version['version']}],
+                }
+            ]
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _format_cyclonedx_output(self) -> str:
raise NotImplementedError('CycloneDX output is not implemented')
if not self.scanner.scan_results:
return ''
try:
if 'results' not in self.scanner.scan_results or not self.scanner.scan_results['results']:
self.base.print_stderr('ERROR: No scan results found')
return ''
first_result = self.scanner.scan_results['results'][0]
best_match_components = [c for c in first_result.get('components', []) if c.get('order') == 1]
if not best_match_components:
self.base.print_stderr('ERROR: No best match component found')
return ''
best_match_component = best_match_components[0]
if not best_match_component.get('versions'):
self.base.print_stderr('ERROR: No versions found for best match component')
return ''
best_match_version = best_match_component['versions'][0]
purl = best_match_component['purl']
get_dependencies_json_request = {
'files': [
{
'file': f'{best_match_component["name"]}:{best_match_version["version"]}',
'purls': [{'purl': purl, 'requirement': best_match_version['version']}],
}
]
}
decorated_scan_results = self.scanner.client.get_dependencies(get_dependencies_json_request)
cdx = CycloneDx(self.base.debug, self.output_file)
scan_results = {}
for f in decorated_scan_results['files']:
scan_results[f['file']] = [f]
if not cdx.produce_from_json(scan_results, self.output_file):
error_msg = 'ERROR: Failed to produce CycloneDX output'
self.base.print_stderr(error_msg)
raise ValueError(error_msg)
except Exception as e:
self.base.print_stderr(f'ERROR: Failed to get license information: {e}')
return None
def _format_cyclonedx_output(self) -> str:
if not self.scanner.scan_results:
return ''
try:
if 'results' not in self.scanner.scan_results or not self.scanner.scan_results['results']:
self.base.print_stderr('ERROR: No scan results found')
return ''
first_result = self.scanner.scan_results['results'][0]
best_match_components = [c for c in first_result.get('components', []) if c.get('order') == 1]
if not best_match_components:
self.base.print_stderr('ERROR: No best match component found')
return ''
best_match_component = best_match_components[0]
if not best_match_component.get('versions'):
self.base.print_stderr('ERROR: No versions found for best match component')
return ''
best_match_version = best_match_component['versions'][0]
purl = best_match_component['purl']
get_dependencies_json_request = {
'files': [
{
'file': f'{best_match_component["name"]}:{best_match_version["version"]}',
'purls': [{'purl': purl, 'requirement': best_match_version['version']}],
}
]
}
decorated_scan_results = self.scanner.client.get_dependencies(get_dependencies_json_request)
cdx = CycloneDx(self.base.debug, self.output_file)
scan_results = {}
for f in decorated_scan_results['files']:
scan_results[f['file']] = [f]
if not cdx.produce_from_json(scan_results, self.output_file):
error_msg = 'ERROR: Failed to produce CycloneDX output'
self.base.print_stderr(error_msg)
raise ValueError(error_msg)
except Exception as e:
self.base.print_stderr(f'ERROR: Failed to get license information: {e}')
return ''
def _get_best_match_component(self, scan_results: dict) -> dict:
"""Extract and validate the best match component from scan results."""
if 'results' not in scan_results or not scan_results['results']:
raise ValueError('No scan results found')
first_result = scan_results['results'][0]
best_match_components = [c for c in first_result.get('components', []) if c.get('order') == 1]
if not best_match_components:
raise ValueError('No best match component found')
best_match_component = best_match_components[0]
if not best_match_component.get('versions'):
raise ValueError('No versions found for best match component')
return best_match_component
def _create_dependencies_request(self, component: dict, version: dict, purl: str) -> dict:
"""Create the dependencies request JSON."""
return {
'files': [
{
'file': f'{component["name"]}:{version["version"]}',
'purls': [{'purl': purl, 'requirement': version['version']}],
}
]
}
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 165-165: Either all return statements in a function should return an expression, or none of them should.

(R1710)

🤖 Prompt for AI Agents
In src/scanoss/scanners/scanner_hfh.py between lines 165 and 209, the method
_format_cyclonedx_output has inconsistent return types, returning an empty
string on some errors but None on exceptions. To fix this, ensure all error
paths return the same type, preferably an empty string for consistency.
Additionally, refactor the broad try-except block into more specific exception
handling to avoid masking errors. Finally, split the method into smaller helper
functions to reduce complexity and improve maintainability.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44584b6 and 7333c97.

📒 Files selected for processing (3)
  • CLIENT_HELP.md (1 hunks)
  • docs/source/index.rst (1 hunks)
  • src/scanoss/cli.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/source/index.rst
🧰 Additional context used
🪛 LanguageTool
CLIENT_HELP.md

[uncategorized] ~490-~490: Loose punctuation mark.
Context: ...ate higher quality matches. - --format: Result output format (json or cyclonedx...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 Ruff (0.11.9)
src/scanoss/cli.py

635-635: Line too long (157 > 120)

(E501)

🪛 GitHub Actions: Lint
src/scanoss/cli.py

[error] 635-635: Ruff: Line too long (157 > 120) (E501)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: build
🔇 Additional comments (4)
CLIENT_HELP.md (1)

488-496: Excellent documentation for new CLI options.

The new Options section clearly documents the --rank-threshold and --format parameters with helpful descriptions and examples. The explanation that lower rank values indicate higher quality matches is particularly valuable for users.

src/scanoss/cli.py (3)

57-57: LGTM: Constant import added.

The import of DEFAULT_HFH_RANK_THRESHOLD aligns with the new rank threshold functionality.


627-627: LGTM: CycloneDX format support added.

Adding 'cyclonedx' to the format choices correctly implements the new output format option documented in the help file.


1961-1961: LGTM: Scanner constructor updated correctly.

The change from setting scanner.best_match and scanner.threshold properties to passing rank_threshold directly to the constructor aligns with the API changes described in the summary.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/scanoss/scanners/folder_hasher.py (1)

38-41: Type annotation inconsistency causes runtime error

The type annotation change creates a mismatch. CRC64.get_hash_buff() returns bytes (line 159), but the annotation now claims List[bytes]. This will cause confusion and potential runtime errors when file.key is used in vectorize_bytes() which expects individual bytes objects, not lists.

-    def __init__(self, path: str, key: List[bytes], key_str: str):
+    def __init__(self, path: str, key: bytes, key_str: str):
🧹 Nitpick comments (1)
src/scanoss/scanners/folder_hasher.py (1)

18-18: Misleading constant name with current usage

The constant is renamed to MAXIMUM_FILE_NAME_LENGTH but line 143 uses <= MAXIMUM_FILE_NAME_LENGTH, which suggests files with names shorter than or equal to this length are kept. This naming is confusing - either rename to FILE_NAME_LENGTH_LIMIT or change the logic to use > for exclusion.

-MAXIMUM_FILE_NAME_LENGTH = 32
+FILE_NAME_LENGTH_LIMIT = 32
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7333c97 and 1d25624.

📒 Files selected for processing (1)
  • src/scanoss/scanners/folder_hasher.py (7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: build
🔇 Additional comments (2)
src/scanoss/scanners/folder_hasher.py (2)

197-214: Improved relative path calculation

The relative path calculation with proper error handling is a good improvement over the previous implementation. This addresses the environment-dependency issue mentioned in past comments.


287-290: Good addition of debug logging

The debug logging provides valuable insights into the simhash calculation process, making debugging easier.

ext_without_dot = extension[1:]
extension_map[ext_without_dot] = extension_map.get(ext_without_dot, 0) + 1

current_directory.removeprefix(self.scan_dir)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

String method call has no effect

removeprefix() returns a new string but the result isn't assigned back to current_directory. This line is a no-op.

-            current_directory.removeprefix(self.scan_dir)
+            current_directory = current_directory.removeprefix(self.scan_dir)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
current_directory.removeprefix(self.scan_dir)
- current_directory.removeprefix(self.scan_dir)
+ current_directory = current_directory.removeprefix(self.scan_dir)
🤖 Prompt for AI Agents
In src/scanoss/scanners/folder_hasher.py at line 252, the call to removeprefix()
on current_directory does not modify the string in place and its result is not
assigned back. Fix this by assigning the result of
current_directory.removeprefix(self.scan_dir) back to current_directory to
ensure the prefix is actually removed.

Comment on lines 267 to +276
selected_names.sort()
concatenated_names = ''.join(selected_names)

if len(concatenated_names.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH:
return {
'name_hash': None,
'content_hash': None,
}
return {'name_hash': None, 'content_hash': None, 'dir_hash': None, 'lang_extensions': None}

# Concatenate the unique file names without the extensions, adding a space and sorting them alphabetically
unique_file_names_list = list(unique_file_names)
unique_file_names_list.sort()
concatenated_names = ' '.join(unique_file_names_list)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Variable reassignment creates confusion and potential bugs

The variable concatenated_names is first assigned with ''.join(selected_names) (line 268), used for length validation (line 270), then immediately reassigned with a completely different value ' '.join(unique_file_names_list) (line 276). This is confusing and makes the length check potentially meaningless since it's checking the length of concatenated file names with extensions, but then using concatenated file names without extensions for simhash calculation.

         selected_names.sort()
-        concatenated_names = ''.join(selected_names)
+        concatenated_names_with_ext = ''.join(selected_names)
 
-        if len(concatenated_names.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH:
+        if len(concatenated_names_with_ext.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH:
             return {'name_hash': None, 'content_hash': None, 'dir_hash': None, 'lang_extensions': None}
 
         # Concatenate the unique file names without the extensions, adding a space and sorting them alphabetically
         unique_file_names_list = list(unique_file_names)
         unique_file_names_list.sort()
-        concatenated_names = ' '.join(unique_file_names_list)
+        concatenated_names_without_ext = ' '.join(unique_file_names_list)

Then update line 283:

-        names_simhash = simhash(WordFeatureSet(concatenated_names.encode('utf-8')))
+        names_simhash = simhash(WordFeatureSet(concatenated_names_without_ext.encode('utf-8')))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
selected_names.sort()
concatenated_names = ''.join(selected_names)
if len(concatenated_names.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH:
return {
'name_hash': None,
'content_hash': None,
}
return {'name_hash': None, 'content_hash': None, 'dir_hash': None, 'lang_extensions': None}
# Concatenate the unique file names without the extensions, adding a space and sorting them alphabetically
unique_file_names_list = list(unique_file_names)
unique_file_names_list.sort()
concatenated_names = ' '.join(unique_file_names_list)
selected_names.sort()
concatenated_names_with_ext = ''.join(selected_names)
if len(concatenated_names_with_ext.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH:
return {'name_hash': None, 'content_hash': None, 'dir_hash': None, 'lang_extensions': None}
# Concatenate the unique file names without the extensions, adding a space and sorting them alphabetically
unique_file_names_list = list(unique_file_names)
unique_file_names_list.sort()
concatenated_names_without_ext = ' '.join(unique_file_names_list)
names_simhash = simhash(WordFeatureSet(concatenated_names_without_ext.encode('utf-8')))
🤖 Prompt for AI Agents
In src/scanoss/scanners/folder_hasher.py around lines 267 to 276, avoid
reassigning the variable concatenated_names with two different concatenations
that differ in content and format, which causes confusion and invalidates the
length check. Instead, use separate variables for the concatenation used in the
length validation (with extensions and no spaces) and for the concatenation used
later (without extensions and joined by spaces). Update the length check to use
the appropriate variable and ensure the simhash calculation uses the correctly
concatenated names without extensions.

Comment on lines +281 to +284
concatenated_directories = ' '.join(unique_directories_list)

names_simhash = simhash(WordFeatureSet(concatenated_names.encode('utf-8')))
dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8')))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Potential issue with empty directory simhash

If no valid directories are found (empty unique_directories), concatenated_directories will be an empty string, and WordFeatureSet('') might produce unexpected simhash results.

Verify how WordFeatureSet handles empty strings and ensure directory simhash behaves correctly when no directories are present.


🏁 Script executed:

#!/bin/bash
# Description: Check how WordFeatureSet handles empty strings in the simhash utility

ast-grep --pattern 'class WordFeatureSet {
  $$$
}'

# Also check if there are any tests for empty string handling
rg -A 5 -B 5 "WordFeatureSet.*empty|empty.*WordFeatureSet"

Length of output: 114


🏁 Script executed:

#!/bin/bash
# Locate any local definitions of WordFeatureSet
rg -n "class WordFeatureSet" -t py || rg -n "def WordFeatureSet" -t py

# If none found locally, attempt to inspect the external simhash library
python3 - <<'EOF'
import inspect
try:
    from simhash import WordFeatureSet
    print(inspect.getsource(WordFeatureSet))
except Exception as e:
    print("Error fetching WordFeatureSet from simhash:", e)
EOF

Length of output: 173


🏁 Script executed:

#!/bin/bash
# Show the WordFeatureSet class definition and methods
sed -n '150,260p' src/scanoss/utils/simhash.py

# Show the simhash() function to see how features are processed
sed -n '1,100p' src/scanoss/utils/simhash.py

Length of output: 4683


🏁 Script executed:

#!/bin/bash
# Extract the simhash function and its helper fingerprint to see handling of empty feature lists
rg -n "def simhash" -n 1,200 src/scanoss/utils/simhash.py -A 20

rg -n "def fingerprint" -n 1,200 src/scanoss/utils/simhash.py -A 20

Length of output: 3159


Clarify empty‐directory simhash behavior

We verified that when concatenated_directories is empty,
WordFeatureSet(b'') → no features → vector of zeros → fingerprint sets all 64 bits → dir_simhash == 0xFFFFFFFFFFFFFFFF.

If you’d rather treat “no directories” as a distinct value (e.g. 0), add a guard:

• In src/scanoss/scanners/folder_hasher.py around lines 281–284:

- dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8')))
+ if not unique_directories_list:
+     dir_simhash = 0
+ else:
+     dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8')))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
concatenated_directories = ' '.join(unique_directories_list)
names_simhash = simhash(WordFeatureSet(concatenated_names.encode('utf-8')))
dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8')))
concatenated_directories = ' '.join(unique_directories_list)
names_simhash = simhash(WordFeatureSet(concatenated_names.encode('utf-8')))
if not unique_directories_list:
dir_simhash = 0
else:
dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8')))
🤖 Prompt for AI Agents
In src/scanoss/scanners/folder_hasher.py around lines 281 to 284, the simhash
calculation for concatenated_directories results in a value of
0xFFFFFFFFFFFFFFFF when the input is empty, which may be misleading. Add a guard
to check if concatenated_directories is empty before computing the simhash; if
it is empty, set dir_simhash to 0 explicitly to represent the absence of
directories distinctly.

@matiasdaloia matiasdaloia force-pushed the feature/mdaloia/SP-2587-SCANOSS-PY-Add-directory-hashes-to-folder-hashing-command branch from 1d25624 to 2933009 Compare June 30, 2025 16:25
@eeisegn eeisegn self-requested a review June 30, 2025 17:06
ext_without_dot = extension[1:]
extension_map[ext_without_dot] = extension_map.get(ext_without_dot, 0) + 1

current_directory.removeprefix(self.scan_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we doing this for every directly. surely we can do this once and make everything else relative?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this doesn't work on python 3.8

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
src/scanoss/scanners/folder_hasher.py (3)

38-41: Fix type annotation: key should be bytes, not List[bytes]

This is the same issue flagged in previous reviews. The CRC64.get_hash_buff() method returns a single bytes object, not a list of bytes objects.

Apply this fix:

-    def __init__(self, path: str, key: List[bytes], key_str: str):
+    def __init__(self, path: str, key: bytes, key_str: str):

243-262: Fix string method with no effect and improve directory path handling

Line 252 contains the same issue flagged in previous reviews - replace() returns a new string but the result isn't assigned.

-            current_directory.replace(self.scan_dir, '', 1).lstrip(os.path.sep)
+            current_directory = current_directory.replace(self.scan_dir, '', 1).lstrip(os.path.sep)

Additionally, consider using Path operations for more robust path handling:

-            current_directory.replace(self.scan_dir, '', 1).lstrip(os.path.sep)
-            parts = current_directory.split(os.path.sep)
+            try:
+                rel_dir = Path(current_directory).relative_to(self.scan_dir)
+                parts = rel_dir.parts
+            except ValueError:
+                parts = Path(current_directory).parts

267-276: Fix variable reassignment that invalidates length check

This is the same confusing variable reassignment issue from previous reviews. The concatenated_names variable is used for length validation, then immediately reassigned with different content.

         selected_names.sort()
-        concatenated_names = ''.join(selected_names)
+        concatenated_names_with_ext = ''.join(selected_names)
 
-        if len(concatenated_names.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH:
+        if len(concatenated_names_with_ext.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH:
             return {'name_hash': None, 'content_hash': None, 'dir_hash': None, 'lang_extensions': None}
 
         # Concatenate the unique file names without the extensions, adding a space and sorting them alphabetically
         unique_file_names_list = list(unique_file_names)
         unique_file_names_list.sort()
-        concatenated_names = ' '.join(unique_file_names_list)
+        concatenated_names_without_ext = ' '.join(unique_file_names_list)

Then update line 283:

-        names_simhash = simhash(WordFeatureSet(concatenated_names.encode('utf-8')))
+        names_simhash = simhash(WordFeatureSet(concatenated_names_without_ext.encode('utf-8')))
🧹 Nitpick comments (1)
src/scanoss/scanners/folder_hasher.py (1)

284-284: Handle empty directory case for consistent simhash behavior

When unique_directories_list is empty, WordFeatureSet(b'') produces a simhash of 0xFFFFFFFFFFFFFFFF. Consider making this explicit:

-        dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8')))
+        if not unique_directories_list:
+            dir_simhash = 0
+        else:
+            dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8')))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 21def1b and 628eeca.

📒 Files selected for processing (2)
  • src/scanoss/file_filters.py (3 hunks)
  • src/scanoss/scanners/folder_hasher.py (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/scanoss/file_filters.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: build
🔇 Additional comments (7)
src/scanoss/scanners/folder_hasher.py (7)

18-18: LGTM: Constant renaming reflects correct usage

The renaming from MINIMUM_FILE_NAME_LENGTH to MAXIMUM_FILE_NAME_LENGTH accurately reflects its usage as an upper bound constraint in the file filtering logic.


80-80: LGTM: Making config parameter non-optional improves API clarity

Requiring an explicit FolderHasherConfig prevents potential issues with default values and makes the dependency explicit.


143-143: LGTM: Correct usage of the renamed constant

The file filtering logic correctly uses MAXIMUM_FILE_NAME_LENGTH as an upper bound constraint.


197-214: LGTM: Improved path handling with proper error recovery

The safe relative path calculation with try-except block and fallback logic handles edge cases gracefully. The addition of new hash fields (sim_hash_dir_names, lang_extensions) enhances the folder hashing functionality.


232-234: LGTM: New data structures for enhanced directory hashing

The introduction of unique_file_names, unique_directories, and extension_map enables more sophisticated hashing that includes directory structure and file type information.


287-291: LGTM: Comprehensive debug logging

The debug logging provides valuable insights into the unique file names, directories, and computed hashes, which will help with troubleshooting and validation.


299-299: LGTM: Explicit parameter types improve method signature

Making the optional parameters explicit with type annotations enhances code clarity and IDE support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants