-
Notifications
You must be signed in to change notification settings - Fork 24
[SP-2587] Add directory simhash, modify concatenated names to remove extensions #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[SP-2587] Add directory simhash, modify concatenated names to remove extensions #119
Conversation
WalkthroughThis update modifies folder hashing and scanning features. It updates type hints, filtering logic, and simhash calculation in file and directory processing. The protobuf schema evolves with new and removed fields. CLI arguments for folder scanning are changed, and the scanning logic is refactored to use a unified rank threshold parameter. CycloneDX output support is added. Documentation is updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant ScannerHFH
participant FolderHasher
participant gRPC_Server
User->>CLI: Run folder-scan with --rank-threshold
CLI->>ScannerHFH: Initialize with rank_threshold
ScannerHFH->>FolderHasher: Prepare folder hash data
ScannerHFH->>gRPC_Server: Send scan request (rank_threshold, sim_hash_dir_names, lang_extensions)
gRPC_Server-->>ScannerHFH: Return scan results
ScannerHFH-->>CLI: Output results
CLI-->>User: Display scan output
Possibly related PRs
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
src/scanoss/file_filters.py (1)
514-521
: 💡 Verification agent🧩 Analysis chain
Update the doc-string & type handling for
scan_root
The parameter was changed toOptional[str]
, but the doc-string (lines 518-521) and the run-time handling (line 530) still assume a non-None
str
. Callingrelative_to()
with aNone
value will raise, so an early‐return or guard clause is advisable, and the docs should be synchronised.@@ - scan_root (str): Root directory to scan or fingerprint + scan_root (Optional[str]): Root directory to scan or fingerprint. + When None the paths are returned as absolute paths.
🏁 Script executed:
#!/bin/bash # Locate the file_filters.py file file=$(fd file_filters.py -t f) echo "Inspecting: $file" # Print lines around the doc-string and the runtime handling of scan_root sed -n '500,580p' "$file"Length of output: 406
🏁 Script executed:
#!/bin/bash # Locate the src/scanoss/file_filters.py file specifically file=$(fd file_filters.py -t f | grep 'src/scanoss/file_filters.py') echo "Inspecting: $file" # Print lines around the doc-string and the runtime handling of scan_root sed -n '500,580p' "$file"Length of output: 3643
Update doc-string for
scan_root
inget_filtered_files_from_files
Thescan_root
parameter is nowOptional[str]
, so the doc-string should reflect the new type and describe the None-case behavior.• File:
src/scanoss/file_filters.py
• Method:get_filtered_files_from_files
Suggested diff:
Args: files (List[str]): List of files to scan or fingerprint - scan_root (str): Root directory to scan or fingerprint + scan_root (Optional[str]): Root directory to scan or fingerprint. + When None, returned paths will be absolute.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/scanoss/file_filters.py
(3 hunks)src/scanoss/scanners/folder_hasher.py
(4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/scanoss/scanners/folder_hasher.py (5)
src/scanoss/utils/simhash.py (4)
simhash
(125-130)WordFeatureSet
(163-169)fingerprint
(100-109)vectorize_bytes
(84-97)src/scanoss/scanners/scanner_hfh.py (1)
present
(129-131)src/scanoss/results.py (1)
present
(273-275)src/scanoss/utils/abstract_presenter.py (1)
present
(28-55)src/scanoss/scanners/container_scanner.py (1)
present
(381-383)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (2)
src/scanoss/scanners/folder_hasher.py (2)
80-82
: Constructor signature change breaks backward compatibility
config
is now mandatory (FolderHasherConfig
, notOptional
), yet callers in the tree (e.g. tests, CLI entry points) may still passNone
. Consider providing a sensible default:def __init__(self, scan_dir: str, config: Optional[FolderHasherConfig] = None, ...) config = config or FolderHasherConfig()Without this, existing integrations will raise
TypeError
.
256-270
: Potential misuse ofvectorize_bytes
input
file_hashes.append(file.key)
appends the entire CRC digest;vectorize_bytes
will iterate through these 8-byte sequences as separate features, which is fine. Ifkey
accidentally becomesList[bytes]
(see earlier issue) you will instead pass a list of lists causing the FNV hash to fail.Fixing the type annotation as suggested earlier avoids this failure path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
src/scanoss/scanners/folder_hasher.py (1)
38-41
:key
parameter type should bebytes
, notList[bytes]
.As noted in the previous review,
CRC64.get_hash_buff()
returns an 8-byte digest (bytes object), and the code expects a bytes object, not a list. This type annotation is incorrect and will cause runtime errors.Apply this diff to fix the type annotation:
- def __init__(self, path: str, key: List[bytes], key_str: str): + def __init__(self, path: str, key: bytes, key_str: str):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/scanoss/api/common/v2/scanoss_common_pb2.py
(2 hunks)src/scanoss/api/common/v2/scanoss_common_pb2_grpc.py
(1 hunks)src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py
(2 hunks)src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py
(5 hunks)src/scanoss/scanners/folder_hasher.py
(6 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
src/scanoss/api/common/v2/scanoss_common_pb2_grpc.py
4-4: warnings
imported but unused
Remove unused import: warnings
(F401)
20-20: f-string without any placeholders
Remove extraneous f
prefix
(F541)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py
4-4: warnings
imported but unused
Remove unused import: warnings
(F401)
7-7: Line too long (122 > 120)
(E501)
22-22: f-string without any placeholders
Remove extraneous f
prefix
(F541)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py
30-30: Line too long (2520 > 120)
(E501)
37-37: Line too long (391 > 120)
(E501)
49-49: Line too long (130 > 120)
(E501)
51-51: Line too long (144 > 120)
(E501)
🪛 Pylint (3.3.7)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py
[error] 47-47: Module 'scanoss.api.scanning.v2.scanoss_scanning_pb2' has no 'HFHRequest' member
(E1101)
[error] 48-48: Module 'scanoss.api.scanning.v2.scanoss_scanning_pb2' has no 'HFHResponse' member
(E1101)
[error] 112-112: Module 'scanoss.api.common.v2.scanoss_common_pb2' has no 'EchoRequest' member
(E1101)
[error] 113-113: Module 'scanoss.api.common.v2.scanoss_common_pb2' has no 'EchoResponse' member
(E1101)
[error] 139-139: Module 'scanoss.api.scanning.v2.scanoss_scanning_pb2' has no 'HFHRequest' member
(E1101)
[error] 140-140: Module 'scanoss.api.scanning.v2.scanoss_scanning_pb2' has no 'HFHResponse' member
(E1101)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py
[error] 9-9: No name 'runtime_version' in module 'google.protobuf'
(E0611)
src/scanoss/api/common/v2/scanoss_common_pb2.py
[error] 9-9: No name 'runtime_version' in module 'google.protobuf'
(E0611)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (6)
src/scanoss/api/common/v2/scanoss_common_pb2.py (1)
1-50
: Auto-generated protobuf code looks good.The runtime version validation and code reorganization follow standard protobuf practices. The pylint warning about
runtime_version
can be safely ignored as it's available in protobuf 5.27.2.src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py (1)
44-49
: Registered method flags are correctly added.The addition of
_registered_method=True
to all gRPC method registrations and calls is correct and aligns with gRPC 1.67.0 requirements.Also applies to: 88-88, 122-122, 149-149
src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py (1)
1-67
: Auto-generated protobuf schema updates look good.The schema enhancements adding
sim_hash_dir_names
andlang_extensions
fields align with the folder hasher implementation. Line length warnings for serialized descriptors can be safely ignored in auto-generated code.src/scanoss/scanners/folder_hasher.py (3)
197-205
: Good error handling for relative path calculation.The implementation properly handles the case where
relative_to()
might fail with a ValueError, providing a sensible fallback.
232-294
: Enhanced hash calculation with directory and language tracking.The implementation correctly:
- Tracks unique file names without extensions
- Collects unique directory names
- Counts file extensions by language
- Computes separate simhashes for files and directories
The logic is well-structured and aligns with the protobuf schema changes.
252-255
: Consider validating directory extraction logic.The logic extracts the last directory component but excludes cases where it equals the full path. Ensure this handles all edge cases correctly, especially for files in the root directory.
#!/bin/bash # Description: Check how the directory extraction logic handles edge cases # Create a test script to verify directory extraction behavior cat > test_dir_extraction.py << 'EOF' import os test_cases = [ "file.txt", # File in root "./file.txt", # File in current dir "dir/file.txt", # File in subdirectory "dir/subdir/file.txt", # File in nested directory "/absolute/path/file.txt", # Absolute path ] for path in test_cases: current_directory = os.path.dirname(path) if current_directory and current_directory != '.': last_directory = os.path.basename(current_directory) valid = last_directory != current_directory and last_directory not in ['.', '..'] print(f"Path: {path}") print(f" Current dir: {current_directory}") print(f" Last dir: {last_directory}") print(f" Valid: {valid}") else: print(f"Path: {path} - Skipped (empty or current dir)") print() EOF python test_dir_extraction.py rm test_dir_extraction.py
162716e
to
450ac5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
docs/source/index.rst
(0 hunks)src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py
(1 hunks)src/scanoss/cli.py
(3 hunks)src/scanoss/constants.py
(1 hunks)src/scanoss/file_filters.py
(3 hunks)src/scanoss/scanners/folder_hasher.py
(6 hunks)src/scanoss/scanners/scanner_hfh.py
(6 hunks)
💤 Files with no reviewable changes (1)
- docs/source/index.rst
✅ Files skipped from review due to trivial changes (1)
- src/scanoss/constants.py
🚧 Files skipped from review as they are similar to previous changes (3)
- src/scanoss/file_filters.py
- src/scanoss/cli.py
- src/scanoss/scanners/folder_hasher.py
🧰 Additional context used
🪛 Flake8 (7.2.0)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py
[error] 25-25: indentation is not a multiple of 4
(E111)
[error] 26-26: indentation is not a multiple of 4
(E111)
[error] 27-27: indentation is not a multiple of 4
(E111)
[error] 28-28: indentation is not a multiple of 4
(E111)
[error] 29-29: indentation is not a multiple of 4
(E111)
[error] 30-30: indentation is not a multiple of 4
(E111)
[error] 31-31: indentation is not a multiple of 4
(E111)
[error] 32-32: indentation is not a multiple of 4
(E111)
[error] 33-33: indentation is not a multiple of 4
(E111)
[error] 33-33: missing whitespace around operator
(E225)
[error] 34-34: indentation is not a multiple of 4
(E111)
[error] 34-34: missing whitespace around operator
(E225)
[error] 35-35: indentation is not a multiple of 4
(E111)
[error] 35-35: missing whitespace around operator
(E225)
[error] 36-36: indentation is not a multiple of 4
(E111)
[error] 36-36: missing whitespace around operator
(E225)
[error] 37-37: indentation is not a multiple of 4
(E111)
[error] 37-37: missing whitespace around operator
(E225)
[error] 38-38: indentation is not a multiple of 4
(E111)
[error] 38-38: missing whitespace around operator
(E225)
[error] 39-39: indentation is not a multiple of 4
(E111)
[error] 39-39: missing whitespace around operator
(E225)
[error] 40-40: indentation is not a multiple of 4
(E111)
[error] 40-40: missing whitespace around operator
(E225)
[error] 41-41: indentation is not a multiple of 4
(E111)
[error] 41-41: missing whitespace around operator
(E225)
[error] 42-42: indentation is not a multiple of 4
(E111)
[error] 42-42: missing whitespace around operator
(E225)
[error] 43-43: indentation is not a multiple of 4
(E111)
[error] 43-43: missing whitespace around operator
(E225)
[error] 44-44: indentation is not a multiple of 4
(E111)
[error] 44-44: missing whitespace around operator
(E225)
[error] 45-45: indentation is not a multiple of 4
(E111)
[error] 45-45: missing whitespace around operator
(E225)
[error] 46-46: indentation is not a multiple of 4
(E111)
[error] 46-46: missing whitespace around operator
(E225)
[error] 47-47: indentation is not a multiple of 4
(E111)
[error] 47-47: missing whitespace around operator
(E225)
[error] 48-48: indentation is not a multiple of 4
(E111)
[error] 48-48: missing whitespace around operator
(E225)
🪛 Ruff (0.11.9)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py
19-19: Line too long (2500 > 120)
(E501)
26-26: Line too long (391 > 120)
(E501)
30-30: Line too long (130 > 120)
(E501)
32-32: Line too long (144 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (4)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py (1)
19-48
: Protobuf-generated code - formatting issues are expected and should not be manually fixed.This file is automatically generated by the protocol buffer compiler as indicated by the header comment "DO NOT EDIT!". The static analysis warnings about indentation, line length, and missing whitespace around operators are expected in generated code and should not be manually addressed, as any changes would be overwritten when the protobuf definitions are recompiled.
The schema changes reflected in this generated code (replacing
best_match
/threshold
withrank_threshold
and adding new fields) align with the API evolution described in the summary.src/scanoss/scanners/scanner_hfh.py (3)
32-33
: LGTM - New imports align with functionality additions.The imports for
DEFAULT_HFH_RANK_THRESHOLD
andCycloneDx
are properly utilized in the constructor default parameter and the new CycloneDX output formatting method respectively.
57-57
: LGTM - Constructor properly updated for new API.The constructor changes correctly:
- Add the new
rank_threshold
parameter with appropriate default from constants- Update the docstring to reflect the new parameter
- Store the parameter as an instance variable
This aligns with the API evolution from separate
best_match
/threshold
parameters to a unifiedrank_threshold
.Also applies to: 67-67, 95-95
106-106
: LGTM - Scan method correctly updated for new API.The gRPC request dictionary is properly updated to use
rank_threshold
instead of the removedbest_match
andthreshold
parameters, consistent with the protobuf schema changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/scanoss/scanners/scanner_hfh.py
(6 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/scanoss/scanners/scanner_hfh.py (3)
src/scanoss/cyclonedx.py (1)
CycloneDx
(37-302)src/scanoss/scanossbase.py (1)
print_stderr
(45-49)src/scanoss/scanossgrpc.py (1)
get_dependencies
(248-255)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (5)
src/scanoss/scanners/scanner_hfh.py (5)
32-33
: LGTM: Clean import additions.The new imports are well-organized and support the updated functionality for rank threshold configuration and CycloneDX output formatting.
57-57
: LGTM: Improved parameter consolidation.The change from separate
best_match
andthreshold
parameters to a unifiedrank_threshold
with a default constant simplifies the API and improves maintainability.
67-67
: LGTM: Updated documentation reflects the parameter change.The docstring accurately describes the new
rank_threshold
parameter's purpose.
95-95
: LGTM: Instance variable assignment is correct.The assignment properly stores the rank threshold for use in the scan method.
106-106
: LGTM: Request parameter updated consistently.The gRPC request now correctly uses the unified
rank_threshold
parameter instead of the previous separate parameters.
def _format_cyclonedx_output(self) -> str: | ||
raise NotImplementedError('CycloneDX output is not implemented') | ||
if not self.scanner.scan_results: | ||
return '' | ||
try: | ||
if 'results' not in self.scanner.scan_results or not self.scanner.scan_results['results']: | ||
self.base.print_stderr('ERROR: No scan results found') | ||
return '' | ||
|
||
first_result = self.scanner.scan_results['results'][0] | ||
|
||
best_match_components = [c for c in first_result.get('components', []) if c.get('order') == 1] | ||
if not best_match_components: | ||
self.base.print_stderr('ERROR: No best match component found') | ||
return '' | ||
|
||
best_match_component = best_match_components[0] | ||
if not best_match_component.get('versions'): | ||
self.base.print_stderr('ERROR: No versions found for best match component') | ||
return '' | ||
|
||
best_match_version = best_match_component['versions'][0] | ||
purl = best_match_component['purl'] | ||
|
||
get_dependencies_json_request = { | ||
'files': [ | ||
{ | ||
'file': f'{best_match_component["name"]}:{best_match_version["version"]}', | ||
'purls': [{'purl': purl, 'requirement': best_match_version['version']}], | ||
} | ||
] | ||
} | ||
|
||
decorated_scan_results = self.scanner.client.get_dependencies(get_dependencies_json_request) | ||
|
||
cdx = CycloneDx(self.base.debug, self.output_file) | ||
scan_results = {} | ||
for f in decorated_scan_results['files']: | ||
scan_results[f['file']] = [f] | ||
if not cdx.produce_from_json(scan_results, self.output_file): | ||
error_msg = 'ERROR: Failed to produce CycloneDX output' | ||
self.base.print_stderr(error_msg) | ||
raise ValueError(error_msg) | ||
except Exception as e: | ||
self.base.print_stderr(f'ERROR: Failed to get license information: {e}') | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix inconsistent return type and improve error handling.
The method shows significant improvement over previous versions by adding proper data validation and error handling. However, there are still some issues:
- Inconsistent return type: Line 209 returns
None
while other error cases return empty string - Exception handling could be more granular: The broad try-catch might mask specific errors
- Method complexity: The method is quite long and handles multiple responsibilities
Apply this diff to fix the inconsistent return type:
except Exception as e:
self.base.print_stderr(f'ERROR: Failed to get license information: {e}')
- return None
+ return ''
Consider refactoring this method into smaller helper methods for better maintainability:
+ def _get_best_match_component(self, scan_results: dict) -> dict:
+ """Extract and validate the best match component from scan results."""
+ if 'results' not in scan_results or not scan_results['results']:
+ raise ValueError('No scan results found')
+
+ first_result = scan_results['results'][0]
+ best_match_components = [c for c in first_result.get('components', []) if c.get('order') == 1]
+ if not best_match_components:
+ raise ValueError('No best match component found')
+
+ best_match_component = best_match_components[0]
+ if not best_match_component.get('versions'):
+ raise ValueError('No versions found for best match component')
+
+ return best_match_component
+
+ def _create_dependencies_request(self, component: dict, version: dict, purl: str) -> dict:
+ """Create the dependencies request JSON."""
+ return {
+ 'files': [
+ {
+ 'file': f'{component["name"]}:{version["version"]}',
+ 'purls': [{'purl': purl, 'requirement': version['version']}],
+ }
+ ]
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def _format_cyclonedx_output(self) -> str: | |
raise NotImplementedError('CycloneDX output is not implemented') | |
if not self.scanner.scan_results: | |
return '' | |
try: | |
if 'results' not in self.scanner.scan_results or not self.scanner.scan_results['results']: | |
self.base.print_stderr('ERROR: No scan results found') | |
return '' | |
first_result = self.scanner.scan_results['results'][0] | |
best_match_components = [c for c in first_result.get('components', []) if c.get('order') == 1] | |
if not best_match_components: | |
self.base.print_stderr('ERROR: No best match component found') | |
return '' | |
best_match_component = best_match_components[0] | |
if not best_match_component.get('versions'): | |
self.base.print_stderr('ERROR: No versions found for best match component') | |
return '' | |
best_match_version = best_match_component['versions'][0] | |
purl = best_match_component['purl'] | |
get_dependencies_json_request = { | |
'files': [ | |
{ | |
'file': f'{best_match_component["name"]}:{best_match_version["version"]}', | |
'purls': [{'purl': purl, 'requirement': best_match_version['version']}], | |
} | |
] | |
} | |
decorated_scan_results = self.scanner.client.get_dependencies(get_dependencies_json_request) | |
cdx = CycloneDx(self.base.debug, self.output_file) | |
scan_results = {} | |
for f in decorated_scan_results['files']: | |
scan_results[f['file']] = [f] | |
if not cdx.produce_from_json(scan_results, self.output_file): | |
error_msg = 'ERROR: Failed to produce CycloneDX output' | |
self.base.print_stderr(error_msg) | |
raise ValueError(error_msg) | |
except Exception as e: | |
self.base.print_stderr(f'ERROR: Failed to get license information: {e}') | |
return None | |
def _format_cyclonedx_output(self) -> str: | |
if not self.scanner.scan_results: | |
return '' | |
try: | |
if 'results' not in self.scanner.scan_results or not self.scanner.scan_results['results']: | |
self.base.print_stderr('ERROR: No scan results found') | |
return '' | |
first_result = self.scanner.scan_results['results'][0] | |
best_match_components = [c for c in first_result.get('components', []) if c.get('order') == 1] | |
if not best_match_components: | |
self.base.print_stderr('ERROR: No best match component found') | |
return '' | |
best_match_component = best_match_components[0] | |
if not best_match_component.get('versions'): | |
self.base.print_stderr('ERROR: No versions found for best match component') | |
return '' | |
best_match_version = best_match_component['versions'][0] | |
purl = best_match_component['purl'] | |
get_dependencies_json_request = { | |
'files': [ | |
{ | |
'file': f'{best_match_component["name"]}:{best_match_version["version"]}', | |
'purls': [{'purl': purl, 'requirement': best_match_version['version']}], | |
} | |
] | |
} | |
decorated_scan_results = self.scanner.client.get_dependencies(get_dependencies_json_request) | |
cdx = CycloneDx(self.base.debug, self.output_file) | |
scan_results = {} | |
for f in decorated_scan_results['files']: | |
scan_results[f['file']] = [f] | |
if not cdx.produce_from_json(scan_results, self.output_file): | |
error_msg = 'ERROR: Failed to produce CycloneDX output' | |
self.base.print_stderr(error_msg) | |
raise ValueError(error_msg) | |
except Exception as e: | |
self.base.print_stderr(f'ERROR: Failed to get license information: {e}') | |
return '' | |
def _get_best_match_component(self, scan_results: dict) -> dict: | |
"""Extract and validate the best match component from scan results.""" | |
if 'results' not in scan_results or not scan_results['results']: | |
raise ValueError('No scan results found') | |
first_result = scan_results['results'][0] | |
best_match_components = [c for c in first_result.get('components', []) if c.get('order') == 1] | |
if not best_match_components: | |
raise ValueError('No best match component found') | |
best_match_component = best_match_components[0] | |
if not best_match_component.get('versions'): | |
raise ValueError('No versions found for best match component') | |
return best_match_component | |
def _create_dependencies_request(self, component: dict, version: dict, purl: str) -> dict: | |
"""Create the dependencies request JSON.""" | |
return { | |
'files': [ | |
{ | |
'file': f'{component["name"]}:{version["version"]}', | |
'purls': [{'purl': purl, 'requirement': version['version']}], | |
} | |
] | |
} |
🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 165-165: Either all return statements in a function should return an expression, or none of them should.
(R1710)
🤖 Prompt for AI Agents
In src/scanoss/scanners/scanner_hfh.py between lines 165 and 209, the method
_format_cyclonedx_output has inconsistent return types, returning an empty
string on some errors but None on exceptions. To fix this, ensure all error
paths return the same type, preferably an empty string for consistency.
Additionally, refactor the broad try-except block into more specific exception
handling to avoid masking errors. Finally, split the method into smaller helper
functions to reduce complexity and improve maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
CLIENT_HELP.md
(1 hunks)docs/source/index.rst
(1 hunks)src/scanoss/cli.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/source/index.rst
🧰 Additional context used
🪛 LanguageTool
CLIENT_HELP.md
[uncategorized] ~490-~490: Loose punctuation mark.
Context: ...ate higher quality matches. - --format
: Result output format (json or cyclonedx...
(UNLIKELY_OPENING_PUNCTUATION)
🪛 Ruff (0.11.9)
src/scanoss/cli.py
635-635: Line too long (157 > 120)
(E501)
🪛 GitHub Actions: Lint
src/scanoss/cli.py
[error] 635-635: Ruff: Line too long (157 > 120) (E501)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (4)
CLIENT_HELP.md (1)
488-496
: Excellent documentation for new CLI options.The new Options section clearly documents the
--rank-threshold
and--format
parameters with helpful descriptions and examples. The explanation that lower rank values indicate higher quality matches is particularly valuable for users.src/scanoss/cli.py (3)
57-57
: LGTM: Constant import added.The import of
DEFAULT_HFH_RANK_THRESHOLD
aligns with the new rank threshold functionality.
627-627
: LGTM: CycloneDX format support added.Adding 'cyclonedx' to the format choices correctly implements the new output format option documented in the help file.
1961-1961
: LGTM: Scanner constructor updated correctly.The change from setting
scanner.best_match
andscanner.threshold
properties to passingrank_threshold
directly to the constructor aligns with the API changes described in the summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/scanoss/scanners/folder_hasher.py (1)
38-41
: Type annotation inconsistency causes runtime errorThe type annotation change creates a mismatch.
CRC64.get_hash_buff()
returnsbytes
(line 159), but the annotation now claimsList[bytes]
. This will cause confusion and potential runtime errors whenfile.key
is used invectorize_bytes()
which expects individualbytes
objects, not lists.- def __init__(self, path: str, key: List[bytes], key_str: str): + def __init__(self, path: str, key: bytes, key_str: str):
🧹 Nitpick comments (1)
src/scanoss/scanners/folder_hasher.py (1)
18-18
: Misleading constant name with current usageThe constant is renamed to
MAXIMUM_FILE_NAME_LENGTH
but line 143 uses<= MAXIMUM_FILE_NAME_LENGTH
, which suggests files with names shorter than or equal to this length are kept. This naming is confusing - either rename toFILE_NAME_LENGTH_LIMIT
or change the logic to use>
for exclusion.-MAXIMUM_FILE_NAME_LENGTH = 32 +FILE_NAME_LENGTH_LIMIT = 32
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/scanoss/scanners/folder_hasher.py
(7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (2)
src/scanoss/scanners/folder_hasher.py (2)
197-214
: Improved relative path calculationThe relative path calculation with proper error handling is a good improvement over the previous implementation. This addresses the environment-dependency issue mentioned in past comments.
287-290
: Good addition of debug loggingThe debug logging provides valuable insights into the simhash calculation process, making debugging easier.
ext_without_dot = extension[1:] | ||
extension_map[ext_without_dot] = extension_map.get(ext_without_dot, 0) + 1 | ||
|
||
current_directory.removeprefix(self.scan_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String method call has no effect
removeprefix()
returns a new string but the result isn't assigned back to current_directory
. This line is a no-op.
- current_directory.removeprefix(self.scan_dir)
+ current_directory = current_directory.removeprefix(self.scan_dir)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
current_directory.removeprefix(self.scan_dir) | |
- current_directory.removeprefix(self.scan_dir) | |
+ current_directory = current_directory.removeprefix(self.scan_dir) |
🤖 Prompt for AI Agents
In src/scanoss/scanners/folder_hasher.py at line 252, the call to removeprefix()
on current_directory does not modify the string in place and its result is not
assigned back. Fix this by assigning the result of
current_directory.removeprefix(self.scan_dir) back to current_directory to
ensure the prefix is actually removed.
selected_names.sort() | ||
concatenated_names = ''.join(selected_names) | ||
|
||
if len(concatenated_names.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH: | ||
return { | ||
'name_hash': None, | ||
'content_hash': None, | ||
} | ||
return {'name_hash': None, 'content_hash': None, 'dir_hash': None, 'lang_extensions': None} | ||
|
||
# Concatenate the unique file names without the extensions, adding a space and sorting them alphabetically | ||
unique_file_names_list = list(unique_file_names) | ||
unique_file_names_list.sort() | ||
concatenated_names = ' '.join(unique_file_names_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Variable reassignment creates confusion and potential bugs
The variable concatenated_names
is first assigned with ''.join(selected_names)
(line 268), used for length validation (line 270), then immediately reassigned with a completely different value ' '.join(unique_file_names_list)
(line 276). This is confusing and makes the length check potentially meaningless since it's checking the length of concatenated file names with extensions, but then using concatenated file names without extensions for simhash calculation.
selected_names.sort()
- concatenated_names = ''.join(selected_names)
+ concatenated_names_with_ext = ''.join(selected_names)
- if len(concatenated_names.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH:
+ if len(concatenated_names_with_ext.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH:
return {'name_hash': None, 'content_hash': None, 'dir_hash': None, 'lang_extensions': None}
# Concatenate the unique file names without the extensions, adding a space and sorting them alphabetically
unique_file_names_list = list(unique_file_names)
unique_file_names_list.sort()
- concatenated_names = ' '.join(unique_file_names_list)
+ concatenated_names_without_ext = ' '.join(unique_file_names_list)
Then update line 283:
- names_simhash = simhash(WordFeatureSet(concatenated_names.encode('utf-8')))
+ names_simhash = simhash(WordFeatureSet(concatenated_names_without_ext.encode('utf-8')))
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
selected_names.sort() | |
concatenated_names = ''.join(selected_names) | |
if len(concatenated_names.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH: | |
return { | |
'name_hash': None, | |
'content_hash': None, | |
} | |
return {'name_hash': None, 'content_hash': None, 'dir_hash': None, 'lang_extensions': None} | |
# Concatenate the unique file names without the extensions, adding a space and sorting them alphabetically | |
unique_file_names_list = list(unique_file_names) | |
unique_file_names_list.sort() | |
concatenated_names = ' '.join(unique_file_names_list) | |
selected_names.sort() | |
concatenated_names_with_ext = ''.join(selected_names) | |
if len(concatenated_names_with_ext.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH: | |
return {'name_hash': None, 'content_hash': None, 'dir_hash': None, 'lang_extensions': None} | |
# Concatenate the unique file names without the extensions, adding a space and sorting them alphabetically | |
unique_file_names_list = list(unique_file_names) | |
unique_file_names_list.sort() | |
concatenated_names_without_ext = ' '.join(unique_file_names_list) | |
names_simhash = simhash(WordFeatureSet(concatenated_names_without_ext.encode('utf-8'))) |
🤖 Prompt for AI Agents
In src/scanoss/scanners/folder_hasher.py around lines 267 to 276, avoid
reassigning the variable concatenated_names with two different concatenations
that differ in content and format, which causes confusion and invalidates the
length check. Instead, use separate variables for the concatenation used in the
length validation (with extensions and no spaces) and for the concatenation used
later (without extensions and joined by spaces). Update the length check to use
the appropriate variable and ensure the simhash calculation uses the correctly
concatenated names without extensions.
concatenated_directories = ' '.join(unique_directories_list) | ||
|
||
names_simhash = simhash(WordFeatureSet(concatenated_names.encode('utf-8'))) | ||
dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8'))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Potential issue with empty directory simhash
If no valid directories are found (empty unique_directories
), concatenated_directories
will be an empty string, and WordFeatureSet('')
might produce unexpected simhash results.
Verify how WordFeatureSet
handles empty strings and ensure directory simhash behaves correctly when no directories are present.
🏁 Script executed:
#!/bin/bash
# Description: Check how WordFeatureSet handles empty strings in the simhash utility
ast-grep --pattern 'class WordFeatureSet {
$$$
}'
# Also check if there are any tests for empty string handling
rg -A 5 -B 5 "WordFeatureSet.*empty|empty.*WordFeatureSet"
Length of output: 114
🏁 Script executed:
#!/bin/bash
# Locate any local definitions of WordFeatureSet
rg -n "class WordFeatureSet" -t py || rg -n "def WordFeatureSet" -t py
# If none found locally, attempt to inspect the external simhash library
python3 - <<'EOF'
import inspect
try:
from simhash import WordFeatureSet
print(inspect.getsource(WordFeatureSet))
except Exception as e:
print("Error fetching WordFeatureSet from simhash:", e)
EOF
Length of output: 173
🏁 Script executed:
#!/bin/bash
# Show the WordFeatureSet class definition and methods
sed -n '150,260p' src/scanoss/utils/simhash.py
# Show the simhash() function to see how features are processed
sed -n '1,100p' src/scanoss/utils/simhash.py
Length of output: 4683
🏁 Script executed:
#!/bin/bash
# Extract the simhash function and its helper fingerprint to see handling of empty feature lists
rg -n "def simhash" -n 1,200 src/scanoss/utils/simhash.py -A 20
rg -n "def fingerprint" -n 1,200 src/scanoss/utils/simhash.py -A 20
Length of output: 3159
Clarify empty‐directory simhash behavior
We verified that when concatenated_directories
is empty,
WordFeatureSet(b'')
→ no features → vector of zeros → fingerprint
sets all 64 bits → dir_simhash
== 0xFFFFFFFFFFFFFFFF.
If you’d rather treat “no directories” as a distinct value (e.g. 0), add a guard:
• In src/scanoss/scanners/folder_hasher.py
around lines 281–284:
- dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8')))
+ if not unique_directories_list:
+ dir_simhash = 0
+ else:
+ dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8')))
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
concatenated_directories = ' '.join(unique_directories_list) | |
names_simhash = simhash(WordFeatureSet(concatenated_names.encode('utf-8'))) | |
dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8'))) | |
concatenated_directories = ' '.join(unique_directories_list) | |
names_simhash = simhash(WordFeatureSet(concatenated_names.encode('utf-8'))) | |
if not unique_directories_list: | |
dir_simhash = 0 | |
else: | |
dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8'))) |
🤖 Prompt for AI Agents
In src/scanoss/scanners/folder_hasher.py around lines 281 to 284, the simhash
calculation for concatenated_directories results in a value of
0xFFFFFFFFFFFFFFFF when the input is empty, which may be misleading. Add a guard
to check if concatenated_directories is empty before computing the simhash; if
it is empty, set dir_simhash to 0 explicitly to represent the absence of
directories distinctly.
… align with golang miner
1d25624
to
2933009
Compare
ext_without_dot = extension[1:] | ||
extension_map[ext_without_dot] = extension_map.get(ext_without_dot, 0) + 1 | ||
|
||
current_directory.removeprefix(self.scan_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we doing this for every directly. surely we can do this once and make everything else relative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this doesn't work on python 3.8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
src/scanoss/scanners/folder_hasher.py (3)
38-41
: Fix type annotation:key
should bebytes
, notList[bytes]
This is the same issue flagged in previous reviews. The
CRC64.get_hash_buff()
method returns a singlebytes
object, not a list of bytes objects.Apply this fix:
- def __init__(self, path: str, key: List[bytes], key_str: str): + def __init__(self, path: str, key: bytes, key_str: str):
243-262
: Fix string method with no effect and improve directory path handlingLine 252 contains the same issue flagged in previous reviews -
replace()
returns a new string but the result isn't assigned.- current_directory.replace(self.scan_dir, '', 1).lstrip(os.path.sep) + current_directory = current_directory.replace(self.scan_dir, '', 1).lstrip(os.path.sep)Additionally, consider using
Path
operations for more robust path handling:- current_directory.replace(self.scan_dir, '', 1).lstrip(os.path.sep) - parts = current_directory.split(os.path.sep) + try: + rel_dir = Path(current_directory).relative_to(self.scan_dir) + parts = rel_dir.parts + except ValueError: + parts = Path(current_directory).parts
267-276
: Fix variable reassignment that invalidates length checkThis is the same confusing variable reassignment issue from previous reviews. The
concatenated_names
variable is used for length validation, then immediately reassigned with different content.selected_names.sort() - concatenated_names = ''.join(selected_names) + concatenated_names_with_ext = ''.join(selected_names) - if len(concatenated_names.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH: + if len(concatenated_names_with_ext.encode('utf-8')) < MINIMUM_CONCATENATED_NAME_LENGTH: return {'name_hash': None, 'content_hash': None, 'dir_hash': None, 'lang_extensions': None} # Concatenate the unique file names without the extensions, adding a space and sorting them alphabetically unique_file_names_list = list(unique_file_names) unique_file_names_list.sort() - concatenated_names = ' '.join(unique_file_names_list) + concatenated_names_without_ext = ' '.join(unique_file_names_list)Then update line 283:
- names_simhash = simhash(WordFeatureSet(concatenated_names.encode('utf-8'))) + names_simhash = simhash(WordFeatureSet(concatenated_names_without_ext.encode('utf-8')))
🧹 Nitpick comments (1)
src/scanoss/scanners/folder_hasher.py (1)
284-284
: Handle empty directory case for consistent simhash behaviorWhen
unique_directories_list
is empty,WordFeatureSet(b'')
produces a simhash of0xFFFFFFFFFFFFFFFF
. Consider making this explicit:- dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8'))) + if not unique_directories_list: + dir_simhash = 0 + else: + dir_simhash = simhash(WordFeatureSet(concatenated_directories.encode('utf-8')))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/scanoss/file_filters.py
(3 hunks)src/scanoss/scanners/folder_hasher.py
(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/scanoss/file_filters.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (7)
src/scanoss/scanners/folder_hasher.py (7)
18-18
: LGTM: Constant renaming reflects correct usageThe renaming from
MINIMUM_FILE_NAME_LENGTH
toMAXIMUM_FILE_NAME_LENGTH
accurately reflects its usage as an upper bound constraint in the file filtering logic.
80-80
: LGTM: Makingconfig
parameter non-optional improves API clarityRequiring an explicit
FolderHasherConfig
prevents potential issues with default values and makes the dependency explicit.
143-143
: LGTM: Correct usage of the renamed constantThe file filtering logic correctly uses
MAXIMUM_FILE_NAME_LENGTH
as an upper bound constraint.
197-214
: LGTM: Improved path handling with proper error recoveryThe safe relative path calculation with try-except block and fallback logic handles edge cases gracefully. The addition of new hash fields (
sim_hash_dir_names
,lang_extensions
) enhances the folder hashing functionality.
232-234
: LGTM: New data structures for enhanced directory hashingThe introduction of
unique_file_names
,unique_directories
, andextension_map
enables more sophisticated hashing that includes directory structure and file type information.
287-291
: LGTM: Comprehensive debug loggingThe debug logging provides valuable insights into the unique file names, directories, and computed hashes, which will help with troubleshooting and validation.
299-299
: LGTM: Explicit parameter types improve method signatureMaking the optional parameters explicit with type annotations enhances code clarity and IDE support.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores
--threshold
with--rank-threshold
(default 5) and removed--best-match
for folder scans.