-
Notifications
You must be signed in to change notification settings - Fork 25
[SP-2587] Add directory simhash, modify concatenated names to remove extensions #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
""" WalkthroughThe updates modify type hints and filtering logic in file filters, and enhance the folder hashing process with additional simhash calculations for directories and file names. Method signatures are updated for clarity and stricter typing, and the hashing methods now track and return more granular directory and file name uniqueness information. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant FolderHasher
participant DirectoryNode
User->>FolderHasher: Initialize with scan_dir and config
FolderHasher->>DirectoryNode: Traverse directory tree
FolderHasher->>FolderHasher: _hash_calc(node)
FolderHasher->>FolderHasher: Collect unique file and dir names
FolderHasher->>FolderHasher: Compute simhashes (content, name, dir)
FolderHasher->>User: Return hash dictionary with content_hash, name_hash, dir_hash
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
src/scanoss/file_filters.py (1)
514-521
: 💡 Verification agent🧩 Analysis chain
Update the doc-string & type handling for
scan_root
The parameter was changed toOptional[str]
, but the doc-string (lines 518-521) and the run-time handling (line 530) still assume a non-None
str
. Callingrelative_to()
with aNone
value will raise, so an early‐return or guard clause is advisable, and the docs should be synchronised.@@ - scan_root (str): Root directory to scan or fingerprint + scan_root (Optional[str]): Root directory to scan or fingerprint. + When None the paths are returned as absolute paths.
🏁 Script executed:
#!/bin/bash # Locate the file_filters.py file file=$(fd file_filters.py -t f) echo "Inspecting: $file" # Print lines around the doc-string and the runtime handling of scan_root sed -n '500,580p' "$file"Length of output: 406
🏁 Script executed:
#!/bin/bash # Locate the src/scanoss/file_filters.py file specifically file=$(fd file_filters.py -t f | grep 'src/scanoss/file_filters.py') echo "Inspecting: $file" # Print lines around the doc-string and the runtime handling of scan_root sed -n '500,580p' "$file"Length of output: 3643
Update doc-string for
scan_root
inget_filtered_files_from_files
Thescan_root
parameter is nowOptional[str]
, so the doc-string should reflect the new type and describe the None-case behavior.• File:
src/scanoss/file_filters.py
• Method:get_filtered_files_from_files
Suggested diff:
Args: files (List[str]): List of files to scan or fingerprint - scan_root (str): Root directory to scan or fingerprint + scan_root (Optional[str]): Root directory to scan or fingerprint. + When None, returned paths will be absolute.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/scanoss/file_filters.py
(3 hunks)src/scanoss/scanners/folder_hasher.py
(4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/scanoss/scanners/folder_hasher.py (5)
src/scanoss/utils/simhash.py (4)
simhash
(125-130)WordFeatureSet
(163-169)fingerprint
(100-109)vectorize_bytes
(84-97)src/scanoss/scanners/scanner_hfh.py (1)
present
(129-131)src/scanoss/results.py (1)
present
(273-275)src/scanoss/utils/abstract_presenter.py (1)
present
(28-55)src/scanoss/scanners/container_scanner.py (1)
present
(381-383)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (2)
src/scanoss/scanners/folder_hasher.py (2)
80-82
: Constructor signature change breaks backward compatibility
config
is now mandatory (FolderHasherConfig
, notOptional
), yet callers in the tree (e.g. tests, CLI entry points) may still passNone
. Consider providing a sensible default:def __init__(self, scan_dir: str, config: Optional[FolderHasherConfig] = None, ...) config = config or FolderHasherConfig()Without this, existing integrations will raise
TypeError
.
256-270
: Potential misuse ofvectorize_bytes
input
file_hashes.append(file.key)
appends the entire CRC digest;vectorize_bytes
will iterate through these 8-byte sequences as separate features, which is fine. Ifkey
accidentally becomesList[bytes]
(see earlier issue) you will instead pass a list of lists causing the FNV hash to fail.Fixing the type annotation as suggested earlier avoids this failure path.
Summary by CodeRabbit
New Features
Bug Fixes
Refactor