-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine ScanCode.io d2d pipeline for JavaScript using symbol mapping #1622
Conversation
keshav-space
commented
Mar 12, 2025
- Map source and deployed javascript files based on Jensen–Shannon divergence
- Map source and deployed javascript files based on Jensen–Shannon divergence Signed-off-by: Keshav Priyadarshi <[email protected]>
Signed-off-by: Keshav Priyadarshi <[email protected]>
20ea1f3
to
e800be8
Compare
@keshav-space what's the status on this PR? Are we ready to merge? If not, what's left to do? |
@tdruez It's not ready for merge yet. I need to add string mapping for JavaScript, and I'm also testing and adding improvements to JavaScript symbol matching to reduce false positive matches. |
Signed-off-by: Keshav Priyadarshi <[email protected]>
@keshav-space Sounds good. Let me know when it's ready to go as we need this one merged in the back2source context ;) |
Signed-off-by: Keshav Priyadarshi <[email protected]>
Signed-off-by: Keshav Priyadarshi <[email protected]>
@tdruez this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keshav-space Overall looks great.
I think we are missing a few unit test for some new simple functions, such as is_decomposed_javascript
, get_symbols_probability_distribution
, ...
scanpipe/pipes/symbolmap.py
Outdated
common_symbols_ratio > matching_ratio | ||
or common_symbols_unique_ratio > matching_ratio | ||
): | ||
return True, stats | ||
elif source_symbols_count > SMALL_FILE_SYMBOLS_THRESHOLD and ( | ||
common_symbols_ratio > MATCHING_RATIO_RUST_SMALL_FILE | ||
or common_symbols_unique_ratio > MATCHING_RATIO_RUST_SMALL_FILE | ||
elif source_symbols_count > small_file_threshold and ( | ||
common_symbols_ratio > matching_ratio_small_file | ||
or common_symbols_unique_ratio > matching_ratio_small_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we wrap those conditions in a explicit variable name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should remove this. This change is no longer needed since we're not using match_source_symbols_to_binary
for javascript symbol matching.
Signed-off-by: Keshav Priyadarshi <[email protected]>
Signed-off-by: Keshav Priyadarshi <[email protected]>