Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process input files in parallel #26

Merged
merged 1 commit into from
Nov 8, 2024

Conversation

ricekot
Copy link
Contributor

@ricekot ricekot commented Nov 7, 2024

Process input source files in parallel to reduce the total AST generation time. The time taken to generate an AST is cut down by more than half with parallel processing.

About the changes

A PLINQ query is used to parallelize the foreach loops that iterate over files in the input directory one by one.

Since the AsParallel method is called on an IEnumerable<FileInfo>, PLINQ should opt for chunk partitioning (dynamic batch sizes) by default, as opposed to a range partitioning (fixed batch sizes) approach as suggested in #11.

Refs:

Benchmark

A fresh clone of the bitwarden/server repo was created locally (at commit 4adcecb).

Then, a scan was run against the cloned repo twice, once from the main branch of this repo and once with the code changes in this PR.

time dotnet run --project DotNetAstGen/ -i /tmp/bitwarden-server -o /tmp/bitwarden-server-ast/

The results of the scans are added below:

Without the changes from this PR (sequential processing):

________________________________________________________
Executed in   88.05 secs    fish           external
   usr time   87.45 secs   49.00 micros   87.45 secs
   sys time    4.36 secs  662.00 micros    4.36 secs

With the changes from this PR (parallel processing)

________________________________________________________
Executed in   39.36 secs    fish           external
   usr time  158.99 secs    0.08 millis  158.99 secs
   sys time   10.21 secs    1.48 millis   10.21 secs

Thus, AST generation with parallel processing was 48.69 seconds or about 55% faster.

I did multiple runs and the result was similar each time.

For reference these runs were done on a machine running macOS 15.0.1 with Apple M3 Pro and 36GiB RAM.

Copy link
Collaborator

@DavidBakerEffendi DavidBakerEffendi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! Much needed. I recently did something similar for the Ruby counterpart. Just have one request on the moving of a logging line I observed

@@ -218,6 +211,8 @@ private static void _SummaryForDLLFile(FileInfo filePath, string? exclusionRegex
return;
}

_logger?.LogInformation("Parsing file {fileName}", fullPath);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for moving this? Would this not end up logging all types of files?

Copy link
Contributor Author

@ricekot ricekot Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the existing behaviour too I believe, see line 192 which logs the file name in the loop.
https://github.com/joernio/DotNetAstGen/pull/26/files#diff-600f8abdf835e81e081c02e55f3bed18ecb1fdd2c80333c77184fefc67c6ad92L192

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the looping has changed, my bad. Looks all good to me then

@@ -218,6 +211,8 @@ private static void _SummaryForDLLFile(FileInfo filePath, string? exclusionRegex
return;
}

_logger?.LogInformation("Parsing file {fileName}", fullPath);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the looping has changed, my bad. Looks all good to me then

@DavidBakerEffendi DavidBakerEffendi merged commit 8e95368 into joernio:main Nov 8, 2024
DavidBakerEffendi added a commit to joernio/joern that referenced this pull request Nov 10, 2024
Uses new version of `DotNetAstGen` that uses a parallelized version. Credits to @ricekot

joernio/DotNetAstGen#26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants