Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise without unrolling #940

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Raise without unrolling #940

wants to merge 1 commit into from

Conversation

wsmoses
Copy link
Member

@wsmoses wsmoses commented Mar 18, 2025

No description provided.

@wsmoses
Copy link
Member Author

wsmoses commented Mar 18, 2025

this is blocked waiting for new jll

@giordano
Copy link
Member

giordano commented Mar 19, 2025

This is looking good, based on PRONTOLab/GB-25#79 (comment). Don't have performance numbers though.

@wsmoses
Copy link
Member Author

wsmoses commented Mar 19, 2025

@giordano if you can confirm runtime perf is reasonable by comparison lets merge

@giordano
Copy link
Member

I'll need to do some testing on GPU, profiling on CPU is broken according to @Pangoraw so we don't have any numbers there.

@wsmoses
Copy link
Member Author

wsmoses commented Mar 19, 2025

we can still do lazy @Btime on outside

@giordano
Copy link
Member

Uhm, this seems to degrade performance a lot: according to profiling information, on A100 3000 iterations take 58s with Reactant v0.2.45 (Reactant_jll v0.0.92), on this branch (and Reactant_jll v0.0.93) they take 4m 15s

@Pangoraw
Copy link
Collaborator

Yeah without the parallelize pass the raised code will be pretty inefficient.

@giordano
Copy link
Member

giordano commented Mar 19, 2025

Uhm, maybe the profiler was messing up timing a lot (I got lots of warnings about large overhead), with @time I get a smaller difference (but still disfavouring this change): v0.2.95

julia> @time rloop!(model, ConcreteRNumber(3000));
 57.401839 seconds (926.27 k allocations: 49.752 MiB, 1.75% compilation time)

julia> @time rloop!(model, ConcreteRNumber(6000));
112.590276 seconds (484 allocations: 13.555 KiB)

julia> @time rloop!(model, ConcreteRNumber(10000));
187.604489 seconds (484 allocations: 13.555 KiB)

this PR:

julia> @time rloop!(model, ConcreteRNumber(3000));
 72.680004 seconds (926.16 k allocations: 49.762 MiB, 1.32% compilation time)

julia> @time rloop!(model, ConcreteRNumber(6000));
143.340644 seconds (484 allocations: 13.555 KiB)

julia> @time rloop!(model, ConcreteRNumber(10000));
238.512447 seconds (484 allocations: 13.555 KiB)

Edit: I think there are two problems in the profiling data:

  • I was looking at the reactant_loop_:XLA GPU module event on the CPU, which sometimes is accurate, but other times just continues for a long time after the actual run is finished. I should have selected the GPU events like this
    image
    (note the ☑️ on the GPU line to see those events in the pane below)
  • ...but the GPU data goes up to 7.1 seconds, while the whole run should have been like 10x of that.

XLA did show warnings about CUPTI data being dropped because some buffers were full, but this is a bit rubbish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants