Raise without unrolling #940

wsmoses · 2025-03-18T12:33:13Z

No description provided.

wsmoses · 2025-03-18T15:15:35Z

this is blocked waiting for new jll

giordano · 2025-03-19T14:02:10Z

This is looking good, based on PRONTOLab/GB-25#79 (comment). Don't have performance numbers though.

wsmoses · 2025-03-19T14:31:41Z

@giordano if you can confirm runtime perf is reasonable by comparison lets merge

giordano · 2025-03-19T14:34:22Z

I'll need to do some testing on GPU, profiling on CPU is broken according to @Pangoraw so we don't have any numbers there.

wsmoses · 2025-03-19T14:39:21Z

we can still do lazy @Btime on outside

giordano · 2025-03-19T17:00:03Z

Uhm, this seems to degrade performance a lot: according to profiling information, on A100 3000 iterations take 58s with Reactant v0.2.45 (Reactant_jll v0.0.92), on this branch (and Reactant_jll v0.0.93) they take 4m 15s

Pangoraw · 2025-03-19T17:10:53Z

Yeah without the parallelize pass the raised code will be pretty inefficient.

giordano · 2025-03-19T17:13:04Z

Uhm, maybe the profiler was messing up timing a lot (I got lots of warnings about large overhead), with @time I get a smaller difference (but still disfavouring this change): v0.2.95

julia> @time rloop!(model, ConcreteRNumber(3000));
 57.401839 seconds (926.27 k allocations: 49.752 MiB, 1.75% compilation time)

julia> @time rloop!(model, ConcreteRNumber(6000));
112.590276 seconds (484 allocations: 13.555 KiB)

julia> @time rloop!(model, ConcreteRNumber(10000));
187.604489 seconds (484 allocations: 13.555 KiB)

this PR:

julia> @time rloop!(model, ConcreteRNumber(3000));
 72.680004 seconds (926.16 k allocations: 49.762 MiB, 1.32% compilation time)

julia> @time rloop!(model, ConcreteRNumber(6000));
143.340644 seconds (484 allocations: 13.555 KiB)

julia> @time rloop!(model, ConcreteRNumber(10000));
238.512447 seconds (484 allocations: 13.555 KiB)

Edit: I think there are two problems in the profiling data:

I was looking at the reactant_loop_:XLA GPU module event on the CPU, which sometimes is accurate, but other times just continues for a long time after the actual run is finished. I should have selected the GPU events like this

(note the ☑️ on the GPU line to see those events in the pane below)
...but the GPU data goes up to 7.1 seconds, while the whole run should have been like 10x of that.

XLA did show warnings about CUPTI data being dropped because some buffers were full, but this is a bit rubbish.

Raise without unrolling

8ffcfc3

giordano force-pushed the nounroll branch from f09b902 to 8ffcfc3 Compare March 19, 2025 02:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise without unrolling #940

Raise without unrolling #940

wsmoses commented Mar 18, 2025

wsmoses commented Mar 18, 2025

giordano commented Mar 19, 2025 •

edited

Loading

wsmoses commented Mar 19, 2025

giordano commented Mar 19, 2025

wsmoses commented Mar 19, 2025

giordano commented Mar 19, 2025

Pangoraw commented Mar 19, 2025

giordano commented Mar 19, 2025 •

edited

Loading

Raise without unrolling #940

Are you sure you want to change the base?

Raise without unrolling #940

Conversation

wsmoses commented Mar 18, 2025

wsmoses commented Mar 18, 2025

giordano commented Mar 19, 2025 • edited Loading

wsmoses commented Mar 19, 2025

giordano commented Mar 19, 2025

wsmoses commented Mar 19, 2025

giordano commented Mar 19, 2025

Pangoraw commented Mar 19, 2025

giordano commented Mar 19, 2025 • edited Loading

giordano commented Mar 19, 2025 •

edited

Loading

giordano commented Mar 19, 2025 •

edited

Loading