Description
I used heaptrack to analyze the memory allocation patterns of my compiler which uses regalloc2. I used a large benchmark (LLVM library) which has 1257955 basic blocks in 104913 functions. Out of a total of 46780667 allocations, 46779011 of them come from regalloc2 and only 1656 come from other parts of the compiler.
My compiler uses a similar structure 1 to Cranelift where a Context
structure allows memory allocations to be reused across compilation units. Specifically, this allows various Vec
s and HashMap
s to grow to a sufficient size only once while being reused many times for compiling multiple functions.
regalloc2 unfortunately doesn't follow this design:
- A fresh
Env
structure is allocated and freed for each function. No memory allocations are preserved. - Heavy use of temporary
SmallVec
s andHashSet
s. Heaptrack reports that 19865290 allocations (42% of the total) are "temporary". These are allocations which are immediately followed by a free, with no other allocations in between. - Heavy use of
BTreeMap
which needs to allocate many nodes separately and cannot reuse memory unlikeHashMap
andVec
.
Heaptrack results
Here are the major allocation hotspots shown by heaptrack:
In try_to_allocate_bundle_to_reg
:
- 16158324 temporary allocations from the
FxHashSet
. - 9677435 allocations from inserting liveranges into
PRegData::allocations
(LiveRangeSet
backed by aBTreeMap
).
In merge_bundles
:
- 4478881 allocations from pushing a
VReg
toSpillSet::vregs
(SmallVec
). - 1895443 allocations from appending
LiveRangeListEntry
s toLiveBundle::ranges
(SmallVec
).
In insert_use_into_liverange
:
- 2578957 allocations from pushing a
Use
toLiveRange::uses
(SmallVec
).
In add_liverange_to_vreg
:
- 1432853 allocations from pushing a
LiveRangeListEntry
toVRegData::ranges
(SmallVec
).
In create_pregs_and_vregs
:
- 948582 allocations from pushing to
Env::vregs
andEnv::allocs
andEnv::inst_alloc_offsets
. (Vec
)
In Env::new
:
- 1154043 allocations from initializing the various
Vec
s withVec::with_capacity
.
Performance impact
Benchmarking shows that approximately 10% of the time is spent in memory allocation functions (malloc
, realloc
, free
). Additionally, another 10% of the time is spent in memcpy
/memmove
which could be due to vector reallocation.
Compiling with multiple threads can cause additional contention on the global allocator. This was noticed on musl targets where the global allocator doesn't have thread-local caches: compiler performance on 6 threads (on a 6-core machine) was reduced to that of a single thread whereas other allocators allow almost-perfect performance scaling.
Proposed solution
I believe the best way to improve regalloc2 would be to use a design similar to Cranelift with a Context
structure that can preserve memory allocation across multiple invocations of the register allocator.
- Temporary
SmallVec
s andHashSet
s should be replaced with a single instance inContext
. - The many
BTreeMap
s should be replaced withcranelift-bforest
which maintains a pool of nodes that can be reused. - The many
SmallVec
s should be replaced withEntityList
fromcranelift-entity
and aListPool
inContext
.
Since this would require a dependency on the cranelift-entity
and cranelift-bforest
crates, this is also a good opportunity to resolve #62 and switch to using proper indexed types internally.
Footnotes
-
In fact I am using the
cranelift-entity
crate directly, it's very well written. ↩