Enhance GC devdocs (#60256)

d-netto · web-flow · commit 8efd1f08ffdd · 2025-11-28T11:16:33.000-05:00
Cleans up the writing in the GC devdocs and adds a bit more technical
detail in a few sections.
diff --git a/doc/src/devdocs/gc.md b/doc/src/devdocs/gc.md
@@ -1,61 +1,74 @@
-# Garbage Collection in Julia
+# Julia Garbage Collector (GC) Internals
 
 ## Introduction
 
-Julia has a non-moving, partially concurrent, parallel, generational and mostly precise mark-sweep collector (an interface
-for conservative stack scanning is provided as an option for users who wish to call Julia from C).
+Julia implements a garbage collector (GC) to automate dynamic memory management. Julia's GC is:
+
+- **Mark-sweep**: the object graph is traced starting from a root-set (e.g., global variables and local variables on the stack) to determine the set of live objects.
+- **Non-moving**: objects are not relocated to a different memory address.
+- **Parallel**: multiple threads can be used during the marking and sweeping phases.
+- **Partially concurrent**: the runtime provides an option to scavenge pool-allocated memory blocks (e.g., call `madvise` on these blocks on Linux) concurrently with Julia user code.
+- **Generational**: objects are partitioned into generations according to how many collection cycles they've survived. Younger generations are collected more often.
+- **Mostly precise**: Julia optionally supports conservative stack scanning for users who inter-operate with foreign languages like C.
 
 ## Allocation
 
-Julia uses two types of allocators, the size of the allocation request determining which one is used. Objects up to 2k
-bytes are allocated on a per-thread free-list pool allocator, while objects larger than 2k bytes are allocated through libc
-malloc.
+Julia uses two types of allocators, depending on the size of the allocation request.
+
+### Small Object Allocation
+
+Sufficiently small objects, up to 2k bytes, are allocated through a per-thread free-list pool allocator.
 
-Julia’s pool allocator partitions objects on different size classes, so that a memory page managed by the pool allocator
-(which spans 4 operating system pages on 64bit platforms) only contains objects of the same size class. Each memory
-page from the pool allocator is paired with some page metadata stored on per-thread lock-free lists. The page metadata contains information such as whether the page has live objects at all, number of free slots, and offsets to the first and last objects in the free-list contained in that page. These metadata are used to optimize the collection phase: a page which has no live objects at all may be returned to the operating system without any need of scanning it, for example.
+Julia's pool allocator often has better runtime performance than `libc` `malloc` for small allocations. Additionally, using a custom pool allocator enables a few optimizations during the sweeping phase (e.g., concurrent scavenging).
 
-While a page that has no objects may be returned to the operating system, its associated metadata is permanently
-allocated and may outlive the given page. As mentioned above, metadata for allocated pages are stored on per-thread lock-free
-lists. Metadata for free pages, however, may be stored into three separate lock-free lists depending on whether the page has been mapped but never accessed (`page_pool_clean`), or whether the page has been lazily sweeped and it's waiting to be madvised by a background GC thread (`page_pool_lazily_freed`), or whether the page has been madvised (`page_pool_freed`).
+The pool allocator segregates objects on different size classes. Each large memory block (16k bytes) managed by the pool allocator only contains objects belonging to the same size class.
 
-Julia's pool allocator follows a "tiered" allocation discipline. When requesting a memory page for the pool allocator, Julia will:
+Each pool-allocated memory block is paired with a metadata structure containing information such as whether the block has live objects at all, the number of free memory slots in the block, the offsets to the first and last objects in the block, etc. This metadata is used to aggregate statistics such as number of objects freed during a collection cycle. It's also used to optimize the sweeping phase of the GC: blocks that have no live objects whatsoever don't need to be linearly scanned during the sweeping phase.
 
-- Try to claim a page from `page_pool_lazily_freed`, which contains pages which were empty on the last stop-the-world phase, but not yet madvised by a concurrent sweeper GC thread.
+Julia's pool allocator stores memory blocks into different global lock-free lists depending on whether the block has been mapped but never accessed (`page_pool_clean`),  whether the page has been lazily swept and it's waiting to be scavenged by a background GC thread (`page_pool_lazily_freed`), or whether the page has been scavenged (`page_pool_freed`).
 
-- If it failed claiming a page from `page_pool_lazily_freed`, it will try to claim a page from `page_pool_clean`, which contains pages which were mmaped on a previous page allocation request but never accessed.
+ The pool allocator uses this partitioning of blocks to implement a tiered allocation discipline. When it requests a fresh memory block, it will:
 
-- If it failed claiming a page from `pool_page_clean` and from `page_pool_lazily_freed`, it will try to claim a page
-  from `page_pool_freed`, which contains pages which have already been madvised by a concurrent sweeper GC thread and whose underlying virtual address can be recycled.
+- Try to claim a block from `page_pool_lazily_freed`, which contains blocks that were empty during the last stop-the-world phase, but haven't been madvised by a concurrent scavenger GC thread yet.
 
-- If it failed in all of the attempts mentioned above, it will mmap a batch of pages, claim one page for itself, and
-  insert the remaining pages into `page_pool_clean`.
+- If it failed to claim a block from `page_pool_lazily_freed`, it will try to claim a block from `page_pool_clean`, which contains blocks mapped on a previous block allocation request but never accessed.
+
+- If it failed to claim a block from `page_pool_clean` and from `page_pool_lazily_freed`, it will try to claim a block from `page_pool_freed`, which contains blocks already scavenged by a concurrent scavenger GC thread and whose underlying virtual address can be recycled.
+
+- If it failed in all of the attempts mentioned above, it will map a batch of operating system pages, partition them into memory blocks, claim one block for itself, and insert the remaining blocks into `page_pool_clean`.
 
 ![Diagram of tiered pool allocation](./img/gc-tiered-allocation.jpg)
 
+### Large Object Allocation
+
+Sufficiently large objects, above the 2k byte threshold mentioned in the previous section, are allocated through `libc` `malloc`. Large allocations are typically less performance-critical than small allocations, as they occur less frequently.
+
+Although Julia currently uses `libc` `malloc`, it also supports pre-loading other dynamic memory allocators (e.g., `jemalloc`).
+
 ## Marking and Generational Collection
 
-Julia’s mark phase is implemented through a parallel iterative depth-first-search over the object graph. Julia’s collector is non-moving, so object age information can’t be determined through the memory region in which the object resides alone, but has to be somehow encoded in the object header or on a side table. The lowest two bits of an object’s header are used to store, respectively, a mark bit that is set when an object is scanned during the mark phase and an age bit for the generational collection.
+Julia’s mark phase is implemented through a parallel depth-first-search that traverses the object graph to determine which objects are alive.
+
+Julia stores age information for its generational GC in the object header: the lowest two bits of an object’s header store a mark bit, set when an object is marked, and an age bit, set when the object is promoted. Because Julia’s GC is non-moving, object age information can’t be only determined through the object's memory address, such as in GC implementations that allocate young objects in certain memory regions and relocate them to other memory regions during object promotion.
 
-Generational collection is implemented through sticky bits: objects are only pushed to the mark-stack, and therefore
-traced, if their mark-bits are not set. When objects reach the oldest generation, their mark-bits are not reset during
-the so-called "quick-sweep", which leads to these objects not being traced in a subsequent mark phase. A "full-sweep",
-however, causes the mark-bits of all objects to be reset, leading to all objects being traced in a subsequent mark phase.
-Objects are promoted to the next generation during every sweep phase they survive. On the mutator side, field writes
-are intercepted through a write barrier that pushes an object’s address into a per-thread remembered set if the object is
-in the last generation, and if the object at the field being written is not. Objects in this remembered set are then traced
-during the mark phase.
+Generational collection is implemented through sticky bits: objects are only pushed to the mark-stack, and therefore traced, if their mark-bits have not been set. When objects reach the oldest generation, their mark-bits aren't reset during a quick sweep, so these objects aren't traced during a subsequent mark phase. A full sweep, however, resets the mark-bits of all objects, so all of them are traced in a subsequent collection.
+
+When the mutator is running, a write barrier intercepts field writes and pushes an object’s address into a per-thread remembered set if the reference crosses generations. Objects in this remembered set are then traced during the next mark phase.
 
 ## Sweeping
 
-Sweeping of object pools for Julia may fall into two categories: if a given page managed by the pool allocator contains at least one live object, then a free-list must be threaded through its dead objects; if a given page contains no live objects at all, then its underlying physical memory may be returned to the operating system through, for instance, the use of madvise system calls on Linux.
+If a memory block managed by the pool allocator contains at least one live object, the sweeping phase creates a free-list from its dead objects; if it doesn't, then the block is scavenged and its underlying physical memory might be returned to the operating system through, for instance, `madvise` on Linux.
+
+The linear scan of memory blocks that have at least one live object can be run with multiple threads. If concurrent page sweeping is enabled through the flag `--gcthreads=X,1` the GC scavenges memory blocks concurrently with the mutator.
+
+During the stop-the-world phase of the collector, memory blocks containing no live objects are initially pushed into the `page_pool_lazily_freed`. The background scavenger thread is then woken up and removes blocks from `page_pool_lazily_freed`, scavenges them (e.g., `madvise` on Linux), and inserts them into `page_pool_freed`. `page_pool_lazily_freed` is also shared with mutator threads. This can improve performance in some applications because in allocation-heavy multithreaded workloads, mutator threads often avoid a page fault during allocation, which happens by accessing a freshly mapped operating system page or a madvised page, by directly allocating a block from `page_pool_lazily_freed`. In these workloads, the scavenger thread also needs to scavenge fewer blocks, since some have already been claimed by the mutators.
 
-The first category of sweeping is parallelized through work-stealing. For the second category of sweeping, if concurrent page sweeping is enabled through the flag `--gcthreads=X,1` we perform the madvise system calls in a background sweeper thread, concurrently with the mutator threads. During the stop-the-world phase of the collector, pool allocated pages which contain no live objects are initially pushed into the `pool_page_lazily_freed`. The background sweeping thread is then woken up and is responsible for removing pages from `pool_page_lazily_freed`, calling madvise on them, and inserting them into `pool_page_freed`. As described above, `pool_page_lazily_freed` is also shared with mutator threads. This implies that on allocation-heavy multithreaded workloads, mutator threads would often avoid a page fault on allocation (coming from accessing a fresh mmaped page or accessing a madvised page) by directly allocating from a page in `pool_page_lazily_freed`, while the background sweeper thread needs to madvise a reduce number of pages given some of them were already claimed by the mutators.
+## Memory Accounting
 
-## Heuristics
+The GC determines the heap size by adding the number of bytes in-use by pool-allocated memory blocks and bytes in-use by objects allocated through the large allocator. Previously, we measured the heap size by adding up the bytes for live objects, but not live memory blocks. This definition ignores fragmentation, which can lead to inaccurate GC decisions.
 
-GC heuristics tune the GC by changing the size of the allocation interval between garbage collections.
+## GC Trigger Heuristics
 
-The GC heuristics measure how big the heap size is after a collection and set the next collection according to the algorithm described by https://dl.acm.org/doi/10.1145/3563323, in summary, it argues that the heap target should have a square root relationship with the live heap, and that it should also be scaled by how fast the GC is freeing objects and how fast the mutators are allocating. The heuristics measure the heap size by counting the number of pages that are in use and the objects that use malloc. Previously we measured the heap size by counting the alive objects, but that doesn't take into account fragmentation which could lead to bad decisions, that also meant that we used thread local information (allocations) to make decisions about a process wide (when to GC), measuring pages means the decision is global.
+Julia's GC heuristics are based on `MemBalancer` (https://dl.acm.org/doi/10.1145/3563323). They decide when to trigger a collection and which (quick or full) collection to trigger. The heuristics adjust the number of bytes the mutator can allocate before triggering a collection cycle by measuring metrics such as allocation rate, freeing rate, and current heap size.
 
-The GC will do full collections when the heap size reaches 80% of the maximum allowed size.
+Independently of allocation rates, freeing rates, or GC times, Julia will always trigger full collections if the heap size exceeds 80% of a memory upper bound specified through `--heap-size-hint` or determined by reading system information.