Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 75 additions & 3 deletions Source/MLX/GPU.swift
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,32 @@ import Metal
/// ``activeMemory`` is in currently active ``MLXArray`` and ``cacheMemory``
/// is recently used memory that can be recycled.
///
/// ## Memory Management and Buffer Recycling
///
/// MLX uses a buffer recycling system to optimize performance. When MLXArrays
/// are no longer needed (such as intermediate computation results), their buffers
/// are not immediately deallocated. Instead, they are placed in a buffer pool where they
/// can be reused by later computations of similar size.
///
/// This recycling strategy is particularly important during model inference:
/// - Initial model weights might use ~500MB
/// - Each token generation creates intermediate buffers (e.g., ~1MB for the first token)
/// - As sequence length grows, buffer sizes increase but previous smaller buffers
/// remain in the pool waiting for reuse
/// - By the end of a long inference run, you may see several GB of cached memory
/// from accumulated buffers of various sizes
///
/// The buffer pool policy is based on Metal's `recommendedMaxWorkingSetSize`, which
/// scales with available physical memory. Systems with more RAM will cache more buffers.
///
/// ## Cache Size Optimization
///
/// The optimal cache size varies significantly by workload. While unconstrained cache
/// can grow to several GB, developers often find that relatively small cache sizes
/// (e.g., 2MB) perform just as well for their specific use cases. The best approach
/// is to experiment with different cache limits and measure performance for your
/// particular workload.
///
/// Control the size of ``cacheMemory`` via ``GPU/set(cacheLimit:)``
/// and the overall memory limit with ``GPU/set(memoryLimit:relaxed:)``.
///
Expand Down Expand Up @@ -39,6 +65,23 @@ public enum GPU {
/// ``activeMemory`` is in currently active ``MLXArray`` and ``cacheMemory``
/// is recently used memory that can be recycled.
///
/// ## Understanding Memory Growth During Inference
///
/// During model inference with **unconstrained cache size**, you'll typically see memory usage patterns like:
/// - **Initial**: ~500MB (model weights) + minimal cache
/// - **After first token**: +~1MB intermediates → cache grows as buffers are recycled
/// - **After 100 tokens**: Cache may be ~500MB (accumulated smaller buffers)
/// - **After 500 tokens**: Cache may be ~9.9GB (buffers of various sizes waiting for reuse)
///
/// The cache grows because each token generation needs slightly larger buffers
/// (longer sequences), but smaller buffers from previous tokens remain cached
/// for potential reuse. Running inference again will reuse these cached buffers
/// without additional memory growth.
///
/// **Important**: These large cache sizes can be controlled by setting appropriate
/// cache limits via ``GPU/set(cacheLimit:)``. The cache limit defaults to the
/// memory limit but can be set much lower to constrain memory usage.
///
/// Control the size of ``cacheMemory`` via ``GPU/set(cacheLimit:)``
/// and the overall memory limit with ``GPU/set(memoryLimit:relaxed:)``.
///
Expand Down Expand Up @@ -139,7 +182,16 @@ public enum GPU {
/// Get the cache size in bytes.
///
/// The cache includes memory not currently used that has not been returned
/// to the system allocator.
/// to the system allocator. This represents buffers from previous
/// computations that are kept in a buffer pool for potential reuse.
///
/// During model inference, this can grow significantly as buffers of various
/// sizes accumulate from intermediate computations. Each token generation
/// may need slightly larger buffers, causing smaller cached buffers to
/// remain unused while new, larger buffers are allocated.
///
/// The cache size is controlled by the cache limit (see ``set(cacheLimit:)``).
/// When the limit is exceeded, older cached buffers are freed on the next allocation.
public static var cacheMemory: Int {
var result: size_t = 0
mlx_get_cache_memory(&result)
Expand Down Expand Up @@ -200,7 +252,22 @@ public enum GPU {
/// from the cache on the next allocation. To disable the cache,
/// set the limit to 0.
///
/// The cache limit defaults to the memory limit.
/// The cache limit defaults to the memory limit, which may allow very
/// large cache sizes on systems with abundant RAM. For memory-constrained
/// applications or to prevent excessive memory growth during long inference
/// runs, consider setting a much lower cache limit.
///
/// ## Performance Optimization
///
/// The optimal cache size varies by workload. Many developers find that
/// relatively small cache sizes (e.g., 2MB) perform just as well as
/// unconstrained cache sizes for their specific use cases. Experiment
/// with different values and measure performance to find the best setting
/// for your workload.
///
/// **Important**: The policy is applied on allocation, not when buffers
/// are returned to the cache. This means you may observe cache sizes
/// temporarily exceeding the limit until the next allocation triggers cleanup.
///
/// Returns the previous cache limit.
public static func set(cacheLimit: Int) {
Expand Down Expand Up @@ -235,7 +302,12 @@ public enum GPU {
/// swap) if `relaxed` is true.
///
/// The memory limit defaults to 1.5 times the maximum recommended working set
/// size reported by the device ([recommendedMaxWorkingSetSize](https://developer.apple.com/documentation/metal/mtldevice/recommendedmaxworkingsetsize))
/// size reported by the device ([recommendedMaxWorkingSetSize](https://developer.apple.com/documentation/metal/mtldevice/recommendedmaxworkingsetsize)).
///
/// **Important**: This limit controls total MLX memory allocation. The cache limit
/// (see ``set(cacheLimit:)``) defaults to this value, so systems with large memory
/// limits may cache many GB of buffers. Consider setting a lower cache limit for
/// memory-constrained applications.
public static func set(memoryLimit: Int, relaxed: Bool = true) {
queue.sync {
_memoryLimit = memoryLimit
Expand Down