ml-explore · davidkoski · Oct 1, 2025
diff --git a/Source/MLX/GPU.swift b/Source/MLX/GPU.swift
@@ -10,6 +10,32 @@ import Metal
 /// ``activeMemory`` is in currently active ``MLXArray`` and ``cacheMemory``
 /// is recently used memory that can be recycled.
 ///
+/// ## Memory Management and Buffer Recycling
+///
+/// MLX uses a buffer recycling system to optimize performance. When MLXArrays
+/// are no longer needed (such as intermediate computation results), their buffers
+/// are not immediately deallocated. Instead, they are placed in a buffer pool where they
+/// can be reused by later computations of similar size.
+///
+/// This recycling strategy is particularly important during model inference:
+/// - Initial model weights might use ~500MB
+/// - Each token generation creates intermediate buffers (e.g., ~1MB for the first token)
+/// - As sequence length grows, buffer sizes increase but previous smaller buffers
+///   remain in the pool waiting for reuse
+/// - By the end of a long inference run, you may see several GB of cached memory
+///   from accumulated buffers of various sizes
+///
+/// The buffer pool policy is based on Metal's `recommendedMaxWorkingSetSize`, which
+/// scales with available physical memory. Systems with more RAM will cache more buffers.
+///
+/// ## Cache Size Optimization
+///
+/// The optimal cache size varies significantly by workload. While unconstrained cache
+/// can grow to several GB, developers often find that relatively small cache sizes
+/// (e.g., 2MB) perform just as well for their specific use cases. The best approach
+/// is to experiment with different cache limits and measure performance for your
+/// particular workload.
+///
 /// Control the size of ``cacheMemory`` via ``GPU/set(cacheLimit:)``
 /// and the overall memory limit with ``GPU/set(memoryLimit:relaxed:)``.
 ///
@@ -39,6 +65,23 @@ public enum GPU {
     /// ``activeMemory`` is in currently active ``MLXArray`` and ``cacheMemory``
     /// is recently used memory that can be recycled.
     ///
+    /// ## Understanding Memory Growth During Inference
+    ///
+    /// During model inference with **unconstrained cache size**, you'll typically see memory usage patterns like:
+    /// - **Initial**: ~500MB (model weights) + minimal cache
+    /// - **After first token**: +~1MB intermediates → cache grows as buffers are recycled
+    /// - **After 100 tokens**: Cache may be ~500MB (accumulated smaller buffers)
+    /// - **After 500 tokens**: Cache may be ~9.9GB (buffers of various sizes waiting for reuse)
+    ///
+    /// The cache grows because each token generation needs slightly larger buffers
+    /// (longer sequences), but smaller buffers from previous tokens remain cached
+    /// for potential reuse. Running inference again will reuse these cached buffers
+    /// without additional memory growth.
+    ///
+    /// **Important**: These large cache sizes can be controlled by setting appropriate
+    /// cache limits via ``GPU/set(cacheLimit:)``. The cache limit defaults to the
+    /// memory limit but can be set much lower to constrain memory usage.
+    ///
     /// Control the size of ``cacheMemory`` via ``GPU/set(cacheLimit:)``
     /// and the overall memory limit with ``GPU/set(memoryLimit:relaxed:)``.
     ///
@@ -139,7 +182,16 @@ public enum GPU {
     /// Get the cache size in bytes.
     ///
     /// The cache includes memory not currently used that has not been returned
-    /// to the system allocator.
+    /// to the system allocator. This represents buffers from previous
+    /// computations that are kept in a buffer pool for potential reuse.
+    ///
+    /// During model inference, this can grow significantly as buffers of various
+    /// sizes accumulate from intermediate computations. Each token generation
+    /// may need slightly larger buffers, causing smaller cached buffers to
+    /// remain unused while new, larger buffers are allocated.
+    ///
+    /// The cache size is controlled by the cache limit (see ``set(cacheLimit:)``).
+    /// When the limit is exceeded, older cached buffers are freed on the next allocation.
     public static var cacheMemory: Int {
         var result: size_t = 0
         mlx_get_cache_memory(&result)
@@ -200,7 +252,22 @@ public enum GPU {
     /// from the cache on the next allocation. To disable the cache,
     /// set the limit to 0.
     ///
-    /// The cache limit defaults to the memory limit.
+    /// The cache limit defaults to the memory limit, which may allow very
+    /// large cache sizes on systems with abundant RAM. For memory-constrained
+    /// applications or to prevent excessive memory growth during long inference
+    /// runs, consider setting a much lower cache limit.
+    ///
+    /// ## Performance Optimization
+    ///
+    /// The optimal cache size varies by workload. Many developers find that
+    /// relatively small cache sizes (e.g., 2MB) perform just as well as
+    /// unconstrained cache sizes for their specific use cases. Experiment
+    /// with different values and measure performance to find the best setting
+    /// for your workload.
+    ///
+    /// **Important**: The policy is applied on allocation, not when buffers
+    /// are returned to the cache. This means you may observe cache sizes
+    /// temporarily exceeding the limit until the next allocation triggers cleanup.
     ///
     /// Returns the previous cache limit.
     public static func set(cacheLimit: Int) {
@@ -235,7 +302,12 @@ public enum GPU {
     /// swap) if `relaxed` is true.
     ///
     /// The memory limit defaults to 1.5 times the maximum recommended working set
-    /// size reported by the device ([recommendedMaxWorkingSetSize](https://developer.apple.com/documentation/metal/mtldevice/recommendedmaxworkingsetsize))
+    /// size reported by the device ([recommendedMaxWorkingSetSize](https://developer.apple.com/documentation/metal/mtldevice/recommendedmaxworkingsetsize)).
+    ///
+    /// **Important**: This limit controls total MLX memory allocation. The cache limit
+    /// (see ``set(cacheLimit:)``) defaults to this value, so systems with large memory
+    /// limits may cache many GB of buffers. Consider setting a lower cache limit for
+    /// memory-constrained applications.
     public static func set(memoryLimit: Int, relaxed: Bool = true) {
         queue.sync {
             _memoryLimit = memoryLimit