Skip to content

Simulation and optimization of extended L2 cache architectures in GPUs using GPGPU-Sim to improve CUDA workload performance and analyze area tradeoff - CS 6501 CPU/GPU Memory Systems @ UVA Spring '25

License

Notifications You must be signed in to change notification settings

huygnguyen04/GPU-L2-Cache-Optimizer

Repository files navigation

GPU-L2-Cache-Optimizer

📑 Documentation

📄 Final Report - Full analysis of cache scaling experiments, architectural insights, and hardware modeling.

📄 Research Proposal - Initial background, motivation, and project plan.

🎞️ Presentation Slides - Slide deck covering methodology and key results.

🔍 Project Overview

This project explores the simulation and optimization of extended L2 cache architectures in GPUs using GPGPU-Sim, with the goal of improving performance for memory-bound CUDA workloads. It combines architectural experimentation, benchmark profiling, and area-energy tradeoff analysis to guide future GPU memory hierarchy design.

Author: Huy G Nguyen
Advisor: Prof. Kevin Skadron
Department of Computer Science, University of Virginia

🚀 Key Contributions

  • Performance Gains: Larger L2 caches significantly reduce memory stalls and DRAM traffic for memory-bound workloads.
  • Workload Sensitivity: Applications with high L2 miss rates and strong spatial/temporal locality (e.g., NW, DWT2D) see the most benefit.
  • Design Tradeoffs:
    • Associativity scaling reduces conflict misses in irregular patterns.
    • Set count scaling helps with large working sets.
    • Gains plateau beyond 2–4× L2 size increases.
  • Hardware Cost: Larger caches increase area and dynamic energy by 2–2.5×, necessitating workload-aware cache sizing.

🧪 Tools & Benchmarks

  • Simulator: GPGPU-Sim v4.0 (Volta V100 config)
  • Modeling Tool: CACTI 6.5 (32nm, 128B line)
  • Benchmarks:
    • Memory-bound: BFS, KMeans, Gaussian, NW, LUD, DWT2D
    • Compute-bound (control): Nearest Neighbor, Pathfinder

📈 Results Summary

Extending L2 cache size yielded up to 50% performance improvement for memory-bound CUDA applications with large working sets. In contrast, compute-bound benchmarks showed negligible gains, reinforcing that cache scaling is workload-sensitive.

  • Associativity scaling: Best for reducing conflict misses in irregular workloads (e.g., Gaussian, LUD).
  • Set count scaling: Best for large spatially-reused datasets (e.g., KMeans, DWT2D).
  • Tradeoff: Quadrupling L2 size significantly increases area/energy (~2.5×), so optimization must consider system constraints.

🔮 Future Work

  • Extend analysis to PPA (performance-per-area) and ED² (energy-delay²) metrics
  • Analyze deeper input size ranges to identify L2 stress thresholds
  • Evaluate new GPU architectures (varying SM/core count, L1 size)
  • Explore software-level optimizations (prefetching, data placement)
  • Test on ML workloads like CNNs and Transformers
  • Use Roofline modeling for further compute-bound analysis

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.


This work provides foundational insight into L2 cache scaling for future GPU architectures, especially in the context of AI and HPC workloads.

About

Simulation and optimization of extended L2 cache architectures in GPUs using GPGPU-Sim to improve CUDA workload performance and analyze area tradeoff - CS 6501 CPU/GPU Memory Systems @ UVA Spring '25

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published