Skip to content

low-level micro-architecture description of the datapath in Tensor Cores (Tensor Core Units, or Matrix Cores)

License

Notifications You must be signed in to change notification settings

Jerc007/TC_core

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TC_core

RTL Low-level micro-architecture description of the datapath of Tensor Cores (Tensor Core Units, or Matrix Cores)


🧩 Overview

This repository collects several fundamental blocks used in the Datapath of Tensor Core units (an in-chip hardware accelerator commonly found in GPUs and processors) [1] [2] [3]

A Tensor Core Unit (TCU), also referred to as a Matrix Core, is a Domain-Specific Architecture (DSA) designed to accelerate $m \times n \times k$ matrix multiplications and serves as a fundamental building block in modern AI accelerators, commonly integrated into today’s processors and GPUs. At their core, they execute the fused matrix operation:

Equation

where $A$ and $B$ are the input matrices with shapes ($m \times k$) and ($k \times n$), respectively. Moreover, $C$ and $D$, with ($n \times m$) shapes, represent the accumulation and output matrices, respectively. The operating format might use half- (FP16) or single-precision (FP32) floating point, as well as integer (INT) or custom formats, e.g., Posit16, Posit32, or FP8.

Alt text

As shown in the illustration, a $4 \times 4 \times 4$ TCU comprises 16 Dot-Product Units (DPUs). Each DPU contains a layer of multipliers followed by multiple layers of adders, forming the pipeline that performs high-throughput matrix multiplications. Importantly, every adder and multiplier is itself built from lower-level components such as shifters, lead-zero counters (LZCs), and integer adders/multipliers, illustrating the hierarchical design complexity of the accelerator.

The synthesizable VHDL IP cores are designed for ease of integration as a coprocessor or accelerator on Processor-based systems.

Ideal for [your use case: e.g., embedded systems, SoC design, digital signal processing], it offers:

  • 🔧 Configurable parameters
  • 🧪 Fully testbenched with simulation support
  • 📚 Clean documentation with example integrations

📁 Directory Structure

TC_core/
│
├── README.md               # Overview of the project
│
├── DPU_core/
│   ├── DPU_FP_32           # HDL files of the DPU description
│   ├── files               # Scripting files for running the TB through ModelSim
│   └── TB                  # TB files for DPU
│
├── TCU_FP32_pipe/
│   ├── HW_sources          # HDL files for the integration of DPUs as the TCU core
│   └── TB                  # TB files for the verification of the TCU core
├──                         # Other shapes and number formats TCs

🎲 Architectural Simulation Tools

The PyopenTCU tool is an architectural description of the TCU core that includes the scheduling, dispatching, and memory hierarchy management (i.e., register files and buffers), according to SASS MMA instructions [1] [4] .

🎲 Additional documentation

🙌 Credits

📬 Contact

For questions, suggestions, or collaboration, feel free to reach out:

About

low-level micro-architecture description of the datapath in Tensor Cores (Tensor Core Units, or Matrix Cores)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •