Additional metadata for Sspesa #27

magnusjahre · 2025-11-13T06:50:23Z

As discussed in the previous meeting, Bruce and I have been working on an optional extension for Sspesa implementations that want to provide more metadata per sample, and thereby create more informative time-based profiles. This PR covers the changes that we propose.

bcstrongx · 2025-11-15T01:22:40Z

src/body.adoc

+
+The `shpmsdetails` CSRs provide optional extensions targeted at implementations that want to provide metadata in addition to `shpmspc` and `shpmsdata` for a collected sample. The goal for these extensions is to attribute execution time to all instructions retired in the sampled cycle, for those cores that can retire more than one instruction per cycle.  The primary method is to record the PC of the first instruction then during post-acquisition decoding, assign (fractional) execution time to each instruction retired.  This assumes the instructions were sequentially executed.
+
+There are, however, cases where retirement can include one, possible more branches taken or other control transfers, e.g., interrupts or exceptions.  For these cases, the `shpmsdetailsdata` CSR records how many CTs (control transfers) occurred in the retired cycle and their locations in the retired sequence. If the core implements the CTR (Control Transfer Records, Ssctr) trace hardware, then the last control transfer target entry or entries (in the trace) can be used to determine the destination of the transfer(s) so additional instructions can be decoded from the execution binary and assigned execution times.  If CTR is not included in the core then two additional CSRs (`shpmsdetailsbt1` and `shpmsdetailsbt2`) are defined in this specification to record them.


We already have an ISA extension for recording branch history, I don't think ARC will look kindly on adding another. If you want a lower-cost (shorter) branch history mechanism, the better approach would be a Smctr/Ssctr extension that supports smaller depth values. Today the minimum is 16, your extension could add 2, 4, and 8.

bcstrongx · 2025-11-15T01:28:42Z

src/body.adoc

+The sample details mechanism consists of additional SXLEN-bit read/write WARL registers called `shpmsdetailsbt1,2` and `shpmsdetailsdata`. The `shpmsdetailsbt1,2` stores the PCs of up to two branch target address if at least one taken branch occurs among the instructions that the core retired in the cycle the sample was taken. The rationale for supporting up two branches is that cores with higher retire widths (e.g., 8) will typically implement SSctr.

-WARNING: _Should we say that software should default to using `shpmspc` as-is, if it does not know of any custom algorithm for using `shpmsdata`?  Though that could result in the appearance of support for precise-attribution that in fact is not precise.  Perhaps if `shpmsdata` is not hardcoded to 0 and no custom algorithm is reported then software shouldn't report support for precise attribution?_
+NOTE: _The support for the `shpmsdetails` extension should be discoverable via the implementation's performance event JSON file._


That's not the kind of thing included in the JSON file, which holds events and details associated with them. I expected these CSRs to be part of a new extension (maybe Sspesatpp, where tpp = time proportional profiling?), so software could use the ISA string to determine whether they are implemented.

bcstrongx · 2025-11-18T00:59:16Z

src/body.adoc

+
+As briefly described previously, sampling software can calculate the addresses of the retiring instructions by assuming sequential execution starting from the PC in `shpmspc`. This however does not take into account that there may be taken branches or other control transfers among the retiring instructions. The `CTs` field thus reports the number of taken branches among the retiring instructions. If `CTs`= 0, sampling software knows that all the retired instructions executed sequentially. If `CTs`= 1, the instruction stream jumped to the address in `shpmsdetailsbt1` at the instruction count recorded in the `CTOFFSET1`  field. If `CTs`= 2, `CTOFFSET2` holds the instruction count of where a second branch (or control transfer) occurred, and `shpmsdetailsbt2` in the target address for decoding starting at that address. If `CTs` > 2, attribution will have some bias since only two branch targets can be communicated, as defined by this specification. If this is a concern, implementors should consider implementing CTR (Control Transfer Records) hardware since it can be used to identify all control transfers.
+
+The purpose of the `RSTATE` and `CAUSE` fields is to enable creating Per-Instruction Cycle Stacks (PICS). PICS break down the time attributed to each sampled instruction to record and subsequently report its impact on overall execution time. This is done by breaking down each instruction's contribution to overall execution time according to the state of the retire stage in each sample, as recorded in the `RSTATE` field. `RSTATE` is a two bit field because there are four fundamental retire states, i.e., retiring, stalling, drained, or flushed. Additional information for reporting the cause of instructions that slow down "ideal" execution are the performance events the instruction was subjected to, stored in the `CAUSE` field. The first set of bits are the most common types and fixed; remaining bits are implementation specific.


I think much more explanation of when the hardware chooses RSTATE>00 is needed. How does the hardware know which state to choose when no instructions are retiring? What if a flush is followed by an ITLB miss on the subsequent restart, what should happen in that case?

bcstrongx · 2025-11-18T01:03:59Z

src/body.adoc

+
+The purpose of the `RSTATE` and `CAUSE` fields is to enable creating Per-Instruction Cycle Stacks (PICS). PICS break down the time attributed to each sampled instruction to record and subsequently report its impact on overall execution time. This is done by breaking down each instruction's contribution to overall execution time according to the state of the retire stage in each sample, as recorded in the `RSTATE` field. `RSTATE` is a two bit field because there are four fundamental retire states, i.e., retiring, stalling, drained, or flushed. Additional information for reporting the cause of instructions that slow down "ideal" execution are the performance events the instruction was subjected to, stored in the `CAUSE` field. The first set of bits are the most common types and fixed; remaining bits are implementation specific.
+
+If the core is unable to reach its ideal retire bandwidth in the cycle the sample is taken, implementations should blame the loss of retire bandwidth on a single instruction because instructions retire in program order. The typical case is that the oldest instruction is to blame, e.g., a long-latency load stalled at the head of the ROB, but a younger instruction can also block retirement. The latter case typically occurs in the cycles where the core transitions between the retiring state and non-retiring states (stalling, drained, flushed). For example, consider a cycle in which a four-wide core retires its two oldest instructions but that a long-latency load blocks further retirement. In this case, implementations should report `CAUSES` for the long-latency load, and this is enabled through the `CAUSEOFFSET` field. `CAUSEOFFSET` = 0 encodes that the `CAUSES` field is invalid or not implemented, and `CAUSEOFFSET` = 1 therefore points to the oldest retiring instruction, `CAUSEOFFSET` = 2 to the second oldest instruction, etc.


So is CAUSE presumed to be a series of bits, each associated with an event? That should probably be explicit.

Do you really want CAUSE associated only with retiring instructions? I would think that those events would be most interesting when no instruction is retiring, to indicate why there's no retirement. I don't really see what use they are for retiring instructions, you can use base Sspesa to find out which retired instructions are incurring events. Why a new method that's limited to a few events?

This is also quite expensive. If I'm following correctly, tracking each of these CAUSE events per instruction means adding 10 bits (or however many are implemented) per ROB entry. I would not expect even a big, high-performance implementation to support that. Are you sure you want to track these events per instruction?

bcstrongx · 2025-11-18T01:13:14Z

src/body.adoc

+| WARL| Bits for future expansion.
+|====
+
+As briefly described previously, sampling software can calculate the addresses of the retiring instructions by assuming sequential execution starting from the PC in `shpmspc`. This however does not take into account that there may be taken branches or other control transfers among the retiring instructions. The `CTs` field thus reports the number of taken branches among the retiring instructions. If `CTs`= 0, sampling software knows that all the retired instructions executed sequentially. If `CTs`= 1, the instruction stream jumped to the address in `shpmsdetailsbt1` at the instruction count recorded in the `CTOFFSET1`  field. If `CTs`= 2, `CTOFFSET2` holds the instruction count of where a second branch (or control transfer) occurred, and `shpmsdetailsbt2` in the target address for decoding starting at that address. If `CTs` > 2, attribution will have some bias since only two branch targets can be communicated, as defined by this specification. If this is a concern, implementors should consider implementing CTR (Control Transfer Records) hardware since it can be used to identify all control transfers.


The hardcoded support for 2 branches per cycle feels like this was defined with a particular implementation in mind. So implementations that can only retire 1 taken branch per cycle have extra bits/CSRs to implement, and those that support >2 have to jump up to CTR (while still having to pay the tax of CTOFFSET1/2 and shpmdetailsbt1/2). A more flexible solution would be better here.

bcstrongx · 2025-11-18T01:46:12Z

src/body.adoc

+
+This register can also be used to generate an overflow interrupt to flag software to read the CSR sample registers, thus not requiring a programmable performance counter to be dedicated for this use.
+
+NOTE: _It is recommended that hardware be included to prevent values to be put into the register that cause a period shorter than a minimal duration, say, 2^8 cycle counts, otherwise the downstream hardware may not be able to handle the data rates. This can be done with a comparator on the upper [30:8] bits == all 1's._


Why? We don't suggest that for counters today. Do we really want to spend HW on this?

bcstrongx · 2025-11-18T01:48:43Z

src/body.adoc

+
+=== Sspesa Sample Record with Extensions
+
+The sample record for precise time-based event sampling includes all of the sample data collected during execution of the sampled instruction.  For RV64 the record is 64 bytes, while for RV32 the record is 16 bytes.


I'm confused. What record? Is this written out somewhere? Or are you just summarizing the CSRs to be collected by the ISR? Why is shpmsdata ignored, are you assuming that implementations that include your extension can always record the precise PC for any instruction?

bcstrongx · 2025-11-18T01:50:43Z

src/body.adoc

+[NOTE]
+====
+_Bits in the above WARL field may be subsequently defined as details are added to the spec, including bits defining core hardware resources needed for software decoding/interpreting Sspesa records._
+====


Why is this added? It looks like your extension doesn't use shpmsdata at all.

bcstrongx · 2025-11-18T01:51:38Z

src/body.adoc

 ====

-NOTE: _When creating time-based profiles, the value in `shpmspc` can be combined with implementation-specific metadata in `shpmsdata` and Control Transfer Records (Ssctr) to account for instruction parallelism by obtaining the addresses of all instructions that retired in the cycle the sample was taken. If Ssctr is not available, implementations should strive to avoid bias when selecting the value for `shpmspc`._
+NOTE: _When creating time-based profiles, the value in `shpmspc` can be combined with implementation-specific metadata in `shpmsdata` and Control Transfer Records (Ssctr) to account for instruction parallelism by obtaining the addresses of all instructions that retired in the cycle the sample was taken. If Ssctr is not available, additonal CSRs have been defined for control transfer target addresses to be recorded (details below)._


I wouldn't alter the original comment here to remove the suggestion of avoiding bias. What you're doing is adding a 3rd alternative.

bcstrongx · 2025-11-18T01:53:55Z

src/body.adoc

-NOTE: _When creating time-based profiles, the value in `shpmspc` can be combined with implementation-specific metadata in `shpmsdata` and Control Transfer Records (Ssctr) to account for instruction parallelism by obtaining the addresses of all instructions that retired in the cycle the sample was taken. If Ssctr is not available, implementations should strive to avoid bias when selecting the value for `shpmspc`._
+NOTE: _When creating time-based profiles, the value in `shpmspc` can be combined with implementation-specific metadata in `shpmsdata` and Control Transfer Records (Ssctr) to account for instruction parallelism by obtaining the addresses of all instructions that retired in the cycle the sample was taken. If Ssctr is not available, additonal CSRs have been defined for control transfer target addresses to be recorded (details below)._
+
+==== Hardware Performance Monitor Sample Details  (`shpmsdetails`)


Are any of these CSRs, or the fields within them, optional?

Magnus Jahre added 10 commits September 29, 2025 19:07

A first stab at providing more metadata

73becf1

Merge remote-tracking branch 'origin/main' into shpmsdata-details

3aa1a3b

Another pass after discussions with Bruce

e235784

Edits from Bruce

0bf2a5a

Minor typography edits

cbf405b

Redefinition by Bruce with some minor edits (typos/layout) by me

cb53122

The edits I want to make that require review

fb2f128

Added the CAUSEOFFSET field

5b4da72

Reverted docs-resources to make the PR clean

ad6d24a

Removed trailing whitespace

1053172

bcstrongx requested changes Nov 18, 2025

View reviewed changes


		The `shpmsdetails` CSRs provide optional extensions targeted at implementations that want to provide metadata in addition to `shpmspc` and `shpmsdata` for a collected sample. The goal for these extensions is to attribute execution time to all instructions retired in the sampled cycle, for those cores that can retire more than one instruction per cycle. The primary method is to record the PC of the first instruction then during post-acquisition decoding, assign (fractional) execution time to each instruction retired. This assumes the instructions were sequentially executed.

		There are, however, cases where retirement can include one, possible more branches taken or other control transfers, e.g., interrupts or exceptions. For these cases, the `shpmsdetailsdata` CSR records how many CTs (control transfers) occurred in the retired cycle and their locations in the retired sequence. If the core implements the CTR (Control Transfer Records, Ssctr) trace hardware, then the last control transfer target entry or entries (in the trace) can be used to determine the destination of the transfer(s) so additional instructions can be decoded from the execution binary and assigned execution times. If CTR is not included in the core then two additional CSRs (`shpmsdetailsbt1` and `shpmsdetailsbt2`) are defined in this specification to record them.


		As briefly described previously, sampling software can calculate the addresses of the retiring instructions by assuming sequential execution starting from the PC in `shpmspc`. This however does not take into account that there may be taken branches or other control transfers among the retiring instructions. The `CTs` field thus reports the number of taken branches among the retiring instructions. If `CTs`= 0, sampling software knows that all the retired instructions executed sequentially. If `CTs`= 1, the instruction stream jumped to the address in `shpmsdetailsbt1` at the instruction count recorded in the `CTOFFSET1` field. If `CTs`= 2, `CTOFFSET2` holds the instruction count of where a second branch (or control transfer) occurred, and `shpmsdetailsbt2` in the target address for decoding starting at that address. If `CTs` > 2, attribution will have some bias since only two branch targets can be communicated, as defined by this specification. If this is a concern, implementors should consider implementing CTR (Control Transfer Records) hardware since it can be used to identify all control transfers.

		The purpose of the `RSTATE` and `CAUSE` fields is to enable creating Per-Instruction Cycle Stacks (PICS). PICS break down the time attributed to each sampled instruction to record and subsequently report its impact on overall execution time. This is done by breaking down each instruction's contribution to overall execution time according to the state of the retire stage in each sample, as recorded in the `RSTATE` field. `RSTATE` is a two bit field because there are four fundamental retire states, i.e., retiring, stalling, drained, or flushed. Additional information for reporting the cause of instructions that slow down "ideal" execution are the performance events the instruction was subjected to, stored in the `CAUSE` field. The first set of bits are the most common types and fixed; remaining bits are implementation specific.


		The purpose of the `RSTATE` and `CAUSE` fields is to enable creating Per-Instruction Cycle Stacks (PICS). PICS break down the time attributed to each sampled instruction to record and subsequently report its impact on overall execution time. This is done by breaking down each instruction's contribution to overall execution time according to the state of the retire stage in each sample, as recorded in the `RSTATE` field. `RSTATE` is a two bit field because there are four fundamental retire states, i.e., retiring, stalling, drained, or flushed. Additional information for reporting the cause of instructions that slow down "ideal" execution are the performance events the instruction was subjected to, stored in the `CAUSE` field. The first set of bits are the most common types and fixed; remaining bits are implementation specific.

		If the core is unable to reach its ideal retire bandwidth in the cycle the sample is taken, implementations should blame the loss of retire bandwidth on a single instruction because instructions retire in program order. The typical case is that the oldest instruction is to blame, e.g., a long-latency load stalled at the head of the ROB, but a younger instruction can also block retirement. The latter case typically occurs in the cycles where the core transitions between the retiring state and non-retiring states (stalling, drained, flushed). For example, consider a cycle in which a four-wide core retires its two oldest instructions but that a long-latency load blocks further retirement. In this case, implementations should report `CAUSES` for the long-latency load, and this is enabled through the `CAUSEOFFSET` field. `CAUSEOFFSET` = 0 encodes that the `CAUSES` field is invalid or not implemented, and `CAUSEOFFSET` = 1 therefore points to the oldest retiring instruction, `CAUSEOFFSET` = 2 to the second oldest instruction, etc.


		This register can also be used to generate an overflow interrupt to flag software to read the CSR sample registers, thus not requiring a programmable performance counter to be dedicated for this use.

		NOTE: _It is recommended that hardware be included to prevent values to be put into the register that cause a period shorter than a minimal duration, say, 2^8 cycle counts, otherwise the downstream hardware may not be able to handle the data rates. This can be done with a comparator on the upper [30:8] bits == all 1's._


		=== Sspesa Sample Record with Extensions

		The sample record for precise time-based event sampling includes all of the sample data collected during execution of the sampled instruction. For RV64 the record is 64 bytes, while for RV32 the record is 16 bytes.

Additional metadata for Sspesa #27

Are you sure you want to change the base?

Additional metadata for Sspesa #27

Uh oh!

Conversation

magnusjahre commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants