riscv · magnusjahre · Sep 29, 2025 · Oct 29, 2025 · Oct 29, 2025 · Oct 30, 2025
diff --git a/src/body.adoc b/src/body.adoc
@@ -17,9 +17,9 @@ NOTE: _In contrast to Sspesa, Smpdis/Sspdis (see <<smpdis>>) create profiles in
 
 Performance profiles obtained through sampling are approximate for two reasons. The first is systematic error (or bias) which reduces profile accuracy by attributing a counter overflow to a different instruction than the one that caused it.  Sspesa aims to eliminate such inaccuracies for events that support precise attribution, and to minimize it for other events.  For example, upon overflow of a counter configured to count an L1 data cache miss event that supports precise attribution, an Sspesa implementation allows the PC of the instruction that caused the counter overflow to be discerned.
 
-Time-based profiles are more complicated because no specific instruction caused the cycle counter to overflow. A time-based profile is however most useful when each instruction is represented in the profile in proportion to its impact on overall execution time. To create a bias-free profile, the implementation must therefore attribute each sample to the instruction(s) for which the core is exposing execution time (i.e., retiring) in the cycle the counter overflows; this is known as time-proportional attribution. Systematic errors can hence be eliminated by adopting time-proportional attribution policies, but this may not always be practical (e.g., due to implementation overheads). For this reason, it is strongly recommended that Sspesa implementations minimize attribution bias when creating time-based profiles.
+Time-based profiles are different because no specific instruction causes the cycle counter to overflow; the cycle count instead randomly selects an instruction at retirement. A time-based profile is useful because each instruction is represented in the profile in proportion to its impact on overall execution time. To create a bias-free profile, the implementation must therefore attribute each sample to the instruction(s) for which the core is exposing execution time (i.e., retiring) in the cycle the counter overflows; this is known as time-proportional attribution. Systematic errors can hence be eliminated by adopting time-proportional attribution policies, however this may require additional overhead to implement. For this reason, it is strongly recommended that Sspesa implementations minimize attribution bias when creating time-based profiles.
 
-Statistical (or random) error is the second reason why all sample-based profiles are approximate. This is a byproduct of sampling events, rather than collecting attribution information for every event occurrence.  This source of imprecision can be mitigated by increasing the sample rate, though even sampling at 4kHz (the default for Linux perf) typically shows negligible error.
+Statistical (or random) error is the second reason why all sample-based profiles are approximate. This is a byproduct of sampling events since only a limited number of samples are collected rather than on every instruction.  This source of imprecision can be mitigated by increasing the sample rate, though even sampling at 4kHz (the default for Linux perf) typically shows negligible error.
 
 === CSRs
 
@@ -29,7 +29,7 @@ Statistical (or random) error is the second reason why all sample-based profiles
 
 NOTE: _The dependence on `__x__ip`.LCOFIP serves to ensure that only the first overflow leading to an LCOFI updates `shpmspc`.  If additional counters overflow before the LCOFI trap, the recorded sample PC is not overwritten.  This ensures that the sample PC is consistent with the point where CTR is frozen, if `__x__ctrctl`.LCOFIFRZ=1. The ID of the counter that caused the sample PC update is recorded in `shpmsdata`.CNTRID._
 
-.Sample PC Register for SXLEN=64
+.Sample PC Register for SXLEN=64 (`shpmspc`)
 [%unbreakable]
 [wavedrom, , svg]
 ....
@@ -58,7 +58,7 @@ NOTE: _As an example, for simultaneous overflow of `mhpmcounter5` and `mhpmcount
 
 The format of the remaining bits is implementation-defined, serving to provide any metadata necessary to derive the sample PC from `shpmspc` and any other relevant state.  For implementations where `shpmspc` captures the precise sample PC, or where no additional metadata is needed, the remaining `shpmsdata` bits may be hardwired to 0.
 
-.Sample Metadata Register for SXLEN=64
+.Sample Metadata Register for SXLEN=64 (`shpmsdata`)
 [%unbreakable]
 [wavedrom, , svg]
 ....
@@ -70,14 +70,137 @@ The format of the remaining bits is implementation-defined, serving to provide a
 
 Access to `shpmsdata` matches that of `shpmspc` above, and `shpmsdata` captures sample metadata for the same cases where `shpmspc` captures a sample PC.
 
+[NOTE]
+====
+_Bits in the above WARL field may be subsequently defined as details are added to the spec, including bits defining core hardware resources needed for software decoding/interpreting Sspesa records._
+====
+
 [NOTE]
 ====
 _In modern, superscalar implementations, the microarchitecture may be optimized such that the full PC of each retired instruction is not maintained throughout the pipeline.  The `shpmsdata` register provides a standard means by which such implementations can provide precise attribution, using a reference PC (`shpmspc`) and custom metadata that can be used by implementation-specific software algorithms to discern the appropriate sample PC._
 ====
 
-NOTE: _When creating time-based profiles, the value in `shpmspc` can be combined with implementation-specific metadata in `shpmsdata` and Control Transfer Records (Ssctr) to account for instruction parallelism by obtaining the addresses of all instructions that retired in the cycle the sample was taken. If Ssctr is not available, implementations should strive to avoid bias when selecting the value for `shpmspc`._
+NOTE: _When creating time-based profiles, the value in `shpmspc` can be combined with implementation-specific metadata in `shpmsdata` and Control Transfer Records (Ssctr) to account for instruction parallelism by obtaining the addresses of all instructions that retired in the cycle the sample was taken. If Ssctr is not available, additonal CSRs have been defined for control transfer target addresses to be recorded (details below)._
+
+==== Hardware Performance Monitor Sample Details  (`shpmsdetails`)
+
+The `shpmsdetails` CSRs provide optional extensions targeted at implementations that want to provide metadata in addition to `shpmspc` and `shpmsdata` for a collected sample. The goal for these extensions is to attribute execution time to all instructions retired in the sampled cycle, for those cores that can retire more than one instruction per cycle.  The primary method is to record the PC of the first instruction then during post-acquisition decoding, assign (fractional) execution time to each instruction retired.  This assumes the instructions were sequentially executed.
+
+There are, however, cases where retirement can include one, possible more branches taken or other control transfers, e.g., interrupts or exceptions.  For these cases, the `shpmsdetailsdata` CSR records how many CTs (control transfers) occurred in the retired cycle and their locations in the retired sequence. If the core implements the CTR (Control Transfer Records, Ssctr) trace hardware, then the last control transfer target entry or entries (in the trace) can be used to determine the destination of the transfer(s) so additional instructions can be decoded from the execution binary and assigned execution times.  If CTR is not included in the core then two additional CSRs (`shpmsdetailsbt1` and `shpmsdetailsbt2`) are defined in this specification to record them.
+
+.Sample Details Branch Target PC Register for SXLEN=64  (`shpmsdetailsbt1`)
+[%unbreakable]
+[wavedrom, , svg]
+....
+{reg: [
+    {bits:  64, name: 'Sample Branch Target PC 1'},
+], config:{lanes: 1, hspace:1024}}
+....
+
+.Sample Details Branch Target PC Register for SXLEN=64  (`shpmsdetailsbt2`)
+[%unbreakable]
+[wavedrom, , svg]
+....
+{reg: [
+    {bits:  64, name: 'Sample Branch Target PC 2'},
+], config:{lanes: 1, hspace:1024}}
+....
+
+The sample details mechanism consists of additional SXLEN-bit read/write WARL registers called `shpmsdetailsbt1,2` and `shpmsdetailsdata`. The `shpmsdetailsbt1,2` stores the PCs of up to two branch target address if at least one taken branch occurs among the instructions that the core retired in the cycle the sample was taken. The rationale for supporting up two branches is that cores with higher retire widths (e.g., 8) will typically implement SSctr.
 
-WARNING: _Should we say that software should default to using `shpmspc` as-is, if it does not know of any custom algorithm for using `shpmsdata`?  Though that could result in the appearance of support for precise-attribution that in fact is not precise.  Perhaps if `shpmsdata` is not hardcoded to 0 and no custom algorithm is reported then software shouldn't report support for precise attribution?_
+NOTE: _The support for the `shpmsdetails` extension should be discoverable via the implementation's performance event JSON file._
+
+The `shpmsdetailsdata` register provides the information needed to interpret the `shpmsdetailsbt1,2` registers and additional sample metadata. Another reason for this and other CSRs is that they provide a standard interface for communicating the reason(s) why an instruction was sampled to profiling software through a specified software interface.
+
+.Sample Details Data Register (`shpmsdetailsdata`)
+[wavedrom, , svg]
+....
+{reg: [
+    {bits:  4, name: 'INSTRET'},
+    {bits:  2, name: 'RSTATE'},
+    {bits:  4, name: 'CTs'},
+    {bits:  4, name: 'CTOFFSET1'},
+    {bits:  4, name: 'CTOFFSET2'},
+    {bits:  4, name: 'CAUSEOFFSET'},
+    {bits:  10, name: 'CAUSE'},
+    {bits: 32, name: 'WARL', type: 1},
+], config:{lanes: 2, hspace:1024}}
+....
+
+[cols="15%,85%",options="header"]
+|====
+| Field | Description
+| Instret | The number of instructions retired regardless of 16b or 32b.
+| RSTATE | The state of the *Retire* stage in the cycle the sample is taken. +
+ `00` signifies that the core retired at least one instruction. +
+ `01` means that no instructions retired because the pipeline is stalled. +
+ `10` reports that retire is empty due to a pipeline flush. +
+ `11` reports that retire is empty because it was not supplied with instructions.
+| CTs | (CT=Control Transfer) The number of deviations from sequential control flow instructions (typically taken branches) that occurred among the sequence of instructions that retired in the cycle the sample was taken. Breaks from sequential flow can include interrupts and exceptions.
+| CTOFFSET1 | If CTs > 0, this field records the offset from the first (oldest) retired instruction, pointed to by the sampled PC, stored in `shpmspc`, at which the first branch occurred. The PC of the branch target is in `shpmsdetailsbt1` and this is used by recording software to assign sample information to that instruction. For example, CTOFFSET1 equals 1 when a control transfer (a non-sequential execution address) occurred between the first and second retired instruction.
+| CTOFFSET2 | If CTs > 1, `CTOFFSET2` holds the instruction count offset from the first retired instruction at which the second control transfer occurred. The PC of the branch target for this instruction is saved in `shpmsdetailsbt2`.
+| CAUSEOFFSET | This field reports which of the retired instructions the `CAUSE` field applies to and is off-by-one encoded, i.e., that the value 1 refers to the first retired instruction, 2 refers to the second retired instruction, etc. The value 0 reports that no cause information is provided.
+| CAUSE | This field provides further information about the cause(s) for why an instruction blocked retire, e.g., the (combinations of) performance event(s) that a blocking instruction was subjected to. This specification recommends fixing the first set of CAUSE bits to be I$miss, ITLBmiss, Branch mispredict, D$miss, DTLBmiss.
+| WARL| Bits for future expansion.
+|====
+
+As briefly described previously, sampling software can calculate the addresses of the retiring instructions by assuming sequential execution starting from the PC in `shpmspc`. This however does not take into account that there may be taken branches or other control transfers among the retiring instructions. The `CTs` field thus reports the number of taken branches among the retiring instructions. If `CTs`= 0, sampling software knows that all the retired instructions executed sequentially. If `CTs`= 1, the instruction stream jumped to the address in `shpmsdetailsbt1` at the instruction count recorded in the `CTOFFSET1`  field. If `CTs`= 2, `CTOFFSET2` holds the instruction count of where a second branch (or control transfer) occurred, and `shpmsdetailsbt2` in the target address for decoding starting at that address. If `CTs` > 2, attribution will have some bias since only two branch targets can be communicated, as defined by this specification. If this is a concern, implementors should consider implementing CTR (Control Transfer Records) hardware since it can be used to identify all control transfers.
+
+The purpose of the `RSTATE` and `CAUSE` fields is to enable creating Per-Instruction Cycle Stacks (PICS). PICS break down the time attributed to each sampled instruction to record and subsequently report its impact on overall execution time. This is done by breaking down each instruction's contribution to overall execution time according to the state of the retire stage in each sample, as recorded in the `RSTATE` field. `RSTATE` is a two bit field because there are four fundamental retire states, i.e., retiring, stalling, drained, or flushed. Additional information for reporting the cause of instructions that slow down "ideal" execution are the performance events the instruction was subjected to, stored in the `CAUSE` field. The first set of bits are the most common types and fixed; remaining bits are implementation specific.
+
+If the core is unable to reach its ideal retire bandwidth in the cycle the sample is taken, implementations should blame the loss of retire bandwidth on a single instruction because instructions retire in program order. The typical case is that the oldest instruction is to blame, e.g., a long-latency load stalled at the head of the ROB, but a younger instruction can also block retirement. The latter case typically occurs in the cycles where the core transitions between the retiring state and non-retiring states (stalling, drained, flushed). For example, consider a cycle in which a four-wide core retires its two oldest instructions but that a long-latency load blocks further retirement. In this case, implementations should report `CAUSES` for the long-latency load, and this is enabled through the `CAUSEOFFSET` field. `CAUSEOFFSET` = 0 encodes that the `CAUSES` field is invalid or not implemented, and `CAUSEOFFSET` = 1 therefore points to the oldest retiring instruction, `CAUSEOFFSET` = 2 to the second oldest instruction, etc.
+
+For the non-retiring sample states, hardware will use the 'shpmspc' "Sample PC register" to point to the instruction that caused the appropriate state.
+
+NOTE: _Even if the meaning of some of the bits in the causes field is implementation specific, the software required for creating PICS can be reused across implementations. The reason is that creating PICS requires partioning samples according to the values of the `RSTATE` or `Causes` fields, and this task is independent of the specific meaning of each value._
+
+.Sample Details Cycle Counter (`shpmsdetailscc`)
+[%unbreakable]
+[wavedrom, , svg]
+....
+{reg: [
+{bits:  31, name: 'SAMPLEINITVALUE'},
+{bits:  1, name: 'AR'}
+], config:{lanes: 1, hspace:1024}}
+....
+
+[cols="20%,80%",options="header"]
+|====
+| Field | Description
+|SAMPLEINITVALUE | Initial value that hardware loads into the count-up cycle counter after an overflow. When the counter reaches all ones, the next clock cycle causes a sample trigger and the count-up counter is reloaded with the initial value and counting begins again. Note that hardware implementations have the discretion to use fewer than 31 bits for the counter; the number defines the largest duration.  (A 24 bit SAMPLEINITVALUE overflows at ~.0167 seconds for a 1GHz clock).
+| AR | Sampling goes into AutoRun mode when the AR bit is set to 1 and sampling has started.  This is primarily for sampling that streams records to a Trace Unit. AutoRun is disabled when set to 0 and samples are read by an interrupt handler.
+|====
+
+The `shpmsdetailscc` register enables streaming data to a memory buffer. For example, if the output of Sspesa is connected to a Trace Unit, the hardware will take a sample and send it to the trace unit to be formed into a message and output. The value in the upper bits of `shpmsdetailscc` are set into a count-up cycle counter. The counter runs and when it overflows, a sample is taken, streamed to the trace, and the count-up cycle counter is reloaded with the contents of this register and counting is restarted.
+
+This register can also be used to generate an overflow interrupt to flag software to read the CSR sample registers, thus not requiring a programmable performance counter to be dedicated for this use.
+
+NOTE: _It is recommended that hardware be included to prevent values to be put into the register that cause a period shorter than a minimal duration, say, 2^8 cycle counts, otherwise the downstream hardware may not be able to handle the data rates. This can be done with a comparator on the upper [30:8] bits == all 1's._
+
+=== Sspesa Sample Record with Extensions
+
+The sample record for precise time-based event sampling includes all of the sample data collected during execution of the sampled instruction.  For RV64 the record is 64 bytes, while for RV32 the record is 16 bytes.
+
+.Sspesa Sample Record with Extensions, for RV64
+[cols="5%,90%,5%",options="header",grid=rows]
+|===
+| 63 || 0
+3+^| Pesa Program Counter (shpmspc)
+3+^| Pesa Samples Details Data (shpmsdetailsdata)
+3+^| Pesa Branch Target PC Register1 (shpmsdetailsbt1)
+3+^| Pesa Branch Target PC Register2 (shpmsdetailsbt2)
+|===
+
+
+.Sspesa Sample Record with Extensions, for RV32
+[cols="5%,90%,5%",options="header",grid=rows]
+|===
+| 31 || 0
+3+^| Pesa Program Counter (shpmspc)
+3+^| Pesa Samples Details Data (shpmsdetailsdata[31:0])
+3+^| Pesa Branch Target PC Register1 (shpmsdetailsbt1)
+3+^| Pesa Branch Target PC Register2 (shpmsdetailsbt2)
+|===
 
 == Precise Local Counter Overflow Interrupt ISA Extension (Ssplcofi)