Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 129 additions & 6 deletions src/body.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ NOTE: _In contrast to Sspesa, Smpdis/Sspdis (see <<smpdis>>) create profiles in

Performance profiles obtained through sampling are approximate for two reasons. The first is systematic error (or bias) which reduces profile accuracy by attributing a counter overflow to a different instruction than the one that caused it. Sspesa aims to eliminate such inaccuracies for events that support precise attribution, and to minimize it for other events. For example, upon overflow of a counter configured to count an L1 data cache miss event that supports precise attribution, an Sspesa implementation allows the PC of the instruction that caused the counter overflow to be discerned.

Time-based profiles are more complicated because no specific instruction caused the cycle counter to overflow. A time-based profile is however most useful when each instruction is represented in the profile in proportion to its impact on overall execution time. To create a bias-free profile, the implementation must therefore attribute each sample to the instruction(s) for which the core is exposing execution time (i.e., retiring) in the cycle the counter overflows; this is known as time-proportional attribution. Systematic errors can hence be eliminated by adopting time-proportional attribution policies, but this may not always be practical (e.g., due to implementation overheads). For this reason, it is strongly recommended that Sspesa implementations minimize attribution bias when creating time-based profiles.
Time-based profiles are different because no specific instruction causes the cycle counter to overflow; the cycle count instead randomly selects an instruction at retirement. A time-based profile is useful because each instruction is represented in the profile in proportion to its impact on overall execution time. To create a bias-free profile, the implementation must therefore attribute each sample to the instruction(s) for which the core is exposing execution time (i.e., retiring) in the cycle the counter overflows; this is known as time-proportional attribution. Systematic errors can hence be eliminated by adopting time-proportional attribution policies, however this may require additional overhead to implement. For this reason, it is strongly recommended that Sspesa implementations minimize attribution bias when creating time-based profiles.

Statistical (or random) error is the second reason why all sample-based profiles are approximate. This is a byproduct of sampling events, rather than collecting attribution information for every event occurrence. This source of imprecision can be mitigated by increasing the sample rate, though even sampling at 4kHz (the default for Linux perf) typically shows negligible error.
Statistical (or random) error is the second reason why all sample-based profiles are approximate. This is a byproduct of sampling events since only a limited number of samples are collected rather than on every instruction. This source of imprecision can be mitigated by increasing the sample rate, though even sampling at 4kHz (the default for Linux perf) typically shows negligible error.

=== CSRs

Expand All @@ -29,7 +29,7 @@ Statistical (or random) error is the second reason why all sample-based profiles

NOTE: _The dependence on `__x__ip`.LCOFIP serves to ensure that only the first overflow leading to an LCOFI updates `shpmspc`. If additional counters overflow before the LCOFI trap, the recorded sample PC is not overwritten. This ensures that the sample PC is consistent with the point where CTR is frozen, if `__x__ctrctl`.LCOFIFRZ=1. The ID of the counter that caused the sample PC update is recorded in `shpmsdata`.CNTRID._

.Sample PC Register for SXLEN=64
.Sample PC Register for SXLEN=64 (`shpmspc`)
[%unbreakable]
[wavedrom, , svg]
....
Expand Down Expand Up @@ -58,7 +58,7 @@ NOTE: _As an example, for simultaneous overflow of `mhpmcounter5` and `mhpmcount

The format of the remaining bits is implementation-defined, serving to provide any metadata necessary to derive the sample PC from `shpmspc` and any other relevant state. For implementations where `shpmspc` captures the precise sample PC, or where no additional metadata is needed, the remaining `shpmsdata` bits may be hardwired to 0.

.Sample Metadata Register for SXLEN=64
.Sample Metadata Register for SXLEN=64 (`shpmsdata`)
[%unbreakable]
[wavedrom, , svg]
....
Expand All @@ -70,14 +70,137 @@ The format of the remaining bits is implementation-defined, serving to provide a

Access to `shpmsdata` matches that of `shpmspc` above, and `shpmsdata` captures sample metadata for the same cases where `shpmspc` captures a sample PC.

[NOTE]
====
_Bits in the above WARL field may be subsequently defined as details are added to the spec, including bits defining core hardware resources needed for software decoding/interpreting Sspesa records._
====
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this added? It looks like your extension doesn't use shpmsdata at all.


[NOTE]
====
_In modern, superscalar implementations, the microarchitecture may be optimized such that the full PC of each retired instruction is not maintained throughout the pipeline. The `shpmsdata` register provides a standard means by which such implementations can provide precise attribution, using a reference PC (`shpmspc`) and custom metadata that can be used by implementation-specific software algorithms to discern the appropriate sample PC._
====

NOTE: _When creating time-based profiles, the value in `shpmspc` can be combined with implementation-specific metadata in `shpmsdata` and Control Transfer Records (Ssctr) to account for instruction parallelism by obtaining the addresses of all instructions that retired in the cycle the sample was taken. If Ssctr is not available, implementations should strive to avoid bias when selecting the value for `shpmspc`._
NOTE: _When creating time-based profiles, the value in `shpmspc` can be combined with implementation-specific metadata in `shpmsdata` and Control Transfer Records (Ssctr) to account for instruction parallelism by obtaining the addresses of all instructions that retired in the cycle the sample was taken. If Ssctr is not available, additonal CSRs have been defined for control transfer target addresses to be recorded (details below)._
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't alter the original comment here to remove the suggestion of avoiding bias. What you're doing is adding a 3rd alternative.


==== Hardware Performance Monitor Sample Details (`shpmsdetails`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of these CSRs, or the fields within them, optional?


The `shpmsdetails` CSRs provide optional extensions targeted at implementations that want to provide metadata in addition to `shpmspc` and `shpmsdata` for a collected sample. The goal for these extensions is to attribute execution time to all instructions retired in the sampled cycle, for those cores that can retire more than one instruction per cycle. The primary method is to record the PC of the first instruction then during post-acquisition decoding, assign (fractional) execution time to each instruction retired. This assumes the instructions were sequentially executed.

There are, however, cases where retirement can include one, possible more branches taken or other control transfers, e.g., interrupts or exceptions. For these cases, the `shpmsdetailsdata` CSR records how many CTs (control transfers) occurred in the retired cycle and their locations in the retired sequence. If the core implements the CTR (Control Transfer Records, Ssctr) trace hardware, then the last control transfer target entry or entries (in the trace) can be used to determine the destination of the transfer(s) so additional instructions can be decoded from the execution binary and assigned execution times. If CTR is not included in the core then two additional CSRs (`shpmsdetailsbt1` and `shpmsdetailsbt2`) are defined in this specification to record them.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have an ISA extension for recording branch history, I don't think ARC will look kindly on adding another. If you want a lower-cost (shorter) branch history mechanism, the better approach would be a Smctr/Ssctr extension that supports smaller depth values. Today the minimum is 16, your extension could add 2, 4, and 8.


.Sample Details Branch Target PC Register for SXLEN=64 (`shpmsdetailsbt1`)
[%unbreakable]
[wavedrom, , svg]
....
{reg: [
{bits: 64, name: 'Sample Branch Target PC 1'},
], config:{lanes: 1, hspace:1024}}
....

.Sample Details Branch Target PC Register for SXLEN=64 (`shpmsdetailsbt2`)
[%unbreakable]
[wavedrom, , svg]
....
{reg: [
{bits: 64, name: 'Sample Branch Target PC 2'},
], config:{lanes: 1, hspace:1024}}
....

The sample details mechanism consists of additional SXLEN-bit read/write WARL registers called `shpmsdetailsbt1,2` and `shpmsdetailsdata`. The `shpmsdetailsbt1,2` stores the PCs of up to two branch target address if at least one taken branch occurs among the instructions that the core retired in the cycle the sample was taken. The rationale for supporting up two branches is that cores with higher retire widths (e.g., 8) will typically implement SSctr.

WARNING: _Should we say that software should default to using `shpmspc` as-is, if it does not know of any custom algorithm for using `shpmsdata`? Though that could result in the appearance of support for precise-attribution that in fact is not precise. Perhaps if `shpmsdata` is not hardcoded to 0 and no custom algorithm is reported then software shouldn't report support for precise attribution?_
NOTE: _The support for the `shpmsdetails` extension should be discoverable via the implementation's performance event JSON file._
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the kind of thing included in the JSON file, which holds events and details associated with them. I expected these CSRs to be part of a new extension (maybe Sspesatpp, where tpp = time proportional profiling?), so software could use the ISA string to determine whether they are implemented.


The `shpmsdetailsdata` register provides the information needed to interpret the `shpmsdetailsbt1,2` registers and additional sample metadata. Another reason for this and other CSRs is that they provide a standard interface for communicating the reason(s) why an instruction was sampled to profiling software through a specified software interface.

.Sample Details Data Register (`shpmsdetailsdata`)
[wavedrom, , svg]
....
{reg: [
{bits: 4, name: 'INSTRET'},
{bits: 2, name: 'RSTATE'},
{bits: 4, name: 'CTs'},
{bits: 4, name: 'CTOFFSET1'},
{bits: 4, name: 'CTOFFSET2'},
{bits: 4, name: 'CAUSEOFFSET'},
{bits: 10, name: 'CAUSE'},
{bits: 32, name: 'WARL', type: 1},
], config:{lanes: 2, hspace:1024}}
....

[cols="15%,85%",options="header"]
|====
| Field | Description
| Instret | The number of instructions retired regardless of 16b or 32b.
| RSTATE | The state of the *Retire* stage in the cycle the sample is taken. +
`00` signifies that the core retired at least one instruction. +
`01` means that no instructions retired because the pipeline is stalled. +
`10` reports that retire is empty due to a pipeline flush. +
`11` reports that retire is empty because it was not supplied with instructions.
| CTs | (CT=Control Transfer) The number of deviations from sequential control flow instructions (typically taken branches) that occurred among the sequence of instructions that retired in the cycle the sample was taken. Breaks from sequential flow can include interrupts and exceptions.
| CTOFFSET1 | If CTs > 0, this field records the offset from the first (oldest) retired instruction, pointed to by the sampled PC, stored in `shpmspc`, at which the first branch occurred. The PC of the branch target is in `shpmsdetailsbt1` and this is used by recording software to assign sample information to that instruction. For example, CTOFFSET1 equals 1 when a control transfer (a non-sequential execution address) occurred between the first and second retired instruction.
| CTOFFSET2 | If CTs > 1, `CTOFFSET2` holds the instruction count offset from the first retired instruction at which the second control transfer occurred. The PC of the branch target for this instruction is saved in `shpmsdetailsbt2`.
| CAUSEOFFSET | This field reports which of the retired instructions the `CAUSE` field applies to and is off-by-one encoded, i.e., that the value 1 refers to the first retired instruction, 2 refers to the second retired instruction, etc. The value 0 reports that no cause information is provided.
| CAUSE | This field provides further information about the cause(s) for why an instruction blocked retire, e.g., the (combinations of) performance event(s) that a blocking instruction was subjected to. This specification recommends fixing the first set of CAUSE bits to be I$miss, ITLBmiss, Branch mispredict, D$miss, DTLBmiss.
| WARL| Bits for future expansion.
|====

As briefly described previously, sampling software can calculate the addresses of the retiring instructions by assuming sequential execution starting from the PC in `shpmspc`. This however does not take into account that there may be taken branches or other control transfers among the retiring instructions. The `CTs` field thus reports the number of taken branches among the retiring instructions. If `CTs`= 0, sampling software knows that all the retired instructions executed sequentially. If `CTs`= 1, the instruction stream jumped to the address in `shpmsdetailsbt1` at the instruction count recorded in the `CTOFFSET1` field. If `CTs`= 2, `CTOFFSET2` holds the instruction count of where a second branch (or control transfer) occurred, and `shpmsdetailsbt2` in the target address for decoding starting at that address. If `CTs` > 2, attribution will have some bias since only two branch targets can be communicated, as defined by this specification. If this is a concern, implementors should consider implementing CTR (Control Transfer Records) hardware since it can be used to identify all control transfers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded support for 2 branches per cycle feels like this was defined with a particular implementation in mind. So implementations that can only retire 1 taken branch per cycle have extra bits/CSRs to implement, and those that support >2 have to jump up to CTR (while still having to pay the tax of CTOFFSET1/2 and shpmdetailsbt1/2). A more flexible solution would be better here.


The purpose of the `RSTATE` and `CAUSE` fields is to enable creating Per-Instruction Cycle Stacks (PICS). PICS break down the time attributed to each sampled instruction to record and subsequently report its impact on overall execution time. This is done by breaking down each instruction's contribution to overall execution time according to the state of the retire stage in each sample, as recorded in the `RSTATE` field. `RSTATE` is a two bit field because there are four fundamental retire states, i.e., retiring, stalling, drained, or flushed. Additional information for reporting the cause of instructions that slow down "ideal" execution are the performance events the instruction was subjected to, stored in the `CAUSE` field. The first set of bits are the most common types and fixed; remaining bits are implementation specific.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think much more explanation of when the hardware chooses RSTATE>00 is needed. How does the hardware know which state to choose when no instructions are retiring? What if a flush is followed by an ITLB miss on the subsequent restart, what should happen in that case?


If the core is unable to reach its ideal retire bandwidth in the cycle the sample is taken, implementations should blame the loss of retire bandwidth on a single instruction because instructions retire in program order. The typical case is that the oldest instruction is to blame, e.g., a long-latency load stalled at the head of the ROB, but a younger instruction can also block retirement. The latter case typically occurs in the cycles where the core transitions between the retiring state and non-retiring states (stalling, drained, flushed). For example, consider a cycle in which a four-wide core retires its two oldest instructions but that a long-latency load blocks further retirement. In this case, implementations should report `CAUSES` for the long-latency load, and this is enabled through the `CAUSEOFFSET` field. `CAUSEOFFSET` = 0 encodes that the `CAUSES` field is invalid or not implemented, and `CAUSEOFFSET` = 1 therefore points to the oldest retiring instruction, `CAUSEOFFSET` = 2 to the second oldest instruction, etc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is CAUSE presumed to be a series of bits, each associated with an event? That should probably be explicit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want CAUSE associated only with retiring instructions? I would think that those events would be most interesting when no instruction is retiring, to indicate why there's no retirement. I don't really see what use they are for retiring instructions, you can use base Sspesa to find out which retired instructions are incurring events. Why a new method that's limited to a few events?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also quite expensive. If I'm following correctly, tracking each of these CAUSE events per instruction means adding 10 bits (or however many are implemented) per ROB entry. I would not expect even a big, high-performance implementation to support that. Are you sure you want to track these events per instruction?


For the non-retiring sample states, hardware will use the 'shpmspc' "Sample PC register" to point to the instruction that caused the appropriate state.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about shpmsdata? Are you assuming that the HW has the PC for each ROB instruction available? Also, the spec already recommends that the attribution PC in such cases is that of the at-ret instruction. Is there ever a case where that isn't the instruction "that caused the appropriate state"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And when you say "caused the appropriate state", you mean the RSTATE?


NOTE: _Even if the meaning of some of the bits in the causes field is implementation specific, the software required for creating PICS can be reused across implementations. The reason is that creating PICS requires partioning samples according to the values of the `RSTATE` or `Causes` fields, and this task is independent of the specific meaning of each value._

.Sample Details Cycle Counter (`shpmsdetailscc`)
[%unbreakable]
[wavedrom, , svg]
....
{reg: [
{bits: 31, name: 'SAMPLEINITVALUE'},
{bits: 1, name: 'AR'}
], config:{lanes: 1, hspace:1024}}
....

[cols="20%,80%",options="header"]
|====
| Field | Description
|SAMPLEINITVALUE | Initial value that hardware loads into the count-up cycle counter after an overflow. When the counter reaches all ones, the next clock cycle causes a sample trigger and the count-up counter is reloaded with the initial value and counting begins again. Note that hardware implementations have the discretion to use fewer than 31 bits for the counter; the number defines the largest duration. (A 24 bit SAMPLEINITVALUE overflows at ~.0167 seconds for a 1GHz clock).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "the count-up cycle counter"? Is it the cycle counter? Or a Zihpm counter programmed to count CPU cycles? What if there are multiple?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All RV counters are "count-up", so I'm not sure why that is called out here.

| AR | Sampling goes into AutoRun mode when the AR bit is set to 1 and sampling has started. This is primarily for sampling that streams records to a Trace Unit. AutoRun is disabled when set to 0 and samples are read by an interrupt handler.
|====

The `shpmsdetailscc` register enables streaming data to a memory buffer. For example, if the output of Sspesa is connected to a Trace Unit, the hardware will take a sample and send it to the trace unit to be formed into a message and output. The value in the upper bits of `shpmsdetailscc` are set into a count-up cycle counter. The counter runs and when it overflows, a sample is taken, streamed to the trace, and the count-up cycle counter is reloaded with the contents of this register and counting is restarted.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're adding HW counter-reloading for Zihpm counters I think it would make sense to do it in a more general manner, rather than only supporting this one specific method of sampling. And there should be a consumer of these samples. If you're assuming above that the overflow signal is fed into a TE as an external trigger, best to say that.


This register can also be used to generate an overflow interrupt to flag software to read the CSR sample registers, thus not requiring a programmable performance counter to be dedicated for this use.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this register do that? Why wouldn't software use the associated OF bit to pend an LCOFI on overflow?


NOTE: _It is recommended that hardware be included to prevent values to be put into the register that cause a period shorter than a minimal duration, say, 2^8 cycle counts, otherwise the downstream hardware may not be able to handle the data rates. This can be done with a comparator on the upper [30:8] bits == all 1's._
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? We don't suggest that for counters today. Do we really want to spend HW on this?


=== Sspesa Sample Record with Extensions

The sample record for precise time-based event sampling includes all of the sample data collected during execution of the sampled instruction. For RV64 the record is 64 bytes, while for RV32 the record is 16 bytes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. What record? Is this written out somewhere? Or are you just summarizing the CSRs to be collected by the ISR? Why is shpmsdata ignored, are you assuming that implementations that include your extension can always record the precise PC for any instruction?


.Sspesa Sample Record with Extensions, for RV64
[cols="5%,90%,5%",options="header",grid=rows]
|===
| 63 || 0
3+^| Pesa Program Counter (shpmspc)
3+^| Pesa Samples Details Data (shpmsdetailsdata)
3+^| Pesa Branch Target PC Register1 (shpmsdetailsbt1)
3+^| Pesa Branch Target PC Register2 (shpmsdetailsbt2)
|===


.Sspesa Sample Record with Extensions, for RV32
[cols="5%,90%,5%",options="header",grid=rows]
|===
| 31 || 0
3+^| Pesa Program Counter (shpmspc)
3+^| Pesa Samples Details Data (shpmsdetailsdata[31:0])
3+^| Pesa Branch Target PC Register1 (shpmsdetailsbt1)
3+^| Pesa Branch Target PC Register2 (shpmsdetailsbt2)
|===

== Precise Local Counter Overflow Interrupt ISA Extension (Ssplcofi)

Expand Down
Loading