-
Notifications
You must be signed in to change notification settings - Fork 3
Additional metadata for Sspesa #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| The `shpmsdetails` CSRs provide optional extensions targeted at implementations that want to provide metadata in addition to `shpmspc` and `shpmsdata` for a collected sample. The goal for these extensions is to attribute execution time to all instructions retired in the sampled cycle, for those cores that can retire more than one instruction per cycle. The primary method is to record the PC of the first instruction then during post-acquisition decoding, assign (fractional) execution time to each instruction retired. This assumes the instructions were sequentially executed. | ||
|
|
||
| There are, however, cases where retirement can include one, possible more branches taken or other control transfers, e.g., interrupts or exceptions. For these cases, the `shpmsdetailsdata` CSR records how many CTs (control transfers) occurred in the retired cycle and their locations in the retired sequence. If the core implements the CTR (Control Transfer Records, Ssctr) trace hardware, then the last control transfer target entry or entries (in the trace) can be used to determine the destination of the transfer(s) so additional instructions can be decoded from the execution binary and assigned execution times. If CTR is not included in the core then two additional CSRs (`shpmsdetailsbt1` and `shpmsdetailsbt2`) are defined in this specification to record them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have an ISA extension for recording branch history, I don't think ARC will look kindly on adding another. If you want a lower-cost (shorter) branch history mechanism, the better approach would be a Smctr/Ssctr extension that supports smaller depth values. Today the minimum is 16, your extension could add 2, 4, and 8.
| The sample details mechanism consists of additional SXLEN-bit read/write WARL registers called `shpmsdetailsbt1,2` and `shpmsdetailsdata`. The `shpmsdetailsbt1,2` stores the PCs of up to two branch target address if at least one taken branch occurs among the instructions that the core retired in the cycle the sample was taken. The rationale for supporting up two branches is that cores with higher retire widths (e.g., 8) will typically implement SSctr. | ||
|
|
||
| WARNING: _Should we say that software should default to using `shpmspc` as-is, if it does not know of any custom algorithm for using `shpmsdata`? Though that could result in the appearance of support for precise-attribution that in fact is not precise. Perhaps if `shpmsdata` is not hardcoded to 0 and no custom algorithm is reported then software shouldn't report support for precise attribution?_ | ||
| NOTE: _The support for the `shpmsdetails` extension should be discoverable via the implementation's performance event JSON file._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not the kind of thing included in the JSON file, which holds events and details associated with them. I expected these CSRs to be part of a new extension (maybe Sspesatpp, where tpp = time proportional profiling?), so software could use the ISA string to determine whether they are implemented.
|
|
||
| As briefly described previously, sampling software can calculate the addresses of the retiring instructions by assuming sequential execution starting from the PC in `shpmspc`. This however does not take into account that there may be taken branches or other control transfers among the retiring instructions. The `CTs` field thus reports the number of taken branches among the retiring instructions. If `CTs`= 0, sampling software knows that all the retired instructions executed sequentially. If `CTs`= 1, the instruction stream jumped to the address in `shpmsdetailsbt1` at the instruction count recorded in the `CTOFFSET1` field. If `CTs`= 2, `CTOFFSET2` holds the instruction count of where a second branch (or control transfer) occurred, and `shpmsdetailsbt2` in the target address for decoding starting at that address. If `CTs` > 2, attribution will have some bias since only two branch targets can be communicated, as defined by this specification. If this is a concern, implementors should consider implementing CTR (Control Transfer Records) hardware since it can be used to identify all control transfers. | ||
|
|
||
| The purpose of the `RSTATE` and `CAUSE` fields is to enable creating Per-Instruction Cycle Stacks (PICS). PICS break down the time attributed to each sampled instruction to record and subsequently report its impact on overall execution time. This is done by breaking down each instruction's contribution to overall execution time according to the state of the retire stage in each sample, as recorded in the `RSTATE` field. `RSTATE` is a two bit field because there are four fundamental retire states, i.e., retiring, stalling, drained, or flushed. Additional information for reporting the cause of instructions that slow down "ideal" execution are the performance events the instruction was subjected to, stored in the `CAUSE` field. The first set of bits are the most common types and fixed; remaining bits are implementation specific. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think much more explanation of when the hardware chooses RSTATE>00 is needed. How does the hardware know which state to choose when no instructions are retiring? What if a flush is followed by an ITLB miss on the subsequent restart, what should happen in that case?
|
|
||
| The purpose of the `RSTATE` and `CAUSE` fields is to enable creating Per-Instruction Cycle Stacks (PICS). PICS break down the time attributed to each sampled instruction to record and subsequently report its impact on overall execution time. This is done by breaking down each instruction's contribution to overall execution time according to the state of the retire stage in each sample, as recorded in the `RSTATE` field. `RSTATE` is a two bit field because there are four fundamental retire states, i.e., retiring, stalling, drained, or flushed. Additional information for reporting the cause of instructions that slow down "ideal" execution are the performance events the instruction was subjected to, stored in the `CAUSE` field. The first set of bits are the most common types and fixed; remaining bits are implementation specific. | ||
|
|
||
| If the core is unable to reach its ideal retire bandwidth in the cycle the sample is taken, implementations should blame the loss of retire bandwidth on a single instruction because instructions retire in program order. The typical case is that the oldest instruction is to blame, e.g., a long-latency load stalled at the head of the ROB, but a younger instruction can also block retirement. The latter case typically occurs in the cycles where the core transitions between the retiring state and non-retiring states (stalling, drained, flushed). For example, consider a cycle in which a four-wide core retires its two oldest instructions but that a long-latency load blocks further retirement. In this case, implementations should report `CAUSES` for the long-latency load, and this is enabled through the `CAUSEOFFSET` field. `CAUSEOFFSET` = 0 encodes that the `CAUSES` field is invalid or not implemented, and `CAUSEOFFSET` = 1 therefore points to the oldest retiring instruction, `CAUSEOFFSET` = 2 to the second oldest instruction, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is CAUSE presumed to be a series of bits, each associated with an event? That should probably be explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really want CAUSE associated only with retiring instructions? I would think that those events would be most interesting when no instruction is retiring, to indicate why there's no retirement. I don't really see what use they are for retiring instructions, you can use base Sspesa to find out which retired instructions are incurring events. Why a new method that's limited to a few events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also quite expensive. If I'm following correctly, tracking each of these CAUSE events per instruction means adding 10 bits (or however many are implemented) per ROB entry. I would not expect even a big, high-performance implementation to support that. Are you sure you want to track these events per instruction?
| | WARL| Bits for future expansion. | ||
| |==== | ||
|
|
||
| As briefly described previously, sampling software can calculate the addresses of the retiring instructions by assuming sequential execution starting from the PC in `shpmspc`. This however does not take into account that there may be taken branches or other control transfers among the retiring instructions. The `CTs` field thus reports the number of taken branches among the retiring instructions. If `CTs`= 0, sampling software knows that all the retired instructions executed sequentially. If `CTs`= 1, the instruction stream jumped to the address in `shpmsdetailsbt1` at the instruction count recorded in the `CTOFFSET1` field. If `CTs`= 2, `CTOFFSET2` holds the instruction count of where a second branch (or control transfer) occurred, and `shpmsdetailsbt2` in the target address for decoding starting at that address. If `CTs` > 2, attribution will have some bias since only two branch targets can be communicated, as defined by this specification. If this is a concern, implementors should consider implementing CTR (Control Transfer Records) hardware since it can be used to identify all control transfers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded support for 2 branches per cycle feels like this was defined with a particular implementation in mind. So implementations that can only retire 1 taken branch per cycle have extra bits/CSRs to implement, and those that support >2 have to jump up to CTR (while still having to pay the tax of CTOFFSET1/2 and shpmdetailsbt1/2). A more flexible solution would be better here.
|
|
||
| This register can also be used to generate an overflow interrupt to flag software to read the CSR sample registers, thus not requiring a programmable performance counter to be dedicated for this use. | ||
|
|
||
| NOTE: _It is recommended that hardware be included to prevent values to be put into the register that cause a period shorter than a minimal duration, say, 2^8 cycle counts, otherwise the downstream hardware may not be able to handle the data rates. This can be done with a comparator on the upper [30:8] bits == all 1's._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? We don't suggest that for counters today. Do we really want to spend HW on this?
|
|
||
| === Sspesa Sample Record with Extensions | ||
|
|
||
| The sample record for precise time-based event sampling includes all of the sample data collected during execution of the sampled instruction. For RV64 the record is 64 bytes, while for RV32 the record is 16 bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. What record? Is this written out somewhere? Or are you just summarizing the CSRs to be collected by the ISR? Why is shpmsdata ignored, are you assuming that implementations that include your extension can always record the precise PC for any instruction?
| [NOTE] | ||
| ==== | ||
| _Bits in the above WARL field may be subsequently defined as details are added to the spec, including bits defining core hardware resources needed for software decoding/interpreting Sspesa records._ | ||
| ==== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this added? It looks like your extension doesn't use shpmsdata at all.
| ==== | ||
|
|
||
| NOTE: _When creating time-based profiles, the value in `shpmspc` can be combined with implementation-specific metadata in `shpmsdata` and Control Transfer Records (Ssctr) to account for instruction parallelism by obtaining the addresses of all instructions that retired in the cycle the sample was taken. If Ssctr is not available, implementations should strive to avoid bias when selecting the value for `shpmspc`._ | ||
| NOTE: _When creating time-based profiles, the value in `shpmspc` can be combined with implementation-specific metadata in `shpmsdata` and Control Transfer Records (Ssctr) to account for instruction parallelism by obtaining the addresses of all instructions that retired in the cycle the sample was taken. If Ssctr is not available, additonal CSRs have been defined for control transfer target addresses to be recorded (details below)._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't alter the original comment here to remove the suggestion of avoiding bias. What you're doing is adding a 3rd alternative.
| NOTE: _When creating time-based profiles, the value in `shpmspc` can be combined with implementation-specific metadata in `shpmsdata` and Control Transfer Records (Ssctr) to account for instruction parallelism by obtaining the addresses of all instructions that retired in the cycle the sample was taken. If Ssctr is not available, implementations should strive to avoid bias when selecting the value for `shpmspc`._ | ||
| NOTE: _When creating time-based profiles, the value in `shpmspc` can be combined with implementation-specific metadata in `shpmsdata` and Control Transfer Records (Ssctr) to account for instruction parallelism by obtaining the addresses of all instructions that retired in the cycle the sample was taken. If Ssctr is not available, additonal CSRs have been defined for control transfer target addresses to be recorded (details below)._ | ||
|
|
||
| ==== Hardware Performance Monitor Sample Details (`shpmsdetails`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are any of these CSRs, or the fields within them, optional?
As discussed in the previous meeting, Bruce and I have been working on an optional extension for Sspesa implementations that want to provide more metadata per sample, and thereby create more informative time-based profiles. This PR covers the changes that we propose.