From bed083b9b134de4865f1a9f449cae8c23694f819 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Daniel=20Gracia=20P=C3=A9rez?= <50202128+dgptha@users.noreply.github.com> Date: Tue, 26 Nov 2024 23:58:09 +0100 Subject: [PATCH] Add first 3 sections of Perf Counters chapter --- src/chapters/pmc.adoc | 637 +++++++++++++++++++++++++++++++++++++++ src/fusa-whitepaper.adoc | 1 + 2 files changed, 638 insertions(+) create mode 100644 src/chapters/pmc.adoc diff --git a/src/chapters/pmc.adoc b/src/chapters/pmc.adoc new file mode 100644 index 0000000..0325f7f --- /dev/null +++ b/src/chapters/pmc.adoc @@ -0,0 +1,637 @@ +[[ch-pmc]] +== Performance counters + +=== Safety needs + +Performance counters are used to monitor and analyse safety-critical +solutions in systems which include sources of indeterminism. +They were primarily designed for application profiling and performance tuning, +but the ability to observe the behaviour of the hardware as an application is +executed is valuable for development and runtime monitoring of safety-critical +systems. + +==== Features + +Performance Counters, frequently referred as PMCs for Performance Monitoring +Counters, count events occurring in a core or processor. +The exact nature of events to count depends on the actual core/processor +architecture, but typically they count events as: +Cache and TLB usage (hit/miss), number of executed instructions and clock +cycles, stalled cycles, Load/Store accesses, branch prediction statistics, +and number of exceptions raised. + +Next, we provide a detailed breakdown across the main performance counter types, +along with some details on the information that would be needed to ease +verification and validation of safety-related systems, as well as for the +implementation of safety measures. +The table also indicates the source, core or another SoC IP, the event usually +comes. + +.Non-exhaustive breakdown of the main performance counter types +[cols="2,5a,1",] +|=== +|*Performance counter type* |*Detail* |*Usual scope* + +|Local cache usage (hit, miss, ...) +|Accesses broken down with enough detail to determine how many accesses remain +local, and how many are propagated to the following cache/memory level (e.g. +misses, dirty evictions), whether accesses are reads or writes. +For those accesses propagated to the following level, enough information to +break down across word-only and full-line accesses is needed, as well as whether +they are reads or writes. + +NOTE: Note that load and store accesses do not need +to be necessarily the same as load and store instructions since instructions +may be merged into fewer accesses (e.g. if accessing the same cache line), +among other optimizations. + +|Core + +|Shared cache usage (hit, miss, ...) +|Analogous to local cache usage counters, possibly broken down by core. +|SoC + +|Memory accesses +|Number of transactions broken down across reads and writes, page hits and +misses, and any other relevant parameters that may lead to significantly +different latencies. +Ideally, it may be interesting having those statistics per bank/rank +separately. +|SoC + +|TLB usage (hit, miss…) +|Analogous to local cache usage counters. +Accesses may be broken down across different TLB tables (e.g. across regular +TLB and large-pages TLB). +|Core/SoC + +|Exceptions +|Broken down across exception types (e.g. division by zero, access of +unauthorized address, etc). +|Core + +|Interrupts +|Broken down per interrupt type and, if possible, by the source of such +interrupts (e.g. a core, an external device, etc.). +Note that statistics about when an interrupt is suspended due to the arrival of +a higher priority interrupt, or whether an interrupt is ignored because it was +already raised but not served yet (e.g. a timer interrupt arriving before the +previous timing interrupt has been served) are also interesting. +|Core/SoC + +|Executed instructions +|Broken down per instruction type, especially separating loads, stores, integer +ALU, floating point ALU, branches. Further breakdowns (e.g. short latency vs. +long latency ALU) may also be convenient. +If instructions can be executed speculatively (e.g. after a predicted branch), +then it may be convenient having separate counts for fetched/decoded and +committed instructions. + +From experience, loads and stores are particularly sensitive to multicore +interference. +|Core + +|Local stall cycles +|Stall cycles in core-local resources, such as fetch, ALU, full write buffer, +etc. broken down per resource. +Note that stall cycles will typically be counted on a per-resource basis, hence +are potentially overlapping (e.g. if stalls occur in the write-buffer and fetch +stage simultaneously in the same cycle, they will be counted in both components +despite occurring in the same cycle). +|Core + +|Shared component stall cycles +|Analogous to local stall cycles, but broken down per core. For instance, stalls +due to a full read queue, miss queue, fill buffer, etc. of a shared cache are +highly recommended to be broken down per core. +This is generally very relevant for any shared component in the path to memory +(shared caches, interconnects, memory controller). + +Note that some accesses may be produced by some devices rather than by cores +directly (e.g. DMA, PCIe, Ethernet controllers accessing the memory controller), +so those stalls need to be counted separately. +|SoC + +|Branch predictions +|Correctly predicted and mispredicted branches. +If the target address is also predicted, then predictions should also be broken +down across correctly predicted and mispredicted branch targets. +|Core + +|Timestamp (number of clock cycles) +|This is generally a default counter. +|Core + +|Multicore interference +a|This category includes, for instance, the following: + +1. Core-to-core stalls: how many of the stalls experienced by a core have been + caused by another. + It may not always be easy to determine, but as far as it can be monitored, it + is valuable information for verification & validation, and diagnostics as + part of safety measures. + +2. Latency measurement: ability to measure the latency of requests in specific + resources, typically, maximum latencies. + This is important for WCET (worst-case execution time) estimation, since + those latencies may be implementation dependent and difficult to document or + obtain from the documentation. +|SoC + +|Peripheral and DMA controllers usage +|Access counts, data transferred, etc. broken down as much as possible (e.g. +read/write, source/destination) for peripheral controllers such as PCIe, +Ethernet, UART, SPI, etc. as well as DMA controllers. +|SoC + +|Error counts +|Some components such as, for instance, caches and memories, include error +detection and/or correction capabilities (e.g. SECDED). +Counters for errors detected and corrected, optionally along with error logs, +are convenient per component and, if possible, broken down across subcomponents +(e.g. banks). +|Core/SoC +|=== + +In addition to the performance counters a programmable threshold feature per +performance counter should be provided. +Programmable thresholds enable the association of actions whenever a counter +exceeds its threshold. +For example, a cache miss counter or a pipeline flush counter can have an +associated programmable threshold that once exceeded raises an action (e.g. an +interrupt). +Frequently, this threshold feature can also be implemented as an overflow +feature if the counters can also be set up programmatically. +E.g. in some cases it can be used by software to enforce quotas by software. +Note that resource allocation in safety-criticality systems also may need other +hardware support. + +Another feature useful for safety analysis and control is the programmable +filtering of the performance counters when applicable. +Following the cache miss counter example, the filtering capability causes the +counter to only be updated on cache misses to specific address ranges. +However, the kind of filtering provided heavily depends on the event semantic, +e.g. address range, event source, etc. + +Quota and filtering features can be helpful on software control mechanisms on +mixed-criticality systems to ensure the safety of critical applications. +For example, the software control mechanism can exploit both features to filter +actions of the non-critical applications (e.g. in a cache miss counter only +counting non-critical applications mapped addresses) and raising an interrupt +that will stop the execution of these applications when the quota is exceeded +(e.g. a fixed number of cache misses). +Likewise, when designing a system these features can be helpful to debug +(filter) specific applications running in the system and raising signals and/or +alarms when a state is reached (quota). + +==== Level + +Performance Counters in the context of Safety are needed on the SoC- and +core-level. +The level or scope where a counter is deployed depends on the location of the +component. +For instance, instruction counts, branch-related statistics and the like occur +generally at the scope of the core, and hence, that is the right level for them. +Others, instead, such as memory and peripheral controller related counters must +clearly be placed at the SoC level. +Some others, such as those related to shared caches, may fall in either +category, namely core or SoC, depending on the specific implementation. +For instance, a shared cache may be a standalone component, hence belonging to +the SoC level, or part of a cluster of cores so that the cores and the shared +cache cannot be deployed separately. In the latter case, the level for the +performance counters can be assumed to be the cores themselves. + +Hypervisors, OSes and RTOS can implement further counters at software level, +either to complement hardware counters, or as an alternative to hardware +counters if the latter do not exist for some event types. +Software-based counters are appropriate to monitor software-visible events such +as, for instance, those related to peripherals, DMA, and even some memories +(e.g. some flash memories). +Such components may only be accessible through specific hypervisor/OS/RTOS +services, and hence, those software layers can implement software counters to +monitor activities related to those devices (e.g., access counts, data +transferred). Other software-visible events, such as interrupts and exceptions, +can also be monitored with software counters implemented in the +hypervisor/OS/RTOS. + +==== Importance + +Performance counters are important for timing-sensitive applications that are +implemented on architectures where there can be timing interferences between +various processes or cores and other sources of indeterminism. + +Performance counters can be used at any criticality level. +The higher the criticality, the more urgently they are needed. + +In general, whether performance counters are needed or convenient is not only +highly dependent on the criticality level of the functionality being considered, +but also on the characteristics of the hardware and software platform. +For instance, if the SoC provides a high degree of isolation across cores so +that interference is low and limited by construction, or fully controllable by +software means, then having performance counters to monitor interference, or to +break down activity across cores may not be required. +In this case, one could simply perform analyses in isolation, develop estimates +based on some access counters, and not implement any safety measures requiring +performance counters, since overruns during operation would not relate to how +hardware resources are shared. + +Therefore, there is no _one-size-fits-all_ solution in terms of performance +counters but, in general, a higher number of performance counters, more detailed +breakdowns and more per-core information, means the SoC becomes easier to +integrate into safety-relevant systems due to the reduced costs for +verification, validation and implementation of safety measures. + +Hand-in-hand with the deployed performance counters, one cannot forget the +importance of properly documenting them in the corresponding technical reference +manuals. +It is often the case that counters are described only with their names or with +one-liners. +Those descriptions bring uncertainty and hence, even though a performance +counter may be of much use, it may end up being ignored simply because there is +not enough information and evidence of such counter providing the required +information. +Therefore, it is of prominent importance to provide detailed documentation along +with the performance counters of what they really measure. + +//// + +[#anchor-12]####1.4 Justification + +From the “justification cell”: explain why we need it. Explain the +safety properties (see the “main” chapter) that this/these attribute(s) +fulfill. Could be illustrated with an example (at a later stage, we +could have common examples across the white paper. To be +decided/implemented later when we have more material). + +This section provides first the scope of why performance counters are +needed in safety-related systems and then reviews specific uses through +some examples. + +[#anchor-13]####1.4.1 Traceability to standards + +We might move this as a subsection to “Justification” (once we write the +text). + +Performance Counters can be used as the basis for meeting safety +requirements related to a variety of safety needs such as “freedom from +interference” (ISO 26262), “resource usage tests” (ISO 26262), and +“interference channel characterization” (CAST-32A), as well as for +processes related to timing estimation, critical configuration setting +validation and random hardware fault management. + +Putting performance counters in the context of the product life-cycle of +safety-relevant systems, we foresee their need in at least three +different phases of the product life-cycle, as detailed next: + +* {blank} ++ +____ +During verification, performance counters are needed for estimation +purposes, such as those related to timing, memory usage, peripheral +usage, etc. +____ +* {blank} ++ +____ +During validation, test campaigns are conducted and performance counters +are typically used to assess real usage of resources against estimates, +and to diagnose misbehavior since counters can provide detailed +information on the source of the misbehavior. +____ +* {blank} ++ +____ +At run-time, the integrity or assurance level of the functionality at +hand determines the safety measures needed as part of the system +architecture. Some of those safety measures may include monitoring, +quota and/or diagnostics capabilities to proactively avoid failures, or +to react to specific events to avoid failures by taking corrective +actions promptly and precisely (e.g. degrading the system by dropping +the offending task). +____ + +In all those cases, evidence obtained from performance counters can be +used to feed certification documentation. + +[#anchor-14]####1.4.2 Specific uses of performance counters + +Without being exhaustive, this section identifies a number of uses of +performance counters in the context of safety-relevant systems. + +[#anchor-15]####Example 1: WCET estimation + +Performance counters can be used for measurement-based timing analysis, +or to feed some input data related to, for instance, latencies into +static timing analysis. In particular, one can use performance counters +to measure the number of accesses to each shared resource and the +maximum latency experienced under stressful scenarios in each shared +resource, and then compute the execution time expected if all accesses +experience those worst-case latencies. + +In the context of automotive systems, it is also common to attempt to +optimize the timing behavior of critical tasks without such a process +being a strict WCET estimation process as one could have in other +domains such as avionics. In that case, performance counters can be used +to feed timing models to find the best task scheduling in terms of +timespan based on the timing model. + +[#anchor-16]####Example 2: resource usage validation and diagnostics + +Performance counters can be used to measure accesses to different +resources (e.g. peripheral devices, DRAM memory), as well as data +transferred during the validation phase of a subsystem to check that +specific bandwidth bounds are not exceeded. + +Another example relates to assess whether timing deadlines are exceeded +or not. If they are exceeded, performance counters can provide a precise +and detailed snapshot of the use of resources for the task experiencing +the overrun as well as for the potentially offending tasks. Such +information can allow a quick diagnosis of the source of the overrun. In +fact, those counters can be used even if no overrun is experienced, to +predict future overruns as further integration occurs, by revealing +whether some specific resources are highly stressed and hence, whether +consolidating additional applications may lead to resource +overutilization. + +[#anchor-17]####Example 3: resource usage monitoring and diagnostics + +As for example 2, performance counters can be used during operation +analogously to the validation process, but to implement safety measures. +For instance, some counters can be read periodically to detect whether +any task is abusing any resource or exhibiting any other type of +misbehavior that may affect other tasks. Similarly, instead of +monitoring those counters, one may let tasks run and, upon a failure to +finish by a given deadline or to finish enough jobs in a given time +period, diagnose the cause of the excessive duration by referring to the +performance monitoring counters. Note that diagnostics information can +be used not only for instantaneous decisions, but also to track some +history and, for instance, if a task experiences overruns too +frequently, switch to a different precomputed task schedule. + +[#anchor-18]####Example 4: quota allocation + +If performance counters allow programming quotas (e.g. maximum number of +accesses or data transferred in a given resource), safety measures can +be implemented atop. One can set a maximum number of DRAM accesses for a +task in a given period of time to limit the amount of interference such +a task can cause on others. Upon reaching such limit, quota-related +counters may raise an interrupt so that the hypervisor/OS/RTOS performs +an appropriate corrective action by, for instance, dropping the specific +job of this task if it may affect more critical ones, or drop other +tasks if this one is highly critical and becomes more vulnerable to +interference. + +[#anchor-19]####Example 5: management of random hardware faults + +Performance counters related to errors detected and/or corrected may be +used to detect intermittent and permanent faults. For instance, SECDED +codes deployed along with some DRAM memories may allow detecting and +correcting transient faults due to, for instance, particle strikes. +However, performance counters may allow tracking whether those errors +occur too frequently or too concentrated in a specific component (e.g. a +DRAM DIMM). In that case, if errors exceed specific predefined +thresholds, performance counters can be used to trigger the replacement +of some components (e.g. a DIMM) or perform a hardware fix (e.g. a cache +line being replaced by a spare one) to avoid having unprotected +components if the correction capabilities are devoted to correct +permanent or intermittent errors, which would make transient faults not +be correctable. + +[#anchor-20]####1.4.3 Contribution to safety properties + +This section refers to the safety properties presented in the main +chapter of this white paper and how performance counters address them: + +* {blank} ++ +____ +Availability: Performance counters can be used to monitor or control the +correct real-time behavior of the system, the bounded impact of +interference channels, the correct usage of resources... +____ +* {blank} ++ +____ +Reliability: Performance counters can be used to detect or control the +over-consumption of resources that could provoke an excessive thermal +dissipation. They can be used to measure the occurrences of errors. +____ +* {blank} ++ +____ +Observability: Performance counters add observation capabilities that +can be used during SW/HW development and at run-time. +____ + +[arabic] +. {blank} ++ +____ +[#anchor-21]####RISC-V solutions +____ + +From the “RISC-V leverage cell”. Here we identify the existing RISC-V +features which help fulfilling the attribute. + +The RISC-V Privileged ISA Specification (Section 3.1.10) outlines a +basic hardware performance counters facility for M-Mode. In particular, +the following counters are included: + +* {blank} ++ +____ +Machine cycle counter (_mcycle_) CSR, counting the number of clock +cycles executed by the processor core on which the hart is running. +____ +* {blank} ++ +____ +Machine instruction retired counter (_minstret_) CSR, counting the +number of instructions that the hart has retired. +____ +* {blank} ++ +____ +Machine performance monitoring counters (_mhpmcounter3_ - +_mhpmcounter31_), counting platform-specific events. An additional set +of Event Selector CSRs (_mhpmevent3_ - _mhpmevent31_) control which +specific event causes the correspondent counter to increment. +____ + +RISC-V performance counters are 64-bit wide. In RV32 processors, they +are accessed via two 32-bit CSRs for their LSB and MSB portions. + +The RISC-V Unprivileged ISA Specification (Chapter 8) defines with the +Zicntr and Zihpm extensions an analogous facility for unprivileged +hardware performance counters, including the Cycle Counter (cycle) CSR, +the Instruction Retired Counter (instret) CSR and 29 additional +Performance Monitoring Counters (_hpmcounter3_ - _hpmcounter31_). + +The Privileged ISA Specification (Section 10.1.4) also addresses the +Supervisor Software case, specifying that it uses the same hardware +performance monitoring facility as user-mode software. + +It should be noted that additional CSRs are defined to provide control +over counter activation (Machine Counter-Inhibit CSR, RISC-V Privileged +ISA Specification Section 3.1.12) or availability of the hardware +performance-monitoring counters to the next-lowest privileged mode +(Counter-Enable Register CSR, RISC-V Privileged ISA Specification +Sections 3.1.11 and 10.1.15, for machine and supervisor modes +respectively). + +The RISC-V Privileged ISA Specification (Chapter 17) defines the +Sscofpmf extension providing performance counters overflow and mode +filtering capabilities for machine and supervisor modes. The overflow +capability allows the implementation of quotas as identified in the +Features section of this chapter (Section 1.1), while the mode filtering +capabilities partially addresses the filtering capabilities identified +in the same section, but limited to execution modes. Note that the +overflow capability does not apply to the mandatory cycle and instret +counters. + +[arabic] +. {blank} ++ +____ +[#anchor-22]####Recommendations +____ + +From the “RISC-V gap” cell. If needed for this attribute, we create +subsections to delineate SW/API and HW. + +____ +[#anchor-23]####3.1 Identified Gaps in existing specifications. +____ + +The standard Hardware Performance Monitoring facility and extensions +defined by the RISC-V specifications, see previous section, provide an +important base to address the implementation of safety-related hardware +performance counters. The following desirable features, not addressed by +the RISC-V specification, can be highlighted: + +[arabic] +. {blank} ++ +____ +Event specification: besides the identification of specific events +causing a counter to increment, it would be desirable to provide the +possibility of specifying a family of events (i.e. events that have to +be recorded at the same time) or specifying non-event conditions (i.e. +counting the number of clock cycles for which a certain event does not +occur). +____ +. {blank} ++ +____ +Filtering capabilities: the Sscofpmf extension provides mode-filtering +capabilities, nevertheless it would be desirable to provide other +event-filtering capabilities, such as comparison or edge detection, or +the initiator/target of the transaction (core ID for instance). +____ +. {blank} ++ +____ +Linked counters: it would be desirable to provide the capability of +linking multiple counters, defining chains of events to be monitored. +____ +. {blank} ++ +____ +Quota allocation (cf. example 4 above): upon reaching the defined +threshold, an interrupt would be triggered. An implementation would be +to preload a value in the counter and trigger an interrupt when the +counter overflows. +____ +. {blank} ++ +____ +Standardized event description: the description of events should be +standardized as much as possible among the different RISC-V processor +implementations. This is important to allow the development of software +solutions (e.g. hypervisors) capable of addressing the different +processor implementations as long as the events are available in those +cores. At the time of this writing the Performance Events TG is already +addressing this feature at the core level. +____ + +____ +[#anchor-24]####3.2 Possible Gaps in Implementation +____ + +[arabic] +. {blank} ++ +____ +Availability of SoC-level counters: monitoring harts or SoC resource +usage (e.g. use of shared resources) requires the definition of counters +outside the core. A MMIO architecture could be considered for the +implementation, with Machine Timer Registers (_mtime_ and _mtimecmp_) +constituting a valuable reference in this sense. +____ +. {blank} ++ +____ +Support for counter management: support at software and configuration +level to guarantee the availability of safety related counters (e.g. +preventing disabling the counters) while granting the user access to +specific resources. It should be noted that some degree of protection is +already guaranteed by the existing privileged architecture, as remarked +in the previous section. +____ + +____ +[#anchor-25]## 3.3 Safety Usage +____ + +[arabic] +. {blank} ++ +____ +Mcountinhibit: While this register allows stopping the counter from +incrementing to save energy consumption or to prevent side channel +security attacks, it may result in violation of some safety requirements +or usage which depends on the counter being always active. The designer +of the system should weigh the tradeoffs depending on the overall system +requirements before using this register and/or device additional logic +such as authentication of the client(s) that has access to this +register. +____ + +[arabic] +. {blank} ++ +____ +[#anchor-26]####Relevant activities +____ + +Jérôme will prepare this section for June 16th meeting. + +[#anchor-27]####Related external bodies + +Not present in the blueprint + +Performance counters usually have very diverse specifications on +different processors (Power, x86…). + +Linux features the perf command to instrument performance counters. +Other OSes and vendors provide similar tools. + +[#anchor-28]####Related chapters + +Other related chapters in the white paper + +Performance counters can be used to monitor the effect of QoS policies, +or even to dynamically influence them. Refer to the [Ref: Configurable +QoS] chapter. + +Performance counters are obviously used to monitor cache performance. +Refer to the [Ref: caches] chapter. + +Performance counters can be used to measure the occurrences of certain +(obviously not fatal) errors. Refer to the [Ref: Error reporting and +management] chapter. + +SoC-level performance counters and monitoring are needed to implement +some features identified in the [Ref: Multicore interference monitoring +support] chapter. +//// \ No newline at end of file diff --git a/src/fusa-whitepaper.adoc b/src/fusa-whitepaper.adoc index 95e08be..b843d67 100644 --- a/src/fusa-whitepaper.adoc +++ b/src/fusa-whitepaper.adoc @@ -71,6 +71,7 @@ include::contributors.adoc[] include::intro.adoc[] include::chapter2.adoc[] +include::chapters/pmc.adoc[] // The index must precede the bibliography include::index.adoc[]