diff --git a/src/chapters/qos.adoc b/src/chapters/qos.adoc new file mode 100644 index 0000000..06130c7 --- /dev/null +++ b/src/chapters/qos.adoc @@ -0,0 +1,317 @@ +[#sec:qos] +## Configurable Quality of Service (QoS) and Priority Management + +[#sec:qos:safety] +### Safety needs + +Safety standards define processes of safety assessment and hazard analysis to +allow each task to be categorized according to its criticality, e.g. ASIL levels +in ISO 26262 cite:[iso26262:2018], DAL levels in DO-178 cite:[do178c:2012], +software criticality categories in ECSS cite:[ecss:2024]. +In a given system, if the tasks that share hardware resources vary in their +criticality levels, or if safety critical and non-safety critical tasks can +coexist, the system is said to be of mixed-criticality. + +The criticality of a task and the standard according to which its code is +developed have a direct impact on the number and type of safety mechanisms that +need to be implemented. As a consequence, the acceptable error rates of the +tasks differ according to their criticality level. +To prevent a less/non critical task from interfering with the execution of a +more critical task, it is therefore required to ensure freedom from such +interference. + +Even if the tasks have the same criticality level, it is sometimes necessary to +ensure freedom from interference if the tasks rely on each other to perform some +safety function. +Such cases could be exposed, for example, by a dependent failure analysis, as it +is the case with ISO 26262. + +Freedom from interference, limited or complete, is often implemented by building +on Quality of Service (QoS) and priority-related features, where QoS is the +minimal end-to-end performance that is guaranteed in advance by a service level +agreement (SLA) to an application. +The performance may be measured by metrics such as instructions per cycle (IPC), +latency of servicing work, etc. + +Note that, in the context of this document, we consider priority-related +features (i.e., those features controlling the access to resources according to +priorities assigned to tasks) a subset of QoS features. + +QoS features, that can be implemented at hardware and/or software level, can be +used to guarantee specific maximum latencies, minimum bandwidth, minimum cache +space, and the like, for specific tasks with safety and performance +requirements. +These guarantees can be fully enforced to achieve complete freedom from +interference (a “hard” guarantee), or allow the system to miss targets some of +the time up to some agreed-upon threshold, achieving a limited degree of freedom +from interference (a “soft” or “best-effort” guarantee). + +In this document, we cover QoS features in the context of tasks with safety +requirements. + +[#sec:qos:safety:features] +#### Features + +Our aim is having configurable QoS that allows exercising control on the +performance of tasks. +Such control can be exercised in multiple forms, such as (not intended to be an +exhaustive list): + +* End-to-end latency bounds +* Throughput bounds +* Space allocation bounds + +QoS features can be used to guarantee a performance target for a task by +granting such task a share of the resources. +However, since that share of the resources then becomes unavailable to other +tasks, any task that depends upon those resources is likely to experience a +performance drop. + +QoS features are allocated to tasks and/or components with different degrees of +temporal flexibility. +Some examples are: + +* Static allocation: a task has a given QoS level that never changes across + different executions. +* Semi-static allocation: each execution of a given task may have a different + QoS level, but such level remains constant during the complete execution of + the task. +* Dynamic allocation: QoS levels may change during the execution of the task. + +Along with configurable QoS levels, we aim at having metrics (e.g., performance +counters) to assess how tasks are using the system to select the appropriate QoS +level for each task (and potentially for each shared resource). +The discussion of performance counters can be found in the corresponding +chapter. + +[#sec:qos:safety:level] +#### Level + +While QoS could be applied to single-core (and single-threaded) processors +(e.g., for temperature concerns, or to manage interference across tasks running +serially), it is particularly relevant for multi-core configurations where tasks +run concurrently sharing hardware resources. Single-core concerns can be viewed +as a (small) subset of multi-core concerns. +Hence, the level at which QoS is relevant is typically the SoC, with particular +emphasis on the shared hardware resources. + +[#sec:qos:safety:importance] +#### Importance + +If more than one application or process need to coexist on the same platform, +then configurable QoS is an important solution to mitigate interference +channels, which is required at every criticality level. + +[#sec:qos:safety:justification] +#### Justification + +Without QoS support a task (referred to as application in CAST32A +cite:[cast32:2016]) may delay another by creating contention over a shared +resource, which could be processor cycles or any of the physical resources. +This leads to a reduction in the availability of the system. + +In avionics, the CAST32A guideline -- now superseded by EASA +AMC 20-193 cite:[amc20193:2022] and FAA AC 20-193 cite:[ac20193:2024] -- +mandates that all interference channels must be identified and mitigated. +A task of any criticality shall not impact the execution of another task, +including its execution time (robust partitioning). + +In automotive, ISO26262 part 6 (software) identifies freedom from interference +as a requirement across different software partitions. +Annex D further lists relevant faults that can arise upon the lack of freedom +from interference, one of which is as follows: + +* Timing and execution faults: blocking of execution, deadlocks, livelocks, + incorrect allocation of execution time (i.e. exceeding allocated time + budgets), and incorrect synchronization across software elements. + +ISO26262 also mandates dependent failure analysis (i.e., analysis of failures +that occur as a consequence of a previous failure) to identify and limit the +impact of a failure, which aims to make the system more reliable. +Either QoS support and/or partitioning are likely to be mandated as an outcome +of this analysis. + +Note, however, that not all incarnations of QoS support are appropriate to +mitigate timing interference in the context of safety. +For instance, dynamic features are generally ill-advised since they may +challenge the certification process. +Examples of dynamic features are, for instance, QoS features varying +autonomously (i.e., without being specifically instructed by the affected safety +critical task), such as in the case of a bus arbiter aiming at keeping similar +waiting times across tasks in different cores. + +[#sec:qos:rv] +### RISC-V solutions + +The most relevant set of features in the context of RISC-V can be found in the +“RISC-V Quality-of-Service (QoS) Identifiers (*Ssqosid*)” v1.0 extension +ratified on the 2024/06/29 cite:[ssqosid:2024]. +While such document provides the specification of the QoS identifiers, the Fast +Track ISA Extension Proposal with the same name cite:[ft-qosid:2023] also +includes motivation and use cases for those QoS identifiers. + +The proposal describes two types of identifiers that we briefly summarize here +for reference, although we strongly suggest that the reader reviews the original +and complete documents. Note that in the discussion below, as well as in the +rest of the document, we refer to tasks as the software unit of interest. +In the aforementioned documents about QoS IDs, the reference unit of interest +is the hart (abbreviation for “hardware thread”). +While harts and tasks are different types of entities, we use task for +consistency with the rest of the chapters and assume during our discussions +below that each hart runs a single task (or none). +The two types of identifiers defined by the *Ssqosid* extension are: + +* Resource Control Identifiers (RCIDs): Each RCID covers a set of shared + resources. Each task with such RCID gets access to a specific service level + (QoS) from those resources, and shares them with all other tasks with the same + RCID. +** Example: RCID1 could correspond to 25% bandwidth of a bus, ways 1 and 2 of a + shared L2 cache, and 8 entries in the request queue of the memory controller. + RCID2 could correspond to 50% of bandwidth, ways 3 and 4, and 16 entries. + RCID3 could correspond to 75%, way 3, and 16 entries. + We could map task A to RCID1 and task B to RCID3. + This would allow sharing the bus (25% vs 75%), the shared L2 cache (ways 1-2 + vs way 3), and the memory controller as long as its queue has at least 24 + entries (8 vs 16 entries). + We could also map another task (task C) to RCID3, which would make tasks B + and C compete for their 75% of bandwidth, cache way 3, and the 16 entries in + the request queue (resources in RCID3 are guaranteed only in the aggregate, + but B and C compete without constraint for RCID3 resources). + If we have an additional task D allocated to RCID2, then guarantees would not + be feasible (e.g., 150% bandwidth required in the bus). +*** Note that the RCID allocated to a given task can be changed dynamically. +* Monitor Counter Identifiers (MCIDs): Each MCID is mapped to a specific monitor + of each resource with QoS capabilities, hence typically providing information + about the usage of that resource by tasks with such MCID (e.g., 20% bandwidth + utilization, 50% space allocated). In general, more than one monitor per + component may be needed for safety reasons (e.g., L2 cache dirty evictions, + L2 stall cycles in the eviction buffer, etc). + Therefore, MCIDs are not directly amenable to safety uses unless some tricks + are played: +** One could create multiple virtual components with QoS support (e.g., as many + as monitors required) and make RCIDs have no effect on those components, + but let them have a monitor each. + Yet, while feasible, this is an anomalous use of MCIDs. + +Overall, RCIDs need to be carefully set and allocated, and modified dynamically +in a controlled fashion (if at all modified). +MCIDs could serve the purpose of accessing safety-relevant monitors, but they do +not generally match safety needs. + +[#sec:qos:recom] +### Recommendations + +[#sec:qos:recom:enforcement] +#### QoS enforcement + +RCIDs provide a sufficiently powerful abstraction allowing to define any set of +constraints in any shared component that may be needed for safety reasons. +RCIDs provide an abstraction allowing to set constraints for diverse components, +including interconnects, cache memories, queues, and any other. +Yet, defining constraints must be done with special care since nothing prevents +using RCIDs with incompatible or potentially problematic constraints across +tasks running in different harts. +For instance, it is possible to run tasks whose aggregated bandwidth allocated +in an interconnect is above 100%, which would be incompatible in practice, or +with potentially problematic cache allocations (e.g., task A uses ways 1 and 2, +and task B ways 2 and 3) that provide neither partitioning, nor full sharing. +Also, specific combinations of RCIDs, if used by different concurrent tasks, +could lead to issues such as priority inversion if not defined and used with +care. + +Based on their definition, RCIDs could allow expressing virtually any set of +constraints, such as end-to-end constraints (e.g., end-to-end memory latencies), +but how to map RCIDs to specific QoS constraints is completely implementation +dependent. +Therefore, from an ISA perspective, no further ISA support is needed to realize +end-to-end constraints. + +One could use RCIDs to express multiple constraints even for a single shared +resource, such as for instance, the virtual channel to use and the bandwidth +allocated within that virtual channel for a NoC, as well as the allocated cache +space and the number of entries allocated in multiple queues in such a cache (to +hold miss requests, eviction requests, etc.). +Since RCIDs can be changed dynamically, even if associated to harts, one could +keep an RCID per task and update the RCID of the hart upon a context switch. +Hence, the scope at which to use RCIDs is completely software dependent and +virtually any required scope can be realized with RCIDs. + +RCID management can likely be implemented in the operating system or the +hypervisor. +One could, for instance, link RCIDs to scheduling priorities to provide a simple +user interface. + +It remains to be defined how those RCIDs are effectively implemented at +microarchitectural level, but such a definition is beyond RISC-V ISA +specifications. +Hence, while tagging requests with RCIDs and propagating those RCIDs across +cascade requests in other components could be an appropriate implementation, +whether this or another implementation is used is beyond the scope of this +document. + +[#sec:qos:recom:monitors] +#### QoS-relevant monitors + +MCIDs offer a single monitor per component which, for safety purposes, may fall +short since QoS choices may be performed based on multiple monitors. +For instance, one may decide to increase or decrease the service for a task in a +shared L2 cache based on how often such a task accesses the cache, whether it +performs read or write requests, experiences hits or misses, keeps occupancy of +specific queues high or low, etc. +The fact that multiple such metrics would have to be covered by a single monitor +can be regarded as a limitation and some form of safety extension may be needed. + +As explained before, virtual components can be defined as a way to define as +many MCIDs as required per physical or logical component. +While this trick would be practically doable, it can be regarded as an +inappropriate use of MCIDs. +Hence, this further encourages the definition of appropriate safety extensions +for safety-related monitoring in general, and safety-related QoS monitoring in +particular. + +Safety extensions for monitoring could consist of having an arbitrarily large +(or large enough) set of memory mapped monitors so that a given task can access +as much information as needed. +These safety extensions could be easily combined with the current MCID +definition so that the MCID is used to choose the appropriate set of monitors +to read. +Different tasks with different MCIDs may want to read the same monitor, which +may be mapped into multiple memory locations (e.g., overall interconnect +utilization), or different per-task monitors (e.g., individual interconnect +utilization). + +[#sec:qos:recom:propagation] +#### QoS IDs propagation + +Finally, a concern spanning across both RCIDs and MCIDs is RCID/MCID +propagation. +A number of microarchitectural events such as cache dirty evictions, cascade +requests of the coherence protocol, and I/O generated activity are hard to +attribute to specific tasks. +For instance, in the case of a dirty line eviction from cache, one could +attribute such request to the task evicting the line or to the one modifying +originally the line. +RCIDs and MCIDs are agnostic to those choices, which are fully implementation +dependent (e.g., one may use a specific RCID/MCID for I/O generated activity), +but it is important to make a sound use of RCIDs and MCIDs for those types of +requests also because they may have non-negligible performance effects (e.g., +dirty cache line evictions may occur frequently and saturate memory access). + +[#sec:qos:activities] +### Relevant activities + +#### Related external bodies + +None identified. + +#### Related chapters +The goal of QoS and priority-related features overlaps quite significantly with +that of time partitioning since both types of features are generally used to +mitigate multicore interference channels. +Hence, the xref:sec:partitioning[xrefstyle=full] is related to this chapter, +xref:sec:qos[xrefstyle=full]. + +Also, QoS support often relies on performance monitoring counters to make QoS +decisions. +Hence, the xref:sec:pmc[xrefstyle=full] is related to this chapter, +xref:sec:qos[xrefstyle=full]. diff --git a/src/fusa-whitepaper.adoc b/src/fusa-whitepaper.adoc index 274d36a..4b1f2fa 100644 --- a/src/fusa-whitepaper.adoc +++ b/src/fusa-whitepaper.adoc @@ -74,6 +74,7 @@ include::contributors.adoc[] include::intro.adoc[] include::chapter2.adoc[] include::chapters/pmc.adoc[] +include::chapters/qos.adoc[] // The index must precede the bibliography include::index.adoc[]