Skip to content

Commit

Permalink
Add QoS chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
dgptha committed Dec 1, 2024
1 parent 7cf81b2 commit 82df320
Show file tree
Hide file tree
Showing 2 changed files with 318 additions and 0 deletions.
317 changes: 317 additions & 0 deletions src/chapters/qos.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,317 @@
[#sec:qos]
## Configurable Quality of Service (QoS) and Priority Management

[#sec:qos:safety]
### Safety needs

Safety standards define processes of safety assessment and hazard analysis to
allow each task to be categorized according to its criticality, e.g. ASIL levels
in ISO 26262 cite:[iso26262:2018], DAL levels in DO-178 cite:[do178c:2012],
software criticality categories in ECSS cite:[ecss:2024].
In a given system, if the tasks that share hardware resources vary in their
criticality levels, or if safety critical and non-safety critical tasks can
coexist, the system is said to be of mixed-criticality.

The criticality of a task and the standard according to which its code is
developed have a direct impact on the number and type of safety mechanisms that
need to be implemented. As a consequence, the acceptable error rates of the
tasks differ according to their criticality level.
To prevent a less/non critical task from interfering with the execution of a
more critical task, it is therefore required to ensure freedom from such
interference.

Even if the tasks have the same criticality level, it is sometimes necessary to
ensure freedom from interference if the tasks rely on each other to perform some
safety function.
Such cases could be exposed, for example, by a dependent failure analysis, as it
is the case with ISO 26262.

Freedom from interference, limited or complete, is often implemented by building
on Quality of Service (QoS) and priority-related features, where QoS is the
minimal end-to-end performance that is guaranteed in advance by a service level
agreement (SLA) to an application.
The performance may be measured by metrics such as instructions per cycle (IPC),
latency of servicing work, etc.

Note that, in the context of this document, we consider priority-related
features (i.e., those features controlling the access to resources according to
priorities assigned to tasks) a subset of QoS features.

QoS features, that can be implemented at hardware and/or software level, can be
used to guarantee specific maximum latencies, minimum bandwidth, minimum cache
space, and the like, for specific tasks with safety and performance
requirements.
These guarantees can be fully enforced to achieve complete freedom from
interference (a “hard” guarantee), or allow the system to miss targets some of
the time up to some agreed-upon threshold, achieving a limited degree of freedom
from interference (a “soft” or “best-effort” guarantee).

In this document, we cover QoS features in the context of tasks with safety
requirements.

[#sec:qos:safety:features]
#### Features

Our aim is having configurable QoS that allows exercising control on the
performance of tasks.
Such control can be exercised in multiple forms, such as (not intended to be an
exhaustive list):

* End-to-end latency bounds
* Throughput bounds
* Space allocation bounds
QoS features can be used to guarantee a performance target for a task by
granting such task a share of the resources.
However, since that share of the resources then becomes unavailable to other
tasks, any task that depends upon those resources is likely to experience a
performance drop.

QoS features are allocated to tasks and/or components with different degrees of
temporal flexibility.
Some examples are:

* Static allocation: a task has a given QoS level that never changes across
different executions.
* Semi-static allocation: each execution of a given task may have a different
QoS level, but such level remains constant during the complete execution of
the task.
* Dynamic allocation: QoS levels may change during the execution of the task.
Along with configurable QoS levels, we aim at having metrics (e.g., performance
counters) to assess how tasks are using the system to select the appropriate QoS
level for each task (and potentially for each shared resource).
The discussion of performance counters can be found in the corresponding
chapter.

[#sec:qos:safety:level]
#### Level

While QoS could be applied to single-core (and single-threaded) processors
(e.g., for temperature concerns, or to manage interference across tasks running
serially), it is particularly relevant for multi-core configurations where tasks
run concurrently sharing hardware resources. Single-core concerns can be viewed
as a (small) subset of multi-core concerns.
Hence, the level at which QoS is relevant is typically the SoC, with particular
emphasis on the shared hardware resources.

[#sec:qos:safety:importance]
#### Importance

If more than one application or process need to coexist on the same platform,
then configurable QoS is an important solution to mitigate interference
channels, which is required at every criticality level.

[#sec:qos:safety:justification]
#### Justification

Without QoS support a task (referred to as application in CAST32A
cite:[cast32:2016]) may delay another by creating contention over a shared
resource, which could be processor cycles or any of the physical resources.
This leads to a reduction in the availability of the system.

In avionics, the CAST32A guideline -- now superseded by EASA
AMC 20-193 cite:[amc20193:2022] and FAA AC 20-193 cite:[ac20193:2024] --
mandates that all interference channels must be identified and mitigated.
A task of any criticality shall not impact the execution of another task,
including its execution time (robust partitioning).

In automotive, ISO26262 part 6 (software) identifies freedom from interference
as a requirement across different software partitions.
Annex D further lists relevant faults that can arise upon the lack of freedom
from interference, one of which is as follows:

* Timing and execution faults: blocking of execution, deadlocks, livelocks,
incorrect allocation of execution time (i.e. exceeding allocated time
budgets), and incorrect synchronization across software elements.
ISO26262 also mandates dependent failure analysis (i.e., analysis of failures
that occur as a consequence of a previous failure) to identify and limit the
impact of a failure, which aims to make the system more reliable.
Either QoS support and/or partitioning are likely to be mandated as an outcome
of this analysis.

Note, however, that not all incarnations of QoS support are appropriate to
mitigate timing interference in the context of safety.
For instance, dynamic features are generally ill-advised since they may
challenge the certification process.
Examples of dynamic features are, for instance, QoS features varying
autonomously (i.e., without being specifically instructed by the affected safety
critical task), such as in the case of a bus arbiter aiming at keeping similar
waiting times across tasks in different cores.

[#sec:qos:rv]
### RISC-V solutions

The most relevant set of features in the context of RISC-V can be found in the
“RISC-V Quality-of-Service (QoS) Identifiers (*Ssqosid*)” v1.0 extension
ratified on the 2024/06/29 cite:[ssqosid:2024].
While such document provides the specification of the QoS identifiers, the Fast
Track ISA Extension Proposal with the same name cite:[ft-qosid:2023] also
includes motivation and use cases for those QoS identifiers.

The proposal describes two types of identifiers that we briefly summarize here
for reference, although we strongly suggest that the reader reviews the original
and complete documents. Note that in the discussion below, as well as in the
rest of the document, we refer to tasks as the software unit of interest.
In the aforementioned documents about QoS IDs, the reference unit of interest
is the hart (abbreviation for “hardware thread”).
While harts and tasks are different types of entities, we use task for
consistency with the rest of the chapters and assume during our discussions
below that each hart runs a single task (or none).
The two types of identifiers defined by the *Ssqosid* extension are:

* Resource Control Identifiers (RCIDs): Each RCID covers a set of shared
resources. Each task with such RCID gets access to a specific service level
(QoS) from those resources, and shares them with all other tasks with the same
RCID.
** Example: RCID1 could correspond to 25% bandwidth of a bus, ways 1 and 2 of a
shared L2 cache, and 8 entries in the request queue of the memory controller.
RCID2 could correspond to 50% of bandwidth, ways 3 and 4, and 16 entries.
RCID3 could correspond to 75%, way 3, and 16 entries.
We could map task A to RCID1 and task B to RCID3.
This would allow sharing the bus (25% vs 75%), the shared L2 cache (ways 1-2
vs way 3), and the memory controller as long as its queue has at least 24
entries (8 vs 16 entries).
We could also map another task (task C) to RCID3, which would make tasks B
and C compete for their 75% of bandwidth, cache way 3, and the 16 entries in
the request queue (resources in RCID3 are guaranteed only in the aggregate,
but B and C compete without constraint for RCID3 resources).
If we have an additional task D allocated to RCID2, then guarantees would not
be feasible (e.g., 150% bandwidth required in the bus).
*** Note that the RCID allocated to a given task can be changed dynamically.
* Monitor Counter Identifiers (MCIDs): Each MCID is mapped to a specific monitor
of each resource with QoS capabilities, hence typically providing information
about the usage of that resource by tasks with such MCID (e.g., 20% bandwidth
utilization, 50% space allocated). In general, more than one monitor per
component may be needed for safety reasons (e.g., L2 cache dirty evictions,
L2 stall cycles in the eviction buffer, etc).
Therefore, MCIDs are not directly amenable to safety uses unless some tricks
are played:
** One could create multiple virtual components with QoS support (e.g., as many
as monitors required) and make RCIDs have no effect on those components,
but let them have a monitor each.
Yet, while feasible, this is an anomalous use of MCIDs.
Overall, RCIDs need to be carefully set and allocated, and modified dynamically
in a controlled fashion (if at all modified).
MCIDs could serve the purpose of accessing safety-relevant monitors, but they do
not generally match safety needs.
[#sec:qos:recom]
### Recommendations
[#sec:qos:recom:enforcement]
#### QoS enforcement
RCIDs provide a sufficiently powerful abstraction allowing to define any set of
constraints in any shared component that may be needed for safety reasons.
RCIDs provide an abstraction allowing to set constraints for diverse components,
including interconnects, cache memories, queues, and any other.
Yet, defining constraints must be done with special care since nothing prevents
using RCIDs with incompatible or potentially problematic constraints across
tasks running in different harts.
For instance, it is possible to run tasks whose aggregated bandwidth allocated
in an interconnect is above 100%, which would be incompatible in practice, or
with potentially problematic cache allocations (e.g., task A uses ways 1 and 2,
and task B ways 2 and 3) that provide neither partitioning, nor full sharing.
Also, specific combinations of RCIDs, if used by different concurrent tasks,
could lead to issues such as priority inversion if not defined and used with
care.
Based on their definition, RCIDs could allow expressing virtually any set of
constraints, such as end-to-end constraints (e.g., end-to-end memory latencies),
but how to map RCIDs to specific QoS constraints is completely implementation
dependent.
Therefore, from an ISA perspective, no further ISA support is needed to realize
end-to-end constraints.
One could use RCIDs to express multiple constraints even for a single shared
resource, such as for instance, the virtual channel to use and the bandwidth
allocated within that virtual channel for a NoC, as well as the allocated cache
space and the number of entries allocated in multiple queues in such a cache (to
hold miss requests, eviction requests, etc.).
Since RCIDs can be changed dynamically, even if associated to harts, one could
keep an RCID per task and update the RCID of the hart upon a context switch.
Hence, the scope at which to use RCIDs is completely software dependent and
virtually any required scope can be realized with RCIDs.
RCID management can likely be implemented in the operating system or the
hypervisor.
One could, for instance, link RCIDs to scheduling priorities to provide a simple
user interface.
It remains to be defined how those RCIDs are effectively implemented at
microarchitectural level, but such a definition is beyond RISC-V ISA
specifications.
Hence, while tagging requests with RCIDs and propagating those RCIDs across
cascade requests in other components could be an appropriate implementation,
whether this or another implementation is used is beyond the scope of this
document.
[#sec:qos:recom:monitors]
#### QoS-relevant monitors
MCIDs offer a single monitor per component which, for safety purposes, may fall
short since QoS choices may be performed based on multiple monitors.
For instance, one may decide to increase or decrease the service for a task in a
shared L2 cache based on how often such a task accesses the cache, whether it
performs read or write requests, experiences hits or misses, keeps occupancy of
specific queues high or low, etc.
The fact that multiple such metrics would have to be covered by a single monitor
can be regarded as a limitation and some form of safety extension may be needed.
As explained before, virtual components can be defined as a way to define as
many MCIDs as required per physical or logical component.
While this trick would be practically doable, it can be regarded as an
inappropriate use of MCIDs.
Hence, this further encourages the definition of appropriate safety extensions
for safety-related monitoring in general, and safety-related QoS monitoring in
particular.
Safety extensions for monitoring could consist of having an arbitrarily large
(or large enough) set of memory mapped monitors so that a given task can access
as much information as needed.
These safety extensions could be easily combined with the current MCID
definition so that the MCID is used to choose the appropriate set of monitors
to read.
Different tasks with different MCIDs may want to read the same monitor, which
may be mapped into multiple memory locations (e.g., overall interconnect
utilization), or different per-task monitors (e.g., individual interconnect
utilization).
[#sec:qos:recom:propagation]
#### QoS IDs propagation
Finally, a concern spanning across both RCIDs and MCIDs is RCID/MCID
propagation.
A number of microarchitectural events such as cache dirty evictions, cascade
requests of the coherence protocol, and I/O generated activity are hard to
attribute to specific tasks.
For instance, in the case of a dirty line eviction from cache, one could
attribute such request to the task evicting the line or to the one modifying
originally the line.
RCIDs and MCIDs are agnostic to those choices, which are fully implementation
dependent (e.g., one may use a specific RCID/MCID for I/O generated activity),
but it is important to make a sound use of RCIDs and MCIDs for those types of
requests also because they may have non-negligible performance effects (e.g.,
dirty cache line evictions may occur frequently and saturate memory access).
[#sec:qos:activities]
### Relevant activities
#### Related external bodies
None identified.
#### Related chapters
The goal of QoS and priority-related features overlaps quite significantly with
that of time partitioning since both types of features are generally used to
mitigate multicore interference channels.
Hence, the xref:sec:partitioning[xrefstyle=full] is related to this chapter,
xref:sec:qos[xrefstyle=full].
Also, QoS support often relies on performance monitoring counters to make QoS
decisions.
Hence, the xref:sec:pmc[xrefstyle=full] is related to this chapter,
xref:sec:qos[xrefstyle=full].
1 change: 1 addition & 0 deletions src/fusa-whitepaper.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ include::contributors.adoc[]
include::intro.adoc[]
include::chapter2.adoc[]
include::chapters/pmc.adoc[]
include::chapters/qos.adoc[]

// The index must precede the bibliography
include::index.adoc[]
Expand Down

0 comments on commit 82df320

Please sign in to comment.