From cd754e5230caeac8710046c78946ab203d4e8850 Mon Sep 17 00:00:00 2001 From: THS-on Date: Thu, 24 Jun 2021 20:37:48 +0200 Subject: [PATCH] Proposal for revocation severity levels and context (#47) Signed-off-by: Thore Sommer --- 46_revocation_severity_and_context.md | 334 ++++++++++++++++++++++++++ 1 file changed, 334 insertions(+) create mode 100644 46_revocation_severity_and_context.md diff --git a/46_revocation_severity_and_context.md b/46_revocation_severity_and_context.md new file mode 100644 index 0000000..a6b11e7 --- /dev/null +++ b/46_revocation_severity_and_context.md @@ -0,0 +1,334 @@ + +# enhancement-46: Context and severity levels for revocations + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +- [ ] Enhancement issue in release milestone, which links to pull request in [keylime/enhancements] +- [ ] Core members have approved the issue with the label `implementable` +- [ ] Design details are appropriately documented +- [ ] Test plan is in place +- [ ] User-facing documentation has been created in [keylime/keylime-docs] + + + +## Summary +This enhancement proposal adds tagging of revocation events and the capability +differentiate and classify revocation events by adding severity levels. + +## Motivation +Currently Keylime operates in a binary state. Either a device verified or not. + +Not all revocation events have to be handled with an equal severity. +Depending on the use case a failing PCR caused by a firmware update is handled +differently than IMA policy failures. + +Granular reporting of revocation events allows the user to make more informed +decisions if failure occurs. The additional provided context also helps with +reconstructing why a revocation was issued. + +This enhancement allows users to use Keylime in new scenarios like exam +environments where a person must make an informed decision if a device is +actually compromised. + +### Goals + * Allow the user to specify a severity level and context information for every + part in Keylime that causes a potential revocation. + * Tagging all parts that cause revocations in Keylime with an unique id. + * Provide the option for Keylime to add context to revocation events. + This is useful if for example Eventlog parsing gains support for dynamic + policies. + * Extend the verifier API and tenant to add those rules to Keylime. + * Providing all the necessary information about an event for future revocation + mechanisms. +### Non-Goals + * The classification of revocation events is highly environment and + configuration specific. It is not a goal to provide a default configuration + other than all events are classified with the highest severity. + * Changing the Keylime databases to keep which events were already sent. This + would break the assumption that the databases are "non historical". + * Implementing a new way for adding revocation mechanisms. + + +## Proposal +This proposal consists of two parts. Part A can be implemented without part B. + +### Part A - Tagging of all parts of Keylime that can cause a revocation +In the current model every check that fails causes that the agent is not +verified anymore and no further information what exactly failed is recorded. +Checks that are still outstanding are not executed after the first failure +occurs. + +Part A changes that by tagging each component in Keylime that might cause a +revocation event and trying to evaluate all checks instead of aborting if one +check failed. + +### Part B - Extending the revocation events to support severity levels +This part uses the gained capabilities from part A and adds severity level +functionality to the current revocation mechanism. + +Currently revocation messages are either send if something with the +communication to the agent went wrong, the quote is not valid or one of the PCR +based checks fail. This part changes that behavior by introducing the concept of +severity levels. Now messages are send if a event if a higher than the currently +recorded severity level occurs. + +Instead of stop polling an agent if a failure occurred the agent is added back +for checking if the failure is recoverable. If it is not a revocation event with +the highest severity is generated. + +The user can specify for an event a severity level on a agent by agent basis. + + +### User Stories (optional) +User stories belong to part B of this proposal. Part A only provides internal +design changes for future enhancements. + +#### Story 1 +* User specifies that all `ima` events have a severity level of `warning` and + all a failure of `pcr_validation` has a severity level of `err`. +* Agent B removes Agents from a system if any revocation message with the + severity level `err` is generated. +* Agent A has a file that fails the ima check and the failure object contains an + event with the event id `ima.ng-sig.hashfailed`. +* Failure object is evaluated and the highest severity level is `warning`. +* Agent A hadn't had a failure before so a revocation message is sent with the + severity level `warning` and for agent A the `severity_level` is set to + `warning`. +* Agent B ignores the revocation message based on the level. + +#### Story 2 +* Setup is the same as the end of Story 1. +* Agent A has still a file that fails the ima check and the failure object + contains an event with the event id `ima.ng-sig.hashfailed`. +* Failure object is evaluated and the highest severity level is `warning`. +* The `severity_level` of agent A is `warning`so no revocation message gets send. + +#### Story 3 +* Setup is the same as the end of Story 1. +* Agents A pcr10 has a now a wrong value and the failure object contains now an + event with an event with the event id `ima.ng-sig.hashfailed` and also one + with `pcr_validation.pcr1`. +* Failure object is evaluated and the highest severity level is `err`. +* `err` is a higher severity level than `warning` and revocation message with + the severity level of `err` is sent and for agent A the `severity_level` is + set to `err`. +* Agent B removes agent A from a system. + +### Notes/Constraints/Caveats +Part B of this proposal is limited by the database design of Keylime. More +flexible solutions can be implemented outside of Keylime when an API for +external revocation mechanisms gets implemented. + +### Risks and Mitigations +The option to classify the revocation events opens the possibility that events +are not handled. To mitigate this all revocation events that are not explicitly +from the user classified are assigned the highest severity level. + +The same concept applies for revocation events that are caused by failures that +make the agent irrecoverable. Those also always need to be treated with the +highest severity by default because otherwise some events won't be caught if a +irrecoverable failure is triggered before. + +## Design Details + +### Part A +#### Event tagging +Events are tagged with an event id. Which has the following schema: +`component.[sub_component].event`. + +* `component`: Name of the part of Keylime where the event was generated. + * Current would be: `quote_validation`, `pcr_validation`, `measured_boot`, + `ima` and `internal`. +* `sub_component`: If it is useful the separate a component into other sub + components. One example would be IMA checks. +* `event`: The actual event itself. +An example event id for a failing static PCR check would be `pcr_validation.pcr10`. + +In most cases the event ids static but for dynamic policies the `sub_component` +and `event` can be generated automatically. + +The motivation behind that schema is to allow Keylime to adjust granularity of +the events where needed and give the user an easy way to specify a severity +level to one whole subset of Keylime. + +Events can have a specified context. Those can be either a static string or an +JSON object that contains more information about that event. This is optional +and it is gernerally assumed that two events from the same agent with the same +event id have the same severity and are the same event. + +### Collecting events +Instead of returning early if one check in a component fails a new failure +object will be introduced to collect the events. All parts of the validation +process such as (`check_qoute` and `check_pcrs`) will append their generated +events to that object instead of returning false. + +If the validation process cannot advance without a step that failed the failure +object must be marked as irrecoverable and only then a function can return +without validating any further. This will be the case if e.g. quote validation +failed. + +Recoverable in this context means that the validation can continue without that +check needing to succeed, not that if the check succeeds in the future +again the agent can get back into a verified state. + +Without the implementation of part B a revocation event will be sent if any +check failed and the polling of the agent will be stopped. + +### Part B +#### Severity levels and user rules +The severity level is described by a label. By default following labels are +available: crit, err, warning, notice, info and debug. They are strictly ordered +from hightest to lowest severity. The labels are configurable in the +`keylime.conf` to allow finer granularity if necessary. + +The user can specify severity level for event ids. +One rule contains the following attributes: + * `event_id`: The event id to match or a regex rule for it. + * `severity_level`: The severity level that the matched events have. + +To make the rules future proof with dynamic policies it is possible to specify a +regular expression for that matching. This introduces more complexity on the +parsing and matching side, but allows for flexible rules. Rules are parsed in a +top down order and the first matching rule is used and are supplied as a JSON +Array string to the verifier API (or the tenant). + +Rules are added to the agent similar on how it is done currently for +`mb_refstate`. A new attribute `revocation_rules` is added to the agent data in +the verifier to hold that information. The rules can be specified on a per agent +basis when the agent is added to the verifier. + +The tenant is extended to support that functionality. + +#### Changes to the state machine and revocation events +We keep the current model of the states, but modify the behavior of the failure +states. If we are in a failure state that is recoverable the polling of +the agent is stopped otherwise the agent is still added back for normal polling. + +If the failure object is marked as irrecoverable the state of the agent after +that should be `QUOTE_FAILED_IRRECOVERABLE`, a revocation event with the highest +severity level gets generated and the agent gets removed from polling. + +To the agent table in the verifier a new column called `severity_level` is added. +It contains the highest severity level that a generated event had. + +To the revocation message a new field called `severity_level` is added which +contains the highest severity level that was generated by an event that caused +the message to be sent. + +The `process_agent` function gets a new argument that can contain a failure +object. If the status is `QUOTE_FAILED` the events from failure object are +evaluated against the user specified rules and if the highest severity level is +higher than the saved in `severity_level` a revocation message is sent and +`severity_level` is updated to the new highest severity level. + +This checking against `severity_level` is done to prevent spamming the agents +with messages that don't contain new information. + +### Test Plan + + + +### Upgrade / Downgrade Strategy +A new column `severity_level` to the table `verifiermain` gets added. +Otherwise by default Keylime will still operate in the old binary state. + +New fields in the API are introduced so an API version update is needed to +indicate that change. + +### Dependency requirements +No additional dependencies should be required. + +## Drawbacks + * This will add additional complexity for features that cause revocation events. + * Part B changes the current binary state of the agents verification status. + +## Alternatives + * Only implement the tagging (part A) and completely redesign the revocation mechanism. + +## Infrastructure Needed (optional) + +