Error bit should be used only to validate message payload

* [x] Proposed
* [ ] Prototype: Not Started
* [ ] Implementation: Not Started
* [ ] Specification: Not Started

## Summary


As described in the [Binary Protocol](https://github.com/harp-tech/protocol/blob/016171aa9fb802b2a6f4871d1fdf5caa2d4c2a05/BinaryProtocol-8bit.md?plain=1#L145), the `Error` flag was intended to be used for payload validation only. This proposal clarifies what this means exactly and aims to strongly discourage use of this bit to signal anything else, including device error states or errors in execution of long-actions.

## Motivation


The state of error reporting and handling in the Harp protocol is not great. Currently, the only explicit error reporting information is the `Error` flag in the message format. Because of this, there has been confusion over when this bit should be used. Specifically, some devices have used this bit to signal internal error state or failure to execute a long-action.

This is a problem because by default errors reported through this bit will cause an exception to be raised by the `Device` node in clients, and communication will terminate. If we start assuming that the state of device internal logic needs to be propagated and handled through this bit, this would greatly complicate error handling in workflows.

At best, new operators may need to be introduced. At worst, this will drive people to always turn on the `IgnoreErrors` property and start being blind to all kinds of validation errors which would then silently propagate unchecked.

The `Error` flag can be limited to a single bit in the message protocol if we assume the context required to interpret the `Error` can be derived exclusively from the message contents. This already works if we match message contents to the register specification in the `device.yml` and is what is used in the vast majority of cases.

If instead what we want is to communicate issues with the ongoing device state, then more information than just 1-bit is definitely required. There is currently no way to map a 1-bit flag to a meaningful error message on a register-by-register basis. We would need at least an error code, and ideally a documented register address containing more information about error states. This is why we do not think it is appropriate to try and force logical errors into this single bit. For instance, we may not be able to distinguish between a logical error and a payload type, or out-of-bounds error.

The other problem of using this bit to communicate device state errors, is there is no way to specify and document this behavior in the `device.yml` metadata. While format errors such as message and payload types, or allowed value ranges, are declared in the register metadata, the information about special uses of the error bit cannot easily be made explicit anywhere, leaving people wondering whether they need to worry about logical errors.

By clarifying that error states in the device must be communicated via their own registers, we will be promoting explicitness in the protocol, and encouraging device developers to create specific group masks with error codes, and error register documentation for any known device-specific failure modes.

## Detailed Design


Improve the binary protocol specification to clarify that the `Error` flag should not be used for anything else other than immediate validation of payload conformity. Adapting from the original specification (annotated):

---
The field [`Command`] has an Error flag on the 4th least significant bit. When this bit is set it means that an error has occurred.
**THIS IS TOO GENERAL, WE NEED TO CONSTRAIN WHICH ERRORS ARE ALLOWED HERE**

~~Examples of~~ possible errors can be: **THIS SHOULD BE A COMPLETE LIST OF ERRORS**

  1. The controller tries to read from **(or write to)** a register that doesn’t exist; **THIS IS FINE**
  2. The controller tries to write on a read-only register. **THIS IS MISSING**
  3. The controller tries to write invalid data to a certain register; **NEED TO CLARIFY THE MEANING OF INVALID DATA**
  4. The [`PayloadType`] doesn’t match the target register specification. **THIS IS FINE**
  5. The length of the payload doesn't match the target register specification. **THIS IS MISSING**
---

As noted in **bold** the current specification is too ambiguous. The proposal is to constrain the `Error` flag to only signal payload compliance with the device schema so that errors signaled by this bit are only protocol errors.

As for the examples, even if we make these examples a complete list, we would still have an issue with the meaning of "invalid data". From the context in the above specification, we propose at least the following clarifications:
  - CONSIDER using the `Error` flag only for `Reply` messages.
  - DO NOT set the `Error` flag on messages of type `Event`.
  - DO NOT set the `Error` flag on messages sent from the device as a result of long-actions triggered by a command, or any invalid state entered by the device as a result of external or internal actions not caused by an invalid command.

A final clarification to be discussed is how we should recommend handling situations in which data might be "valid" contingent on the state of other registers, or other internal device state. As a general principle, this should be avoided as much as possible, as it requires the controller to keep track of device state.

However, this is sometimes unavoidable, especially for actions which may require configuration of a number of registers prior to the action being triggered (e.g. starting a pulse train, moving a motor), and where the action might be invalid only for specific combinations of register values (i.e. each register value may be "valid" by itself, but not the specific combination of all the register values taken together).

We propose here that this situation should NOT be handled by replying to the write command with a message flagged with the `Error` bit.

Instead, this should either be handled by sending a message from another register, or replying to the write with an alternative register value which is allowed (e.g. consider the case where setting an invalid frame rate results in setting the register to the closest allowed frame rate).

If sending a message from another register, the documentation of that register should clearly document the values indicating error, e.g. through a group mask.

## Drawbacks


Waiting for review.

## Alternatives


The impact of not clarifying expected behavior of the `Error` flag is proliferation of unexpected behavior in `Reply` messages, undocumented failure modes, and growing complexity in handling device event streams.

A complementary proposal might be to introduce a specific register address for device error handling which could be monitored by the generic `Device` node and resolved by the controller to throw specific exceptions following lookup using the `device.yml`.

## Unresolved Questions


None at the moment.

## Design Meetings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error bit should be used only to validate message payload #71

Summary

Motivation

Detailed Design

Drawbacks

Alternatives

Unresolved Questions

Design Meetings

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error bit should be used only to validate message payload #71

Description

Summary

Motivation

Detailed Design

Drawbacks

Alternatives

Unresolved Questions

Design Meetings

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions