Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error bit should be used only to validate message payload #71

Open
1 of 4 tasks
glopesdev opened this issue Feb 1, 2025 · 0 comments
Open
1 of 4 tasks

Error bit should be used only to validate message payload #71

glopesdev opened this issue Feb 1, 2025 · 0 comments
Labels
proposal Request for a new feature

Comments

@glopesdev
Copy link
Collaborator

glopesdev commented Feb 1, 2025

  • Proposed
  • Prototype: Not Started
  • Implementation: Not Started
  • Specification: Not Started

Summary

As described in the Binary Protocol, the Error flag was intended to be used for payload validation only. This proposal clarifies what this means exactly and aims to strongly discourage use of this bit to signal anything else, including device error states or errors in execution of long-actions.

Motivation

The state of error reporting and handling in the Harp protocol is not great. Currently, the only explicit error reporting information is the Error flag in the message format. Because of this, there has been confusion over when this bit should be used. Specifically, some devices have used this bit to signal internal error state or failure to execute a long-action.

This is a problem because by default errors reported through this bit will cause an exception to be raised by the Device node in clients, and communication will terminate. If we start assuming that the state of device internal logic needs to be propagated and handled through this bit, this would greatly complicate error handling in workflows.

At best, new operators may need to be introduced. At worst, this will drive people to always turn on the IgnoreErrors property and start being blind to all kinds of validation errors which would then silently propagate unchecked.

The Error flag can be limited to a single bit in the message protocol if we assume the context required to interpret the Error can be derived exclusively from the message contents. This already works if we match message contents to the register specification in the device.yml and is what is used in the vast majority of cases.

If instead what we want is to communicate issues with the ongoing device state, then more information than just 1-bit is definitely required. There is currently no way to map a 1-bit flag to a meaningful error message on a register-by-register basis. We would need at least an error code, and ideally a documented register address containing more information about error states. This is why we do not think it is appropriate to try and force logical errors into this single bit. For instance, we may not be able to distinguish between a logical error and a payload type, or out-of-bounds error.

The other problem of using this bit to communicate device state errors, is there is no way to specify and document this behavior in the device.yml metadata. While format errors such as message and payload types, or allowed value ranges, are declared in the register metadata, the information about special uses of the error bit cannot easily be made explicit anywhere, leaving people wondering whether they need to worry about logical errors.

By clarifying that error states in the device must be communicated via their own registers, we will be promoting explicitness in the protocol, and encouraging device developers to create specific group masks with error codes, and error register documentation for any known device-specific failure modes.

Detailed Design

Improve the binary protocol specification to clarify that the Error flag should not be used for anything else other than immediate validation of payload conformity. Adapting from the original specification (annotated):


The field [Command] has an Error flag on the 4th least significant bit. When this bit is set it means that an error has occurred.
THIS IS TOO GENERAL, WE NEED TO CONSTRAIN WHICH ERRORS ARE ALLOWED HERE

Examples of possible errors can be: THIS SHOULD BE A COMPLETE LIST OF ERRORS

  1. The controller tries to read from (or write to) a register that doesn’t exist; THIS IS FINE
  2. The controller tries to write on a read-only register. THIS IS MISSING
  3. The controller tries to write invalid data to a certain register; NEED TO CLARIFY THE MEANING OF INVALID DATA
  4. The [PayloadType] doesn’t match the target register specification. THIS IS FINE
  5. The length of the payload doesn't match the target register specification. THIS IS MISSING

As noted in bold the current specification is too ambiguous. The proposal is to constrain the Error flag to only signal payload compliance with the device schema so that errors signaled by this bit are only protocol errors.

As for the examples, even if we make these examples a complete list, we would still have an issue with the meaning of "invalid data". From the context in the above specification, we propose at least the following clarifications:

  • CONSIDER using the Error flag only for Reply messages.
  • DO NOT set the Error flag on messages of type Event.
  • DO NOT set the Error flag on messages sent from the device as a result of long-actions triggered by a command, or any invalid state entered by the device as a result of external or internal actions not caused by an invalid command.

A final clarification to be discussed is how we should recommend handling situations in which data might be "valid" contingent on the state of other registers, or other internal device state. As a general principle, this should be avoided as much as possible, as it requires the controller to keep track of device state.

However, this is sometimes unavoidable, especially for actions which may require configuration of a number of registers prior to the action being triggered (e.g. starting a pulse train, moving a motor), and where the action might be invalid only for specific combinations of register values (i.e. each register value may be "valid" by itself, but not the specific combination of all the register values taken together).

We propose here that this situation should NOT be handled by replying to the write command with a message flagged with the Error bit.

Instead, this should either be handled by sending a message from another register, or replying to the write with an alternative register value which is allowed (e.g. consider the case where setting an invalid frame rate results in setting the register to the closest allowed frame rate).

If sending a message from another register, the documentation of that register should clearly document the values indicating error, e.g. through a group mask.

Drawbacks

Waiting for review.

Alternatives

The impact of not clarifying expected behavior of the Error flag is proliferation of unexpected behavior in Reply messages, undocumented failure modes, and growing complexity in handling device event streams.

A complementary proposal might be to introduce a specific register address for device error handling which could be monitored by the generic Device node and resolved by the controller to throw specific exceptions following lookup using the device.yml.

Unresolved Questions

None at the moment.

Design Meetings

@glopesdev glopesdev added the proposal Request for a new feature label Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Request for a new feature
Projects
None yet
Development

No branches or pull requests

1 participant