Generalizing and Optimizing Centralized Federated Execution #1626

edwardalee · 2023-03-06T15:58:35Z

edwardalee
Mar 6, 2023
Maintainer

Sparse Communication

@Jakio815 observed a problematic pattern with centralized coordination of federated execution that is represented by the following program:

Suppose that the Sender, which triggers at 100ms intervals, only occasionally sends an output message, say, on average, every few seconds. The problem currently is that both Sender and Receiver communicate with the RTI every 100ms even though nothing interesting is happening. This communication is captured nicely by @ChadliaJerad 's (still rough) prototype trace visualizer as follows:

The RTI is on the left, the Sender in the middle, and the Receiver on the right. I've skipped the initial messages to focus more on the steady-state behavior. At logical time 0, the Receiver sends a LTC (Logical Tag Complete) message with tag 0 followed by a NET (Next Event Tag) with value 2s (the timeout time). It is telling the RTI that its event queue is empty, and that, absent network inputs, it has nothing to do until time 2s (shutdown time).

In this trace, the Sender next sends LTC (100ms), which causes the RTI to send TAG (Tag Advance Grant) (100ms) to the Receiver. This latter message, however, is unnecessary because the RTI knows that Receiver has nothing to do until 2s and that no message was sent by the Sender with tag of 100ms or earlier. First optimization: (easy) Eliminate this TAG message.

Second optimization: (a bit harder) Eliminate the LTC messages from the Sender to the RTI as well. Idea: To do this, when the RTI receives NET(2s) from the Receiver, it should forward that message to all federates upstream of the Receiver. Those federates should maintain a barrier tag b that is the least such tag they have received (they will need a queue of such tags because the least will drop off when the federate sends an LTC matching or exceeding the tag).
Given such a barrier b, a federate is not required to send an LTC(g) for any g < b unless it sends an output message at g.

There are many interesting variants of this example (local timer at Receiver, physical action at Receiver), but discussion of those will have to wait for followup postings here.

lhstrh · 2023-03-06T16:29:27Z

lhstrh
Mar 6, 2023
Maintainer

Just to make sure that credit goes where credit's due: it was @byeong-gil who identified the problem after spending a lot of time parsing through the logs manually. He also constructed this sequence diagram that we included in our paper:

@Jakio815 (co-author) was in charge of setting up the Raspberry Pis, polling the range sensors, and coding up the application logic.

1 reply

edwardalee Mar 6, 2023
Maintainer Author

Oops, apologies @byeong-gil for attributing to @Jakio815 .

byeonggiljun · 2023-03-07T00:43:40Z

byeonggiljun
Mar 7, 2023
Collaborator

I tried to resolve

First optimization: (easy) Eliminate this TAG message.

on a branch reactor-c/rti-optimizations.

And there is a draft PR reactor-c/pull/175.

0 replies

ChadliaJerad · 2023-03-09T23:51:29Z

ChadliaJerad
Mar 9, 2023
Collaborator

Here are the visualizations of the communication within SparceSender, before and after the optimization done by @byeong-gil. The RTI and the federates positions are the same as described by @edwardalee above.

Before	After

The number of exchanged messages is significantly reduced in this example.

1 reply

edwardalee Mar 10, 2023
Maintainer Author

Nice! Looks like a further optimization would be to suppress the sending of ABS (null messages). Actually, I don't think those are needed on any path that is not on a cycle, except possibly at the logical start time.

byeonggiljun · 2023-03-20T13:13:37Z

byeonggiljun
Mar 20, 2023
Collaborator

This proposal is for

Second optimization: (a bit harder) Eliminate the LTC messages from the Sender to the RTI as well. Idea: To do this, when the RTI receives NET(2s) from the Receiver, it should forward that message to all federates upstream of the Receiver. Those federates should maintain a barrier tag b that is the least such tag they have received (they will need a queue of such tags because the least will drop off when the federate sends an LTC matching or exceeding the tag).
Given such a barrier b, a federate is not required to send an LTC(g) for any g < b unless it sends an output message at g.

Overview

I suggest adding a new message from the RTI. The purpose of this message is to make the RTI notify upstream reactors 'You don't have to send any message until this tag unless you produce an output.'. This is the idea from @Lostroh. I temporarily name this message message_required_tag. There are two reasons why we consider a new message type rather than simply forwarding NET. First, this message's role is not matched the definition of the MSG_TYPE_NEXT_EVENT_TAG. Second, we can make the RTI more efficient with a new message type. It means, the RTI doesn't have to forward all NET messages in some situations. I will describe those situations below.
The diagram below shows a UML diagram of communications. I couldn't use tracing because I did not implement this proposal yet. The figure on left (Current) represents the same situation as After figure on @ChadliaJerad's comment. ABS messages represent port-absent messages. Also, in the figure Goal, the line with --x means eliminated messages.

Current	Goal

Message Required Tag

This message is a notification by the RTI to the federate that some downstream federates have events at this tag.
Thus, the federate only has to send LTC and ABS (port absent) messages which are equal to the message_required_tag. If it has no upstream federates, it also can send NET messages which are equal to the message_required_tag only. If it has any upstream federtes, it cannot skip any NET messages because it has to receive TAG from the RTI.

When should the RTI send Message Required Tag?

The basic concept of this message is forwarding NET. So I'll describe situations in which the RTI doesn't have to forward NET.
The RTI doesn't have to forward NET(t1) from federate D to the federate U if

It already sends message_required_tag with tag t1 to the federate U. This can happen if U has another downstream federates.
The RTI already can send TAG(t1) to the federate D. It can do that when the RTI already receives NET(t2) from U while t2 is later than t1 (t2 > t1), or when it receives LTC(t3) from U while t3 is later than or equal to t1 (t3 >= t1).

How do upstream federates handle this message?

Upstream federates have to manage a priority queue that stores message_required_tag. It is sorted by tag order and an element is removed when the federate advances its tag beyond the element. At the initialization, the queue only has a forever tag.
Every time an upstream federate tries to send ABS and LTC, it will compare the intended tag to send and the earliest message_required_tag. If they are not identical, it doesn't have to send those messages. When the federate receives message_required_tag(t_m) and its current tag t_c is larger than t_m (t_c > t_m), it has to send LTC(t_m) and appropriate NET to the RTI.

Things to discuss

message_required_tag is a very temporal name. Let's find a better name.
About this situation,

It already sends message_required_tag with tag t1 to the federate U. This can happen if U has another downstream federates.

how can we know that the RTI already sends the message_required_tag of tag t1? I think we can use tag information in struct federate_t (e.g. next_tag), but I cannot stand out a concrete statement yet.
In this line,

When the federate receives message_required_tag(t_m) and its current tag t_c is larger than t_m (t_c > t_m), it has to send LTC(t_m) and appropriate NET to the RTI.

What will be the appropriate NET? We can send the current tag or wait some time and send the next event tag of the current tag.
We have to consider the race condition carefully.

Steps

add a new message type to the RTI and make the RTI simply forward the NET using that message type
do not forward NET messages in the situations I described above
Do not send unnecessary messages
- Make a priority queue for storing message_required_tag
- Delete elements of the queue when the current tag exceeds the element
- Make a logic to send LTC and NET messages when receiving message_required_tag that is smaller than the current tag.
- Do not send unnecessary ABS, LTC messages
- Do not send unnecessary NET messages when the federte does not have any upstream federates

0 replies

edwardalee · 2023-03-20T15:09:09Z

edwardalee
Mar 20, 2023
Maintainer Author

This proposal sounds great to me! Maybe rather than message_required_tag we could call next_downstream_event_tag (NDET).

A couple of modifications are needed. On this:

The federate only has to send LTC and ABS (port absent) messages which are equal to the message_required_tag. If it has no upstream federates, it also can send NET messages which are equal to the message_required_tag only. If it has any upstream federtes, it cannot skip any NET messages because it has to receive TAG from the RTI.
I think these two "equal to" should be "equal to or greater than".

When should the RTI send Message Required Tag?

I think the above isn't quite right. You need to also consider the case when D has another upstream federate U'. When it receives a message from the U', D may send a new NET to the RTI. This new NET may require a new NDET. I think the condition is this: When the RTI receives a NET(g1) from downstream federate D, it should send a NDET to upstream federate U if:

The most recent TAG(g2) from U satisfies g2 < g1. (most recent initializes to NEVER).
Otherwise, I think the NDET has to be sent.

How do upstream federates handle this message?

Again, I think you are close, but it isn't quite right.

"If they are not identical" should be "greater than or equal to".
There is a race condition: U could choose not to send ABS(g1), then block on a network input (hence cannot send LTC(g1), then later receive NDET(g2) with g2 <= g1. I think that with the above policy, we get deadlock. I think right policy has to involve sending ABS(g1) in response to the NDET(g2).

Things to discuss

On this:

how can we know that the RTI already sends the message_required_tag of tag t1?
If the above modified policy is right, the RTI doesn't need to keep track of this. But there might be some optimization that I'm missing where the NDET does not need to be sent. I just don't think the optimization given above is quite right.

1 reply

byeonggiljun Mar 21, 2023
Collaborator

I appreciate your feedback!
I'll start to implement this on branch reactor-c/rti-NET-forwarding and draft PR (reactor-c/pull/176).

cmnrd · 2023-03-22T07:21:12Z

cmnrd
Mar 22, 2023
Maintainer

I think that @byeong-gil's proposal goes very much in the direction that I am also taking with the coordination of enclaves in C++ (see #1665). I will document how this works ASAP and also present it in the next Wednesday meeting. The bottom line is, however, that even the RTI can be eliminated. In the right picture in @byeong-gil's post, the Receiver could also send message_required_tag directly to the sender and the sender can send the NET directly to the receiver.

2 replies

byeonggiljun Mar 27, 2023
Collaborator

I have a question on this:

The bottom line is, however, that even the RTI can be eliminated.

Could we eliminate the RTI and send messages directly to a federate in the centralized coordination?

lhstrh Mar 27, 2023
Maintainer

Whether a federate can advance time, in principle, depends on the state of its upstream neighbors. Rather than going through the RTI, a federate could communicate with the neighbors directly.

edwardalee · 2023-03-22T07:35:44Z

edwardalee
Mar 22, 2023
Maintainer Author

I think this is a great direction! Keep in mind that then when the receiver receives a NET, it will need to compare it to its own (running) NET and forward the minimum to its own downstream receivers. The running NET needs to be the minimum of NETs received from upstream senders and its own earliest event queue tag. The tricky part then becomes handling cycles, but I'm sure that is manageable too.

0 replies

edwardalee · 2023-03-27T06:24:21Z

edwardalee
Mar 27, 2023
Maintainer Author

One issue is for the federates to communicate with each other directly, they need each other’s IP addresses. Currently, they get these from the RTI (for physical connections). To join a federation, eac federate only needs to know the IP address of the RTI.

Also, the start time and time to stop in response to a call to request_stop() are a distributed consensus. Without an RTI, these will need to be implemented differently. Transient federates will also need to be rethought.

I’m sure these are all solvable, and perhaps a gradual approach is best. Keep the RTI for now, it bypass it as much as possible. Work towards eliminating it…

0 replies

cmnrd · 2023-03-29T11:07:28Z

cmnrd
Mar 29, 2023
Maintainer

One issue is for the federates to communicate with each other directly, they need each other’s IP addresses. Currently, they get these from the RTI (for physical connections). To join a federation, eac federate only needs to know the IP address of the RTI.

This problem is solved by the established middlewares (be it commercial or open source) and I still don't think we should solve it ourselves unless we absolutely need to. That said, if we don't need auto-discovery, we can also simply pass the addresses to the federates at deployment.

0 replies

cmnrd · 2023-03-31T13:43:48Z

cmnrd
Mar 31, 2023
Maintainer

I ran into an issue with cycles in the current enclave coordination scheme that might also be relevant for the discussion here. While cycles (with delays) work in principle, they can become super inefficient if delays are small. In the EnclaveCycle.lf test, the ping reactor sends a message to pong every 100ms. The pong reactor then replies with a 50ms delay. This works fine. However, when I reduce the delay to 10ms a problem appears. Under the current scheme, ping will ask pong if it is safe to process the tag (100ms, 0). pong does not know because it has an input from ping and consequently it asks ping if it is safe to process the tag (90ms, 0). ping does not know and consequently asks pong for tag (90ms, 0) which in turn will ask ping for tag (80ms, 0). This continues for a while, until the tag (10ms, 0) at ping is reached, which has been released before because ping processed an event at this tag.

The smaller the delay becomes, the worse the situation gets. So far, I don't have a good idea how to break this loop. The enclaves will need to be able to detect that they are located within a loop somehow. I am posting this here, because this might also be an interesting corner case for any of the optimizations discussed for the current federated execution.

1 reply

edwardalee Mar 31, 2023
Maintainer Author

This is solved in federated execution using the transitive NET, which has to have cycle detection. I'm not totally sure how to realize the cycle detection without an RTI, but I suspect it is not hard.

byeonggiljun · 2023-05-18T07:55:42Z

byeonggiljun
May 18, 2023
Collaborator

@edwardalee Hello, Professor Lee, I'd like to ask something about your reply that I quoted below.

This proposal sounds great to me! Maybe rather than message_required_tag we could call next_downstream_event_tag (NDET).

A couple of modifications are needed. On this:

The federate only has to send LTC and ABS (port absent) messages which are equal to the message_required_tag. If it has no upstream federates, it also can send NET messages which are equal to the message_required_tag only. If it has any upstream federtes, it cannot skip any NET messages because it has to receive TAG from the RTI.

I think these two "equal to" should be "equal to or greater than".

When should the RTI send Message Required Tag?

I think the above isn't quite right. You need to also consider the case when D has another upstream federate U'. When it receives a message from the U', D may send a new NET to the RTI. This new NET may require a new NDET. I think the condition is this: When the RTI receives a NET(g1) from downstream federate D, it should send a NDET to upstream federate U if:

The most recent TAG(g2) from U satisfies g2 < g1. (most recent initializes to NEVER).
Otherwise, I think the NDET has to be sent.

How do upstream federates handle this message?

Again, I think you are close, but it isn't quite right.

"If they are not identical" should be "greater than or equal to".

There is a race condition: U could choose not to send ABS(g1), then block on a network input (hence cannot send LTC(g1), then later receive NDET(g2) with g2 <= g1. I think that with the above policy, we get deadlock. I think right policy has to involve sending ABS(g1) in response to the NDET(g2).

Things to discuss

On this:

how can we know that the RTI already sends the message_required_tag of tag t1?

If the above modified policy is right, the RTI doesn't need to keep track of this. But there might be some optimization that I'm missing where the NDET does not need to be sent. I just don't think the optimization given above is quite right.

Here,

When the RTI receives a NET(g1) from downstream federate D, it should send a NDET to upstream federate U if:

The most recent TAG(g2) from U satisfies g2 < g1. (most recent initializes to NEVER).
Otherwise, I think the NDET has to be sent.

I think you wanted to say "Otherwise, I think the NDET doesn't have to be sent". Is it right?

Further, I think the RTI should send a NDET to upstream federate U if

The most recent LTC(g2) (instead of TAG) from U satisfies g2 < g1. (most recent initializes to NEVER).

Let me explain the reason. Let's assume that there is a downstream federate D that has two upstream federates U and U' and both upstream federates do not have any upstream federates. When the RTI receives the NET(g1) from federate D, it has to determine whether it should send or not NDETs to U and U'. If the RTI received LTC(g2) from U and LTC(g3) from U' earlier and g2 < g3, TAG(g2) must be granted to all federates. Finally, in the case g2 < g1 <= g3, the RTI does not have to send a NDET to U'. Thus, I think that the RTI needs to compare LTC from upstream federates before sending NDET messages.

2 replies

edwardalee May 18, 2023
Maintainer Author

Here,

When the RTI receives a NET(g1) from downstream federate D, it should send a NDET to upstream federate U if:

The most recent TAG(g2) from U satisfies g2 < g1. (most recent initializes to NEVER).
Otherwise, I think the NDET has to be sent.
I think you wanted to say "Otherwise, I think the NDET doesn't have to be sent". Is it right?

Further, I think the RTI should send a NDET to upstream federate U if

The most recent LTC(g2) (instead of TAG) from U satisfies g2 < g1. (most recent initializes to NEVER).

Let me explain the reason. Let's assume that there is a downstream federate D that has two upstream federates U and U' and both upstream federates do not have any upstream federates. When the RTI receives the NET(g1) from federate D, it has to determine whether it should send or not NDETs to U and U'. If the RTI received LTC(g2) from U and LTC(g3) from U' earlier and g2 < g3, TAG(g2) must be granted to all federates. Finally, in the case g2 < g1 <= g3, the RTI does not have to send a NDET to U'. Thus, I think that the RTI needs to compare LTC from upstream federates before sending NDET messages.

This still doesn't sound quite right to me. I don't think the "TAG(g2) must be granted to all federates" is right. Shouldn't this depend on g1? I think maybe the right response from the RTI to a NET(g1) from D should be like this:

At least one of these conditions will be true, so there will always be some response from the RTI.

byeonggiljun May 18, 2023
Collaborator

Thank you for replying, professor!

I apologize for confusing you. The intended meaning of "TAG(g2) must be granted to all federates" was "TAG(g2) must have been granted to all federates when the RTI received LTC messages from upstream federates.".

Also, I agree with your figure! Again, thank you for giving me the direction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalizing and Optimizing Centralized Federated Execution #1626

{{title}}

Replies: 11 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

When should the RTI send Message Required Tag?

How do upstream federates handle this message?

Things to discuss

{{title}}

{{title}}

Select a reply

Generalizing and Optimizing Centralized Federated Execution #1626

edwardalee Mar 6, 2023 Maintainer

Sparse Communication

Replies: 11 comments · 8 replies

lhstrh Mar 6, 2023 Maintainer

edwardalee Mar 6, 2023 Maintainer Author

byeonggiljun Mar 7, 2023 Collaborator

ChadliaJerad Mar 9, 2023 Collaborator

edwardalee Mar 10, 2023 Maintainer Author

byeonggiljun Mar 20, 2023 Collaborator

Overview

Message Required Tag

When should the RTI send Message Required Tag?

How do upstream federates handle this message?

Things to discuss

Steps

edwardalee Mar 20, 2023 Maintainer Author

When should the RTI send Message Required Tag?

How do upstream federates handle this message?

Things to discuss

byeonggiljun Mar 21, 2023 Collaborator

cmnrd Mar 22, 2023 Maintainer

byeonggiljun Mar 27, 2023 Collaborator

lhstrh Mar 27, 2023 Maintainer

edwardalee Mar 22, 2023 Maintainer Author

edwardalee Mar 27, 2023 Maintainer Author

cmnrd Mar 29, 2023 Maintainer

cmnrd Mar 31, 2023 Maintainer

edwardalee Mar 31, 2023 Maintainer Author

byeonggiljun May 18, 2023 Collaborator

When should the RTI send Message Required Tag?

How do upstream federates handle this message?

Things to discuss

edwardalee May 18, 2023 Maintainer Author

byeonggiljun May 18, 2023 Collaborator

edwardalee
Mar 6, 2023
Maintainer

Replies: 11 comments 8 replies

lhstrh
Mar 6, 2023
Maintainer

edwardalee Mar 6, 2023
Maintainer Author

byeonggiljun
Mar 7, 2023
Collaborator

ChadliaJerad
Mar 9, 2023
Collaborator

edwardalee Mar 10, 2023
Maintainer Author

byeonggiljun
Mar 20, 2023
Collaborator

edwardalee
Mar 20, 2023
Maintainer Author

byeonggiljun Mar 21, 2023
Collaborator

cmnrd
Mar 22, 2023
Maintainer

byeonggiljun Mar 27, 2023
Collaborator

lhstrh Mar 27, 2023
Maintainer

edwardalee
Mar 22, 2023
Maintainer Author

edwardalee
Mar 27, 2023
Maintainer Author

cmnrd
Mar 29, 2023
Maintainer

cmnrd
Mar 31, 2023
Maintainer

edwardalee Mar 31, 2023
Maintainer Author

byeonggiljun
May 18, 2023
Collaborator

edwardalee May 18, 2023
Maintainer Author

byeonggiljun May 18, 2023
Collaborator