Generalizing and Optimizing Centralized Federated Execution #1626
Replies: 11 comments 8 replies
-
Just to make sure that credit goes where credit's due: it was @byeong-gil who identified the problem after spending a lot of time parsing through the logs manually. He also constructed this sequence diagram that we included in our paper: |
Beta Was this translation helpful? Give feedback.
-
I tried to resolve
on a branch reactor-c/rti-optimizations. And there is a draft PR reactor-c/pull/175. |
Beta Was this translation helpful? Give feedback.
-
Here are the visualizations of the communication within
The number of exchanged messages is significantly reduced in this example. |
Beta Was this translation helpful? Give feedback.
-
This proposal is for
OverviewI suggest adding a new message from the RTI. The purpose of this message is to make the RTI notify upstream reactors 'You don't have to send any message until this tag unless you produce an output.'. This is the idea from @Lostroh. I temporarily name this message
Message Required TagThis message is a notification by the RTI to the federate that some downstream federates have events at this tag. When should the RTI send Message Required Tag?The basic concept of this message is forwarding NET. So I'll describe situations in which the RTI doesn't have to forward NET.
How do upstream federates handle this message?Upstream federates have to manage a priority queue that stores Things to discuss
Steps
|
Beta Was this translation helpful? Give feedback.
-
This proposal sounds great to me! Maybe rather than A couple of modifications are needed. On this:
When should the RTI send Message Required Tag?I think the above isn't quite right. You need to also consider the case when D has another upstream federate U'. When it receives a message from the U', D may send a new NET to the RTI. This new NET may require a new NDET. I think the condition is this: When the RTI receives a NET(g1) from downstream federate D, it should send a NDET to upstream federate U if:
How do upstream federates handle this message?Again, I think you are close, but it isn't quite right.
Things to discussOn this:
|
Beta Was this translation helpful? Give feedback.
-
I think that @byeong-gil's proposal goes very much in the direction that I am also taking with the coordination of enclaves in C++ (see #1665). I will document how this works ASAP and also present it in the next Wednesday meeting. The bottom line is, however, that even the RTI can be eliminated. In the right picture in @byeong-gil's post, the Receiver could also send |
Beta Was this translation helpful? Give feedback.
-
I think this is a great direction! Keep in mind that then when the receiver receives a NET, it will need to compare it to its own (running) NET and forward the minimum to its own downstream receivers. The running NET needs to be the minimum of NETs received from upstream senders and its own earliest event queue tag. The tricky part then becomes handling cycles, but I'm sure that is manageable too. |
Beta Was this translation helpful? Give feedback.
-
One issue is for the federates to communicate with each other directly, they need each other’s IP addresses. Currently, they get these from the RTI (for physical connections). To join a federation, eac federate only needs to know the IP address of the RTI. Also, the start time and time to stop in response to a call to request_stop() are a distributed consensus. Without an RTI, these will need to be implemented differently. Transient federates will also need to be rethought. I’m sure these are all solvable, and perhaps a gradual approach is best. Keep the RTI for now, it bypass it as much as possible. Work towards eliminating it… |
Beta Was this translation helpful? Give feedback.
-
This problem is solved by the established middlewares (be it commercial or open source) and I still don't think we should solve it ourselves unless we absolutely need to. That said, if we don't need auto-discovery, we can also simply pass the addresses to the federates at deployment. |
Beta Was this translation helpful? Give feedback.
-
I ran into an issue with cycles in the current enclave coordination scheme that might also be relevant for the discussion here. While cycles (with delays) work in principle, they can become super inefficient if delays are small. In the EnclaveCycle.lf test, the The smaller the delay becomes, the worse the situation gets. So far, I don't have a good idea how to break this loop. The enclaves will need to be able to detect that they are located within a loop somehow. I am posting this here, because this might also be an interesting corner case for any of the optimizations discussed for the current federated execution. |
Beta Was this translation helpful? Give feedback.
-
@edwardalee Hello, Professor Lee, I'd like to ask something about your reply that I quoted below.
I think you wanted to say "Otherwise, I think the NDET doesn't have to be sent". Is it right? Further, I think the RTI should send a NDET to upstream federate U if
Let me explain the reason. Let's assume that there is a downstream federate D that has two upstream federates U and U' and both upstream federates do not have any upstream federates. When the RTI receives the NET(g1) from federate D, it has to determine whether it should send or not NDETs to U and U'. If the RTI received LTC(g2) from U and LTC(g3) from U' earlier and g2 < g3, TAG(g2) must be granted to all federates. Finally, in the case g2 < g1 <= g3, the RTI does not have to send a NDET to U'. Thus, I think that the RTI needs to compare LTC from upstream federates before sending NDET messages. |
Beta Was this translation helpful? Give feedback.
-
Sparse Communication
@Jakio815 observed a problematic pattern with centralized coordination of federated execution that is represented by the following program:
Suppose that the Sender, which triggers at 100ms intervals, only occasionally sends an output message, say, on average, every few seconds. The problem currently is that both Sender and Receiver communicate with the RTI every 100ms even though nothing interesting is happening. This communication is captured nicely by @ChadliaJerad 's (still rough) prototype trace visualizer as follows:
The RTI is on the left, the Sender in the middle, and the Receiver on the right. I've skipped the initial messages to focus more on the steady-state behavior. At logical time 0, the Receiver sends a LTC (Logical Tag Complete) message with tag 0 followed by a NET (Next Event Tag) with value 2s (the timeout time). It is telling the RTI that its event queue is empty, and that, absent network inputs, it has nothing to do until time 2s (shutdown time).
In this trace, the Sender next sends LTC (100ms), which causes the RTI to send TAG (Tag Advance Grant) (100ms) to the Receiver. This latter message, however, is unnecessary because the RTI knows that Receiver has nothing to do until 2s and that no message was sent by the Sender with tag of 100ms or earlier. First optimization: (easy) Eliminate this TAG message.
Second optimization: (a bit harder) Eliminate the LTC messages from the Sender to the RTI as well. Idea: To do this, when the RTI receives NET(2s) from the Receiver, it should forward that message to all federates upstream of the Receiver. Those federates should maintain a barrier tag b that is the least such tag they have received (they will need a queue of such tags because the least will drop off when the federate sends an LTC matching or exceeding the tag).
Given such a barrier b, a federate is not required to send an LTC(g) for any g < b unless it sends an output message at g.
There are many interesting variants of this example (local timer at Receiver, physical action at Receiver), but discussion of those will have to wait for followup postings here.
Beta Was this translation helpful? Give feedback.
All reactions