Skip to content

Refactoring the shutdown process to fix a payment count bug #235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

bjohnson5
Copy link
Collaborator

Closes #222

Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken a high level look at this and it looks reasonable! Much smaller diff than I was expecting as well, which is nice.

I do (sadly) think we should bite the bullet and add some meta-tests that test all this shutdown business, so that we can square it away once and for all.

Things like:

  • Spin up simulation with producers that finish their tasks, assert that we shut down
  • Shutdown with track payment currently executing, and make sure that we exit

I imagine it's going to be a hell of a lot of mocking, but I think it'll really be worthwhile in the long term. Happy to collaborate on a POC test to figure out the best way to get some utils set up to share the burden a bit!

@bjohnson5
Copy link
Collaborator Author

Taken a high level look at this and it looks reasonable! Much smaller diff than I was expecting as well, which is nice.

I do (sadly) think we should bite the bullet and add some meta-tests that test all this shutdown business, so that we can square it away once and for all.

Things like:

  • Spin up simulation with producers that finish their tasks, assert that we shut down
  • Shutdown with track payment currently executing, and make sure that we exit

I imagine it's going to be a hell of a lot of mocking, but I think it'll really be worthwhile in the long term. Happy to collaborate on a POC test to figure out the best way to get some utils set up to share the burden a bit!

@carlaKC Absolutely agree that some tests should be added for this. Is there a test that requires mocking that would be good to look at as an example or would this be the first one of this kind?

@carlaKC
Copy link
Contributor

carlaKC commented Mar 31, 2025

Is there a test that requires mocking that would be good to look at as an example or would this be the first one of this kind?

The mocks in lib.rs should give you a decent idea of the way we've been approaching mocking. These are just gnarley because we'll want to mock internal_run and then assert whatever set of events we're trying to test. It will be more complicated that the quite simple tests that we currently have, but nothing fundamentally different.

@bjohnson5
Copy link
Collaborator Author

For reference here is what is done in this PR:

  • Removed an unnecessary clone on the event_sender in internal_run so that the sender will be dropped and the receivers will close.

  • Removed the tokio::select! statement and shutdown listener from consume_simulation_results because this is a channel receiver and will close when all of its corresponding senders are dropped.

  • Removed the tokio::select! statement and shutdown listener from consume_events because this is a channel receiver and will close when all of its corresponding senders are dropped.

  • Removed the tokio::select! statement and shutdown listener from produce_simulation_results because this is a channel receiver and will close when all of its corresponding senders are dropped.

  • Refactored track_payment_result to use a timer to wait a given amount of time for current tracking to complete. This function is also now responsible for shutting down the node implementations that are tracking payments (node.track_payment) with a local trigger/listener.

The new shutdown flow looks like this:

  1. The shutdown is triggered by either A.) The producer tasks finish due to payment count being met, B.) The total time has been met, or C.) The user presses ctrlc
  2. The run_results_logger gets the shutdown signal and returns.
  3. The produce_events function gets the shutdown signal (if it was not the one to trigger it) and returns. This drops its sender.
  4. If simln is in the middle of tracking a payment, the track_payment_result function gets the shutdown signal and starts a timer.
  5. The consume_events function quits its recv (because the produce_events dropped its sender), and then drops the consume_events sender.
  6. The produce_simulation_results function quits its recv (because consume_events dropped its sender), and then drops the produce_simulation_results sender.
  7. The consume_simulation_results function quits its recv because produce_simulation_results dropped its sender.
  8. If simln is in the middle of tracking a payment, either the timer expires and it shuts down or the payments finish getting tracked and it shuts down.

@carlaKC
Copy link
Contributor

carlaKC commented Apr 7, 2025

Nice! Thanks for the thorough walkthrough.

The shutdown is triggered by either A.) The producer tasks finish due to payment count being met, B.) The total time has been met, or C.) The user presses ctrlc

Interested that we don't hit any unexpected errors that make us close out. Is that because we'll keep running even if we fail to send a payment (eg, one of the nodes has shut down)?

@bjohnson5
Copy link
Collaborator Author

Nice! Thanks for the thorough walkthrough.

The shutdown is triggered by either A.) The producer tasks finish due to payment count being met, B.) The total time has been met, or C.) The user presses ctrlc

Interested that we don't hit any unexpected errors that make us close out. Is that because we'll keep running even if we fail to send a payment (eg, one of the nodes has shut down)?

Ok yes, good point. Another way the shutdown is triggered is D.) An error is thrown while setting up consumers and producers or while the producers and consumers are running.

@bjohnson5
Copy link
Collaborator Author

@carlaKC I made a first attempt at adding some shutdown tests. Let me know what you think. There is probably some cleanup to do and a few other test cases we could add but this is a starting point.

@bjohnson5 bjohnson5 requested a review from carlaKC April 8, 2025 21:38
Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a high level look at the tests and they look great, approach ACK

@@ -1691,4 +1693,224 @@ mod tests {

assert!(result.is_ok());
}

#[tokio::test]
async fn test_shutdown_timeout() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have #81, we'll be able to replace the clock with a test clock so that this can be much more precise.

I'm tempted to put this test off till then, had a lot of bad experiences with timing based tests falling apart on the potato that github runs CI on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wondered about that. Most of the shutdown tests have some kind of timing aspect... checking the runtime against expected runtime or sleeping for a few seconds before manually shutting down. Should we wait to do any of this until we figure out a better way to handle timing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about tokio advance?

let mut mock_node_2 = MockLightningNode::new();

// Set up node 1 expectations
mock_node_1.expect_get_info().return_const(node_1.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Would be nice if we could pull some of this out into more generic helpers so that we can compose tests like this more easily.

Edit: you did this in the next commit, awesome. Happy for you to squash that now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this PR has several merge commits in it where I updated from master. Would you prefer just squashing the whole PR into 1 commit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to have multiple commits of changes, but I would like to get rid of the merge commits in the PR with a rebase.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased to remove the merge commits. Cleaner looking git history now.

Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great!

Would be nice to add some coverage for the cases where we have multiple activities, specifically:

  1. One with a count, one without -> make sure the second one keeps running
  2. Two activities, one has a permanent error -> make sure both exit

Also think that we need some coverage for the track_payment/timer stuff if possible, even if it's not end to end (I suspect this will be tricky to mock)

}
}
};
let output = output_receiver.recv().await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, "In the multiple-producer case, a single producer shutting down does not drop all sending channels so the consumer will not exit and a trigger is required" won't apply anymore so can be deleted

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean just remove that part of the comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just remove the old comment that will no longer apply

let (stop, listen) = triggered::trigger();

// Timer for waiting after getting the shutdown signal in order for current tracking to complete
let mut timer: Option<tokio::time::Sleep> = None;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move tokio::time::Sleep to top level imports

},

// Trigger and listener to stop the implementation specific track payment functions (node.track_payment())
let (stop, listen) = triggered::trigger();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename track_payment_trigger/track_payment_listener so this is a little more readable.

Some(_) = conditional_sleeper(timer) => {
log::error!("Track payment failed for {}. The shutdown timer expired.", hex::encode(hash.0));
stop.trigger();
timer = None;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set this to None and our select is biased to always select the top case, don't we hit a loop here?

  • listener.is_triggered() && timer.is_none() = T
    • timer = (sleep 3 seconds)
  • conditional_sleeper = T (after 3 seconds)
    • stop.trigger()
    • timer=None
  • listener.clone().is_triggered && timer.is_none() = T
    • Loop repeats

I think that we can remove biased here (we don't need it) and not set timer=None - even without biased the select will randomly choose a branch that's ready, so we could end up looping a few times.

@carlaKC
Copy link
Contributor

carlaKC commented Apr 22, 2025

Also think that it would be nice to make a note about this in dev docs to explain a shorter version of what we've discussed in the issue (happy for that to be a followup or do it myself once this is in).

@bjohnson5 bjohnson5 force-pushed the 222-fixing-payment-count branch from f55644b to e031ddd Compare April 23, 2025 16:40
@bjohnson5
Copy link
Collaborator Author

Looking great!

Would be nice to add some coverage for the cases where we have multiple activities, specifically:

  1. One with a count, one without -> make sure the second one keeps running
  2. Two activities, one has a permanent error -> make sure both exit

Also think that we need some coverage for the track_payment/timer stuff if possible, even if it's not end to end (I suspect this will be tricky to mock)

Added two more tests for these cases. I will have to think through the timer test case some more, that will be tricky.

@bjohnson5
Copy link
Collaborator Author

Summary of remaining tasks for this PR:

  • Add explanation to the developer docs
  • Add a test case for track_payment timer expiration shutdown
  • Think about how to best handle time bases tests (maybe tokio::advance, maybe reference Feature: Simulation Time #81)
  • Think about more shutdown test coverage that is needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: count in defined activity produces count-1 events (one less than expected)
2 participants