Refactoring the shutdown process to fix a payment count bug #235

bjohnson5 · 2025-03-21T17:15:16Z

Closes #222

carlaKC

Taken a high level look at this and it looks reasonable! Much smaller diff than I was expecting as well, which is nice.

I do (sadly) think we should bite the bullet and add some meta-tests that test all this shutdown business, so that we can square it away once and for all.

Things like:

Spin up simulation with producers that finish their tasks, assert that we shut down
Shutdown with track payment currently executing, and make sure that we exit

I imagine it's going to be a hell of a lot of mocking, but I think it'll really be worthwhile in the long term. Happy to collaborate on a POC test to figure out the best way to get some utils set up to share the burden a bit!

simln-lib/src/lib.rs

bjohnson5 · 2025-03-31T19:24:43Z

Taken a high level look at this and it looks reasonable! Much smaller diff than I was expecting as well, which is nice.

I do (sadly) think we should bite the bullet and add some meta-tests that test all this shutdown business, so that we can square it away once and for all.

Things like:

Spin up simulation with producers that finish their tasks, assert that we shut down

Shutdown with track payment currently executing, and make sure that we exit

I imagine it's going to be a hell of a lot of mocking, but I think it'll really be worthwhile in the long term. Happy to collaborate on a POC test to figure out the best way to get some utils set up to share the burden a bit!

@carlaKC Absolutely agree that some tests should be added for this. Is there a test that requires mocking that would be good to look at as an example or would this be the first one of this kind?

carlaKC · 2025-03-31T19:35:00Z

Is there a test that requires mocking that would be good to look at as an example or would this be the first one of this kind?

The mocks in lib.rs should give you a decent idea of the way we've been approaching mocking. These are just gnarley because we'll want to mock internal_run and then assert whatever set of events we're trying to test. It will be more complicated that the quite simple tests that we currently have, but nothing fundamentally different.

bjohnson5 · 2025-03-31T20:52:24Z

For reference here is what is done in this PR:

Removed an unnecessary clone on the event_sender in internal_run so that the sender will be dropped and the receivers will close.
Removed the tokio::select! statement and shutdown listener from consume_simulation_results because this is a channel receiver and will close when all of its corresponding senders are dropped.
Removed the tokio::select! statement and shutdown listener from consume_events because this is a channel receiver and will close when all of its corresponding senders are dropped.
Removed the tokio::select! statement and shutdown listener from produce_simulation_results because this is a channel receiver and will close when all of its corresponding senders are dropped.
Refactored track_payment_result to use a timer to wait a given amount of time for current tracking to complete. This function is also now responsible for shutting down the node implementations that are tracking payments (node.track_payment) with a local trigger/listener.

The new shutdown flow looks like this:

The shutdown is triggered by either A.) The producer tasks finish due to payment count being met, B.) The total time has been met, or C.) The user presses ctrlc
The run_results_logger gets the shutdown signal and returns.
The produce_events function gets the shutdown signal (if it was not the one to trigger it) and returns. This drops its sender.
If simln is in the middle of tracking a payment, the track_payment_result function gets the shutdown signal and starts a timer.
The consume_events function quits its recv (because the produce_events dropped its sender), and then drops the consume_events sender.
The produce_simulation_results function quits its recv (because consume_events dropped its sender), and then drops the produce_simulation_results sender.
The consume_simulation_results function quits its recv because produce_simulation_results dropped its sender.
If simln is in the middle of tracking a payment, either the timer expires and it shuts down or the payments finish getting tracked and it shuts down.

carlaKC · 2025-04-07T13:02:01Z

Nice! Thanks for the thorough walkthrough.

The shutdown is triggered by either A.) The producer tasks finish due to payment count being met, B.) The total time has been met, or C.) The user presses ctrlc

Interested that we don't hit any unexpected errors that make us close out. Is that because we'll keep running even if we fail to send a payment (eg, one of the nodes has shut down)?

bjohnson5 · 2025-04-07T14:50:32Z

Nice! Thanks for the thorough walkthrough.

The shutdown is triggered by either A.) The producer tasks finish due to payment count being met, B.) The total time has been met, or C.) The user presses ctrlc

Interested that we don't hit any unexpected errors that make us close out. Is that because we'll keep running even if we fail to send a payment (eg, one of the nodes has shut down)?

Ok yes, good point. Another way the shutdown is triggered is D.) An error is thrown while setting up consumers and producers or while the producers and consumers are running.

bjohnson5 · 2025-04-08T21:38:12Z

@carlaKC I made a first attempt at adding some shutdown tests. Let me know what you think. There is probably some cleanup to do and a few other test cases we could add but this is a starting point.

carlaKC

Took a high level look at the tests and they look great, approach ACK

carlaKC · 2025-04-14T17:57:45Z

simln-lib/src/lib.rs

@@ -1691,4 +1693,224 @@ mod tests {

        assert!(result.is_ok());
    }
+
+    #[tokio::test]
+    async fn test_shutdown_timeout() {


Once we have #81, we'll be able to replace the clock with a test clock so that this can be much more precise.

I'm tempted to put this test off till then, had a lot of bad experiences with timing based tests falling apart on the potato that github runs CI on.

Yeah I wondered about that. Most of the shutdown tests have some kind of timing aspect... checking the runtime against expected runtime or sleeping for a few seconds before manually shutting down. Should we wait to do any of this until we figure out a better way to handle timing?

How about tokio advance?

carlaKC · 2025-04-14T17:58:24Z

simln-lib/src/lib.rs

+        let mut mock_node_2 = MockLightningNode::new();
+
+        // Set up node 1 expectations
+        mock_node_1.expect_get_info().return_const(node_1.clone());


This is great! Would be nice if we could pull some of this out into more generic helpers so that we can compose tests like this more easily.

Edit: you did this in the next commit, awesome. Happy for you to squash that now.

Yeah, this PR has several merge commits in it where I updated from master. Would you prefer just squashing the whole PR into 1 commit?

Happy to have multiple commits of changes, but I would like to get rid of the merge commits in the PR with a rebase.

Rebased to remove the merge commits. Cleaner looking git history now.

simln-lib/src/lib.rs

carlaKC

Looking great!

Would be nice to add some coverage for the cases where we have multiple activities, specifically:

One with a count, one without -> make sure the second one keeps running
Two activities, one has a permanent error -> make sure both exit

Also think that we need some coverage for the track_payment/timer stuff if possible, even if it's not end to end (I suspect this will be tricky to mock)

carlaKC · 2025-04-22T18:47:45Z

simln-lib/src/lib.rs

-                                }
-                            }
-                        };
+        let output = output_receiver.recv().await;


Above, "In the multiple-producer case, a single producer shutting down does not drop all sending channels so the consumer will not exit and a trigger is required" won't apply anymore so can be deleted

Do you mean just remove that part of the comment?

Yeah, just remove the old comment that will no longer apply

carlaKC · 2025-04-22T18:48:50Z

simln-lib/src/lib.rs

+            let (stop, listen) = triggered::trigger();
+
+            // Timer for waiting after getting the shutdown signal in order for current tracking to complete
+            let mut timer: Option<tokio::time::Sleep> = None;


nit: move tokio::time::Sleep to top level imports

carlaKC · 2025-04-22T18:51:36Z

simln-lib/src/lib.rs

-                },
+
+            // Trigger and listener to stop the implementation specific track payment functions (node.track_payment())
+            let (stop, listen) = triggered::trigger();


nit: rename track_payment_trigger/track_payment_listener so this is a little more readable.

carlaKC · 2025-04-22T18:57:01Z

simln-lib/src/lib.rs

+                    Some(_) = conditional_sleeper(timer) => {
+                        log::error!("Track payment failed for {}. The shutdown timer expired.", hex::encode(hash.0));
+                        stop.trigger();
+                        timer = None;


If we set this to None and our select is biased to always select the top case, don't we hit a loop here?

listener.is_triggered() && timer.is_none() = T

timer = (sleep 3 seconds)

conditional_sleeper = T (after 3 seconds)

stop.trigger()

timer=None

listener.clone().is_triggered && timer.is_none() = T

Loop repeats

I think that we can remove biased here (we don't need it) and not set timer=None - even without biased the select will randomly choose a branch that's ready, so we could end up looping a few times.

@f3r10 This is one issue that should be worked out with the track_payment_result function.

carlaKC · 2025-04-22T19:05:05Z

Also think that it would be nice to make a note about this in dev docs to explain a shorter version of what we've discussed in the issue (happy for that to be a followup or do it myself once this is in).

bjohnson5 · 2025-04-23T20:14:39Z

Looking great!

Would be nice to add some coverage for the cases where we have multiple activities, specifically:

One with a count, one without -> make sure the second one keeps running

Two activities, one has a permanent error -> make sure both exit

Also think that we need some coverage for the track_payment/timer stuff if possible, even if it's not end to end (I suspect this will be tricky to mock)

Added two more tests for these cases. I will have to think through the timer test case some more, that will be tricky.

bjohnson5 · 2025-04-23T20:18:37Z

Summary of remaining tasks for this PR:

Add explanation to the developer docs
Add a test case for track_payment timer expiration shutdown
Think about how to best handle time bases tests (maybe tokio::advance, maybe reference Feature: Simulation Time #81)
Think about more shutdown test coverage that is needed

f3r10

Tested ACK

f3r10 · 2025-05-02T21:23:41Z

Add a test case for track_payment timer expiration shutdown

I think that adding a sleep could do the trick for testing this part 🤔 :

        mock_node_1.expect_track_payment().returning(|_, _| {
            std::thread::sleep(tokio::time::Duration::from_millis(3000));
            Ok(crate::PaymentResult {
                htlc_count: 1,
                payment_outcome: crate::PaymentOutcome::Success,
            })
        });

bjohnson5 · 2025-05-05T20:45:29Z

Add a test case for track_payment timer expiration shutdown

I think that adding a sleep could do the trick for testing this part 🤔 :

        mock_node_1.expect_track_payment().returning(|_, _| {
            std::thread::sleep(tokio::time::Duration::from_millis(3000));
            Ok(crate::PaymentResult {
                htlc_count: 1,
                payment_outcome: crate::PaymentOutcome::Success,
            })
        });

Because the mocked function is not using the shutdown listener, I do not think simply adding a sleep will fully test this functionality. In fact, we probably need to do a little more in the mocking of track_payment to make all of these tests more realistic. Essentially the expect_track_payment().returning() needs to use the shutdown listener and have a tokio::select statement that waits for the sleep and shutdown listener. I am not exactly sure how to implement that yet though. If you have any suggestions that would be great!

In my head mock_node_1 would look like this:

        mock_node_1.expect_track_payment().returning(async |_, shutdown| {
            tokio::select! {
                biased;
                _ = shutdown => {
                    Err(LightningError::TrackPaymentError("Shutdown before tracking results".to_string()))
                },
                _ = tokio::time::sleep(tokio::time::Duration::from_millis(3000)) => { 
                    Ok(crate::PaymentResult {
                        htlc_count: 1,
                        payment_outcome: crate::PaymentOutcome::Success,
                    })
                }
            }
        });

But that is not going to compile I don't think and we will have to do some weird async stuff. But this would more accurately mock track_payment.

carlaKC · 2025-05-06T14:36:06Z

In my head mock_node_1 would look like this:

If we can get this working, LGTM!

But otherwise:

We could also abandon ship on using mock here and spin up a specific impl of LightningNode that fits our requirements? It would be a bunch of boilerplate but I think that's an acceptable tradeoff vs starting with painful mocking x async?

carlaKC · 2025-06-09T18:05:34Z

Any updates here? Would be good to get this in before some other PRs give you a bunch of conflicts :')

bjohnson5 · 2025-06-11T19:52:12Z

Any updates here? Would be good to get this in before some other PRs give you a bunch of conflicts :')

Yes, sorry! There are still a few issues with this implementation and the tests that need to be worked out. I have been meaning to revisit it but just have not had the time. Hoping to get back to it next week, but if anyone has extra cycles and wants to attempt getting it across the finish line that would be great!

f3r10 · 2025-06-16T16:12:23Z

I could give a try @bjohnson5 @carlaKC.
It would be something like this, right 🤔 ?

struct TestLightningNode;

#[async_trait]
impl LightningNode for TestLightningNode {
       async fn track_payment(
        &mut self,
        hash: &PaymentHash,
        shutdown: Listener,
    ) -> Result<PaymentResult, LightningError> {
        tokio::select! {
                biased;
                _ = shutdown => {
                    Err(LightningError::TrackPaymentError("Shutdown before tracking results".to_string()))
                },
                _ = tokio::time::sleep(tokio::time::Duration::from_millis(3000)) => {
                    Ok(crate::PaymentResult {
                        htlc_count: 1,
                        payment_outcome: crate::PaymentOutcome::Success,
                    })
                }

        }
    }

    ..... the rest of the methods 
}

bjohnson5 · 2025-06-17T14:04:59Z

I could give a try @bjohnson5 @carlaKC. It would be something like this, right 🤔 ?

struct TestLightningNode;

#[async_trait]
impl LightningNode for TestLightningNode {
       async fn track_payment(
        &mut self,
        hash: &PaymentHash,
        shutdown: Listener,
    ) -> Result<PaymentResult, LightningError> {
        tokio::select! {
                biased;
                _ = shutdown => {
                    Err(LightningError::TrackPaymentError("Shutdown before tracking results".to_string()))
                },
                _ = tokio::time::sleep(tokio::time::Duration::from_millis(3000)) => {
                    Ok(crate::PaymentResult {
                        htlc_count: 1,
                        payment_outcome: crate::PaymentOutcome::Success,
                    })
                }

        }
    }

    ..... the rest of the methods 
}

@f3r10 Yes, we will need something like that for the test. The bigger issue is I am not 100% sure the track_payment_result timer logic is correct. The select statement inside the loop was not behaving exactly right with some of the tests. That is why I started realizing we need to be using an actual shutdown listener in the mocked node... so that we could test a shutdown and verify that the track_payment_result logic was correct. You might implement this test and then debug track_payment_result and see if you can find any issues. Thanks for the help!

f3r10 · 2025-06-20T18:23:21Z

@bjohnson5 I created a PR adding the test: bjohnson5#1
It appears to be working fine.
In the test, the first payment is tracked correctly. Then, a shutdown is triggered in a separate thread after 3 seconds, which results in starting the conditional_sleeper and returning a LightningError::TrackPaymentError for the second payment.

bjohnson5 · 2025-06-20T20:19:06Z

@f3r10 Awesome! Thanks for picking this up. That looks like a good solution to me. Now the challenge is to get this branch up to date with the latest from main. I took a quick look at the conflicts and it doesn't seem too bad. Just some additions of a clock and a few other items. @carlaKC may be able to help us resolve those conflicts in a way that doesn't break anything.

carlaKC · 2025-06-24T14:27:47Z

Yeah, conflicts don't look too major. Perhaps squash this down to one commit with all the changes and one with the tests and that'll make it a bit easier? I took a look at the rebase and the main difficulty was that there are changes that go back and forth, so that should help a bunch.

I made a half-hearted attempt at it here, but one of the tests is failing so I think I borked something. If you don't have time to do it lmk and I can give it a more serious try.

bjohnson5 · 2025-06-25T21:47:07Z

Yeah, conflicts don't look too major. Perhaps squash this down to one commit with all the changes and one with the tests and that'll make it a bit easier? I took a look at the rebase and the main difficulty was that there are changes that go back and forth, so that should help a bunch.

I made a half-hearted attempt at it here, but one of the tests is failing so I think I borked something. If you don't have time to do it lmk and I can give it a more serious try.

I took a quick look at your attempt and it seems correct to me. @f3r10 do you see anything obvious that would cause a test to fail?

carlaKC · 2025-06-26T17:49:51Z

I think that it may make sense to pause this until we're found some direction on deterministic events - seems like we can get rid of some of these consumers which will simplify some of the shutdown stuff.

We'll still need a big chunk of this, but there might be a bit of an ugly rebase coming up. Let's save strength now, get that in and then push this over the finish line.

bjohnson5 requested a review from carlaKC March 21, 2025 17:41

bjohnson5 marked this pull request as ready for review March 21, 2025 17:41

bjohnson5 mentioned this pull request Mar 21, 2025

Bug: count in defined activity produces count-1 events (one less than expected) #222

Open

carlaKC reviewed Mar 28, 2025

View reviewed changes

simln-lib/src/lib.rs Outdated Show resolved Hide resolved

bjohnson5 requested a review from carlaKC April 8, 2025 21:38

carlaKC reviewed Apr 14, 2025

View reviewed changes

carlaKC reviewed Apr 22, 2025

View reviewed changes

bjohnson5 force-pushed the 222-fixing-payment-count branch from f55644b to e031ddd Compare April 23, 2025 16:40

f3r10 reviewed May 2, 2025

View reviewed changes

bjohnson5 added 8 commits May 5, 2025 13:57

Refactoring the shutdown process to fix a payment count bug

8d5df47

Removing unnecessary clone so that the event sender will be dropped

58cd75b

Adding two tests to verify the shutdown process

da2c0fb

Cleaning up the shutdown tests so that they share setup code

e59417b

Adding a shutdown test for the manual shutdown case

f38f2e1

Adding a test for the error shutdown case

56badbd

Cleaning up comments, variable names, and imports

adab935

Adding shutdown test cases for multiple activities

cf68cb0

bjohnson5 force-pushed the 222-fixing-payment-count branch from db96a89 to cf68cb0 Compare May 5, 2025 18:57

Adding a test for the track_payment shutdown timer

3e6cdc9

bjohnson5 requested a review from carlaKC June 20, 2025 20:19

carlaKC removed their request for review June 30, 2025 13:25

Refactoring the shutdown process to fix a payment count bug #235

Are you sure you want to change the base?

Refactoring the shutdown process to fix a payment count bug #235

Uh oh!

Conversation

bjohnson5 commented Mar 21, 2025

Uh oh!

carlaKC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bjohnson5 commented Mar 31, 2025

Uh oh!

carlaKC commented Mar 31, 2025

Uh oh!

bjohnson5 commented Mar 31, 2025

Uh oh!

carlaKC commented Apr 7, 2025

Uh oh!

bjohnson5 commented Apr 7, 2025

Uh oh!

bjohnson5 commented Apr 8, 2025

Uh oh!

carlaKC left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

carlaKC left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlaKC commented Apr 22, 2025

Uh oh!

bjohnson5 commented Apr 23, 2025

Uh oh!

bjohnson5 commented Apr 23, 2025

Uh oh!

f3r10 left a comment

Choose a reason for hiding this comment

Uh oh!

f3r10 commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjohnson5 commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlaKC commented May 6, 2025

Uh oh!

carlaKC commented Jun 9, 2025

Uh oh!

bjohnson5 commented Jun 11, 2025

Uh oh!

f3r10 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

f3r10 commented May 2, 2025 •

edited

Loading

bjohnson5 commented May 5, 2025 •

edited

Loading

f3r10 commented Jun 16, 2025 •

edited

Loading