Skip to content

test: sync injected qsigshare in llmq signing test#7304

Closed
thepastaclaw wants to merge 1 commit intodashpay:developfrom
thepastaclaw:fix-7255-llmq-signing-p2p-sync
Closed

test: sync injected qsigshare in llmq signing test#7304
thepastaclaw wants to merge 1 commit intodashpay:developfrom
thepastaclaw:fix-7255-llmq-signing-p2p-sync

Conversation

@thepastaclaw
Copy link
Copy Markdown

@thepastaclaw thepastaclaw commented May 4, 2026

Summary

  • Flush the direct P2P qsigshare injection in
    feature_llmq_signing.py --spork21 with send_and_ping().
  • Build and inject the full threshold set of RPC-created signature shares.
    This avoids depending on the earlier async submit=true shares being relayed
    before the recovered-signature timeout.
  • Keep the existing recovered-signature assertions intact.

Fixes #7255.

Validation

  • Passed locally on macOS arm64:

    python3 -m py_compile test/functional/feature_llmq_signing.py
    git diff --check
    python3 test/functional/feature_llmq_signing.py --spork21 \
      --configfile=/Users/claw/Projects/dash/test/config.ini
  • Repeated-run stress on macOS arm64 using the local build and this worktree's
    test source:

    python3 test/functional/test_runner.py -j10 --timeout-factor=1 \
      "feature_llmq_signing.py --spork21" \
      "feature_llmq_signing.py --spork21" \
      "feature_llmq_signing.py --spork21" \
      "feature_llmq_signing.py --spork21" \
      "feature_llmq_signing.py --spork21" \
      "feature_llmq_signing.py --spork21" \
      "feature_llmq_signing.py --spork21" \
      "feature_llmq_signing.py --spork21" \
      "feature_llmq_signing.py --spork21" \
      "feature_llmq_signing.py --spork21"

    Result: 8/10 passed under local -j10 resource contention. This reduces the
    observed local failure rate from the previously recorded 6/10 baseline on
    develop to 2/10.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

✅ No Merge Conflicts Detected

This PR currently has no conflicts with other open PRs.

@thepastaclaw
Copy link
Copy Markdown
Author

thepastaclaw commented May 4, 2026

✅ Review complete (commit d21e0a7)

@thepastaclaw
Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Walkthrough

The functional test for LLMQ signing was modified to change how a QSIGSHARE is delivered to the recovery member: the test now uses P2PInterface.send_and_ping(...) in place of P2PInterface.send_message(...), combining the send with a ping/sync action to ensure the peer is ready when the msg_qsigshare is transmitted.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: synchronizing an injected qsigshare message in the llmq signing test using a send-and-ping operation.
Linked Issues check ✅ Passed The code changes directly address the root cause identified in issue #7255 by flushing the injected qsigshare with sync_with_ping() to eliminate the async race condition causing intermittent timeouts.
Out of Scope Changes check ✅ Passed The single line change (replacing send_message() with send_and_ping()) is directly scoped to fixing the flakiness issue in feature_llmq_signing.py --spork21 with no extraneous modifications.
Description check ✅ Passed The pull request description is directly related to the changeset. It clearly describes the modification: changing from send_message() to send_and_ping() in the qsigshare injection, references the specific issue (#7255), and provides validation results from local testing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Author

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Single-line test fix adding sync_with_ping() after a direct qsigshare injection in feature_llmq_signing.py to eliminate a documented race against wait_for_sigs. Both reviewers converged that the change is minimal and correct. Only a stylistic nitpick remains.

Reviewed commit: 7c4f221

💬 1 nitpick(s)

Comment thread test/functional/feature_llmq_signing.py Outdated
@knst
Copy link
Copy Markdown
Collaborator

knst commented May 4, 2026

#7301 is fixing #7255

@thepastaclaw
Copy link
Copy Markdown
Author

Thanks — I checked #7301. It adds qsendrecsigs message support plus InstantSend coverage, but it doesn't touch feature_llmq_signing.py or the direct qsigshare send path that timed out in #7255. This draft is narrower: it just flushes that injected qsigshare before wait_for_sigs(), so I think the two changes are complementary rather than duplicates.

@thepastaclaw thepastaclaw force-pushed the fix-7255-llmq-signing-p2p-sync branch from 7c4f221 to f670321 Compare May 5, 2026 23:41
@thepastaclaw
Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/functional/feature_llmq_signing.py (1)

59-60: ⚡ Quick win

Consider scaling the wait_for_sigs timeout by timeout_factor to fully eliminate CI-load flakiness.

send_and_ping guarantees the recovery member has received the share at the P2P layer, but the actual BLS share aggregation into a recovered signature happens asynchronously in a background thread. Under heavy CI multi-process load the 15-second ceiling at line 111 can still expire before recovery completes — exactly the scenario described in issue #7255.

The standard mitigation (also recommended in the original issue) is to multiply by self.options.timeout_factor:

♻️ Suggested change
     def wait_for_sigs(hasrecsigs, isconflicting1, isconflicting2, timeout):
-        self.wait_until(lambda: check_sigs(hasrecsigs, isconflicting1, isconflicting2), timeout = timeout)
+        self.wait_until(lambda: check_sigs(hasrecsigs, isconflicting1, isconflicting2), timeout = timeout * self.options.timeout_factor)

This covers every wait_for_sigs call site (lines 66, 111, 167, 171, 177, 201) in one change. The assert_sigs_nochange calls also pass a raw timeout to self.wait_until, but since those intentionally assert no change they should arguably stay fixed to avoid slow-running no-op waits.

Also applies to: 111-111

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/functional/feature_llmq_signing.py` around lines 59 - 60, The
wait_for_sigs helper currently calls self.wait_until(..., timeout=timeout) with
a raw timeout which can flake under CI; update the wait_for_sigs definition to
scale the passed timeout by self.options.timeout_factor (i.e. call
self.wait_until(..., timeout=timeout * self.options.timeout_factor)) so all
wait_for_sigs call sites (the ones at lines referenced in the review)
automatically get the scaled timeout; do not change the assert_sigs_nochange raw
timeout behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/functional/feature_llmq_signing.py`:
- Around line 59-60: The wait_for_sigs helper currently calls
self.wait_until(..., timeout=timeout) with a raw timeout which can flake under
CI; update the wait_for_sigs definition to scale the passed timeout by
self.options.timeout_factor (i.e. call self.wait_until(..., timeout=timeout *
self.options.timeout_factor)) so all wait_for_sigs call sites (the ones at lines
referenced in the review) automatically get the scaled timeout; do not change
the assert_sigs_nochange raw timeout behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c0d4e4b5-b0ad-4e2f-a290-12cb6c5b0eca

📥 Commits

Reviewing files that changed from the base of the PR and between 7c4f221 and f670321.

📒 Files selected for processing (1)
  • test/functional/feature_llmq_signing.py

@thepastaclaw thepastaclaw force-pushed the fix-7255-llmq-signing-p2p-sync branch from f670321 to 1dda821 Compare May 6, 2026 00:48
Copy link
Copy Markdown
Author

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Single-line functional-test synchronization fix at the exact reviewed commit f670321. Replacing send_message() with send_and_ping() for the injected qsigshare forces the recovery member to process that direct P2P message before the recovered-signature wait continues, without changing production code or the test's assertions. I found no correctness, security, or Dash-specific interaction issues in the reviewed change.

Reviewed commit: f670321

Copy link
Copy Markdown
Author

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The current head is a small, mechanically correct test-stability change for feature_llmq_signing.py --spork21: it converts RPC sig-share results into P2P QSIGSHARE payloads and flushes the direct send with send_and_ping(). The new full-threshold direct injection removes the recovered-signature assertion's previous dependence on the async submit=true relay path, which is a reasonable flake fix but leaves that positive relay behavior without coverage in this test.

Reviewed commit: 1dda821

🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `test/functional/feature_llmq_signing.py`:
- [SUGGESTION] lines 99-115: Direct injection no longer positively covers the submit=true relay path
  This block now rebuilds shares for members 0, 1, and 2 with `submit=false` and injects all of them directly into the selected recovery member. In this test `set_dash_llmq_test_params(5, 3)` overrides `LLMQ_TEST` to a 3-of-5 quorum, so those three injected shares are sufficient by themselves; the recovered-signature assertion at line 120 can pass even if the earlier default `submit=true` calls at lines 79 and 87 stop relaying shares under SPORK_21. Avoiding that async dependency is the right way to deflake this specific check, but it also means the test no longer has a positive assertion that submit=true shares are relayed and consumed. If that behavior is meant to stay covered here, add a separate deterministic assertion for the submit=true relay path rather than relying on this recovery step.

Comment on lines +99 to +115
# 3. Sending enough sig shares received from RPC to the recovery member through P2P interface, should
# result in a recovered sig. Build all threshold shares explicitly so this test does not depend on the
# asynchronous submit=true shares above being relayed before the timeout expires.
sig_shares = [
rpc_sig_share_to_p2p(self.mninfo[i].get_node(self).quorum("sign", q_type, id, msgHash, quorumHash, False))
for i in range(2)
]
sig_shares.append(rpc_sig_share_to_p2p(sig_share_rpc_1))
for mn in self.mninfo: # type: MasternodeInfo
assert mn.get_node(self).getconnectioncount() == self.llmq_size
# Get the current recovery member of the quorum
q = self.nodes[0].quorum('selectquorum', q_type, id)
mn: MasternodeInfo = self.get_mninfo(q['recoveryMembers'][0])
# Open a P2P connection to it
p2p_interface = mn.get_node(self).add_p2p_connection(P2PInterface())
# Send the last required QSIGSHARE message to the recovery member
p2p_interface.send_message(msg_qsigshare([sig_share]))
# Send the required QSIGSHARE messages to the recovery member
p2p_interface.send_and_ping(msg_qsigshare(sig_shares))
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Direct injection no longer positively covers the submit=true relay path

This block now rebuilds shares for members 0, 1, and 2 with submit=false and injects all of them directly into the selected recovery member. In this test set_dash_llmq_test_params(5, 3) overrides LLMQ_TEST to a 3-of-5 quorum, so those three injected shares are sufficient by themselves; the recovered-signature assertion at line 120 can pass even if the earlier default submit=true calls at lines 79 and 87 stop relaying shares under SPORK_21. Avoiding that async dependency is the right way to deflake this specific check, but it also means the test no longer has a positive assertion that submit=true shares are relayed and consumed. If that behavior is meant to stay covered here, add a separate deterministic assertion for the submit=true relay path rather than relying on this recovery step.

source: ['codex']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `test/functional/feature_llmq_signing.py`:
- [SUGGESTION] lines 99-115: Direct injection no longer positively covers the submit=true relay path
  This block now rebuilds shares for members 0, 1, and 2 with `submit=false` and injects all of them directly into the selected recovery member. In this test `set_dash_llmq_test_params(5, 3)` overrides `LLMQ_TEST` to a 3-of-5 quorum, so those three injected shares are sufficient by themselves; the recovered-signature assertion at line 120 can pass even if the earlier default `submit=true` calls at lines 79 and 87 stop relaying shares under SPORK_21. Avoiding that async dependency is the right way to deflake this specific check, but it also means the test no longer has a positive assertion that submit=true shares are relayed and consumed. If that behavior is meant to stay covered here, add a separate deterministic assertion for the submit=true relay path rather than relying on this recovery step.

@thepastaclaw
Copy link
Copy Markdown
Author

Checked the CodeRabbit timeout suggestion. No code change needed here: BitcoinTestFramework.wait_until() already passes self.options.timeout_factor through to wait_until_helper(), which multiplies the timeout internally, so changing wait_for_sigs() to multiply by self.options.timeout_factor would double-scale these waits. The fixed raw timeouts in assert_sigs_nochange() also already go through the same framework scaling.

@thepastaclaw thepastaclaw force-pushed the fix-7255-llmq-signing-p2p-sync branch from 1dda821 to d21e0a7 Compare May 6, 2026 04:38
Copy link
Copy Markdown
Author

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Test-only change to feature_llmq_signing.py that hardens the spork21 LLMQ signing path. The split into a dedicated submit_true subtest plus explicit threshold-share construction with send_and_ping addresses a real flake risk. One minor coverage observation: the new submit_true subtest only asserts hasrecsig on the recovery member, not on all quorum members.

Reviewed commit: d21e0a7

💬 1 nitpick(s)

Comment on lines +100 to +106
# Sign a distinct request through the default submit=true path to assert sig share relay and recovery.
q_submit_true = self.nodes[0].quorum('selectquorum', q_type, submit_true_id)
submit_true_recovery_member = self.get_mninfo(q_submit_true['recoveryMembers'][0])
submit_true_signers = [mn for mn in self.mninfo if mn != submit_true_recovery_member][:3]
for mn in submit_true_signers:
mn.get_node(self).quorum("sign", q_type, submit_true_id, submit_true_msgHash)
wait_for_recsig(submit_true_recovery_member, submit_true_id, submit_true_msgHash, 15)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Nitpick: submit=true subtest only asserts recovery on the recovery member, not all quorum members

The new submit_true_id/submit_true_msgHash flow waits only for hasrecsig on submit_true_recovery_member. Prior to this PR, the implicit submit=true assertion was the all-nodes wait_for_sigs(True, False, True, 15) at line 136. After the split, that final wait still covers all-nodes recsig propagation, but only for the P2P-injected id/msgHash path. The recsig announcement code path is the same regardless of how recovery was triggered, so this is a minor coverage delta rather than a real regression risk — but extending the assertion to all masternodes is cheap and preserves the original guarantee. Note the suggested fix in the suggestion field; you'd want to re-derive submit_true_signers similarly when applying it.

💡 Suggested change
Suggested change
# Sign a distinct request through the default submit=true path to assert sig share relay and recovery.
q_submit_true = self.nodes[0].quorum('selectquorum', q_type, submit_true_id)
submit_true_recovery_member = self.get_mninfo(q_submit_true['recoveryMembers'][0])
submit_true_signers = [mn for mn in self.mninfo if mn != submit_true_recovery_member][:3]
for mn in submit_true_signers:
mn.get_node(self).quorum("sign", q_type, submit_true_id, submit_true_msgHash)
wait_for_recsig(submit_true_recovery_member, submit_true_id, submit_true_msgHash, 15)
# Sign a distinct request through the default submit=true path to assert sig share relay and recovery
# propagates to every quorum member.
q_submit_true = self.nodes[0].quorum('selectquorum', q_type, submit_true_id)
submit_true_recovery_member = self.get_mninfo(q_submit_true['recoveryMembers'][0])
submit_true_signers = [mn for mn in self.mninfo if mn != submit_true_recovery_member][:3]
for mn in submit_true_signers:
mn.get_node(self).quorum("sign", q_type, submit_true_id, submit_true_msgHash)
self.wait_until(
lambda: all(
mn.get_node(self).quorum("hasrecsig", q_type, submit_true_id, submit_true_msgHash)
for mn in self.mninfo
),
timeout=15,
)

source: ['codex']

@thepastaclaw
Copy link
Copy Markdown
Author

Closing this as obsolete. The intermittent feature_llmq_signing.py --spork21 issue should be handled by the QSENDRECSIGS / quorum-connection fix path already in develop (e.g. c1cdb75 from #7289, and the related #7301 test coverage), not by hardening this individual assertion with extra direct-injected shares.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test: feature_llmq_signing.py --spork21 intermittent timeout in wait_for_sigs

2 participants