[BugFix] Order PPLSimplifyDedupRule before FilterMergeRule in HEP program (#7) by ryan-gh-bot · Pull Request #8 · RyanL1997/sql

ryan-gh-bot · 2026-05-28T22:22:02Z

Description

When a user where sits below dedup (with keepempty=false, the default), the synthetic IS NOT NULL filter that PPL emits for dedup ends up adjacent to the user filter above the scan. The HEP program in CalciteToolsHelper registered both FilterMergeRule and PPLSimplifyDedupRule via addRuleCollection, which lets HepPlanner schedule them in either order. When FilterMergeRule fires first, the two adjacent filters are merged into a single filter whose condition is AND(IS NOT NULL(field), <user predicate>). PPLSimplifyDedupRule's bottom operand calls mayBeFilterFromBucketNonNull, which only accepts a pure IS NOT NULL or AND of IS NOT NULLs — the merged condition fails the predicate, no LogicalDedup is produced, and the OpenSearch DedupPushdownRule has nothing to match. Dedup falls back to the row-number window form executed on the coordinator, which on large indices exceeds timeouts or triggers shard failures.

This change switches the HEP program to two sequential addRuleInstance calls (PPLSimplifyDedupRule first to fixpoint, then FilterMergeRule). Sequential addRuleInstance instructions guarantee deterministic ordering, so the simplify rule always sees the original adjacent-filter shape and dedup pushdown is preserved when a where clause is present.

Verification

Before (adversarial test demonstrating the failure mode — FilterMergeRule running before PPLSimplifyDedupRule):

LogicalProject(DEPTNO=[$7])
  LogicalFilter(condition=[<=($8, 1)])
    LogicalProject(EMPNO=[$0], ENAME=[$1], JOB=[$2], MGR=[$3], HIREDATE=[$4], SAL=[$5], COMM=[$6], DEPTNO=[$7], _row_number_dedup_=[ROW_NUMBER() OVER (PARTITION BY $7)])
      LogicalFilter(condition=[AND(>($5, 1000), IS NOT NULL($7))])
        LogicalTableScan(table=[[scott, EMP]])

The ROW_NUMBER window survives. mayBeFilterFromBucketNonNull rejects AND(>($5, 1000), IS NOT NULL($7)) because not every conjunct is IS NOT NULL, so PPLSimplifyDedupRule cannot fire and LogicalDedup is not produced. DedupPushdownRule has nothing to match in Volcano, so the dedup runs on the coordinator. This is captured by the new testFilterMergeBeforeSimplifyDedupBreaksPattern test in CalcitePPLDedupTest.

After (this branch, running the production HEP program from CalciteToolsHelper.optimize on a where + dedup + fields plan):

LogicalProject(DEPTNO=[$7])
  LogicalDedup(dedup_fields=[[$7]], allowed_dedup=[1], keepEmpty=[false], consecutive=[false])
    LogicalFilter(condition=[>($5, 1000)])
      LogicalTableScan(table=[[scott, EMP]])

LogicalDedup is produced. DedupPushdownRule can now match the LogicalDedup → LogicalProject → CalciteLogicalIndexScan pattern at Volcano time and push the dedup down as a composite aggregation + top_hits, restoring the same DSL shape that already worked without a where clause. Captured by the new testWhereThenDedupProducesLogicalDedup test.

Test run on this branch:

CalcitePPLDedupTest > testDedup1 PASSED
CalcitePPLDedupTest > testDedup2 PASSED
CalcitePPLDedupTest > testDedupExpr PASSED
CalcitePPLDedupTest > testDedupKeepEmpty1 PASSED
CalcitePPLDedupTest > testDedupKeepEmpty2 PASSED
CalcitePPLDedupTest > testFilterMergeBeforeSimplifyDedupBreaksPattern PASSED
CalcitePPLDedupTest > testRenameDedup PASSED
CalcitePPLDedupTest > testSortFieldProjectedAwayBeforeDedup PASSED
CalcitePPLDedupTest > testSortThenDedup PASSED
CalcitePPLDedupTest > testSortThenDedupWithEval PASSED
CalcitePPLDedupTest > testWhereThenDedupProducesLogicalDedup PASSED

Full :core:test, :ppl:test, and :opensearch:test runs all pass with no failures.

Related Issues

Resolves #7

Check List

New functionality includes testing.
New functionality has been documented.
New PPL command checklist all confirmed.
API changes companion pull request created.
Commits are signed per the DCO using --signoff or -s.
Public documentation issue/PR created.

This pull request was authored by @ryan-gh-bot in response to the maintainer command /fix on #7 (invoked by @RyanL1997). This is a maintainer-triggered automated contribution. Please review carefully before merging.

…gram (opensearch-project#7) When a user WHERE sits below dedup, the synthetic IS NOT NULL filter that PPL emits as part of dedup keepempty=false ends up adjacent to the user filter above the scan. The HEP program in CalciteToolsHelper registered both FilterMergeRule and PPLSimplifyDedupRule via addRuleCollection, which lets HepPlanner schedule them in either order. When FilterMergeRule fires first the two adjacent filters are merged into a single filter whose condition is AND(IS NOT NULL(field), <user predicate>); PPLSimplifyDedupRule's bottom operand calls mayBeFilterFromBucketNonNull, which only accepts a pure IS NOT NULL or AND-of-IS-NOT-NULLs, so the merged condition fails the predicate, no LogicalDedup is produced, and the OpenSearch DedupPushdownRule has nothing to match — dedup falls back to the row-number window form executed on the coordinator. Switch the HEP program to two sequential addRuleInstance calls (PPLSimplifyDedupRule first to fixpoint, then FilterMergeRule). Sequential addRuleInstance instructions guarantee deterministic ordering, so the simplify rule always sees the original adjacent-filter shape and dedup pushdown is preserved when a where clause is present. Signed-off-by: ryan-gh-bot <ryan-gh-bot@users.noreply.github.com>

Tasks 2-6 of the PR-reviewer-end-to-end work. The bridge can now see PR reviews and line-anchored review comments on bot-authored PRs, and handle re-invocations as 'revise the existing PR' instead of refusing. Untested end-to-end against a real review; that and the matching SOP update for the agent are next session. state.py: - Added AuthoredPR(branch, last_seen_review_id, last_seen_review_comment_id). - RepoState.bot_authored_prs: Dict[int, AuthoredPR]. Schema stays v2; the new field is additive and old v2 state files load fine. - State.record_authored_pr(repo, pr_number, branch). bridge.py — handle_fix_comment now records bot-authored PRs: - After a successful create_pull_request, calls state.record_authored_pr(repo, pr_number, branch_name) and persists. Without this, future polling cycles wouldn't fetch reviews on the new PR. bridge.py — _poll_repo_once now polls reviews + review-comments on bot-authored PRs: - After the existing /issues/comments loop, iterates over rs.bot_authored_prs and calls _poll_pr_reviews_once for each. - _poll_pr_reviews_once fetches list_pr_reviews and list_pr_review_comments, filters by per-PR cursors (last_seen_review_id, last_seen_review_comment_id), skips bot- authored and empty-body entries, and dispatches anything mentioning @<bot> through handle_fix_comment with a synthesized Comment (issue_number=pr_number, is_pull_request=True). Cursors persisted after every dispatch. 404 on a PR removes it from bot_authored_prs. bridge.py — format_conversation now takes a unified entries list, merging conversation comments, PR reviews, and PR review-comments: - New helpers _entry_from_comment, _entry_from_pr_review, _entry_from_pr_review_comment normalize each source into a flat dict (kind, id, user, body, at, state, path, line, html_url). - New fetch_full_thread(gh, repo, issue, is_pull_request) returns a flat list of entries from all relevant endpoints. - format_conversation sorts chronologically by 'at' and emits a CONVERSATION block where each entry header includes kind=, state=, file=, line= flags as appropriate. The triggering_id parameter (renamed from triggering_comment_id) flags whichever entry kicked off this invocation. - Both handle_comment and handle_fix_comment now use fetch_full_thread instead of calling list_comments_on_issue directly. bridge.py — handle_fix_comment recognizes revisions: - is_revision = is_pull_request && pr_number in rs.bot_authored_prs. - Revisions skip the standard prepare_workdir + fetch target/main flow and instead use prepare_for_revision (workdir_manager.py: fetches origin, fetches target best-effort, checkout -f origin/<branch> detached, git clean -fdx). Branch_name comes from the stored AuthoredPR.branch, not regenerated from the format string. - Revisions skip the remote_branch_exists collision check (the branch is supposed to exist). - has_unpushed_commits comparison base is origin/<branch> for revisions (the previous tip) vs. target/<base> for fresh fixes. - Push uses force=True (force-with-lease) for revisions so the agent can rebase onto target/<base> if needed; plain push for fresh fixes. - Revisions skip create_pull_request — GitHub auto-updates the open PR when we push to its head branch. Bridge fetches the PR's html_url via get_pull_request and posts a 'Pushed updates to <url>' follow-up on the PR conversation tab, quoting the latest commit subject. - If a revision agent decides not to commit (writes only .bot-response.md), the bridge posts that as a PR comment — handles 'explain X without changing code' and 'I disagree, here is why' cases. workdir_manager.py: - New prepare_for_revision(workdir, branch_name): the workdir prep used for PR-revision invocations. Fetches origin, fetches target (check=False), verifies origin/<branch> exists (raises WorkdirError if not), checks it out detached, cleans untracked files. Returns the resolved short sha. Still pending (next session): - agent fix.sop.md: small update teaching the agent about kind=review and kind=review-comment entries and the existing-branch workdir state for revisions. - Build agent package, install, restart bridge. - End-to-end test with a fresh PR review on RyanL1997/sql#8. - Doc pass. The bridge daemon currently running (PID 30890) is on the OLD code. After next session's restart, it will pick up these changes plus the SOP update. Signed-off-by: ryan-gh-bot <278766827+ryan-gh-bot@users.noreply.github.com> Co-authored-by: RyanL1997 <109499885+RyanL1997@users.noreply.github.com>

RyanL1997

thanks for the. change - just left some reviews

RyanL1997 · 2026-05-29T01:07:55Z

+   * fix is to enforce ordering: simplify-dedup runs to fixpoint first, then filter merging.
+   */
+  /**
+   * Regression test for issue #7: when a user {@code where} sits below {@code dedup}, the HEP


@ryan-gh-bot looks like we are having duplicated javadoc here - can you verify and fix it?

Done — pushed bd848efb1.

RyanL1997 · 2026-05-29T01:08:16Z

  }

-  /** Try to optimize the plan by using HepPlanner */
-  private static final List<RelOptRule> hepRuleList =


@ryan-gh-bot just for my knowledge why are we removing it?

That hepRuleList field only existed as the argument to the old addRuleCollection(hepRuleList) call. Once the builder switched to two sequential addRuleInstance(...) calls (which take single rules directly), nothing else referenced the list, so keeping it would just be dead code — and dropping it also lets the java.util.List and org.apache.calcite.plan.RelOptRule imports go away. The Javadoc that used to sit on the field moved up onto HEP_PROGRAM, which is now the thing whose ordering actually matters.

RyanL1997 · 2026-05-29T01:08:48Z

+            .addRuleInstance(
+                org.opensearch.sql.calcite.plan.rule.PPLSimplifyDedupRule.DEDUP_SIMPLIFY_RULE)
+            .build();
+    org.apache.calcite.plan.hep.HepPlanner planner =


@ryan-gh-bot not only here - can you avoid use inline import? we should do proper import for the class.

Done — pushed e9bcf1b49.

@test

…nsearch-project#8) Address review feedback on PR opensearch-project#8: testWhereThenDedupProducesLogicalDedup had two adjacent javadoc blocks above the @test method. Merge them into a single javadoc that retains the failure-mechanism explanation (mayBeFilterFromBucketNonNull rejecting the merged condition) along with the addRuleCollection -> addRuleInstance fix detail. Signed-off-by: ryan-gh-bot <ryan-gh-bot@users.noreply.github.com>

…test (opensearch-project#8) Add static imports for assertTrue/assertFalse and proper imports for HepPlanner, HepProgram, HepProgramBuilder, PPLSimplifyDedupRule, and LogicalDedup, replacing all inline fully-qualified names in the regression tests added for opensearch-project#7. Pure refactor; no behavior change. Signed-off-by: ryan-gh-bot <ryan-gh-bot@users.noreply.github.com>

…robustness Several fixes from interactive testing on RyanL1997/sql#8. Together they make the bot's PR-review interaction match a regular human contributor's behaviour: commits properly attributed, eyes appear on every review-comment within seconds of submission, conversational threaded replies, fast-forward push when there's no history rewrite, no spurious 'Working on...' chatter. Workdir-local commit identity: - New cfg.git_author_name / git_author_email (default to bot username and <bot>@users.noreply.github.com). - New workdir_manager.set_commit_identity writes them to the workdir's local config on every prep so commits are attributed to the bot, not the host's global identity. Multi-comment review acknowledgement: - _poll_pr_reviews_once is now two-phase: phase 1 adds eyes-emoji to all actionable review-comments upfront, phase 2 dispatches each through the agent. Maintainers immediately see that the bot saw all of their inline comments even if the agent processes them sequentially. Revision-mode push and confirmation: - has_unpushed_commits uses origin/<branch> as base for revisions (not target/<base_branch> which would never see the agent's new commits in revision mode). - .bot-pr-body.md presence is only required for fresh PR creation; revisions don't need a new PR body, and the bridge no longer silently skips the push when the agent correctly omits it. - Follow-up comment is now a tight 'Done — pushed [`abc1234`](url).' one-liner. New latest_commit_sha helper. Single source of truth for response format: - New load_format_rules / get_format_rule parses  blocks out of the agent's github-comment-format/SKILL.md at startup. - build_fix_prompt embeds the matching block in the prompt (new is_review_comment_dispatch parameter). The bridge no longer duplicates format rules anywhere; SKILL.md is the only place to edit them. Bridge and agent literally read the same file. - New cfg.agent_skill_format_path knob lets operators point at a different install location. Review-comment thread reply hygiene: - New github_client.add_pr_review_comment_reaction for the /pulls/comments/<id>/reactions endpoint. - New github_client.create_review_comment_reply for the /pulls/{n}/comments/{id}/replies endpoint. - Comment dataclass gains review_comment_id (Optional[int]). - _acknowledge for review-comment dispatches is silent eyes-emoji only; no 'Working on...' threaded reply. - _post_response_to_comment routes to the threaded-reply endpoint when review_comment_id is set. Defensive _strip_for_thread_reply removes any '## [Tag]' header or '_Automated response_' footer at the bridge boundary in case the agent regresses. - New _ack_body picks ack wording by dispatch source. Signed-off-by: ryan-gh-bot <278766827+ryan-gh-bot@users.noreply.github.com> Co-authored-by: RyanL1997 <109499885+RyanL1997@users.noreply.github.com>

ryan-gh-bot requested a review from RyanL1997 as a code owner May 28, 2026 22:22

ryan-gh-bot mentioned this pull request May 28, 2026

[BUG] Dedup pushdown disabled when combined with WHERE clause (FilterMergeRule defeats PPLSimplifyDedupRule) #7

Open

1 task

Repository owner deleted a comment from ryan-gh-bot May 28, 2026

ryan-gh-bot force-pushed the bot-fix-RyanL1997-sql-7 branch from aa53298 to 52f45f9 Compare May 28, 2026 23:45

Repository owner deleted a comment from ryan-gh-bot May 28, 2026

ryan-gh-bot force-pushed the bot-fix-RyanL1997-sql-7 branch from b306502 to 52f45f9 Compare May 28, 2026 23:55

Repository owner deleted a comment from ryan-gh-bot May 28, 2026

This comment was marked as outdated.

Sign in to view

ryan-gh-bot force-pushed the bot-fix-RyanL1997-sql-7 branch from a79b638 to 52f45f9 Compare May 29, 2026 00:11

Repository owner deleted a comment from ryan-gh-bot May 29, 2026

This comment was marked as off-topic.

Sign in to view

Repository owner deleted a comment from ryan-gh-bot May 29, 2026

ryan-gh-bot force-pushed the bot-fix-RyanL1997-sql-7 branch from 4531606 to 52f45f9 Compare May 29, 2026 01:07

RyanL1997 reviewed May 29, 2026

View reviewed changes

ryan-gh-bot added 2 commits May 29, 2026 01:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Order PPLSimplifyDedupRule before FilterMergeRule in HEP program (#7)#8

[BugFix] Order PPLSimplifyDedupRule before FilterMergeRule in HEP program (#7)#8
ryan-gh-bot wants to merge 3 commits into
RyanL1997:mainfrom
ryan-gh-bot:bot-fix-RyanL1997-sql-7

ryan-gh-bot commented May 28, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as off-topic.

Uh oh!

RyanL1997 left a comment

Uh oh!

RyanL1997 May 29, 2026

Uh oh!

ryan-gh-bot May 29, 2026

Uh oh!

RyanL1997 May 29, 2026

Uh oh!

ryan-gh-bot May 29, 2026

Uh oh!

RyanL1997 May 29, 2026

Uh oh!

ryan-gh-bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ryan-gh-bot commented May 28, 2026

Description

Verification

Related Issues

Check List

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as off-topic.

Uh oh!

RyanL1997 left a comment

Choose a reason for hiding this comment

Uh oh!

RyanL1997 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

ryan-gh-bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

RyanL1997 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

ryan-gh-bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

RyanL1997 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

ryan-gh-bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants