feat(helixops_saas): add shared dbt project and 17 benchmark tasks#137
Merged
feat(helixops_saas): add shared dbt project and 17 benchmark tasks#137
Conversation
Adds the OpsPilot SaaS analytics benchmark project as a new shared dbt project in ADE-bench. The project contains 28 models (11 staging, 11 intermediate, 6 marts) over an intentionally messy 11-table seed dataset covering accounts, workspaces, users, subscriptions, invoices, payments, and support tickets. Key setup decisions: - Seeds converted to dbt sources (source() refs, not ref()) to match existing ADE-bench shared project conventions - Raw data loaded into shared/databases/duckdb/ops_pilot.duckdb as VARCHAR columns to preserve intentional messiness - Hardcoded date anchors replaced with now() in three intermediate models (int_account_users, int_account_engagement, int_support_sla) - DuckDB-only for now; Snowflake migration deferred Verified: dbt run 28/28 PASS with dbt-core==1.10.11 + dbt-duckdb==1.9.3 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
25-task plan for creating ADE-bench tasks against the ops_pilot shared project. Covers all benchmark task ideas from the original handoff doc: Type A (remove-and-restore), Type B (genuine addition), and Type C (logic change). Includes common patterns, file manifest, and end-to-end verification steps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…project Creates task directories helixops_saas001-017 covering: - Type A (remove-and-restore): billing_country, owner_team propagation, api_calls DAG propagation - Type B (genuine addition): net_mrr trap, geo_segment, filter archived workspaces, dim_accounts_v2, SLA seeds - Type C (logic fix): sandbox filter, department infer trap, email rename, Falcon Works sbx bug, rename+propagate, onboarding fees - Refactor: inline int_monthly_revenue_prep as CTE - Already-done: department (no-op, expected-pass for none agent) Each task has task.yaml, setup.sh, solution.sh, solutions/ SQL, and tests/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… diffs Convert all 17 helixops_saas tasks from heredoc/cp-based setup and solution scripts to unified patch files, following the convention introduced in PR #139. - setup.sh: uses `patch -p1 < /app/setup/changes.patch` for tasks with file modifications (001-006, 011, 014); tasks with no file changes unchanged - solution.sh: uses `patch -p1 < /sage/solutions/changes.patch` for all tasks - setup/changes.patch: unified diff from shared project baseline to broken state - solutions/changes.patch: unified diff from broken state to correct solution - Removes ~1500 lines of boilerplate SQL files from solutions/ directories Includes new file creation (009 dim_accounts_v2, 015/016 seeds) and file deletion (012 removes int_monthly_revenue_prep) using /dev/null patch syntax. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ken-state patches The shared helixops_saas.duckdb will only contain raw tables (matching the airbnb pattern). Setup scripts for tasks 001-005, 011, and 013 previously ran dbt on a partial model selection, relying on pre-built views in the DB. Change all partial `dbt run --select ...` calls to `dbt run || true` so the full project is built from raw tables before the patch establishes the broken state. Task 013's DB-mutation-only setup also gets a full `dbt run`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove dummy.sql from 11 tasks (was a placeholder to ensure tasks always failed before solution seeds were generated; SELECT 1 always returns a row which fails dbt tests, and the semicolon caused parser errors in dbt-fusion) - Add AUTO_*_equality.sql and AUTO_*_existence.sql tests for all 17 tasks - Add solution seed CSVs generated by sage agent for all 17 tasks - Add ade_bench_equality_test.sql macros and _no-op.txt for all 17 tasks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- harness.py: SET TimeZone = 'UTC' before COPY TO CSV to fix seeds generated on UTC+12 machines being 12h ahead of CI (UTC) values - helixops_saas014/solutions/changes.patch: add missing stg_workspace_usage_daily hunk so sage can restore api_calls at the source; without this, the downstream int_workspace_daily_metrics.u.api_calls reference caused a compile error and total_api_calls_not_null test always failed - helixops_saas005/tests: exclude days_since_last_login from equality comparison since it uses now() and changes daily - Regenerated seeds for tasks 004, 005, 010, 014, 017 with UTC timezone fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AUTO_*.sql test files are regenerated on every harness run, so cols_to_exclude must be set in task.yaml's solution_seeds config rather than directly in the test SQL. - helixops_saas005: exclude days_since_last_login (uses now()) - helixops_saas015: exclude ticket_age_days (uses now()) - helixops_saas016: exclude ticket_age_days (uses now()) Regenerated seeds for all three tasks with UTC timezone fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
**helixops_saas008**: solution patch was missing int_account_users.sql (which also selects a.account_status). Patch now renames all 5 occurrences. Seed regenerated with customer_status. **helixops_saas012**: structural refactor (inline CTE) produces identical data, so equality alone can't distinguish broken from fixed state. - test_setup now drops the int_monthly_revenue_prep relation before rebuilding fct_monthly_revenue, causing the none-agent build to fail when it still ref()s the dropped view - Added manifest check test: fails if int_monthly_revenue_prep is still present in graph.nodes (i.e. the file was not deleted) - Added drop_relation macro used by the test_setup operation **ci.yml**: add helixops_saas017 to ALLOWED_TO_PASS — it is an intentional no-op task (department already exists) where both none and sage agents correctly produce passing output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each description now captures the non-obvious aspect an agent must understand to solve the task — e.g. the column source, the trap in the prompt, or the structural requirement — rather than just restating the prompt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
joellabes
commented
Mar 24, 2026
- 004: add no_hint prompt variant (same challenge, without the hint)
- 006: add compile+grep test_setup check to enforce upstream field reuse
- 007: fix prompt wording (hyphen separated); add equality seeds for
int_account_billing_snapshot and mart_account_health
- 008: add equality seeds for dim_accounts, int_account_users, mart_account_health
- 009: rework to use dbt model versioning YAML (latest_version=1, v2 via
_models.yml); add graph.nodes manifest check; fix tests to use
versioned ref syntax ref('dim_accounts', v=2)
- 010: add stg_workspaces seed to verify staging layer is unchanged;
update solution.sh to rebuild stg_workspaces
- 012: no change needed (orphan model check already in place)
- 013: add equality seeds for stg_invoice_line_items and int_invoice_finance
- 014: add equality seeds for all 6 intermediate models in the api_calls chain
- shared/scripts/run-dbt-test.sh: propagate test_setup exit code so
compile-based checks can fail the task
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…for upstream field check When mart_account_360 doesn't reference effective_monthly_value_usd, write a failing singular test file into /app/tests/ so dbt picks it up as a real test failure with proper results page output. Reverts the || exit 1 approach in run-dbt-test.sh which caused an empty test list on the results page. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
joellabes
commented
Mar 26, 2026
Comment on lines
+3
to
+7
| description: Add net_mrr to mart_account_360 — correct solution reuses an upstream calculated field rather than recalculating from raw inputs | ||
| prompts: | ||
| - key: base | ||
| prompt: |- | ||
| Please add net_mrr to the account 360, based on contracted price less discount, divided by 12 if billed annually. |
Collaborator
Author
There was a problem hiding this comment.
Let's add another task which is like this one, but requires a formula to be brought through from a grandparent model. Won't be a variant prompt on 006, will need to work out what it actually is
- saas007: fix geo_segment separator ' / ' -> '-' in patch; add no_location_hint prompt variant - saas008: add stg_accounts to solution_seeds - saas010: add int_workspace_roster, int_workspace_daily_metrics, int_support_sla to solution_seeds - saas011: add hard prompt variant - saas012: add hard prompt variant - saas015: add compile+grep check — write failing test if int_support_sla doesn't reference sla_response_targets Seeds for saas007/008/010 need regeneration via sage --seed run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…check Agent created sla_priority_config.csv with 'standard' priority and joined using CASE WHEN ... ELSE 'standard' END, which maps 'low' tickets to 'standard' and produces numerically identical output — equality test cannot distinguish. Compile int_support_sla and fail if compiled SQL contains 'standard' but not 'low': correct solution joins directly on t.priority so neither literal appears; agent's normalization approach contains 'standard'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…seed and SQL Two checks in test_setup: 1. grep any non-solution__ seed file for 'standard' (catches agents that create sla_priority_config.csv with 'standard' instead of 'low') 2. grep compiled int_support_sla for literal 'standard' in SQL (catches agents that normalize via CASE WHEN ... ELSE 'standard' END) Both must be absent for the task to pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- saas007: regenerate 3 seeds with '-' separator (was ' / ') - saas008: add solution__stg_accounts.csv + regenerate affected seeds - saas010: add solution__int_workspace_roster/daily_metrics/support_sla.csv All 4 tasks (including saas007.no_location_hint) pass 38/38 tests with sage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_usd pre-exposed All four raw formula ingredients (billing_cycle, contracted_seats, discount_pct, list_price_usd) are already present in mart_account_360, making inline recalculation maximally tempting. Correct solution still reuses effective_monthly_value_usd from int_account_billing_snapshot. Setup patch adds list_price_usd to both int_account_billing_snapshot and mart_account_360. Same compile check as saas006. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…made it annoying Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
helixops_saasshared dbt project (28 models: 9 staging, 11 intermediate, 8 marts) with a pre-built DuckDB databasehelixops_saas001–helixops_saas017) covering a wide range of dbt agent challenges