Skip to content

feat(helixops_saas): add shared dbt project and 17 benchmark tasks#137

Merged
joellabes merged 21 commits intomainfrom
feature/ops-pilot-shared-project
Mar 30, 2026
Merged

feat(helixops_saas): add shared dbt project and 17 benchmark tasks#137
joellabes merged 21 commits intomainfrom
feature/ops-pilot-shared-project

Conversation

@joellabes
Copy link
Copy Markdown
Collaborator

@joellabes joellabes commented Mar 19, 2026

Summary

  • Adds the helixops_saas shared dbt project (28 models: 9 staging, 11 intermediate, 8 marts) with a pre-built DuckDB database
  • Adds 17 benchmark tasks (helixops_saas001helixops_saas017) covering a wide range of dbt agent challenges

joellabes and others added 5 commits March 18, 2026 14:09
Adds the OpsPilot SaaS analytics benchmark project as a new shared dbt
project in ADE-bench. The project contains 28 models (11 staging, 11
intermediate, 6 marts) over an intentionally messy 11-table seed dataset
covering accounts, workspaces, users, subscriptions, invoices, payments,
and support tickets.

Key setup decisions:
- Seeds converted to dbt sources (source() refs, not ref()) to match
  existing ADE-bench shared project conventions
- Raw data loaded into shared/databases/duckdb/ops_pilot.duckdb as
  VARCHAR columns to preserve intentional messiness
- Hardcoded date anchors replaced with now() in three intermediate
  models (int_account_users, int_account_engagement, int_support_sla)
- DuckDB-only for now; Snowflake migration deferred

Verified: dbt run 28/28 PASS with dbt-core==1.10.11 + dbt-duckdb==1.9.3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
25-task plan for creating ADE-bench tasks against the ops_pilot shared
project. Covers all benchmark task ideas from the original handoff doc:
Type A (remove-and-restore), Type B (genuine addition), and Type C
(logic change). Includes common patterns, file manifest, and end-to-end
verification steps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…project

Creates task directories helixops_saas001-017 covering:
- Type A (remove-and-restore): billing_country, owner_team propagation, api_calls DAG propagation
- Type B (genuine addition): net_mrr trap, geo_segment, filter archived workspaces, dim_accounts_v2, SLA seeds
- Type C (logic fix): sandbox filter, department infer trap, email rename, Falcon Works sbx bug, rename+propagate, onboarding fees
- Refactor: inline int_monthly_revenue_prep as CTE
- Already-done: department (no-op, expected-pass for none agent)

Each task has task.yaml, setup.sh, solution.sh, solutions/ SQL, and tests/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@joellabes joellabes marked this pull request as draft March 19, 2026 03:06
joellabes and others added 8 commits March 23, 2026 15:23
… diffs

Convert all 17 helixops_saas tasks from heredoc/cp-based setup and solution
scripts to unified patch files, following the convention introduced in PR #139.

- setup.sh: uses `patch -p1 < /app/setup/changes.patch` for tasks with file
  modifications (001-006, 011, 014); tasks with no file changes unchanged
- solution.sh: uses `patch -p1 < /sage/solutions/changes.patch` for all tasks
- setup/changes.patch: unified diff from shared project baseline to broken state
- solutions/changes.patch: unified diff from broken state to correct solution
- Removes ~1500 lines of boilerplate SQL files from solutions/ directories

Includes new file creation (009 dim_accounts_v2, 015/016 seeds) and file
deletion (012 removes int_monthly_revenue_prep) using /dev/null patch syntax.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ken-state patches

The shared helixops_saas.duckdb will only contain raw tables (matching the
airbnb pattern). Setup scripts for tasks 001-005, 011, and 013 previously
ran dbt on a partial model selection, relying on pre-built views in the DB.

Change all partial `dbt run --select ...` calls to `dbt run || true` so
the full project is built from raw tables before the patch establishes the
broken state. Task 013's DB-mutation-only setup also gets a full `dbt run`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove dummy.sql from 11 tasks (was a placeholder to ensure tasks
  always failed before solution seeds were generated; SELECT 1 always
  returns a row which fails dbt tests, and the semicolon caused parser
  errors in dbt-fusion)
- Add AUTO_*_equality.sql and AUTO_*_existence.sql tests for all 17 tasks
- Add solution seed CSVs generated by sage agent for all 17 tasks
- Add ade_bench_equality_test.sql macros and _no-op.txt for all 17 tasks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- harness.py: SET TimeZone = 'UTC' before COPY TO CSV to fix seeds generated
  on UTC+12 machines being 12h ahead of CI (UTC) values
- helixops_saas014/solutions/changes.patch: add missing stg_workspace_usage_daily
  hunk so sage can restore api_calls at the source; without this, the downstream
  int_workspace_daily_metrics.u.api_calls reference caused a compile error and
  total_api_calls_not_null test always failed
- helixops_saas005/tests: exclude days_since_last_login from equality comparison
  since it uses now() and changes daily
- Regenerated seeds for tasks 004, 005, 010, 014, 017 with UTC timezone fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AUTO_*.sql test files are regenerated on every harness run, so
cols_to_exclude must be set in task.yaml's solution_seeds config
rather than directly in the test SQL.

- helixops_saas005: exclude days_since_last_login (uses now())
- helixops_saas015: exclude ticket_age_days (uses now())
- helixops_saas016: exclude ticket_age_days (uses now())

Regenerated seeds for all three tasks with UTC timezone fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
**helixops_saas008**: solution patch was missing int_account_users.sql
(which also selects a.account_status). Patch now renames all 5
occurrences. Seed regenerated with customer_status.

**helixops_saas012**: structural refactor (inline CTE) produces identical
data, so equality alone can't distinguish broken from fixed state.
- test_setup now drops the int_monthly_revenue_prep relation before
  rebuilding fct_monthly_revenue, causing the none-agent build to fail
  when it still ref()s the dropped view
- Added manifest check test: fails if int_monthly_revenue_prep is still
  present in graph.nodes (i.e. the file was not deleted)
- Added drop_relation macro used by the test_setup operation

**ci.yml**: add helixops_saas017 to ALLOWED_TO_PASS — it is an
intentional no-op task (department already exists) where both none and
sage agents correctly produce passing output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each description now captures the non-obvious aspect an agent must
understand to solve the task — e.g. the column source, the trap in the
prompt, or the structural requirement — rather than just restating the
prompt.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
joellabes and others added 2 commits March 25, 2026 07:44
- 004: add no_hint prompt variant (same challenge, without the hint)
- 006: add compile+grep test_setup check to enforce upstream field reuse
- 007: fix prompt wording (hyphen separated); add equality seeds for
       int_account_billing_snapshot and mart_account_health
- 008: add equality seeds for dim_accounts, int_account_users, mart_account_health
- 009: rework to use dbt model versioning YAML (latest_version=1, v2 via
       _models.yml); add graph.nodes manifest check; fix tests to use
       versioned ref syntax ref('dim_accounts', v=2)
- 010: add stg_workspaces seed to verify staging layer is unchanged;
       update solution.sh to rebuild stg_workspaces
- 012: no change needed (orphan model check already in place)
- 013: add equality seeds for stg_invoice_line_items and int_invoice_finance
- 014: add equality seeds for all 6 intermediate models in the api_calls chain
- shared/scripts/run-dbt-test.sh: propagate test_setup exit code so
  compile-based checks can fail the task

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…for upstream field check

When mart_account_360 doesn't reference effective_monthly_value_usd, write a
failing singular test file into /app/tests/ so dbt picks it up as a real test
failure with proper results page output. Reverts the || exit 1 approach in
run-dbt-test.sh which caused an empty test list on the results page.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment on lines +3 to +7
description: Add net_mrr to mart_account_360 — correct solution reuses an upstream calculated field rather than recalculating from raw inputs
prompts:
- key: base
prompt: |-
Please add net_mrr to the account 360, based on contracted price less discount, divided by 12 if billed annually.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add another task which is like this one, but requires a formula to be brought through from a grandparent model. Won't be a variant prompt on 006, will need to work out what it actually is

joellabes and others added 4 commits March 30, 2026 13:52
- saas007: fix geo_segment separator ' / ' -> '-' in patch; add no_location_hint prompt variant
- saas008: add stg_accounts to solution_seeds
- saas010: add int_workspace_roster, int_workspace_daily_metrics, int_support_sla to solution_seeds
- saas011: add hard prompt variant
- saas012: add hard prompt variant
- saas015: add compile+grep check — write failing test if int_support_sla doesn't reference sla_response_targets

Seeds for saas007/008/010 need regeneration via sage --seed run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…check

Agent created sla_priority_config.csv with 'standard' priority and joined
using CASE WHEN ... ELSE 'standard' END, which maps 'low' tickets to
'standard' and produces numerically identical output — equality test cannot
distinguish. Compile int_support_sla and fail if compiled SQL contains
'standard' but not 'low': correct solution joins directly on t.priority so
neither literal appears; agent's normalization approach contains 'standard'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…seed and SQL

Two checks in test_setup:
1. grep any non-solution__ seed file for 'standard' (catches agents that
   create sla_priority_config.csv with 'standard' instead of 'low')
2. grep compiled int_support_sla for literal 'standard' in SQL (catches
   agents that normalize via CASE WHEN ... ELSE 'standard' END)

Both must be absent for the task to pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- saas007: regenerate 3 seeds with '-' separator (was ' / ')
- saas008: add solution__stg_accounts.csv + regenerate affected seeds
- saas010: add solution__int_workspace_roster/daily_metrics/support_sla.csv

All 4 tasks (including saas007.no_location_hint) pass 38/38 tests with sage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@joellabes joellabes marked this pull request as ready for review March 30, 2026 01:57
joellabes and others added 2 commits March 30, 2026 15:39
…_usd pre-exposed

All four raw formula ingredients (billing_cycle, contracted_seats, discount_pct,
list_price_usd) are already present in mart_account_360, making inline
recalculation maximally tempting. Correct solution still reuses
effective_monthly_value_usd from int_account_billing_snapshot.

Setup patch adds list_price_usd to both int_account_billing_snapshot and
mart_account_360. Same compile check as saas006.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…made it annoying

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@joellabes joellabes merged commit 3ffb1c1 into main Mar 30, 2026
9 checks passed
@joellabes joellabes deleted the feature/ops-pilot-shared-project branch March 30, 2026 06:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant