Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
0e50137
feat!: initial state of the redesign
constantinius Jan 27, 2026
a8f5a01
feat: add model override system, enhanced reporting, skip configurati…
constantinius Jan 28, 2026
5f511d1
feat: re-added several JS and Py frameworks
constantinius Jan 29, 2026
a08c905
fix: test case setups for simple tests
constantinius Jan 29, 2026
a2f08af
fix: making google-genai an agent framework and fixing litellm
constantinius Jan 29, 2026
9375f9b
feat: adding more options to filter and improve logging
constantinius Jan 29, 2026
158fcb1
feat: add causeAPIError feature for testing error scenarios
constantinius Jan 29, 2026
34e430c
feat: add causeAPIError support to Anthropic, LangChain, and LiteLLM …
constantinius Jan 29, 2026
470259b
feat: implement streaming/blocking(non-streaming) tests
constantinius Jan 29, 2026
2c89496
feat: add streaming support for various LLM frameworks
constantinius Jan 29, 2026
2613d99
fix: templates for various LLM frameworks around multi-turn tests
constantinius Jan 29, 2026
405d264
feat: add command to only render templates
constantinius Jan 29, 2026
488fa7f
fix: rendering tests first before conducting the tests
constantinius Jan 29, 2026
ec9ef48
fix: making the JS framework tests work
constantinius Jan 30, 2026
df839fb
fix: fixing JS templates
constantinius Jan 30, 2026
19310a2
feat: add concurrency
constantinius Jan 30, 2026
062e076
fix: using built-in argparser
constantinius Jan 30, 2026
e6b6785
fix: update default SDK version to test
constantinius Jan 30, 2026
43f308e
fix: better reporting
constantinius Jan 30, 2026
723b1d0
fix: update CLAUDE.md to be representative
constantinius Jan 30, 2026
df56eb2
fix: update Makefile and package.json
constantinius Jan 30, 2026
12ac8f2
feat: add vision test and templates for LLMs
constantinius Jan 30, 2026
60ec4fe
feat: adding vision test for agent frameworks
constantinius Jan 30, 2026
1060b86
feat: add tests for very long messages to be truncated
constantinius Jan 30, 2026
1bf778c
feat: re-arranged agent tests and added new ones
constantinius Jan 30, 2026
eabeea4
fix: framework discovery
constantinius Jan 30, 2026
b372ea4
fix: enhance framework discovery by filtering out incomplete configur…
constantinius Jan 30, 2026
4812e16
fix: adjust action for changes
constantinius Feb 2, 2026
e3d6e53
fix: adjust readme for new layout
constantinius Feb 2, 2026
4b52e62
fix: updating Readme
constantinius Feb 2, 2026
b4de862
feat: making checks re-usable across test cases
constantinius Feb 2, 2026
7e48cfb
feat: better test case structuring
constantinius Feb 2, 2026
66cea7f
fix: update Readme
constantinius Feb 2, 2026
2ffc8b2
fix: renamed checks and improved layout
constantinius Feb 3, 2026
5fba24f
feat: adding tool related checks checkToolCalls, checkAvailableTools,…
constantinius Feb 3, 2026
a82d307
feat: re-add the report generation logic
constantinius Feb 3, 2026
49e8227
feat: add check for binary redaction
constantinius Feb 3, 2026
29106ff
fix: action to use new ctrf file layout
constantinius Feb 3, 2026
776fcf5
chore: update to latest framework version
constantinius Feb 3, 2026
ffab3d5
feat: add mastra as a JS/agent test
constantinius Feb 6, 2026
5b0bc06
docs: cleanup and updated documentation files
constantinius Feb 6, 2026
fa11632
feat: several goodies for report visualization: showing spans, groupi…
constantinius Feb 6, 2026
4839967
chore: documenting the new CLI features
constantinius Feb 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
OPENAI_API_KEY= # get from 1Password
GOOGLE_GENAI_API_KEY= #can be personal since google has a generous free tier>
ANTHROPIC_API_KEY= # get from Claude Console
SENTRY_DSN= # Currently not needed
# API Keys for AI providers
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_GENAI_API_KEY=...

# Sentry Configuration (optional)
SENTRY_DSN=
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,8 @@ Thumbs.db
coverage/
.nyc_output/
test-results/
*.log
*.log

# Test execution directory
runs/
.secrets
956 changes: 492 additions & 464 deletions CLAUDE.md

Large diffs are not rendered by default.

45 changes: 26 additions & 19 deletions HUMANS.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,27 @@
This repo (hopefully) contains everything needed to test Sentry SDK AI integrations for Python and JavaScript.

Quick start and other goodies can be found in (./README.md).
Quick start and other goodies can be found in [README.md](./README.md).

The entire repo was made with Claude Code, and all of the major changes (like refactorings, adding SDKs, etc.) should be done by an agent. Most directories contain README.md files that the agent is instructed to read and update when needed. `.claude/settings.json` make sure it can't read this file or your `.env`
The entire repo was made with Claude Code, and all of the major changes (like refactorings, adding SDKs, etc.) should be done by an agent. Most directories contain README.md files that the agent is instructed to read and update when needed. `.claude/settings.json` makes sure it can't read this file or your `.env`

### The Gist

- There is a separate project directory for every integration, ensuring they are independent of each other.
- The setup file contains code that should be executed before the tests and, in most cases, contains only Sentry SDK initialization with the correct options and mock transport.
- Every test performs real LLM calls.
- After it is done, the mock transport is used to extract all of the envelopes the SDK would send.
- The validator then extracts relevant spans and checks against the fixture.
- The result is reported in CTRF as JSON, HTML, and printed in the console.
- Test definitions (TypeScript) + framework templates (Nunjucks) = generated test files
- A span collector HTTP server acts as a mock Sentry endpoint
- Tests make real LLM calls and the Sentry SDK captures spans
- The validator runs check functions against captured spans
- Results are reported as CTRF JSON, HTML, and printed to the console

#### What this can do:

Assert that AI integrations:

- correctly initialize
- capture all relevant spans in correct order/hierarchy
- correctly capture available attributes
- correctly capture available attributes (model, tokens, messages, tool calls)
- properly handle streaming vs blocking modes
- properly handle sync vs async execution (Python)
- trim long messages and redact binary content

#### What this can't do:

Expand All @@ -30,22 +32,27 @@ Assert that:

### JS vs. Py

Some parts of the test logic are implemented twice (once for JS and once for Python). They can never be exactly the same, but it is vital that they are as close to each other as possible in terms of overall functionality, file names, function names, variable names, etc.
The test cases are defined once in TypeScript and then rendered for each framework using Nunjucks templates. While the templates differ between JS and Python, they aim to produce equivalent behavior. Framework-specific quirks are handled in templates and the `skip` configuration.

### Adding another AI SDK integration

- Should be a matter of prompting an agent to do so.
- Make sure to repeat that it should be consistent with the other SDKs.
- Double-check if it wrote BS tests just to have them pass.
- Make sure it DID NOT change any fixtures or validators to make the tests pass.
- Make sure the package versions are pinned.
- Should be a matter of prompting an agent to do so
- Make sure to repeat that it should be consistent with the other SDKs
- Double-check if it wrote BS tests just to have them pass
- Make sure it DID NOT change any check functions or skip configurations to make the tests pass
- Make sure the package versions are pinned in `config.json`

### Adding more test cases

- The fixture should be written and double-checked by a human.
- The case should be implemented for all SDKs where it makes sense.
- If needed, it can be split into 2 flavors - "agentic" and "low-level."
- The check functions should be written and double-checked by a human
- The case should be implemented for all SDKs where it makes sense
- Test cases are split by type: `llm` (low-level SDKs) and `agent` (agentic frameworks)
- Use existing check functions from `checks.ts` when possible
- Add new checks to `checks.ts` if they're reusable across tests

### Versioning

- Every SDK has an independent Sentry SDK installation. This can be using the CLI but if you do so, make sure that the bersion is bumped everywhere.
- Every framework has an independent Sentry SDK version specified in `config.json`
- The `sentryVersions` array can include specific versions or `"latest"`
- Framework versions are specified in the `versions` array
- Dependencies use `"framework"` as version to inherit from the framework version
Loading
Loading