feat!: redesign of the AI testing framework#30
Merged
Conversation
…on, and live status display
- Add causeAPIError flag to TestDefinition interface - Update base templates (Python/JS) with error handling wrapper - Implement respx-based API mocking for OpenAI Python template - Add respx dependency to OpenAI Python config - Create Basic Error LLM Test case that validates error capture - Simplify OpenAI Python template (remove verbose debug output)
…Python templates - Add respx/httpx imports and inject_api_error block to all 3 templates - Add respx dependency to config.json for each framework - Simplify templates (remove verbose debug output, use kwargs pattern)
… checkResponseToolCalls and checks for input message schema checkInputMessagesSchema
and fix the one for trimming
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Key features
Splitting the test cases from their implementation for the various framework integrations to test.
Test cases are rendered with templates for a given framework, allowing to significantly improve the test coverage for all kinds of variations in interacting with the framework.
New layout of the testing framework to run everything in TypeScript/JavaScript, only Python test execution uses the Python interpreter. Spans are no longer intercepted in the transport, but a Span collector receives the spans.
We now have a test layout in the following manner:
<type>/<platform>/<framework>/<test-case>/<check>, where:typeis the basic thing to test, currently LLMs (non-agentic) and agents. Later MCP and embeddings.platform: either JS or Python, will also be expanded uponframework: the framework integration to test, e.g:openai,anthropicorlanggraphtest-case: the test-case to run, a complete scenario with a given purposecheck: a specific check to be run, with one or more assertions