-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tutorial for testing framework #631
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is missing a bit of interactivity. Right now it's very much just "here's how you add test some test cases"
Some ideas:
- What do I do with results?
- How do I improve my app (could just show how tweaking/improving system prompt or some tool description makes some metric increase)
- How should I think about coming up with new evaluation cases/metrics?
- When/how often should I run tests?
- Would be nice to have a section where you mock out an LLM call in your evaluator or app if you want to e.g. run on every local change
- Should also encourage people to click on experiment link and explore there
- Share with others on team?
I think would be nice to make the app imperfect at first, then add evals to show it's imperfect, then improve the app, then show improvement using evals
The velocity and feedback loop is the real value over other tutorials using evaluate
## Write tests | ||
|
||
Now that we have defined our agent, let's write a few tests to ensure basic functionality. | ||
In this tutorial we are going to test whether the agent's tool calling abilities are working, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd structure this as:
In this tutorial we are going to test:
- The agent can ignore irrelevant questions.
- The agent's tool calling abilities.
- The agent can answer complex questions that involve using all of the tools
- The agent's answer is grounded in search results using LLM-as-a judge evaluator
Would also make sure that the order here matches the tests below
### Test 4: LLM as a judge | ||
|
||
For LLM as a judge, we are going to ensure that the agent's answer is grounded in the search results. | ||
In order to trace the LLM as a judge call separately from our agent, we will use the LangSmith provided `trace_feedback` context manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we link out to SDK docs fortrace_feedback
and wrapEvaluator? i couldnt find
To Jacob's point on "What do I do with results?" Can we show a screenshot of the results of these tests in LangSmith? |
Co-authored-by: Tanushree <[email protected]>
Co-authored-by: Tanushree <[email protected]>
Co-authored-by: Tanushree <[email protected]>
Co-authored-by: Tanushree <[email protected]>
Co-authored-by: Tanushree <[email protected]>
Co-authored-by: Tanushree <[email protected]>
Co-authored-by: Tanushree <[email protected]>
…h-docs into isaac/testtutorial
I think it's ok, when using stuff like It's also still just "paste these test cases into your file" which is fine but I think it would benefit from something like showing how you can use test results to prompt engineer and improve your app. |
Co-authored-by: jacoblee93 <[email protected]>
No description provided.