tutorial for testing framework #631

isahers1 · 2025-01-18T00:36:23Z

No description provided.

vercel · 2025-01-18T00:36:27Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langsmith-docs	✅ Ready (Inspect)	Visit Preview	💬 1 unresolved ✅ 3 resolved	Jan 22, 2025 1:23am

docs/evaluation/tutorials/eval_testing_tutorial.mdx

jacoblee93

I think this is missing a bit of interactivity. Right now it's very much just "here's how you add test some test cases"

Some ideas:

What do I do with results?
How do I improve my app (could just show how tweaking/improving system prompt or some tool description makes some metric increase)
How should I think about coming up with new evaluation cases/metrics?
When/how often should I run tests?
- Would be nice to have a section where you mock out an LLM call in your evaluator or app if you want to e.g. run on every local change
Should also encourage people to click on experiment link and explore there
- Share with others on team?

I think would be nice to make the app imperfect at first, then add evals to show it's imperfect, then improve the app, then show improvement using evals

The velocity and feedback loop is the real value over other tutorials using evaluate

docs/evaluation/tutorials/eval_testing_tutorial.mdx

tanushree-sharma · 2025-01-20T18:30:25Z

docs/evaluation/tutorials/eval_testing_tutorial.mdx

+## Write tests
+
+Now that we have defined our agent, let's write a few tests to ensure basic functionality.
+In this tutorial we are going to test whether the agent's tool calling abilities are working, 


I'd structure this as:

In this tutorial we are going to test:

The agent can ignore irrelevant questions.

The agent's tool calling abilities.

The agent can answer complex questions that involve using all of the tools

The agent's answer is grounded in search results using LLM-as-a judge evaluator

Would also make sure that the order here matches the tests below

docs/evaluation/tutorials/eval_testing_tutorial.mdx

tanushree-sharma · 2025-01-20T18:33:25Z

docs/evaluation/tutorials/eval_testing_tutorial.mdx

+### Test 4: LLM as a judge
+
+For LLM as a judge, we are going to ensure that the agent's answer is grounded in the search results.
+In order to trace the LLM as a judge call separately from our agent, we will use the LangSmith provided `trace_feedback` context manager


can we link out to SDK docs fortrace_feedback and wrapEvaluator? i couldnt find

tanushree-sharma · 2025-01-20T18:35:57Z

To Jacob's point on "What do I do with results?"

Can we show a screenshot of the results of these tests in LangSmith?

Co-authored-by: Tanushree <[email protected]>

…h-docs into isaac/testtutorial

jacoblee93 · 2025-01-21T21:44:17Z

I think it's ok, when using stuff like wrapEvaluator it should link to relevant API refs/how-tos though

It's also still just "paste these test cases into your file" which is fine but I think it would benefit from something like showing how you can use test results to prompt engineer and improve your app.

Co-authored-by: jacoblee93 <[email protected]>

draft

fa842cb

vercel bot deployed to Preview January 18, 2025 00:39 View deployment

progress

dfb66d4

vercel bot deployed to Preview January 18, 2025 03:01 View deployment