Skip to content

Commit 998da0f

Browse files
pytest/vitest/jest docs (#623)
Co-authored-by: jacoblee93 <[email protected]>
1 parent 089271c commit 998da0f

12 files changed

+1013
-347
lines changed

docs/evaluation/concepts/index.mdx

+48-14
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,17 @@
11
# Evaluation concepts
22

3-
The pace of AI application development is often limited by high-quality evaluations.
4-
Evaluations are methods designed to assess the performance and capabilities of AI applications.
3+
The quality and development speed of AI applications is often limited by high-quality evaluation datasets and metrics, which enable you to both optimize and test your applications.
54

6-
Good evaluations make it easy to iteratively improve prompts, select models, test architectures, and ensure that deployed applications continue to perform as expected.
75
LangSmith makes building high-quality evaluations easy.
6+
This guide explains the LangSmith evaluation framework and AI evaluation techniques more broadly.
7+
The building blocks of the LangSmith framework are:
88

9-
This guide explains the key concepts behind the LangSmith evaluation framework and evaluations for AI applications more broadly.
10-
The core components of LangSmith evaluations are:
11-
12-
- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and, optionally, reference outputs for your applications.
13-
- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring the outputs generated by applications given dataset inputs.
9+
- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and reference outputs.
10+
- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring outputs.
1411

1512
## Datasets
1613

17-
A dataset contains a collection of examples used for evaluating an application.
14+
A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.
1815

1916
![Dataset](./static/dataset_concept.png)
2017

@@ -141,7 +138,7 @@ Learn [how run pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise
141138
## Experiment
142139

143140
Each time we evaluate an application on a dataset, we are conducting an experiment.
144-
An experiment is a single execution of the example inputs in your dataset through your application.
141+
An experiment contains the results of running a specific version of your application on the dataset.
145142
Typically, we will run multiple experiments on a given dataset, testing different configurations of our application (e.g., different prompts or LLMs).
146143
In LangSmith, you can easily view all the experiments associated with your dataset.
147144
Additionally, you can [compare multiple experiments in a comparison view](/evaluation/how_to_guides/compare_experiment_results).
@@ -224,6 +221,42 @@ Online evaluators are generally intended to be run server-side. LangSmith has bu
224221

225222
![Online](./static/online.png)
226223

224+
## Testing
225+
226+
### Evaluations vs testing
227+
228+
Testing and evaluation are very similar and overlapping concepts that often get confused.
229+
230+
**An evaluation measures performance according to a metric(s).**
231+
Evaluation metrics can be fuzzy or subjective, and are more useful in relative terms than absolute ones.
232+
That is, they're often used to compare two systems against each other rather than to assert something about an individual system.
233+
234+
**Testing asserts correctness.**
235+
A system can only be deployed if it passes all tests.
236+
237+
Evaluation metrics can be *turned into* tests.
238+
For example, you can write regression tests to assert that any new version of a system must outperform some baseline version of the system on the relevant evaluation metrics.
239+
240+
It can also be more resource efficient to run tests and evaluations together if your system is expensive to run and you have overlapping datasets for your tests and evaluations.
241+
242+
You can also choose to write evaluations using standard software testing tools like `pytest` or `vitest/jest` out of convenience.
243+
244+
### Using `pytest` and `vitest/jest`
245+
246+
The LangSmith SDKs come with integrations for [pytest](./how_to_guides/pytest) and [`vitest/jest`](./how_to_guides/vitest_jest).
247+
These make it easy to:
248+
- Track test results in LangSmith
249+
- Write evaluations as tests
250+
251+
Tracking test results in LangSmith makes it easy to share results, compare systems, and debug failing tests.
252+
253+
Writing evaluations as tests can be useful when each example you want to evaluate on requires custom logic for running the application and/or evaluators.
254+
The standard evaluation flows assume that you can run your application and evaluators in the same way on every example in a dataset.
255+
But for more complex systems or comprehensive evals, you may want to evaluate specific subsets of your system with specific types of inputs and metrics.
256+
These types of heterogenous evals are much easier to write as a suite of distinct test cases that all get tracked together rather than using the standard evaluate flow.
257+
258+
Using testing tools is also helpful when you want to *both* evaluate your system's outputs *and* assert some basic things about them.
259+
227260
## Application-specific techniques
228261

229262
Below, we will discuss evaluation of a few specific, popular LLM applications.
@@ -348,13 +381,13 @@ Summarization is one specific type of free-form writing. The evaluation aim is t
348381
| Faithfulness | Is the summary grounded in the source documents (e.g., no hallucinations)? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-hallucination-evaluator) | Yes |
349382
| Helpfulness | Is summary helpful relative to user need? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-helpfulness-evaluator) | Yes |
350383

351-
### Classification / Tagging
384+
### Classification and tagging
352385

353-
Classification / Tagging applies a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification / Tagging evaluation typically employs the following components, which we will review in detail below:
386+
Classification and tagging apply a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification/tagging evaluation typically employs the following components, which we will review in detail below:
354387

355-
A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
388+
A central consideration for classification/tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a classification/tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
356389

357-
If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
390+
If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the classification/tagging of an input based upon specified criteria (without a ground truth reference).
358391

359392
`Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).
360393

@@ -363,3 +396,4 @@ If ground truth reference labels are provided, then it's common to simply define
363396
| Accuracy | Standard definition | Yes | No | No |
364397
| Precision | Standard definition | Yes | No | No |
365398
| Recall | Standard definition | Yes | No | No |
399+

docs/evaluation/how_to_guides/index.md

+4-3
Original file line numberDiff line numberDiff line change
@@ -45,11 +45,12 @@ Evaluate and improve your application before deploying it.
4545
- [Print detailed logs (Python only)](../../observability/how_to_guides/tracing/output_detailed_logs)
4646
- [Run an evaluation locally (beta, Python only)](./how_to_guides/local)
4747

48-
## Unit testing
48+
## Testing integrations
4949

50-
Unit test your system to identify bugs and regressions.
50+
Run evals using your favorite testing tools:
5151

52-
- [Unit test applications (Python only)](./how_to_guides/unit_testing)
52+
- [Run evals with pytest (beta)](./how_to_guides/pytest)
53+
- [Run evals with Vitest/Jest (beta)](./how_to_guides/vitest_jest)
5354

5455
## Online evaluation
5556

0 commit comments

Comments
 (0)