You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/evaluation/concepts/index.mdx
+48-14
Original file line number
Diff line number
Diff line change
@@ -1,20 +1,17 @@
1
1
# Evaluation concepts
2
2
3
-
The pace of AI application development is often limited by high-quality evaluations.
4
-
Evaluations are methods designed to assess the performance and capabilities of AI applications.
3
+
The quality and development speed of AI applications is often limited by high-quality evaluation datasets and metrics, which enable you to both optimize and test your applications.
5
4
6
-
Good evaluations make it easy to iteratively improve prompts, select models, test architectures, and ensure that deployed applications continue to perform as expected.
7
5
LangSmith makes building high-quality evaluations easy.
6
+
This guide explains the LangSmith evaluation framework and AI evaluation techniques more broadly.
7
+
The building blocks of the LangSmith framework are:
8
8
9
-
This guide explains the key concepts behind the LangSmith evaluation framework and evaluations for AI applications more broadly.
10
-
The core components of LangSmith evaluations are:
11
-
12
-
-[**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and, optionally, reference outputs for your applications.
13
-
-[**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring the outputs generated by applications given dataset inputs.
9
+
-[**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and reference outputs.
10
+
-[**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring outputs.
14
11
15
12
## Datasets
16
13
17
-
A dataset contains a collection of examples used for evaluating an application.
14
+
A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.
18
15
19
16

20
17
@@ -141,7 +138,7 @@ Learn [how run pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise
141
138
## Experiment
142
139
143
140
Each time we evaluate an application on a dataset, we are conducting an experiment.
144
-
An experiment is a single execution of the example inputs in your dataset through your application.
141
+
An experiment contains the results of running a specific version of your application on the dataset.
145
142
Typically, we will run multiple experiments on a given dataset, testing different configurations of our application (e.g., different prompts or LLMs).
146
143
In LangSmith, you can easily view all the experiments associated with your dataset.
147
144
Additionally, you can [compare multiple experiments in a comparison view](/evaluation/how_to_guides/compare_experiment_results).
@@ -224,6 +221,42 @@ Online evaluators are generally intended to be run server-side. LangSmith has bu
224
221
225
222

226
223
224
+
## Testing
225
+
226
+
### Evaluations vs testing
227
+
228
+
Testing and evaluation are very similar and overlapping concepts that often get confused.
229
+
230
+
**An evaluation measures performance according to a metric(s).**
231
+
Evaluation metrics can be fuzzy or subjective, and are more useful in relative terms than absolute ones.
232
+
That is, they're often used to compare two systems against each other rather than to assert something about an individual system.
233
+
234
+
**Testing asserts correctness.**
235
+
A system can only be deployed if it passes all tests.
236
+
237
+
Evaluation metrics can be *turned into* tests.
238
+
For example, you can write regression tests to assert that any new version of a system must outperform some baseline version of the system on the relevant evaluation metrics.
239
+
240
+
It can also be more resource efficient to run tests and evaluations together if your system is expensive to run and you have overlapping datasets for your tests and evaluations.
241
+
242
+
You can also choose to write evaluations using standard software testing tools like `pytest` or `vitest/jest` out of convenience.
243
+
244
+
### Using `pytest` and `vitest/jest`
245
+
246
+
The LangSmith SDKs come with integrations for [pytest](./how_to_guides/pytest) and [`vitest/jest`](./how_to_guides/vitest_jest).
247
+
These make it easy to:
248
+
- Track test results in LangSmith
249
+
- Write evaluations as tests
250
+
251
+
Tracking test results in LangSmith makes it easy to share results, compare systems, and debug failing tests.
252
+
253
+
Writing evaluations as tests can be useful when each example you want to evaluate on requires custom logic for running the application and/or evaluators.
254
+
The standard evaluation flows assume that you can run your application and evaluators in the same way on every example in a dataset.
255
+
But for more complex systems or comprehensive evals, you may want to evaluate specific subsets of your system with specific types of inputs and metrics.
256
+
These types of heterogenous evals are much easier to write as a suite of distinct test cases that all get tracked together rather than using the standard evaluate flow.
257
+
258
+
Using testing tools is also helpful when you want to *both* evaluate your system's outputs *and* assert some basic things about them.
259
+
227
260
## Application-specific techniques
228
261
229
262
Below, we will discuss evaluation of a few specific, popular LLM applications.
@@ -348,13 +381,13 @@ Summarization is one specific type of free-form writing. The evaluation aim is t
348
381
| Faithfulness | Is the summary grounded in the source documents (e.g., no hallucinations)? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-hallucination-evaluator)| Yes |
349
382
| Helpfulness | Is summary helpful relative to user need? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-helpfulness-evaluator)| Yes |
350
383
351
-
### Classification / Tagging
384
+
### Classification and tagging
352
385
353
-
Classification / Tagging applies a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification / Tagging evaluation typically employs the following components, which we will review in detail below:
386
+
Classification and tagging apply a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification/tagging evaluation typically employs the following components, which we will review in detail below:
354
387
355
-
A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
388
+
A central consideration for classification/tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a classification/tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
356
389
357
-
If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
390
+
If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the classification/tagging of an input based upon specified criteria (without a ground truth reference).
358
391
359
392
`Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).
360
393
@@ -363,3 +396,4 @@ If ground truth reference labels are provided, then it's common to simply define
363
396
| Accuracy | Standard definition | Yes | No | No |
364
397
| Precision | Standard definition | Yes | No | No |
0 commit comments