Skip to content

Commit 3db1e3a

Browse files
authored
Merge branch 'main' into isaac/attachmentpaths
2 parents d3494df + e9ec569 commit 3db1e3a

File tree

62 files changed

+2869
-652
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+2869
-652
lines changed

Makefile

+5-3
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,17 @@ build-api-ref:
1010
. .venv/bin/activate
1111
$(PYTHON) -m pip install --upgrade pip
1212
$(PYTHON) -m pip install --upgrade uv
13-
cd langsmith-sdk && ../$(PYTHON) -m uv pip install -r python/docs/requirements.txt
13+
$(PYTHON) -m uv pip install -r langsmith-sdk/python/docs/requirements.txt
14+
$(PYTHON) -m uv pip install -e langsmith-sdk/python
1415
$(PYTHON) langsmith-sdk/python/docs/create_api_rst.py
1516
LC_ALL=C $(PYTHON) -m sphinx -T -E -b html -d langsmith-sdk/python/docs/_build/doctrees -c langsmith-sdk/python/docs langsmith-sdk/python/docs langsmith-sdk/python/docs/_build/html -j auto
1617
$(PYTHON) langsmith-sdk/python/docs/scripts/custom_formatter.py langsmith-sdk/docs/_build/html/
17-
18+
cd langsmith-sdk/js && yarn && yarn run build:typedoc --useHostedBaseUrlForAbsoluteLinks true --hostedBaseUrl "https://$${VERCEL_URL:-docs.smith.langchain.com}/reference/js/"
1819

1920
vercel-build: install-vercel-deps build-api-ref
2021
mkdir -p static/reference/python
2122
mv langsmith-sdk/python/docs/_build/html/* static/reference/python/
23+
mkdir -p static/reference/js
24+
mv langsmith-sdk/js/_build/api_refs/* static/reference/js/
2225
rm -rf langsmith-sdk
2326
NODE_OPTIONS="--max-old-space-size=5000" yarn run docusaurus build
24-

docs/administration/how_to_guides/organization_management/create_account_api_key.mdx

+2-7
Original file line numberDiff line numberDiff line change
@@ -34,12 +34,7 @@ The API key will be shown only once, so make sure to copy it and store it in a s
3434

3535
## Configure the SDK
3636

37-
You may set the following environment variables in addition to `LANGCHAIN_API_KEY` (or equivalently `LANGSMITH_API_KEY`).
37+
You may set the following environment variables in addition to `LANGSMITH_API_KEY`.
3838
These are only required if using the EU instance.
3939

40-
:::info
41-
`LANGCHAIN_HUB_API_URL` is only required if using the legacy langchainhub sdk
42-
:::
43-
44-
`LANGCHAIN_ENDPOINT=`<RegionalUrl type='api' link={false} />
45-
`LANGCHAIN_HUB_API_URL=`<RegionalUrl type='hub' link={false} />
40+
`LANGSMITH_ENDPOINT=`<RegionalUrl type='api' link={false} />

docs/administration/how_to_guides/organization_management/manage_organization_by_api.mdx

+4-4
Original file line numberDiff line numberDiff line change
@@ -174,10 +174,10 @@ import requests
174174

175175

176176
def main():
177-
api_key = os.environ["LANGCHAIN_API_KEY"]
178-
# LANGCHAIN_ORGANIZATION_ID is not a standard environment variable in the SDK, just used for this example
179-
organization_id = os.environ["LANGCHAIN_ORGANIZATION_ID"]
180-
base_url = os.environ.get("LANGCHAIN_ENDPOINT") or "https://api.smith.langchain.com"
177+
api_key = os.environ["LANGSMITH_API_KEY"]
178+
# LANGSMITH_ORGANIZATION_ID is not a standard environment variable in the SDK, just used for this example
179+
organization_id = os.environ["LANGSMITH_ORGANIZATION_ID"]
180+
base_url = os.environ.get("LANGSMITH_ENDPOINT") or "https://api.smith.langchain.com"
181181
headers = {
182182
"Content-Type": "application/json",
183183
"X-API-Key": api_key,

docs/evaluation/concepts/index.mdx

+50-16
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,17 @@
11
# Evaluation concepts
22

3-
The pace of AI application development is often limited by high-quality evaluations.
4-
Evaluations are methods designed to assess the performance and capabilities of AI applications.
3+
The quality and development speed of AI applications is often limited by high-quality evaluation datasets and metrics, which enable you to both optimize and test your applications.
54

6-
Good evaluations make it easy to iteratively improve prompts, select models, test architectures, and ensure that deployed applications continue to perform as expected.
75
LangSmith makes building high-quality evaluations easy.
6+
This guide explains the LangSmith evaluation framework and AI evaluation techniques more broadly.
7+
The building blocks of the LangSmith framework are:
88

9-
This guide explains the key concepts behind the LangSmith evaluation framework and evaluations for AI applications more broadly.
10-
The core components of LangSmith evaluations are:
11-
12-
- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and, optionally, reference outputs for your applications.
13-
- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring the outputs generated by applications given dataset inputs.
9+
- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and reference outputs.
10+
- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring outputs.
1411

1512
## Datasets
1613

17-
A dataset contains a collection of examples used for evaluating an application.
14+
A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.
1815

1916
![Dataset](./static/dataset_concept.png)
2017

@@ -101,7 +98,7 @@ There are a number of ways to define and run evaluators:
10198
- **Custom code**: Define [custom evaluators](/evaluation/how_to_guides/custom_evaluator) as Python or TypeScript functions and run them client-side using the SDKs or server-side via the UI.
10299
- **Built-in evaluators**: LangSmith has a number of built-in evaluators that you can configure and run via the UI.
103100

104-
You can run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript), via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground), or by configuring [Rules](../../observability/how_to_guides/monitoring/rules) to automatically run them on particular tracing projects or datasets.
101+
You can run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and [TypeScript](https://docs.smith.langchain.com/reference/js)), via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground), or by configuring [Rules](../../observability/how_to_guides/monitoring/rules) to automatically run them on particular tracing projects or datasets.
105102

106103
#### Evaluation techniques
107104

@@ -141,7 +138,7 @@ Learn [how run pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise
141138
## Experiment
142139

143140
Each time we evaluate an application on a dataset, we are conducting an experiment.
144-
An experiment is a single execution of the example inputs in your dataset through your application.
141+
An experiment contains the results of running a specific version of your application on the dataset.
145142
Typically, we will run multiple experiments on a given dataset, testing different configurations of our application (e.g., different prompts or LLMs).
146143
In LangSmith, you can easily view all the experiments associated with your dataset.
147144
Additionally, you can [compare multiple experiments in a comparison view](/evaluation/how_to_guides/compare_experiment_results).
@@ -165,7 +162,7 @@ It is offline because we're evaluating on a pre-compiled set of data.
165162
An online evaluation, on the other hand, is one in which we evaluate a deployed application's outputs on real traffic, in near realtime.
166163
Offline evaluations are used for testing a version(s) of your application pre-deployment.
167164

168-
You can run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript). You can run them server-side via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground) or by configuring [automations](/observability/how_to_guides/monitoring/rules) to run certain evaluators on every new experiment against a specific dataset.
165+
You can run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and [TypeScript](https://docs.smith.langchain.com/reference/js)). You can run them server-side via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground) or by configuring [automations](/observability/how_to_guides/monitoring/rules) to run certain evaluators on every new experiment against a specific dataset.
169166

170167
![Offline](./static/offline.png)
171168

@@ -224,6 +221,42 @@ Online evaluators are generally intended to be run server-side. LangSmith has bu
224221

225222
![Online](./static/online.png)
226223

224+
## Testing
225+
226+
### Evaluations vs testing
227+
228+
Testing and evaluation are very similar and overlapping concepts that often get confused.
229+
230+
**An evaluation measures performance according to a metric(s).**
231+
Evaluation metrics can be fuzzy or subjective, and are more useful in relative terms than absolute ones.
232+
That is, they're often used to compare two systems against each other rather than to assert something about an individual system.
233+
234+
**Testing asserts correctness.**
235+
A system can only be deployed if it passes all tests.
236+
237+
Evaluation metrics can be *turned into* tests.
238+
For example, you can write regression tests to assert that any new version of a system must outperform some baseline version of the system on the relevant evaluation metrics.
239+
240+
It can also be more resource efficient to run tests and evaluations together if your system is expensive to run and you have overlapping datasets for your tests and evaluations.
241+
242+
You can also choose to write evaluations using standard software testing tools like `pytest` or `vitest/jest` out of convenience.
243+
244+
### Using `pytest` and `vitest/jest`
245+
246+
The LangSmith SDKs come with integrations for [pytest](./how_to_guides/pytest) and [`vitest/jest`](./how_to_guides/vitest_jest).
247+
These make it easy to:
248+
- Track test results in LangSmith
249+
- Write evaluations as tests
250+
251+
Tracking test results in LangSmith makes it easy to share results, compare systems, and debug failing tests.
252+
253+
Writing evaluations as tests can be useful when each example you want to evaluate on requires custom logic for running the application and/or evaluators.
254+
The standard evaluation flows assume that you can run your application and evaluators in the same way on every example in a dataset.
255+
But for more complex systems or comprehensive evals, you may want to evaluate specific subsets of your system with specific types of inputs and metrics.
256+
These types of heterogenous evals are much easier to write as a suite of distinct test cases that all get tracked together rather than using the standard evaluate flow.
257+
258+
Using testing tools is also helpful when you want to *both* evaluate your system's outputs *and* assert some basic things about them.
259+
227260
## Application-specific techniques
228261

229262
Below, we will discuss evaluation of a few specific, popular LLM applications.
@@ -348,13 +381,13 @@ Summarization is one specific type of free-form writing. The evaluation aim is t
348381
| Faithfulness | Is the summary grounded in the source documents (e.g., no hallucinations)? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-hallucination-evaluator) | Yes |
349382
| Helpfulness | Is summary helpful relative to user need? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-helpfulness-evaluator) | Yes |
350383

351-
### Classification / Tagging
384+
### Classification and tagging
352385

353-
Classification / Tagging applies a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification / Tagging evaluation typically employs the following components, which we will review in detail below:
386+
Classification and tagging apply a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification/tagging evaluation typically employs the following components, which we will review in detail below:
354387

355-
A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
388+
A central consideration for classification/tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a classification/tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
356389

357-
If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
390+
If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the classification/tagging of an input based upon specified criteria (without a ground truth reference).
358391

359392
`Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).
360393

@@ -363,3 +396,4 @@ If ground truth reference labels are provided, then it's common to simply define
363396
| Accuracy | Standard definition | Yes | No | No |
364397
| Precision | Standard definition | Yes | No | No |
365398
| Recall | Standard definition | Yes | No | No |
399+

docs/evaluation/how_to_guides/evaluate_with_attachments.mdx

+5-3
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ if (!response.ok) {
107107
const pdfArrayBuffer = await fetchArrayBuffer(pdfUrl);
108108
const wavArrayBuffer = await fetchArrayBuffer(wavUrl);
109109
const pngArrayBuffer = await fetchArrayBuffer(pngUrl);\n
110-
// Create the LangSmith client (Ensure LANGCHAIN_API_KEY is set in env)
110+
// Create the LangSmith client (Ensure LANGSMITH_API_KEY is set in env)
111111
const langsmithClient = new Client();\n
112112
// Create a unique dataset name
113113
const datasetName = "attachment-test-dataset:" + uuid4().substring(0, 8);\n
@@ -240,11 +240,12 @@ def file_qa(inputs, attachments): # Read the audio bytes from the reader and enc
240240
`,
241241
`The target function you are evaluating must have two positional arguments in order to consume the attachments associated with the example, the first must be called \`inputs\` and the second must be called \`attachments\`.
242242
- The \`inputs\` argument is a dictionary that contains the input data for the example, excluding the attachments.
243-
- The \`attachments\` argument is a dictionary that maps the attachment name to a dictionary containing a presigned url and a reader of the bytes content of the file. Either can be used to read the bytes of the file:
243+
- The \`attachments\` argument is a dictionary that maps the attachment name to a dictionary containing a presigned url, mime_type, and a reader of the bytes content of the file. You can use either the presigned url or the reader to get the file contents.
244244
Each value in the attachments dictionary is a dictionary with the following structure:
245245
\`\`\`
246246
{
247247
"presigned_url": str,
248+
"mime_type": str,
248249
"reader": BinaryIO
249250
}
250251
\`\`\`
@@ -315,8 +316,9 @@ image_answer: imageCompletion.choices[0].message.content,
315316
`In the TypeScript SDK, the \`config\` argument is used to pass in the attachments to the target function if \`includeAttachments\` is set to \`true\`.\n
316317
The \`config\` will contain \`attachments\` which is an object mapping the attachment name to an object of the form:\n
317318
\`\`\`
318-
{\n
319+
{
319320
presigned_url: string,
321+
mime_type: string,
320322
}
321323
\`\`\``
322324
),

docs/evaluation/how_to_guides/index.md

+4-3
Original file line numberDiff line numberDiff line change
@@ -45,11 +45,12 @@ Evaluate and improve your application before deploying it.
4545
- [Print detailed logs (Python only)](../../observability/how_to_guides/tracing/output_detailed_logs)
4646
- [Run an evaluation locally (beta, Python only)](./how_to_guides/local)
4747

48-
## Unit testing
48+
## Testing integrations
4949

50-
Unit test your system to identify bugs and regressions.
50+
Run evals using your favorite testing tools:
5151

52-
- [Unit test applications (Python only)](./how_to_guides/unit_testing)
52+
- [Run evals with pytest (beta)](./how_to_guides/pytest)
53+
- [Run evals with Vitest/Jest (beta)](./how_to_guides/vitest_jest)
5354

5455
## Online evaluation
5556

0 commit comments

Comments
 (0)