Skip to content

Commit

Permalink
Merge branch 'main' into isaac/aevaluateupdate
Browse files Browse the repository at this point in the history
  • Loading branch information
isahers1 authored Jan 29, 2025
2 parents b56cfca + 5260804 commit 695d4e7
Show file tree
Hide file tree
Showing 72 changed files with 2,964 additions and 726 deletions.
8 changes: 5 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,17 @@ build-api-ref:
. .venv/bin/activate
$(PYTHON) -m pip install --upgrade pip
$(PYTHON) -m pip install --upgrade uv
cd langsmith-sdk && ../$(PYTHON) -m uv pip install -r python/docs/requirements.txt
$(PYTHON) -m uv pip install -r langsmith-sdk/python/docs/requirements.txt
$(PYTHON) -m uv pip install -e langsmith-sdk/python
$(PYTHON) langsmith-sdk/python/docs/create_api_rst.py
LC_ALL=C $(PYTHON) -m sphinx -T -E -b html -d langsmith-sdk/python/docs/_build/doctrees -c langsmith-sdk/python/docs langsmith-sdk/python/docs langsmith-sdk/python/docs/_build/html -j auto
$(PYTHON) langsmith-sdk/python/docs/scripts/custom_formatter.py langsmith-sdk/docs/_build/html/

cd langsmith-sdk/js && yarn && yarn run build:typedoc --useHostedBaseUrlForAbsoluteLinks true --hostedBaseUrl "https://$${VERCEL_URL:-docs.smith.langchain.com}/reference/js/"

vercel-build: install-vercel-deps build-api-ref
mkdir -p static/reference/python
mv langsmith-sdk/python/docs/_build/html/* static/reference/python/
mkdir -p static/reference/js
mv langsmith-sdk/js/_build/api_refs/* static/reference/js/
rm -rf langsmith-sdk
NODE_OPTIONS="--max-old-space-size=5000" yarn run docusaurus build

Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,7 @@ The API key will be shown only once, so make sure to copy it and store it in a s

## Configure the SDK

You may set the following environment variables in addition to `LANGCHAIN_API_KEY` (or equivalently `LANGSMITH_API_KEY`).
You may set the following environment variables in addition to `LANGSMITH_API_KEY`.
These are only required if using the EU instance.

:::info
`LANGCHAIN_HUB_API_URL` is only required if using the legacy langchainhub sdk
:::

`LANGCHAIN_ENDPOINT=`<RegionalUrl type='api' link={false} />
`LANGCHAIN_HUB_API_URL=`<RegionalUrl type='hub' link={false} />
`LANGSMITH_ENDPOINT=`<RegionalUrl type='api' link={false} />
Original file line number Diff line number Diff line change
Expand Up @@ -174,10 +174,10 @@ import requests


def main():
api_key = os.environ["LANGCHAIN_API_KEY"]
# LANGCHAIN_ORGANIZATION_ID is not a standard environment variable in the SDK, just used for this example
organization_id = os.environ["LANGCHAIN_ORGANIZATION_ID"]
base_url = os.environ.get("LANGCHAIN_ENDPOINT") or "https://api.smith.langchain.com"
api_key = os.environ["LANGSMITH_API_KEY"]
# LANGSMITH_ORGANIZATION_ID is not a standard environment variable in the SDK, just used for this example
organization_id = os.environ["LANGSMITH_ORGANIZATION_ID"]
base_url = os.environ.get("LANGSMITH_ENDPOINT") or "https://api.smith.langchain.com"
headers = {
"Content-Type": "application/json",
"X-API-Key": api_key,
Expand Down
67 changes: 50 additions & 17 deletions docs/evaluation/concepts/index.mdx
Original file line number Diff line number Diff line change
@@ -1,20 +1,17 @@
# Evaluation concepts

The pace of AI application development is often limited by high-quality evaluations.
Evaluations are methods designed to assess the performance and capabilities of AI applications.
The quality and development speed of AI applications is often limited by high-quality evaluation datasets and metrics, which enable you to both optimize and test your applications.

Good evaluations make it easy to iteratively improve prompts, select models, test architectures, and ensure that deployed applications continue to perform as expected.
LangSmith makes building high-quality evaluations easy.
This guide explains the LangSmith evaluation framework and AI evaluation techniques more broadly.
The building blocks of the LangSmith framework are:

This guide explains the key concepts behind the LangSmith evaluation framework and evaluations for AI applications more broadly.
The core components of LangSmith evaluations are:

- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and, optionally, reference outputs for your applications.
- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring the outputs generated by applications given dataset inputs.
- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and reference outputs.
- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring outputs.

## Datasets

A dataset contains a collection of examples used for evaluating an application.
A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.

![Dataset](./static/dataset_concept.png)

Expand Down Expand Up @@ -101,7 +98,7 @@ There are a number of ways to define and run evaluators:
- **Custom code**: Define [custom evaluators](/evaluation/how_to_guides/custom_evaluator) as Python or TypeScript functions and run them client-side using the SDKs or server-side via the UI.
- **Built-in evaluators**: LangSmith has a number of built-in evaluators that you can configure and run via the UI.

You can run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript), via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground), or by configuring [Rules](../../observability/how_to_guides/monitoring/rules) to automatically run them on particular tracing projects or datasets.
You can run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and [TypeScript](https://docs.smith.langchain.com/reference/js)), via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground), or by configuring [Rules](../../observability/how_to_guides/monitoring/rules) to automatically run them on particular tracing projects or datasets.

#### Evaluation techniques

Expand Down Expand Up @@ -141,7 +138,7 @@ Learn [how run pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise
## Experiment

Each time we evaluate an application on a dataset, we are conducting an experiment.
An experiment is a single execution of the example inputs in your dataset through your application.
An experiment contains the results of running a specific version of your application on the dataset.
Typically, we will run multiple experiments on a given dataset, testing different configurations of our application (e.g., different prompts or LLMs).
In LangSmith, you can easily view all the experiments associated with your dataset.
Additionally, you can [compare multiple experiments in a comparison view](/evaluation/how_to_guides/compare_experiment_results).
Expand All @@ -165,7 +162,7 @@ It is offline because we're evaluating on a pre-compiled set of data.
An online evaluation, on the other hand, is one in which we evaluate a deployed application's outputs on real traffic, in near realtime.
Offline evaluations are used for testing a version(s) of your application pre-deployment.

You can run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript). You can run them server-side via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground) or by configuring [automations](/observability/how_to_guides/monitoring/rules) to run certain evaluators on every new experiment against a specific dataset.
You can run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and [TypeScript](https://docs.smith.langchain.com/reference/js)). You can run them server-side via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground) or by configuring [automations](/observability/how_to_guides/monitoring/rules) to run certain evaluators on every new experiment against a specific dataset.

![Offline](./static/offline.png)

Expand Down Expand Up @@ -224,6 +221,42 @@ Online evaluators are generally intended to be run server-side. LangSmith has bu

![Online](./static/online.png)

## Testing

### Evaluations vs testing

Testing and evaluation are very similar and overlapping concepts that often get confused.

**An evaluation measures performance according to a metric(s).**
Evaluation metrics can be fuzzy or subjective, and are more useful in relative terms than absolute ones.
That is, they're often used to compare two systems against each other rather than to assert something about an individual system.

**Testing asserts correctness.**
A system can only be deployed if it passes all tests.

Evaluation metrics can be *turned into* tests.
For example, you can write regression tests to assert that any new version of a system must outperform some baseline version of the system on the relevant evaluation metrics.

It can also be more resource efficient to run tests and evaluations together if your system is expensive to run and you have overlapping datasets for your tests and evaluations.

You can also choose to write evaluations using standard software testing tools like `pytest` or `vitest/jest` out of convenience.

### Using `pytest` and `vitest/jest`

The LangSmith SDKs come with integrations for [pytest](./how_to_guides/pytest) and [`vitest/jest`](./how_to_guides/vitest_jest).
These make it easy to:
- Track test results in LangSmith
- Write evaluations as tests

Tracking test results in LangSmith makes it easy to share results, compare systems, and debug failing tests.

Writing evaluations as tests can be useful when each example you want to evaluate on requires custom logic for running the application and/or evaluators.
The standard evaluation flows assume that you can run your application and evaluators in the same way on every example in a dataset.
But for more complex systems or comprehensive evals, you may want to evaluate specific subsets of your system with specific types of inputs and metrics.
These types of heterogenous evals are much easier to write as a suite of distinct test cases that all get tracked together rather than using the standard evaluate flow.

Using testing tools is also helpful when you want to *both* evaluate your system's outputs *and* assert some basic things about them.

## Application-specific techniques

Below, we will discuss evaluation of a few specific, popular LLM applications.
Expand Down Expand Up @@ -348,13 +381,13 @@ Summarization is one specific type of free-form writing. The evaluation aim is t
| Faithfulness | Is the summary grounded in the source documents (e.g., no hallucinations)? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-hallucination-evaluator) | Yes |
| Helpfulness | Is summary helpful relative to user need? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-helpfulness-evaluator) | Yes |

### Classification / Tagging
### Classification and tagging

Classification / Tagging applies a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification / Tagging evaluation typically employs the following components, which we will review in detail below:
Classification and tagging apply a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification/tagging evaluation typically employs the following components, which we will review in detail below:

A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
A central consideration for classification/tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a classification/tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).

If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the classification/tagging of an input based upon specified criteria (without a ground truth reference).

`Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).

Expand Down Expand Up @@ -396,4 +429,4 @@ to run at once.
### Caching

Lastly, you can also cache the API calls made in your experiment by setting the `LANGSMITH_CACHE_PATH` to a valid folder on your device with write access.
This will cause the API calls made in your experiment to be cached to disk, meaning future experiments that make the same API calls will be greatly sped up.
This will cause the API calls made in your experiment to be cached to disk, meaning future experiments that make the same API calls will be greatly sped up.
35 changes: 21 additions & 14 deletions docs/evaluation/how_to_guides/evaluate_with_attachments.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -37,17 +37,18 @@ All of the below features are available in the following SDK versions:
<CodeTabs
tabs={[
PythonBlock(`import requests
import uuid\n
import uuid
from pathlib import Path\n
from langsmith import Client
from langsmith.schemas import ExampleUploadWithAttachments, Attachment\n
# Publicly available test files
pdf_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
wav_url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"
png_url = "https://www.w3.org/Graphics/PNG/nurbcup2si.png"\n
wav_url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"\n
# Fetch the files as bytes\n
pdf_bytes = requests.get(pdf_url).content
wav_bytes = requests.get(wav_url).content
png_bytes = requests.get(png_url).content\n
wav_bytes = requests.get(wav_url).content\n
# Define the LANGCHAIN_API_KEY environment variable with your API key
langsmith_client = Client()\n
dataset_name = "attachment-test-dataset:" + str(uuid.uuid4())[0:8]\n
Expand All @@ -70,20 +71,22 @@ example = ExampleUploadWithAttachments(
},
attachments={
"my_pdf": ("application/pdf", pdf_bytes),
"my_wav": ("audio/wav", wav_bytes),
"my_img": Attachment(mime_type="image/png", data=png_bytes)
"my_wav": Attachment(mime_type="audio/wav", data=wav_bytes),
"my_img": ("image/png", Path(__file__).parent / "my_img.png")
},
)\n
# Upload the examples with attachments
langsmith_client.upload_examples_multipart(dataset_id=dataset.id, uploads=[example])
# Must pass the dangerously_allow_filesystem flag to allow file paths
langsmith_client.upload_examples_multipart(dataset_id=dataset.id, uploads=[example], dangerously_allow_filesystem=True)
`,
`In the Python SDK, you can use the \`upload_examples_multipart\` method to upload examples with attachments.\n
Note that this is a different method from the standard \`create_examples\` method, which currently not support attachments.\n
Utilize the \`ExampleUploadWithAttachments\` type to define examples with attachments.\n
Each \`Attachment\` requires:
- \`mime_type\` (str): The MIME type of the file (e.g., \`"image/png"\`).
- \`data\` (bytes): The binary content of the file.\n
You can also define an attachment with a tuple tuple of the form \`(mime_type, data)\` for convenience.
- \`data\` (bytes | Path): The binary content of the file, or the file path.\n
You can also define an attachment with a tuple tuple of the form \`(mime_type, data)\` for convenience. \n
Note that to use the file path instead of the raw bytes, you need to set the \`dangerously_allow_filesystem\` option to \`True\`.\n
`
),
TypeScriptBlock(`import { Client } from "langsmith";
Expand All @@ -104,7 +107,7 @@ if (!response.ok) {
const pdfArrayBuffer = await fetchArrayBuffer(pdfUrl);
const wavArrayBuffer = await fetchArrayBuffer(wavUrl);
const pngArrayBuffer = await fetchArrayBuffer(pngUrl);\n
// Create the LangSmith client (Ensure LANGCHAIN_API_KEY is set in env)
// Create the LangSmith client (Ensure LANGSMITH_API_KEY is set in env)
const langsmithClient = new Client();\n
// Create a unique dataset name
const datasetName = "attachment-test-dataset:" + uuid4().substring(0, 8);\n
Expand Down Expand Up @@ -147,7 +150,9 @@ Note that this is a different method from the standard \`createExamples\` method
Each attachment requires either a \`Uint8Array\` or an \`ArrayBuffer\` as the data type.\n
- \`Uint8Array\`: Useful for handling binary data directly.
- \`ArrayBuffer\`: Represents fixed-length binary data, which can be converted to \`Uint8Array\` as needed.\n`),
- \`ArrayBuffer\`: Represents fixed-length binary data, which can be converted to \`Uint8Array\` as needed.\n
Note that you cannot directly pass in a file path in the TypeScript SDK, as accessing local files is not supported in all runtime environments.\n
`),
]}
groupId="client-language"
/>
Expand Down Expand Up @@ -235,11 +240,12 @@ def file_qa(inputs, attachments): # Read the audio bytes from the reader and enc
`,
`The target function you are evaluating must have two positional arguments in order to consume the attachments associated with the example, the first must be called \`inputs\` and the second must be called \`attachments\`.
- The \`inputs\` argument is a dictionary that contains the input data for the example, excluding the attachments.
- The \`attachments\` argument is a dictionary that maps the attachment name to a dictionary containing a presigned url and a reader of the bytes content of the file. Either can be used to read the bytes of the file:
- The \`attachments\` argument is a dictionary that maps the attachment name to a dictionary containing a presigned url, mime_type, and a reader of the bytes content of the file. You can use either the presigned url or the reader to get the file contents.
Each value in the attachments dictionary is a dictionary with the following structure:
\`\`\`
{
"presigned_url": str,
"mime_type": str,
"reader": BinaryIO
}
\`\`\`
Expand Down Expand Up @@ -310,8 +316,9 @@ image_answer: imageCompletion.choices[0].message.content,
`In the TypeScript SDK, the \`config\` argument is used to pass in the attachments to the target function if \`includeAttachments\` is set to \`true\`.\n
The \`config\` will contain \`attachments\` which is an object mapping the attachment name to an object of the form:\n
\`\`\`
{\n
{
presigned_url: string,
mime_type: string,
}
\`\`\``
),
Expand Down
10 changes: 6 additions & 4 deletions docs/evaluation/how_to_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,17 +45,19 @@ Evaluate and improve your application before deploying it.
- [Print detailed logs (Python only)](../../observability/how_to_guides/tracing/output_detailed_logs)
- [Run an evaluation locally (beta, Python only)](./how_to_guides/local)

## Unit testing
## Testing integrations

Unit test your system to identify bugs and regressions.
Run evals using your favorite testing tools:

- [Unit test applications (Python only)](./how_to_guides/unit_testing)
- [Run evals with pytest (beta)](./how_to_guides/pytest)
- [Run evals with Vitest/Jest (beta)](./how_to_guides/vitest_jest)

## Online evaluation

Evaluate and monitor your system's live performance on production data.

- [Set up an online evaluator](../../observability/how_to_guides/monitoring/online_evaluations)
- [Set up an LLM-as-judge online evaluator](../../observability/how_to_guides/monitoring/online_evaluations#configure-llm-as-judge-evaluators)
- [Set up a custom code online evaluator](../../observability/how_to_guides/monitoring/online_evaluations#configure-custom-code-evaluators)
- [Create a few-shot evaluator](./how_to_guides/create_few_shot_evaluators)

## Automatic evaluation
Expand Down
Loading

0 comments on commit 695d4e7

Please sign in to comment.