Merge branch 'main' into isaac/aevaluateupdate

langchain-ai · Jan 29, 2025 · 695d4e7 · 695d4e7
2 parents b56cfca + 5260804
commit 695d4e7
Show file tree

Hide file tree

Showing 72 changed files with 2,964 additions and 726 deletions.
diff --git a/Makefile b/Makefile
@@ -10,15 +10,17 @@ build-api-ref:
 	. .venv/bin/activate
 	$(PYTHON) -m pip install --upgrade pip
 	$(PYTHON) -m pip install --upgrade uv
-	cd langsmith-sdk && ../$(PYTHON) -m uv pip install -r python/docs/requirements.txt
+	$(PYTHON) -m uv pip install -r langsmith-sdk/python/docs/requirements.txt
+	$(PYTHON) -m uv pip install -e langsmith-sdk/python
 	$(PYTHON) langsmith-sdk/python/docs/create_api_rst.py
 	LC_ALL=C $(PYTHON) -m sphinx -T -E -b html -d langsmith-sdk/python/docs/_build/doctrees -c langsmith-sdk/python/docs langsmith-sdk/python/docs langsmith-sdk/python/docs/_build/html -j auto
 	$(PYTHON) langsmith-sdk/python/docs/scripts/custom_formatter.py langsmith-sdk/docs/_build/html/
-
+	cd langsmith-sdk/js && yarn && yarn run build:typedoc --useHostedBaseUrlForAbsoluteLinks true --hostedBaseUrl "https://$${VERCEL_URL:-docs.smith.langchain.com}/reference/js/"
 
 vercel-build: install-vercel-deps build-api-ref 
 	mkdir -p static/reference/python
 	mv langsmith-sdk/python/docs/_build/html/* static/reference/python/
+	mkdir -p static/reference/js
+	mv langsmith-sdk/js/_build/api_refs/* static/reference/js/
 	rm -rf langsmith-sdk
 	NODE_OPTIONS="--max-old-space-size=5000" yarn run docusaurus build
-
diff --git a/...administration/how_to_guides/organization_management/create_account_api_key.mdx b/...administration/how_to_guides/organization_management/create_account_api_key.mdx
@@ -34,12 +34,7 @@ The API key will be shown only once, so make sure to copy it and store it in a s
 
 ## Configure the SDK
 
-You may set the following environment variables in addition to `LANGCHAIN_API_KEY` (or equivalently `LANGSMITH_API_KEY`).  
+You may set the following environment variables in addition to `LANGSMITH_API_KEY`.  
 These are only required if using the EU instance.
 
-:::info
-`LANGCHAIN_HUB_API_URL` is only required if using the legacy langchainhub sdk
-:::
-
-`LANGCHAIN_ENDPOINT=`<RegionalUrl type='api' link={false} />
-`LANGCHAIN_HUB_API_URL=`<RegionalUrl type='hub' link={false} />
+`LANGSMITH_ENDPOINT=`<RegionalUrl type='api' link={false} />
diff --git a/...nistration/how_to_guides/organization_management/manage_organization_by_api.mdx b/...nistration/how_to_guides/organization_management/manage_organization_by_api.mdx
@@ -174,10 +174,10 @@ import requests
 
 
 def main():
-    api_key = os.environ["LANGCHAIN_API_KEY"]
-    # LANGCHAIN_ORGANIZATION_ID is not a standard environment variable in the SDK, just used for this example
-    organization_id = os.environ["LANGCHAIN_ORGANIZATION_ID"]
-    base_url = os.environ.get("LANGCHAIN_ENDPOINT") or "https://api.smith.langchain.com"
+    api_key = os.environ["LANGSMITH_API_KEY"]
+    # LANGSMITH_ORGANIZATION_ID is not a standard environment variable in the SDK, just used for this example
+    organization_id = os.environ["LANGSMITH_ORGANIZATION_ID"]
+    base_url = os.environ.get("LANGSMITH_ENDPOINT") or "https://api.smith.langchain.com"
     headers = {
         "Content-Type": "application/json",
         "X-API-Key": api_key,

diff --git a/docs/evaluation/concepts/index.mdx b/docs/evaluation/concepts/index.mdx
@@ -1,20 +1,17 @@
 # Evaluation concepts
 
-The pace of AI application development is often limited by high-quality evaluations.
-Evaluations are methods designed to assess the performance and capabilities of AI applications.
+The quality and development speed of AI applications is often limited by high-quality evaluation datasets and metrics, which enable you to both optimize and test your applications.
 
-Good evaluations make it easy to iteratively improve prompts, select models, test architectures, and ensure that deployed applications continue to perform as expected.
 LangSmith makes building high-quality evaluations easy.
+This guide explains the LangSmith evaluation framework and AI evaluation techniques more broadly.
+The building blocks of the LangSmith framework are:
 
-This guide explains the key concepts behind the LangSmith evaluation framework and evaluations for AI applications more broadly.
-The core components of LangSmith evaluations are:
-
-- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and, optionally, reference outputs for your applications.
-- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring the outputs generated by applications given dataset inputs.
+- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and reference outputs.
+- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring outputs.
 
 ## Datasets
 
-A dataset contains a collection of examples used for evaluating an application.
+A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.
 
 ![Dataset](./static/dataset_concept.png)
 
@@ -101,7 +98,7 @@ There are a number of ways to define and run evaluators:
 - **Custom code**: Define [custom evaluators](/evaluation/how_to_guides/custom_evaluator) as Python or TypeScript functions and run them client-side using the SDKs or server-side via the UI.
 - **Built-in evaluators**: LangSmith has a number of built-in evaluators that you can configure and run via the UI.
 
-You can run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript), via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground), or by configuring [Rules](../../observability/how_to_guides/monitoring/rules) to automatically run them on particular tracing projects or datasets.
+You can run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and [TypeScript](https://docs.smith.langchain.com/reference/js)), via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground), or by configuring [Rules](../../observability/how_to_guides/monitoring/rules) to automatically run them on particular tracing projects or datasets.
 
 #### Evaluation techniques
 
@@ -141,7 +138,7 @@ Learn [how run pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise
 ## Experiment
 
 Each time we evaluate an application on a dataset, we are conducting an experiment.
-An experiment is a single execution of the example inputs in your dataset through your application.
+An experiment contains the results of running a specific version of your application on the dataset.
 Typically, we will run multiple experiments on a given dataset, testing different configurations of our application (e.g., different prompts or LLMs).
 In LangSmith, you can easily view all the experiments associated with your dataset.
 Additionally, you can [compare multiple experiments in a comparison view](/evaluation/how_to_guides/compare_experiment_results).
@@ -165,7 +162,7 @@ It is offline because we're evaluating on a pre-compiled set of data.
 An online evaluation, on the other hand, is one in which we evaluate a deployed application's outputs on real traffic, in near realtime.
 Offline evaluations are used for testing a version(s) of your application pre-deployment.
 
-You can run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript). You can run them server-side via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground) or by configuring [automations](/observability/how_to_guides/monitoring/rules) to run certain evaluators on every new experiment against a specific dataset.
+You can run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and [TypeScript](https://docs.smith.langchain.com/reference/js)). You can run them server-side via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground) or by configuring [automations](/observability/how_to_guides/monitoring/rules) to run certain evaluators on every new experiment against a specific dataset.
 
 ![Offline](./static/offline.png)
 
@@ -224,6 +221,42 @@ Online evaluators are generally intended to be run server-side. LangSmith has bu
 
 ![Online](./static/online.png)
 
+## Testing
+
+### Evaluations vs testing
+
+Testing and evaluation are very similar and overlapping concepts that often get confused.
+
+**An evaluation measures performance according to a metric(s).**
+Evaluation metrics can be fuzzy or subjective, and are more useful in relative terms than absolute ones.
+That is, they're often used to compare two systems against each other rather than to assert something about an individual system.
+
+**Testing asserts correctness.**
+A system can only be deployed if it passes all tests.
+
+Evaluation metrics can be *turned into* tests.
+For example, you can write regression tests to assert that any new version of a system must outperform some baseline version of the system on the relevant evaluation metrics.
+
+It can also be more resource efficient to run tests and evaluations together if your system is expensive to run and you have overlapping datasets for your tests and evaluations.
+
+You can also choose to write evaluations using standard software testing tools like `pytest` or `vitest/jest` out of convenience.
+
+### Using `pytest` and `vitest/jest`
+
+The LangSmith SDKs come with integrations for [pytest](./how_to_guides/pytest) and [`vitest/jest`](./how_to_guides/vitest_jest).
+These make it easy to:
+- Track test results in LangSmith
+- Write evaluations as tests
+
+Tracking test results in LangSmith makes it easy to share results, compare systems, and debug failing tests.
+
+Writing evaluations as tests can be useful when each example you want to evaluate on requires custom logic for running the application and/or evaluators.
+The standard evaluation flows assume that you can run your application and evaluators in the same way on every example in a dataset.
+But for more complex systems or comprehensive evals, you may want to evaluate specific subsets of your system with specific types of inputs and metrics.
+These types of heterogenous evals are much easier to write as a suite of distinct test cases that all get tracked together rather than using the standard evaluate flow.
+
+Using testing tools is also helpful when you want to *both* evaluate your system's outputs *and* assert some basic things about them.
+
 ## Application-specific techniques
 
 Below, we will discuss evaluation of a few specific, popular LLM applications.
@@ -348,13 +381,13 @@ Summarization is one specific type of free-form writing. The evaluation aim is t
 | Faithfulness     | Is the summary grounded in the source documents (e.g., no hallucinations)? | No                     | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-hallucination-evaluator) | Yes               |
 | Helpfulness      | Is summary helpful relative to user need?                                  | No                     | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-helpfulness-evaluator)   | Yes               |
 
-### Classification / Tagging
+### Classification and tagging
 
-Classification / Tagging applies a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification / Tagging evaluation typically employs the following components, which we will review in detail below:
+Classification and tagging apply a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification/tagging evaluation typically employs the following components, which we will review in detail below:
 
-A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
+A central consideration for classification/tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a classification/tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
 
-If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
+If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the classification/tagging of an input based upon specified criteria (without a ground truth reference).
 
 `Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).
 
@@ -396,4 +429,4 @@ to run at once.
 ### Caching
 
 Lastly, you can also cache the API calls made in your experiment by setting the `LANGSMITH_CACHE_PATH` to a valid folder on your device with write access.
-This will cause the API calls made in your experiment to be cached to disk, meaning future experiments that make the same API calls will be greatly sped up.
+This will cause the API calls made in your experiment to be cached to disk, meaning future experiments that make the same API calls will be greatly sped up.
diff --git a/docs/evaluation/how_to_guides/evaluate_with_attachments.mdx b/docs/evaluation/how_to_guides/evaluate_with_attachments.mdx
@@ -37,17 +37,18 @@ All of the below features are available in the following SDK versions:
 <CodeTabs
   tabs={[
     PythonBlock(`import requests
-import uuid\n
+import uuid
+from pathlib import Path\n
 from langsmith import Client
 from langsmith.schemas import ExampleUploadWithAttachments, Attachment\n
 # Publicly available test files
 pdf_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
-wav_url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"
-png_url = "https://www.w3.org/Graphics/PNG/nurbcup2si.png"\n
+wav_url = "https://openaiassets.blob.core.windows.net/$web/API/docs/audio/alloy.wav"\n
+
 # Fetch the files as bytes\n
 pdf_bytes = requests.get(pdf_url).content
-wav_bytes = requests.get(wav_url).content
-png_bytes = requests.get(png_url).content\n
+wav_bytes = requests.get(wav_url).content\n
+
 # Define the LANGCHAIN_API_KEY environment variable with your API key
 langsmith_client = Client()\n
 dataset_name = "attachment-test-dataset:" + str(uuid.uuid4())[0:8]\n
@@ -70,20 +71,22 @@ example = ExampleUploadWithAttachments(
     },
     attachments={
         "my_pdf": ("application/pdf", pdf_bytes),
-        "my_wav": ("audio/wav", wav_bytes),
-        "my_img": Attachment(mime_type="image/png", data=png_bytes)
+        "my_wav": Attachment(mime_type="audio/wav", data=wav_bytes),
+        "my_img": ("image/png", Path(__file__).parent / "my_img.png")
     },
 )\n
 # Upload the examples with attachments
-langsmith_client.upload_examples_multipart(dataset_id=dataset.id, uploads=[example])
+# Must pass the dangerously_allow_filesystem flag to allow file paths
+langsmith_client.upload_examples_multipart(dataset_id=dataset.id, uploads=[example], dangerously_allow_filesystem=True)
 `,
         `In the Python SDK, you can use the \`upload_examples_multipart\` method to upload examples with attachments.\n
 Note that this is a different method from the standard \`create_examples\` method, which currently not support attachments.\n
 Utilize the \`ExampleUploadWithAttachments\` type to define examples with attachments.\n
 Each \`Attachment\` requires:
 - \`mime_type\` (str): The MIME type of the file (e.g., \`"image/png"\`).
-- \`data\` (bytes): The binary content of the file.\n
-  You can also define an attachment with a tuple tuple of the form \`(mime_type, data)\` for convenience.
+- \`data\` (bytes | Path): The binary content of the file, or the file path.\n
+  You can also define an attachment with a tuple tuple of the form \`(mime_type, data)\` for convenience. \n
+  Note that to use the file path instead of the raw bytes, you need to set the \`dangerously_allow_filesystem\` option to \`True\`.\n
   `
 ),
     TypeScriptBlock(`import { Client } from "langsmith";
@@ -104,7 +107,7 @@ if (!response.ok) {
 const pdfArrayBuffer = await fetchArrayBuffer(pdfUrl);
 const wavArrayBuffer = await fetchArrayBuffer(wavUrl);
 const pngArrayBuffer = await fetchArrayBuffer(pngUrl);\n
-// Create the LangSmith client (Ensure LANGCHAIN_API_KEY is set in env)
+// Create the LangSmith client (Ensure LANGSMITH_API_KEY is set in env)
 const langsmithClient = new Client();\n
 // Create a unique dataset name
 const datasetName = "attachment-test-dataset:" + uuid4().substring(0, 8);\n
@@ -147,7 +150,9 @@ Note that this is a different method from the standard \`createExamples\` method
 Each attachment requires either a \`Uint8Array\` or an \`ArrayBuffer\` as the data type.\n
 
 - \`Uint8Array\`: Useful for handling binary data directly.
-- \`ArrayBuffer\`: Represents fixed-length binary data, which can be converted to \`Uint8Array\` as needed.\n`),
+- \`ArrayBuffer\`: Represents fixed-length binary data, which can be converted to \`Uint8Array\` as needed.\n
+Note that you cannot directly pass in a file path in the TypeScript SDK, as accessing local files is not supported in all runtime environments.\n
+`),
   ]}
   groupId="client-language"
   />
@@ -235,11 +240,12 @@ def file_qa(inputs, attachments): # Read the audio bytes from the reader and enc
 `,
 `The target function you are evaluating must have two positional arguments in order to consume the attachments associated with the example, the first must be called \`inputs\` and the second must be called \`attachments\`.
 - The \`inputs\` argument is a dictionary that contains the input data for the example, excluding the attachments.
-- The \`attachments\` argument is a dictionary that maps the attachment name to a dictionary containing a presigned url and a reader of the bytes content of the file. Either can be used to read the bytes of the file:
+- The \`attachments\` argument is a dictionary that maps the attachment name to a dictionary containing a presigned url, mime_type, and a reader of the bytes content of the file. You can use either the presigned url or the reader to get the file contents.
 Each value in the attachments dictionary is a dictionary with the following structure:
 \`\`\`
 {
       "presigned_url": str,
+      "mime_type": str,
       "reader": BinaryIO
 }
 \`\`\`
@@ -310,8 +316,9 @@ image_answer: imageCompletion.choices[0].message.content,
     `In the TypeScript SDK, the \`config\` argument is used to pass in the attachments to the target function if \`includeAttachments\` is set to \`true\`.\n
 The \`config\` will contain \`attachments\` which is an object mapping the attachment name to an object of the form:\n
 \`\`\`
-{\n
+{
 presigned_url: string,
+mime_type: string,
 }
 \`\`\``
 ),

diff --git a/docs/evaluation/how_to_guides/index.md b/docs/evaluation/how_to_guides/index.md
@@ -45,17 +45,19 @@ Evaluate and improve your application before deploying it.
 - [Print detailed logs (Python only)](../../observability/how_to_guides/tracing/output_detailed_logs)
 - [Run an evaluation locally (beta, Python only)](./how_to_guides/local)
 
-## Unit testing
+## Testing integrations
 
-Unit test your system to identify bugs and regressions.
+Run evals using your favorite testing tools:
 
-- [Unit test applications (Python only)](./how_to_guides/unit_testing)
+- [Run evals with pytest (beta)](./how_to_guides/pytest)
+- [Run evals with Vitest/Jest (beta)](./how_to_guides/vitest_jest)
 
 ## Online evaluation
 
 Evaluate and monitor your system's live performance on production data.
 
-- [Set up an online evaluator](../../observability/how_to_guides/monitoring/online_evaluations)
+- [Set up an LLM-as-judge online evaluator](../../observability/how_to_guides/monitoring/online_evaluations#configure-llm-as-judge-evaluators)
+- [Set up a custom code online evaluator](../../observability/how_to_guides/monitoring/online_evaluations#configure-custom-code-evaluators)
 - [Create a few-shot evaluator](./how_to_guides/create_few_shot_evaluators)
 
 ## Automatic evaluation