langchain-ai
diff --git a/‎Makefile
+5-3 b/‎Makefile
+5-3
diff --git a/‎docs/administration/how_to_guides/organization_management/create_account_api_key.mdx
+2-7 b/‎docs/administration/how_to_guides/organization_management/create_account_api_key.mdx
+2-7
diff --git a/‎docs/administration/how_to_guides/organization_management/manage_organization_by_api.mdx
+4-4 b/‎docs/administration/how_to_guides/organization_management/manage_organization_by_api.mdx
+4-4
diff --git a/‎docs/evaluation/concepts/index.mdx
+50-16 b/‎docs/evaluation/concepts/index.mdx
+50-16
diff --git a/‎docs/evaluation/how_to_guides/evaluate_with_attachments.mdx
+5-3 b/‎docs/evaluation/how_to_guides/evaluate_with_attachments.mdx
+5-3
diff --git a/‎docs/evaluation/how_to_guides/index.md
+4-3 b/‎docs/evaluation/how_to_guides/index.md
+4-3
@@ -10,15 +10,17 @@ build-api-ref:
 	. .venv/bin/activate
 	$(PYTHON) -m pip install --upgrade pip
 	$(PYTHON) -m pip install --upgrade uv
-	cd langsmith-sdk && ../$(PYTHON) -m uv pip install -r python/docs/requirements.txt
+	$(PYTHON) -m uv pip install -r langsmith-sdk/python/docs/requirements.txt
+	$(PYTHON) -m uv pip install -e langsmith-sdk/python
 	$(PYTHON) langsmith-sdk/python/docs/create_api_rst.py
 	LC_ALL=C $(PYTHON) -m sphinx -T -E -b html -d langsmith-sdk/python/docs/_build/doctrees -c langsmith-sdk/python/docs langsmith-sdk/python/docs langsmith-sdk/python/docs/_build/html -j auto
 	$(PYTHON) langsmith-sdk/python/docs/scripts/custom_formatter.py langsmith-sdk/docs/_build/html/
-
+	cd langsmith-sdk/js && yarn && yarn run build:typedoc --useHostedBaseUrlForAbsoluteLinks true --hostedBaseUrl "https://$${VERCEL_URL:-docs.smith.langchain.com}/reference/js/"
 
 vercel-build: install-vercel-deps build-api-ref 
 	mkdir -p static/reference/python
 	mv langsmith-sdk/python/docs/_build/html/* static/reference/python/
+	mkdir -p static/reference/js
+	mv langsmith-sdk/js/_build/api_refs/* static/reference/js/
 	rm -rf langsmith-sdk
 	NODE_OPTIONS="--max-old-space-size=5000" yarn run docusaurus build
-
 
@@ -34,12 +34,7 @@ The API key will be shown only once, so make sure to copy it and store it in a s
 
 ## Configure the SDK
 
-You may set the following environment variables in addition to `LANGCHAIN_API_KEY` (or equivalently `LANGSMITH_API_KEY`).  
+You may set the following environment variables in addition to `LANGSMITH_API_KEY`.  
 These are only required if using the EU instance.
 
-:::info
-`LANGCHAIN_HUB_API_URL` is only required if using the legacy langchainhub sdk
-:::
-
-`LANGCHAIN_ENDPOINT=`<RegionalUrl type='api' link={false} />
-`LANGCHAIN_HUB_API_URL=`<RegionalUrl type='hub' link={false} />
+`LANGSMITH_ENDPOINT=`<RegionalUrl type='api' link={false} />
@@ -174,10 +174,10 @@ import requests
 
 
 def main():
-    api_key = os.environ["LANGCHAIN_API_KEY"]
-    # LANGCHAIN_ORGANIZATION_ID is not a standard environment variable in the SDK, just used for this example
-    organization_id = os.environ["LANGCHAIN_ORGANIZATION_ID"]
-    base_url = os.environ.get("LANGCHAIN_ENDPOINT") or "https://api.smith.langchain.com"
+    api_key = os.environ["LANGSMITH_API_KEY"]
+    # LANGSMITH_ORGANIZATION_ID is not a standard environment variable in the SDK, just used for this example
+    organization_id = os.environ["LANGSMITH_ORGANIZATION_ID"]
+    base_url = os.environ.get("LANGSMITH_ENDPOINT") or "https://api.smith.langchain.com"
     headers = {
         "Content-Type": "application/json",
         "X-API-Key": api_key,
 
@@ -1,20 +1,17 @@
 # Evaluation concepts
 
-The pace of AI application development is often limited by high-quality evaluations.
-Evaluations are methods designed to assess the performance and capabilities of AI applications.
+The quality and development speed of AI applications is often limited by high-quality evaluation datasets and metrics, which enable you to both optimize and test your applications.
 
-Good evaluations make it easy to iteratively improve prompts, select models, test architectures, and ensure that deployed applications continue to perform as expected.
 LangSmith makes building high-quality evaluations easy.
+This guide explains the LangSmith evaluation framework and AI evaluation techniques more broadly.
+The building blocks of the LangSmith framework are:
 
-This guide explains the key concepts behind the LangSmith evaluation framework and evaluations for AI applications more broadly.
-The core components of LangSmith evaluations are:
-
-- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and, optionally, reference outputs for your applications.
-- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring the outputs generated by applications given dataset inputs.
+- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and reference outputs.
+- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring outputs.
 
 ## Datasets
 
-A dataset contains a collection of examples used for evaluating an application.
+A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.
 
 ![Dataset](./static/dataset_concept.png)
 
@@ -101,7 +98,7 @@ There are a number of ways to define and run evaluators:
 - **Custom code**: Define [custom evaluators](/evaluation/how_to_guides/custom_evaluator) as Python or TypeScript functions and run them client-side using the SDKs or server-side via the UI.
 - **Built-in evaluators**: LangSmith has a number of built-in evaluators that you can configure and run via the UI.
 
-You can run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript), via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground), or by configuring [Rules](../../observability/how_to_guides/monitoring/rules) to automatically run them on particular tracing projects or datasets.
+You can run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and [TypeScript](https://docs.smith.langchain.com/reference/js)), via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground), or by configuring [Rules](../../observability/how_to_guides/monitoring/rules) to automatically run them on particular tracing projects or datasets.
 
 #### Evaluation techniques
 
@@ -141,7 +138,7 @@ Learn [how run pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise
 ## Experiment
 
 Each time we evaluate an application on a dataset, we are conducting an experiment.
-An experiment is a single execution of the example inputs in your dataset through your application.
+An experiment contains the results of running a specific version of your application on the dataset.
 Typically, we will run multiple experiments on a given dataset, testing different configurations of our application (e.g., different prompts or LLMs).
 In LangSmith, you can easily view all the experiments associated with your dataset.
 Additionally, you can [compare multiple experiments in a comparison view](/evaluation/how_to_guides/compare_experiment_results).
@@ -165,7 +162,7 @@ It is offline because we're evaluating on a pre-compiled set of data.
 An online evaluation, on the other hand, is one in which we evaluate a deployed application's outputs on real traffic, in near realtime.
 Offline evaluations are used for testing a version(s) of your application pre-deployment.
 
-You can run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and TypeScript). You can run them server-side via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground) or by configuring [automations](/observability/how_to_guides/monitoring/rules) to run certain evaluators on every new experiment against a specific dataset.
+You can run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python) and [TypeScript](https://docs.smith.langchain.com/reference/js)). You can run them server-side via the [Prompt Playground](../../prompt_engineering/concepts#prompt-playground) or by configuring [automations](/observability/how_to_guides/monitoring/rules) to run certain evaluators on every new experiment against a specific dataset.
 
 ![Offline](./static/offline.png)
 
@@ -224,6 +221,42 @@ Online evaluators are generally intended to be run server-side. LangSmith has bu
 
 ![Online](./static/online.png)
 
+## Testing
+
+### Evaluations vs testing
+
+Testing and evaluation are very similar and overlapping concepts that often get confused.
+
+**An evaluation measures performance according to a metric(s).**
+Evaluation metrics can be fuzzy or subjective, and are more useful in relative terms than absolute ones.
+That is, they're often used to compare two systems against each other rather than to assert something about an individual system.
+
+**Testing asserts correctness.**
+A system can only be deployed if it passes all tests.
+
+Evaluation metrics can be *turned into* tests.
+For example, you can write regression tests to assert that any new version of a system must outperform some baseline version of the system on the relevant evaluation metrics.
+
+It can also be more resource efficient to run tests and evaluations together if your system is expensive to run and you have overlapping datasets for your tests and evaluations.
+
+You can also choose to write evaluations using standard software testing tools like `pytest` or `vitest/jest` out of convenience.
+
+### Using `pytest` and `vitest/jest`
+
+The LangSmith SDKs come with integrations for [pytest](./how_to_guides/pytest) and [`vitest/jest`](./how_to_guides/vitest_jest).
+These make it easy to:
+- Track test results in LangSmith
+- Write evaluations as tests
+
+Tracking test results in LangSmith makes it easy to share results, compare systems, and debug failing tests.
+
+Writing evaluations as tests can be useful when each example you want to evaluate on requires custom logic for running the application and/or evaluators.
+The standard evaluation flows assume that you can run your application and evaluators in the same way on every example in a dataset.
+But for more complex systems or comprehensive evals, you may want to evaluate specific subsets of your system with specific types of inputs and metrics.
+These types of heterogenous evals are much easier to write as a suite of distinct test cases that all get tracked together rather than using the standard evaluate flow.
+
+Using testing tools is also helpful when you want to *both* evaluate your system's outputs *and* assert some basic things about them.
+
 ## Application-specific techniques
 
 Below, we will discuss evaluation of a few specific, popular LLM applications.
@@ -348,13 +381,13 @@ Summarization is one specific type of free-form writing. The evaluation aim is t
 | Faithfulness     | Is the summary grounded in the source documents (e.g., no hallucinations)? | No                     | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-hallucination-evaluator) | Yes               |
 | Helpfulness      | Is summary helpful relative to user need?                                  | No                     | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-helpfulness-evaluator)   | Yes               |
 
-### Classification / Tagging
+### Classification and tagging
 
-Classification / Tagging applies a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification / Tagging evaluation typically employs the following components, which we will review in detail below:
+Classification and tagging apply a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification/tagging evaluation typically employs the following components, which we will review in detail below:
 
-A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
+A central consideration for classification/tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a classification/tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
 
-If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
+If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the classification/tagging of an input based upon specified criteria (without a ground truth reference).
 
 `Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).
 
@@ -363,3 +396,4 @@ If ground truth reference labels are provided, then it's common to simply define
 | Accuracy  | Standard definition | Yes                    | No            | No                |
 | Precision | Standard definition | Yes                    | No            | No                |
 | Recall    | Standard definition | Yes                    | No            | No                |
+
@@ -107,7 +107,7 @@ if (!response.ok) {
 const pdfArrayBuffer = await fetchArrayBuffer(pdfUrl);
 const wavArrayBuffer = await fetchArrayBuffer(wavUrl);
 const pngArrayBuffer = await fetchArrayBuffer(pngUrl);\n
-// Create the LangSmith client (Ensure LANGCHAIN_API_KEY is set in env)
+// Create the LangSmith client (Ensure LANGSMITH_API_KEY is set in env)
 const langsmithClient = new Client();\n
 // Create a unique dataset name
 const datasetName = "attachment-test-dataset:" + uuid4().substring(0, 8);\n
@@ -240,11 +240,12 @@ def file_qa(inputs, attachments): # Read the audio bytes from the reader and enc
 `,
 `The target function you are evaluating must have two positional arguments in order to consume the attachments associated with the example, the first must be called \`inputs\` and the second must be called \`attachments\`.
 - The \`inputs\` argument is a dictionary that contains the input data for the example, excluding the attachments.
-- The \`attachments\` argument is a dictionary that maps the attachment name to a dictionary containing a presigned url and a reader of the bytes content of the file. Either can be used to read the bytes of the file:
+- The \`attachments\` argument is a dictionary that maps the attachment name to a dictionary containing a presigned url, mime_type, and a reader of the bytes content of the file. You can use either the presigned url or the reader to get the file contents.
 Each value in the attachments dictionary is a dictionary with the following structure:
 \`\`\`
 {
       "presigned_url": str,
+      "mime_type": str,
       "reader": BinaryIO
 }
 \`\`\`
@@ -315,8 +316,9 @@ image_answer: imageCompletion.choices[0].message.content,
     `In the TypeScript SDK, the \`config\` argument is used to pass in the attachments to the target function if \`includeAttachments\` is set to \`true\`.\n
 The \`config\` will contain \`attachments\` which is an object mapping the attachment name to an object of the form:\n
 \`\`\`
-{\n
+{
 presigned_url: string,
+mime_type: string,
 }
 \`\`\``
 ),
 
@@ -45,11 +45,12 @@ Evaluate and improve your application before deploying it.
 - [Print detailed logs (Python only)](../../observability/how_to_guides/tracing/output_detailed_logs)
 - [Run an evaluation locally (beta, Python only)](./how_to_guides/local)
 
-## Unit testing
+## Testing integrations
 
-Unit test your system to identify bugs and regressions.
+Run evals using your favorite testing tools:
 
-- [Unit test applications (Python only)](./how_to_guides/unit_testing)
+- [Run evals with pytest (beta)](./how_to_guides/pytest)
+- [Run evals with Vitest/Jest (beta)](./how_to_guides/vitest_jest)
 
 ## Online evaluation