ProtectedInventory
diff --git a/‎langchain/langsmith/evaluation.ipynb
Lines changed: 361 additions & 0 deletions b/‎langchain/langsmith/evaluation.ipynb
Lines changed: 361 additions & 0 deletions
@@ -0,0 +1,361 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "40fb2b99-f188-4634-a11e-672e65752afa",
+   "metadata": {},
+   "source": [
+    "# LangSmith Evaluation 快速入门\n",
+    "\n",
+    "概况来说，评估（Evaluation）过程分为以下步骤：\n",
+    "\n",
+    "- 定义 LLM 应用或目标任务(Target Task)。\n",
+    "- 创建或选择一个数据集来评估 LLM 应用。您的评估标准可能需要数据集中的预期输出。\n",
+    "- 配置评估器（Evaluator）对 LLM 应用的输出进行打分（通常与预期输出/数据标注进行比较）。\n",
+    "- 运行评估并查看结果。\n",
+    "\n",
+    "本教程展示一个非常简单的 LLM 应用（分类器）的评估流程，该应用会将输入数据标记为“有毒（Toxic）”或“无毒（Not Toxic）”。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9c8c8225-42bd-4c9b-adeb-62c83f80c9d3",
+   "metadata": {},
+   "source": [
+    "## 1.定义目标任务\n",
+    "\n",
+    "我们定义了一个简单的评估目标，包括一个LLM Pipeline（将文本分类为有毒或无毒），并启用跟踪（Tracing）以捕获管道中每个步骤的输入和输出。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "4cb8b089-8d3c-4f56-b5d3-2929dcb49c26",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langsmith import traceable, wrappers\n",
+    "from openai import Client\n",
+    "\n",
+    "# 包装 OpenAI 客户端\n",
+    "openai = wrappers.wrap_openai(Client())\n",
+    "\n",
+    "# 标记函数可追踪\n",
+    "@traceable\n",
+    "def label_text(text):\n",
+    "    # 创建消息列表，包含系统消息和用户消息\n",
+    "    messages = [\n",
+    "        {\n",
+    "            \"role\": \"system\",\n",
+    "            \"content\": \"请查看下面的用户查询，判断其中是否包含任何形式的有害行为，例如侮辱、威胁或高度负面的评论。如果有，请回复'Toxic'，如果没有，请回复'Not toxic'。\",\n",
+    "        },\n",
+    "        {\"role\": \"user\", \"content\": text},\n",
+    "    ]\n",
+    "    \n",
+    "    # 调用聊天模型生成回复\n",
+    "    result = openai.chat.completions.create(\n",
+    "        messages=messages, model=\"gpt-3.5-turbo\", temperature=0\n",
+    "    )\n",
+    "    \n",
+    "    # 返回模型的回复内容\n",
+    "    return result.choices[0].message.content\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4c58435a-b044-4939-9902-21e77ec47d42",
+   "metadata": {},
+   "source": [
+    "## 2.创建或选择评估数据集\n",
+    "\n",
+    "下面创建一个评估数据集（包含有毒和无毒文本的示例）。该数据集中每个示例都包含三个字典/对象：\n",
+    "- outputs：数据集中找到的参考标签或其他上下文\n",
+    "- inputs：管道的输入\n",
+    "- metadata：示例中存储的任何其他元数据\n",
+    "\n",
+    "这些字典/对象可以具有任意键（Key）和值（Value），但是所有示例中键必须保持一致。\n",
+    "\n",
+    "示例中的值也可以采用任何形式，例如字符串、数字、列表或字典，但对于本教程的示例，我们仅使用字符串。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "7b54f22e-17ae-41f2-a137-76e84fef9b49",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langsmith import Client\n",
+    "\n",
+    "# 创建客户端\n",
+    "client = Client()\n",
+    "\n",
+    "# 创建一个数据集\n",
+    "examples = [\n",
+    "    (\"Shut up, idiot\", \"Toxic\"),  # 有害\n",
+    "    (\"You're a wonderful person\", \"Not toxic\"),  # 无害\n",
+    "    (\"This is the worst thing ever\", \"Toxic\"),  # 有害\n",
+    "    (\"I had a great day today\", \"Not toxic\"),  # 无害\n",
+    "    (\"Nobody likes you\", \"Toxic\"),  # 有害\n",
+    "    (\"This is unacceptable. I want to speak to the manager.\", \"Not toxic\"),  # 无害\n",
+    "]\n",
+    "\n",
+    "# 数据集名称\n",
+    "dataset_name = \"Toxic Queries\"  \n",
+    "dataset = client.create_dataset(dataset_name=dataset_name)\n",
+    "\n",
+    "# 提取输入和输出\n",
+    "inputs, outputs = zip(\n",
+    "    *[({\"text\": text}, {\"label\": label}) for text, label in examples]\n",
+    ")\n",
+    "\n",
+    "# 创建示例并将其添加到数据集中\n",
+    "client.create_examples(inputs=inputs, outputs=outputs, dataset_id=dataset.id)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "353b3e32-0f28-4de4-8749-03337905385f",
+   "metadata": {},
+   "source": [
+    "## 3.配置评估器\n",
+    "\n",
+    "创建一个评估器，将模型输出与数据集中的标注对比以进行评分。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "0559ea2a-082d-4836-92cd-7473711ee79a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langsmith.schemas import Example, Run\n",
+    "\n",
+    "# 定义函数用于校正标签\n",
+    "def correct_label(root_run: Run, example: Example) -> dict:\n",
+    "    # 检查 root_run 的输出是否与 example 的输出标签相同\n",
+    "    score = root_run.outputs.get(\"output\") == example.outputs.get(\"label\")\n",
+    "    # 返回一个包含分数和键的字典\n",
+    "    return {\"score\": int(score), \"key\": \"correct_label\"}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8fe3233b-4762-48fe-bc72-3924a5bc03f6",
+   "metadata": {},
+   "source": [
+    "## 4.执行评估查看结果\n",
+    "\n",
+    "下面使用`evaluate`方法来运行评估，该方法接受以下参数：\n",
+    "\n",
+    "- 函数（function）：接受输入字典或对象并返回输出字典或对象\n",
+    "- 数据（data): 要在其上进行评估的LangSmith数据集的名称或UUID，或者是示例的迭代器\n",
+    "- 评估器（evaluators）: 用于对函数输出进行打分的评估器列表\n",
+    "- 实验前缀（experiment_prefix）: 用于给实验名称添加前缀的字符串。如果未提供，则将自动生成一个名称。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "eeec0c29-5e85-46e1-915b-619b68627d63",
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "ImportError",
+     "evalue": "cannot import name 'evaluate' from 'langsmith.evaluation' (/home/ubuntu/miniconda3/envs/langchain/lib/python3.10/site-packages/langsmith/evaluation/__init__.py)",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mImportError\u001b[0m                               Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn[20], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mlangsmith\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mevaluation\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m evaluate\n\u001b[1;32m      3\u001b[0m \u001b[38;5;66;03m# 数据集名称\u001b[39;00m\n\u001b[1;32m      4\u001b[0m dataset_name \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mToxic Queries\u001b[39m\u001b[38;5;124m\"\u001b[39m\n",
+      "\u001b[0;31mImportError\u001b[0m: cannot import name 'evaluate' from 'langsmith.evaluation' (/home/ubuntu/miniconda3/envs/langchain/lib/python3.10/site-packages/langsmith/evaluation/__init__.py)"
+     ]
+    }
+   ],
+   "source": [
+    "from langsmith.evaluation import evaluate\n",
+    "\n",
+    "# 数据集名称\n",
+    "dataset_name = \"Toxic Queries\"\n",
+    "\n",
+    "# evaluator = StringEvaluator(evaluation_name=\"toxic_judge\", grading_function=correct_label)\n",
+    "\n",
+    "# 评估函数\n",
+    "results = evaluate(\n",
+    "    # 使用 label_text 函数处理输入\n",
+    "    lambda inputs: label_text(inputs[\"text\"]),\n",
+    "    data=dataset_name,  # 数据集名称\n",
+    "    evaluators=[correct_label],  # 使用 correct_label 评估函数\n",
+    "    experiment_prefix=\"Toxic Queries\",  # 实验前缀名称\n",
+    "    description=\"Testing the baseline system.\",  # 可选描述信息\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bfea48d3-c461-4576-9efe-3ae4af6bd084",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3432c5d1-6e8f-4c42-b18c-a566645c4f40",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0f8ea90-1b5f-4761-8f2d-9e19d3b61e15",
+   "metadata": {},
+   "source": [
+    "## 使用 LCEL 重写 RAG Bot"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "46817304-1e17-4ca1-a5ba-faebd80c3728",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "### 索引部分\n",
+    "\n",
+    "from bs4 import BeautifulSoup as Soup\n",
+    "from langchain_community.vectorstores import Chroma\n",
+    "from langchain_openai import OpenAIEmbeddings\n",
+    "from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader\n",
+    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
+    "\n",
+    "# 加载文档\n",
+    "url = \"https://python.langchain.com/v0.1/docs/expression_language/\"\n",
+    "loader = RecursiveUrlLoader(\n",
+    "    url=url, max_depth=20, extractor=lambda x: Soup(x, \"html.parser\").text\n",
+    ")\n",
+    "docs = loader.load()\n",
+    "\n",
+    "# 分割文档为小块\n",
+    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=4500, chunk_overlap=200)\n",
+    "splits = text_splitter.split_documents(docs)\n",
+    "\n",
+    "# 嵌入并存储在 Chroma 中\n",
+    "vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())\n",
+    "\n",
+    "# 创建检索器\n",
+    "retriever = vectorstore.as_retriever()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "096e3129-8e5e-42b9-8c42-d59f072f20c5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "### RAG 机器人部分\n",
+    "\n",
+    "import openai\n",
+    "from langsmith import traceable\n",
+    "from langsmith.wrappers import wrap_openai\n",
+    "\n",
+    "class RagBot:\n",
+    "\n",
+    "    def __init__(self, retriever, model: str = \"gpt-4-0125-preview\"):\n",
+    "        self._retriever = retriever\n",
+    "        # 包装客户端以监测 LLM\n",
+    "        self._client = wrap_openai(openai.Client())\n",
+    "        self._model = model\n",
+    "\n",
+    "    @traceable()\n",
+    "    def retrieve_docs(self, question):\n",
+    "        # 调用检索器获取相关文档\n",
+    "        return self._retriever.invoke(question)\n",
+    "\n",
+    "    @traceable()\n",
+    "    def invoke_llm(self, question, docs):\n",
+    "        # 调用 LLM 生成回复\n",
+    "        response = self._client.chat.completions.create(\n",
+    "            model=self._model,\n",
+    "            messages=[\n",
+    "                {\n",
+    "                    \"role\": \"system\",\n",
+    "                    \"content\": \"你是一个乐于助人的 AI 编码助手，擅长 LCEL。使用以下文档生成简明的代码解决方案回答用户的问题。\\n\\n\"\n",
+    "                    f\"## 文档\\n\\n{docs}\",\n",
+    "                },\n",
+    "                {\"role\": \"user\", \"content\": question},\n",
+    "            ],\n",
+    "        )\n",
+    "\n",
+    "        # 评估器将期望 \"answer\" 和 \"contexts\"\n",
+    "        return {\n",
+    "            \"answer\": response.choices[0].message.content,\n",
+    "            \"contexts\": [str(doc) for doc in docs],\n",
+    "        }\n",
+    "\n",
+    "    @traceable()\n",
+    "    def get_answer(self, question: str):\n",
+    "        # 获取答案\n",
+    "        docs = self.retrieve_docs(question)\n",
+    "        return self.invoke_llm(question, docs)\n",
+    "\n",
+    "# 创建 RagBot 实例\n",
+    "rag_bot = RagBot(retriever)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "431bbdb3-d4a3-445a-9cfc-2e62adff3ad0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\"To build a RAG (Retrieval-Augmented Generation) chain in LangChain Expression Language (LCEL), you integrate components that handle retrieval (searching for relevant information from a database or document collection) and generation (creating responses based on the retrieved information). The LCEL document provided doesn't go into specifics about a RAG chain configuration, but based on the principles of LCEL, I can guide you through constructing a simplified RAG chain using hypothetical LCEL com\""
+      ]
+     },
+     "execution_count": 32,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "response = rag_bot.get_answer(\"How to build a RAG chain in LCEL?\")\n",
+    "response[\"answer\"][:500]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1b4ca951-0ed8-41c5-adb9-694776a7a2e7",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}