|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "id": "40fb2b99-f188-4634-a11e-672e65752afa", |
| 6 | + "metadata": {}, |
| 7 | + "source": [ |
| 8 | + "# LangSmith Evaluation 快速入门\n", |
| 9 | + "\n", |
| 10 | + "概况来说,评估(Evaluation)过程分为以下步骤:\n", |
| 11 | + "\n", |
| 12 | + "- 定义 LLM 应用或目标任务(Target Task)。\n", |
| 13 | + "- 创建或选择一个数据集来评估 LLM 应用。您的评估标准可能需要数据集中的预期输出。\n", |
| 14 | + "- 配置评估器(Evaluator)对 LLM 应用的输出进行打分(通常与预期输出/数据标注进行比较)。\n", |
| 15 | + "- 运行评估并查看结果。\n", |
| 16 | + "\n", |
| 17 | + "本教程展示一个非常简单的 LLM 应用(分类器)的评估流程,该应用会将输入数据标记为“有毒(Toxic)”或“无毒(Not Toxic)”。" |
| 18 | + ] |
| 19 | + }, |
| 20 | + { |
| 21 | + "cell_type": "markdown", |
| 22 | + "id": "9c8c8225-42bd-4c9b-adeb-62c83f80c9d3", |
| 23 | + "metadata": {}, |
| 24 | + "source": [ |
| 25 | + "## 1.定义目标任务\n", |
| 26 | + "\n", |
| 27 | + "我们定义了一个简单的评估目标,包括一个LLM Pipeline(将文本分类为有毒或无毒),并启用跟踪(Tracing)以捕获管道中每个步骤的输入和输出。" |
| 28 | + ] |
| 29 | + }, |
| 30 | + { |
| 31 | + "cell_type": "code", |
| 32 | + "execution_count": 1, |
| 33 | + "id": "4cb8b089-8d3c-4f56-b5d3-2929dcb49c26", |
| 34 | + "metadata": {}, |
| 35 | + "outputs": [], |
| 36 | + "source": [ |
| 37 | + "from langsmith import traceable, wrappers\n", |
| 38 | + "from openai import Client\n", |
| 39 | + "\n", |
| 40 | + "# 包装 OpenAI 客户端\n", |
| 41 | + "openai = wrappers.wrap_openai(Client())\n", |
| 42 | + "\n", |
| 43 | + "# 标记函数可追踪\n", |
| 44 | + "@traceable\n", |
| 45 | + "def label_text(text):\n", |
| 46 | + " # 创建消息列表,包含系统消息和用户消息\n", |
| 47 | + " messages = [\n", |
| 48 | + " {\n", |
| 49 | + " \"role\": \"system\",\n", |
| 50 | + " \"content\": \"请查看下面的用户查询,判断其中是否包含任何形式的有害行为,例如侮辱、威胁或高度负面的评论。如果有,请回复'Toxic',如果没有,请回复'Not toxic'。\",\n", |
| 51 | + " },\n", |
| 52 | + " {\"role\": \"user\", \"content\": text},\n", |
| 53 | + " ]\n", |
| 54 | + " \n", |
| 55 | + " # 调用聊天模型生成回复\n", |
| 56 | + " result = openai.chat.completions.create(\n", |
| 57 | + " messages=messages, model=\"gpt-3.5-turbo\", temperature=0\n", |
| 58 | + " )\n", |
| 59 | + " \n", |
| 60 | + " # 返回模型的回复内容\n", |
| 61 | + " return result.choices[0].message.content\n" |
| 62 | + ] |
| 63 | + }, |
| 64 | + { |
| 65 | + "cell_type": "markdown", |
| 66 | + "id": "4c58435a-b044-4939-9902-21e77ec47d42", |
| 67 | + "metadata": {}, |
| 68 | + "source": [ |
| 69 | + "## 2.创建或选择评估数据集\n", |
| 70 | + "\n", |
| 71 | + "下面创建一个评估数据集(包含有毒和无毒文本的示例)。该数据集中每个示例都包含三个字典/对象:\n", |
| 72 | + "- outputs:数据集中找到的参考标签或其他上下文\n", |
| 73 | + "- inputs:管道的输入\n", |
| 74 | + "- metadata:示例中存储的任何其他元数据\n", |
| 75 | + "\n", |
| 76 | + "这些字典/对象可以具有任意键(Key)和值(Value),但是所有示例中键必须保持一致。\n", |
| 77 | + "\n", |
| 78 | + "示例中的值也可以采用任何形式,例如字符串、数字、列表或字典,但对于本教程的示例,我们仅使用字符串。" |
| 79 | + ] |
| 80 | + }, |
| 81 | + { |
| 82 | + "cell_type": "code", |
| 83 | + "execution_count": 2, |
| 84 | + "id": "7b54f22e-17ae-41f2-a137-76e84fef9b49", |
| 85 | + "metadata": {}, |
| 86 | + "outputs": [], |
| 87 | + "source": [ |
| 88 | + "from langsmith import Client\n", |
| 89 | + "\n", |
| 90 | + "# 创建客户端\n", |
| 91 | + "client = Client()\n", |
| 92 | + "\n", |
| 93 | + "# 创建一个数据集\n", |
| 94 | + "examples = [\n", |
| 95 | + " (\"Shut up, idiot\", \"Toxic\"), # 有害\n", |
| 96 | + " (\"You're a wonderful person\", \"Not toxic\"), # 无害\n", |
| 97 | + " (\"This is the worst thing ever\", \"Toxic\"), # 有害\n", |
| 98 | + " (\"I had a great day today\", \"Not toxic\"), # 无害\n", |
| 99 | + " (\"Nobody likes you\", \"Toxic\"), # 有害\n", |
| 100 | + " (\"This is unacceptable. I want to speak to the manager.\", \"Not toxic\"), # 无害\n", |
| 101 | + "]\n", |
| 102 | + "\n", |
| 103 | + "# 数据集名称\n", |
| 104 | + "dataset_name = \"Toxic Queries\" \n", |
| 105 | + "dataset = client.create_dataset(dataset_name=dataset_name)\n", |
| 106 | + "\n", |
| 107 | + "# 提取输入和输出\n", |
| 108 | + "inputs, outputs = zip(\n", |
| 109 | + " *[({\"text\": text}, {\"label\": label}) for text, label in examples]\n", |
| 110 | + ")\n", |
| 111 | + "\n", |
| 112 | + "# 创建示例并将其添加到数据集中\n", |
| 113 | + "client.create_examples(inputs=inputs, outputs=outputs, dataset_id=dataset.id)" |
| 114 | + ] |
| 115 | + }, |
| 116 | + { |
| 117 | + "cell_type": "markdown", |
| 118 | + "id": "353b3e32-0f28-4de4-8749-03337905385f", |
| 119 | + "metadata": {}, |
| 120 | + "source": [ |
| 121 | + "## 3.配置评估器\n", |
| 122 | + "\n", |
| 123 | + "创建一个评估器,将模型输出与数据集中的标注对比以进行评分。" |
| 124 | + ] |
| 125 | + }, |
| 126 | + { |
| 127 | + "cell_type": "code", |
| 128 | + "execution_count": 3, |
| 129 | + "id": "0559ea2a-082d-4836-92cd-7473711ee79a", |
| 130 | + "metadata": {}, |
| 131 | + "outputs": [], |
| 132 | + "source": [ |
| 133 | + "from langsmith.schemas import Example, Run\n", |
| 134 | + "\n", |
| 135 | + "# 定义函数用于校正标签\n", |
| 136 | + "def correct_label(root_run: Run, example: Example) -> dict:\n", |
| 137 | + " # 检查 root_run 的输出是否与 example 的输出标签相同\n", |
| 138 | + " score = root_run.outputs.get(\"output\") == example.outputs.get(\"label\")\n", |
| 139 | + " # 返回一个包含分数和键的字典\n", |
| 140 | + " return {\"score\": int(score), \"key\": \"correct_label\"}" |
| 141 | + ] |
| 142 | + }, |
| 143 | + { |
| 144 | + "cell_type": "markdown", |
| 145 | + "id": "8fe3233b-4762-48fe-bc72-3924a5bc03f6", |
| 146 | + "metadata": {}, |
| 147 | + "source": [ |
| 148 | + "## 4.执行评估查看结果\n", |
| 149 | + "\n", |
| 150 | + "下面使用`evaluate`方法来运行评估,该方法接受以下参数:\n", |
| 151 | + "\n", |
| 152 | + "- 函数(function):接受输入字典或对象并返回输出字典或对象\n", |
| 153 | + "- 数据(data): 要在其上进行评估的LangSmith数据集的名称或UUID,或者是示例的迭代器\n", |
| 154 | + "- 评估器(evaluators): 用于对函数输出进行打分的评估器列表\n", |
| 155 | + "- 实验前缀(experiment_prefix): 用于给实验名称添加前缀的字符串。如果未提供,则将自动生成一个名称。" |
| 156 | + ] |
| 157 | + }, |
| 158 | + { |
| 159 | + "cell_type": "code", |
| 160 | + "execution_count": 20, |
| 161 | + "id": "eeec0c29-5e85-46e1-915b-619b68627d63", |
| 162 | + "metadata": {}, |
| 163 | + "outputs": [ |
| 164 | + { |
| 165 | + "ename": "ImportError", |
| 166 | + "evalue": "cannot import name 'evaluate' from 'langsmith.evaluation' (/home/ubuntu/miniconda3/envs/langchain/lib/python3.10/site-packages/langsmith/evaluation/__init__.py)", |
| 167 | + "output_type": "error", |
| 168 | + "traceback": [ |
| 169 | + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", |
| 170 | + "\u001b[0;31mImportError\u001b[0m Traceback (most recent call last)", |
| 171 | + "Cell \u001b[0;32mIn[20], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mlangsmith\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mevaluation\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m evaluate\n\u001b[1;32m 3\u001b[0m \u001b[38;5;66;03m# 数据集名称\u001b[39;00m\n\u001b[1;32m 4\u001b[0m dataset_name \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mToxic Queries\u001b[39m\u001b[38;5;124m\"\u001b[39m\n", |
| 172 | + "\u001b[0;31mImportError\u001b[0m: cannot import name 'evaluate' from 'langsmith.evaluation' (/home/ubuntu/miniconda3/envs/langchain/lib/python3.10/site-packages/langsmith/evaluation/__init__.py)" |
| 173 | + ] |
| 174 | + } |
| 175 | + ], |
| 176 | + "source": [ |
| 177 | + "from langsmith.evaluation import evaluate\n", |
| 178 | + "\n", |
| 179 | + "# 数据集名称\n", |
| 180 | + "dataset_name = \"Toxic Queries\"\n", |
| 181 | + "\n", |
| 182 | + "# evaluator = StringEvaluator(evaluation_name=\"toxic_judge\", grading_function=correct_label)\n", |
| 183 | + "\n", |
| 184 | + "# 评估函数\n", |
| 185 | + "results = evaluate(\n", |
| 186 | + " # 使用 label_text 函数处理输入\n", |
| 187 | + " lambda inputs: label_text(inputs[\"text\"]),\n", |
| 188 | + " data=dataset_name, # 数据集名称\n", |
| 189 | + " evaluators=[correct_label], # 使用 correct_label 评估函数\n", |
| 190 | + " experiment_prefix=\"Toxic Queries\", # 实验前缀名称\n", |
| 191 | + " description=\"Testing the baseline system.\", # 可选描述信息\n", |
| 192 | + ")" |
| 193 | + ] |
| 194 | + }, |
| 195 | + { |
| 196 | + "cell_type": "code", |
| 197 | + "execution_count": null, |
| 198 | + "id": "bfea48d3-c461-4576-9efe-3ae4af6bd084", |
| 199 | + "metadata": {}, |
| 200 | + "outputs": [], |
| 201 | + "source": [] |
| 202 | + }, |
| 203 | + { |
| 204 | + "cell_type": "code", |
| 205 | + "execution_count": null, |
| 206 | + "id": "3432c5d1-6e8f-4c42-b18c-a566645c4f40", |
| 207 | + "metadata": {}, |
| 208 | + "outputs": [], |
| 209 | + "source": [] |
| 210 | + }, |
| 211 | + { |
| 212 | + "cell_type": "markdown", |
| 213 | + "id": "e0f8ea90-1b5f-4761-8f2d-9e19d3b61e15", |
| 214 | + "metadata": {}, |
| 215 | + "source": [ |
| 216 | + "## 使用 LCEL 重写 RAG Bot" |
| 217 | + ] |
| 218 | + }, |
| 219 | + { |
| 220 | + "cell_type": "code", |
| 221 | + "execution_count": 23, |
| 222 | + "id": "46817304-1e17-4ca1-a5ba-faebd80c3728", |
| 223 | + "metadata": {}, |
| 224 | + "outputs": [], |
| 225 | + "source": [ |
| 226 | + "### 索引部分\n", |
| 227 | + "\n", |
| 228 | + "from bs4 import BeautifulSoup as Soup\n", |
| 229 | + "from langchain_community.vectorstores import Chroma\n", |
| 230 | + "from langchain_openai import OpenAIEmbeddings\n", |
| 231 | + "from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader\n", |
| 232 | + "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", |
| 233 | + "\n", |
| 234 | + "# 加载文档\n", |
| 235 | + "url = \"https://python.langchain.com/v0.1/docs/expression_language/\"\n", |
| 236 | + "loader = RecursiveUrlLoader(\n", |
| 237 | + " url=url, max_depth=20, extractor=lambda x: Soup(x, \"html.parser\").text\n", |
| 238 | + ")\n", |
| 239 | + "docs = loader.load()\n", |
| 240 | + "\n", |
| 241 | + "# 分割文档为小块\n", |
| 242 | + "text_splitter = RecursiveCharacterTextSplitter(chunk_size=4500, chunk_overlap=200)\n", |
| 243 | + "splits = text_splitter.split_documents(docs)\n", |
| 244 | + "\n", |
| 245 | + "# 嵌入并存储在 Chroma 中\n", |
| 246 | + "vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())\n", |
| 247 | + "\n", |
| 248 | + "# 创建检索器\n", |
| 249 | + "retriever = vectorstore.as_retriever()" |
| 250 | + ] |
| 251 | + }, |
| 252 | + { |
| 253 | + "cell_type": "code", |
| 254 | + "execution_count": null, |
| 255 | + "id": "096e3129-8e5e-42b9-8c42-d59f072f20c5", |
| 256 | + "metadata": {}, |
| 257 | + "outputs": [], |
| 258 | + "source": [ |
| 259 | + "### RAG 机器人部分\n", |
| 260 | + "\n", |
| 261 | + "import openai\n", |
| 262 | + "from langsmith import traceable\n", |
| 263 | + "from langsmith.wrappers import wrap_openai\n", |
| 264 | + "\n", |
| 265 | + "class RagBot:\n", |
| 266 | + "\n", |
| 267 | + " def __init__(self, retriever, model: str = \"gpt-4-0125-preview\"):\n", |
| 268 | + " self._retriever = retriever\n", |
| 269 | + " # 包装客户端以监测 LLM\n", |
| 270 | + " self._client = wrap_openai(openai.Client())\n", |
| 271 | + " self._model = model\n", |
| 272 | + "\n", |
| 273 | + " @traceable()\n", |
| 274 | + " def retrieve_docs(self, question):\n", |
| 275 | + " # 调用检索器获取相关文档\n", |
| 276 | + " return self._retriever.invoke(question)\n", |
| 277 | + "\n", |
| 278 | + " @traceable()\n", |
| 279 | + " def invoke_llm(self, question, docs):\n", |
| 280 | + " # 调用 LLM 生成回复\n", |
| 281 | + " response = self._client.chat.completions.create(\n", |
| 282 | + " model=self._model,\n", |
| 283 | + " messages=[\n", |
| 284 | + " {\n", |
| 285 | + " \"role\": \"system\",\n", |
| 286 | + " \"content\": \"你是一个乐于助人的 AI 编码助手,擅长 LCEL。使用以下文档生成简明的代码解决方案回答用户的问题。\\n\\n\"\n", |
| 287 | + " f\"## 文档\\n\\n{docs}\",\n", |
| 288 | + " },\n", |
| 289 | + " {\"role\": \"user\", \"content\": question},\n", |
| 290 | + " ],\n", |
| 291 | + " )\n", |
| 292 | + "\n", |
| 293 | + " # 评估器将期望 \"answer\" 和 \"contexts\"\n", |
| 294 | + " return {\n", |
| 295 | + " \"answer\": response.choices[0].message.content,\n", |
| 296 | + " \"contexts\": [str(doc) for doc in docs],\n", |
| 297 | + " }\n", |
| 298 | + "\n", |
| 299 | + " @traceable()\n", |
| 300 | + " def get_answer(self, question: str):\n", |
| 301 | + " # 获取答案\n", |
| 302 | + " docs = self.retrieve_docs(question)\n", |
| 303 | + " return self.invoke_llm(question, docs)\n", |
| 304 | + "\n", |
| 305 | + "# 创建 RagBot 实例\n", |
| 306 | + "rag_bot = RagBot(retriever)" |
| 307 | + ] |
| 308 | + }, |
| 309 | + { |
| 310 | + "cell_type": "code", |
| 311 | + "execution_count": 32, |
| 312 | + "id": "431bbdb3-d4a3-445a-9cfc-2e62adff3ad0", |
| 313 | + "metadata": {}, |
| 314 | + "outputs": [ |
| 315 | + { |
| 316 | + "data": { |
| 317 | + "text/plain": [ |
| 318 | + "\"To build a RAG (Retrieval-Augmented Generation) chain in LangChain Expression Language (LCEL), you integrate components that handle retrieval (searching for relevant information from a database or document collection) and generation (creating responses based on the retrieved information). The LCEL document provided doesn't go into specifics about a RAG chain configuration, but based on the principles of LCEL, I can guide you through constructing a simplified RAG chain using hypothetical LCEL com\"" |
| 319 | + ] |
| 320 | + }, |
| 321 | + "execution_count": 32, |
| 322 | + "metadata": {}, |
| 323 | + "output_type": "execute_result" |
| 324 | + } |
| 325 | + ], |
| 326 | + "source": [ |
| 327 | + "response = rag_bot.get_answer(\"How to build a RAG chain in LCEL?\")\n", |
| 328 | + "response[\"answer\"][:500]" |
| 329 | + ] |
| 330 | + }, |
| 331 | + { |
| 332 | + "cell_type": "code", |
| 333 | + "execution_count": null, |
| 334 | + "id": "1b4ca951-0ed8-41c5-adb9-694776a7a2e7", |
| 335 | + "metadata": {}, |
| 336 | + "outputs": [], |
| 337 | + "source": [] |
| 338 | + } |
| 339 | + ], |
| 340 | + "metadata": { |
| 341 | + "kernelspec": { |
| 342 | + "display_name": "Python 3 (ipykernel)", |
| 343 | + "language": "python", |
| 344 | + "name": "python3" |
| 345 | + }, |
| 346 | + "language_info": { |
| 347 | + "codemirror_mode": { |
| 348 | + "name": "ipython", |
| 349 | + "version": 3 |
| 350 | + }, |
| 351 | + "file_extension": ".py", |
| 352 | + "mimetype": "text/x-python", |
| 353 | + "name": "python", |
| 354 | + "nbconvert_exporter": "python", |
| 355 | + "pygments_lexer": "ipython3", |
| 356 | + "version": "3.10.14" |
| 357 | + } |
| 358 | + }, |
| 359 | + "nbformat": 4, |
| 360 | + "nbformat_minor": 5 |
| 361 | +} |
0 commit comments