Skip to content

Commit 2bfc161

Browse files
committedMay 23, 2024
add langsmith demo
1 parent e3e2343 commit 2bfc161

File tree

3 files changed

+911
-0
lines changed

3 files changed

+911
-0
lines changed
 

‎langchain/langsmith/evaluation.ipynb

+361
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,361 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "40fb2b99-f188-4634-a11e-672e65752afa",
6+
"metadata": {},
7+
"source": [
8+
"# LangSmith Evaluation 快速入门\n",
9+
"\n",
10+
"概况来说,评估(Evaluation)过程分为以下步骤:\n",
11+
"\n",
12+
"- 定义 LLM 应用或目标任务(Target Task)。\n",
13+
"- 创建或选择一个数据集来评估 LLM 应用。您的评估标准可能需要数据集中的预期输出。\n",
14+
"- 配置评估器(Evaluator)对 LLM 应用的输出进行打分(通常与预期输出/数据标注进行比较)。\n",
15+
"- 运行评估并查看结果。\n",
16+
"\n",
17+
"本教程展示一个非常简单的 LLM 应用(分类器)的评估流程,该应用会将输入数据标记为“有毒(Toxic)”或“无毒(Not Toxic)”。"
18+
]
19+
},
20+
{
21+
"cell_type": "markdown",
22+
"id": "9c8c8225-42bd-4c9b-adeb-62c83f80c9d3",
23+
"metadata": {},
24+
"source": [
25+
"## 1.定义目标任务\n",
26+
"\n",
27+
"我们定义了一个简单的评估目标,包括一个LLM Pipeline(将文本分类为有毒或无毒),并启用跟踪(Tracing)以捕获管道中每个步骤的输入和输出。"
28+
]
29+
},
30+
{
31+
"cell_type": "code",
32+
"execution_count": 1,
33+
"id": "4cb8b089-8d3c-4f56-b5d3-2929dcb49c26",
34+
"metadata": {},
35+
"outputs": [],
36+
"source": [
37+
"from langsmith import traceable, wrappers\n",
38+
"from openai import Client\n",
39+
"\n",
40+
"# 包装 OpenAI 客户端\n",
41+
"openai = wrappers.wrap_openai(Client())\n",
42+
"\n",
43+
"# 标记函数可追踪\n",
44+
"@traceable\n",
45+
"def label_text(text):\n",
46+
" # 创建消息列表,包含系统消息和用户消息\n",
47+
" messages = [\n",
48+
" {\n",
49+
" \"role\": \"system\",\n",
50+
" \"content\": \"请查看下面的用户查询,判断其中是否包含任何形式的有害行为,例如侮辱、威胁或高度负面的评论。如果有,请回复'Toxic',如果没有,请回复'Not toxic'。\",\n",
51+
" },\n",
52+
" {\"role\": \"user\", \"content\": text},\n",
53+
" ]\n",
54+
" \n",
55+
" # 调用聊天模型生成回复\n",
56+
" result = openai.chat.completions.create(\n",
57+
" messages=messages, model=\"gpt-3.5-turbo\", temperature=0\n",
58+
" )\n",
59+
" \n",
60+
" # 返回模型的回复内容\n",
61+
" return result.choices[0].message.content\n"
62+
]
63+
},
64+
{
65+
"cell_type": "markdown",
66+
"id": "4c58435a-b044-4939-9902-21e77ec47d42",
67+
"metadata": {},
68+
"source": [
69+
"## 2.创建或选择评估数据集\n",
70+
"\n",
71+
"下面创建一个评估数据集(包含有毒和无毒文本的示例)。该数据集中每个示例都包含三个字典/对象:\n",
72+
"- outputs:数据集中找到的参考标签或其他上下文\n",
73+
"- inputs:管道的输入\n",
74+
"- metadata:示例中存储的任何其他元数据\n",
75+
"\n",
76+
"这些字典/对象可以具有任意键(Key)和值(Value),但是所有示例中键必须保持一致。\n",
77+
"\n",
78+
"示例中的值也可以采用任何形式,例如字符串、数字、列表或字典,但对于本教程的示例,我们仅使用字符串。"
79+
]
80+
},
81+
{
82+
"cell_type": "code",
83+
"execution_count": 2,
84+
"id": "7b54f22e-17ae-41f2-a137-76e84fef9b49",
85+
"metadata": {},
86+
"outputs": [],
87+
"source": [
88+
"from langsmith import Client\n",
89+
"\n",
90+
"# 创建客户端\n",
91+
"client = Client()\n",
92+
"\n",
93+
"# 创建一个数据集\n",
94+
"examples = [\n",
95+
" (\"Shut up, idiot\", \"Toxic\"), # 有害\n",
96+
" (\"You're a wonderful person\", \"Not toxic\"), # 无害\n",
97+
" (\"This is the worst thing ever\", \"Toxic\"), # 有害\n",
98+
" (\"I had a great day today\", \"Not toxic\"), # 无害\n",
99+
" (\"Nobody likes you\", \"Toxic\"), # 有害\n",
100+
" (\"This is unacceptable. I want to speak to the manager.\", \"Not toxic\"), # 无害\n",
101+
"]\n",
102+
"\n",
103+
"# 数据集名称\n",
104+
"dataset_name = \"Toxic Queries\" \n",
105+
"dataset = client.create_dataset(dataset_name=dataset_name)\n",
106+
"\n",
107+
"# 提取输入和输出\n",
108+
"inputs, outputs = zip(\n",
109+
" *[({\"text\": text}, {\"label\": label}) for text, label in examples]\n",
110+
")\n",
111+
"\n",
112+
"# 创建示例并将其添加到数据集中\n",
113+
"client.create_examples(inputs=inputs, outputs=outputs, dataset_id=dataset.id)"
114+
]
115+
},
116+
{
117+
"cell_type": "markdown",
118+
"id": "353b3e32-0f28-4de4-8749-03337905385f",
119+
"metadata": {},
120+
"source": [
121+
"## 3.配置评估器\n",
122+
"\n",
123+
"创建一个评估器,将模型输出与数据集中的标注对比以进行评分。"
124+
]
125+
},
126+
{
127+
"cell_type": "code",
128+
"execution_count": 3,
129+
"id": "0559ea2a-082d-4836-92cd-7473711ee79a",
130+
"metadata": {},
131+
"outputs": [],
132+
"source": [
133+
"from langsmith.schemas import Example, Run\n",
134+
"\n",
135+
"# 定义函数用于校正标签\n",
136+
"def correct_label(root_run: Run, example: Example) -> dict:\n",
137+
" # 检查 root_run 的输出是否与 example 的输出标签相同\n",
138+
" score = root_run.outputs.get(\"output\") == example.outputs.get(\"label\")\n",
139+
" # 返回一个包含分数和键的字典\n",
140+
" return {\"score\": int(score), \"key\": \"correct_label\"}"
141+
]
142+
},
143+
{
144+
"cell_type": "markdown",
145+
"id": "8fe3233b-4762-48fe-bc72-3924a5bc03f6",
146+
"metadata": {},
147+
"source": [
148+
"## 4.执行评估查看结果\n",
149+
"\n",
150+
"下面使用`evaluate`方法来运行评估,该方法接受以下参数:\n",
151+
"\n",
152+
"- 函数(function):接受输入字典或对象并返回输出字典或对象\n",
153+
"- 数据(data): 要在其上进行评估的LangSmith数据集的名称或UUID,或者是示例的迭代器\n",
154+
"- 评估器(evaluators): 用于对函数输出进行打分的评估器列表\n",
155+
"- 实验前缀(experiment_prefix): 用于给实验名称添加前缀的字符串。如果未提供,则将自动生成一个名称。"
156+
]
157+
},
158+
{
159+
"cell_type": "code",
160+
"execution_count": 20,
161+
"id": "eeec0c29-5e85-46e1-915b-619b68627d63",
162+
"metadata": {},
163+
"outputs": [
164+
{
165+
"ename": "ImportError",
166+
"evalue": "cannot import name 'evaluate' from 'langsmith.evaluation' (/home/ubuntu/miniconda3/envs/langchain/lib/python3.10/site-packages/langsmith/evaluation/__init__.py)",
167+
"output_type": "error",
168+
"traceback": [
169+
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
170+
"\u001b[0;31mImportError\u001b[0m Traceback (most recent call last)",
171+
"Cell \u001b[0;32mIn[20], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mlangsmith\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mevaluation\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m evaluate\n\u001b[1;32m 3\u001b[0m \u001b[38;5;66;03m# 数据集名称\u001b[39;00m\n\u001b[1;32m 4\u001b[0m dataset_name \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mToxic Queries\u001b[39m\u001b[38;5;124m\"\u001b[39m\n",
172+
"\u001b[0;31mImportError\u001b[0m: cannot import name 'evaluate' from 'langsmith.evaluation' (/home/ubuntu/miniconda3/envs/langchain/lib/python3.10/site-packages/langsmith/evaluation/__init__.py)"
173+
]
174+
}
175+
],
176+
"source": [
177+
"from langsmith.evaluation import evaluate\n",
178+
"\n",
179+
"# 数据集名称\n",
180+
"dataset_name = \"Toxic Queries\"\n",
181+
"\n",
182+
"# evaluator = StringEvaluator(evaluation_name=\"toxic_judge\", grading_function=correct_label)\n",
183+
"\n",
184+
"# 评估函数\n",
185+
"results = evaluate(\n",
186+
" # 使用 label_text 函数处理输入\n",
187+
" lambda inputs: label_text(inputs[\"text\"]),\n",
188+
" data=dataset_name, # 数据集名称\n",
189+
" evaluators=[correct_label], # 使用 correct_label 评估函数\n",
190+
" experiment_prefix=\"Toxic Queries\", # 实验前缀名称\n",
191+
" description=\"Testing the baseline system.\", # 可选描述信息\n",
192+
")"
193+
]
194+
},
195+
{
196+
"cell_type": "code",
197+
"execution_count": null,
198+
"id": "bfea48d3-c461-4576-9efe-3ae4af6bd084",
199+
"metadata": {},
200+
"outputs": [],
201+
"source": []
202+
},
203+
{
204+
"cell_type": "code",
205+
"execution_count": null,
206+
"id": "3432c5d1-6e8f-4c42-b18c-a566645c4f40",
207+
"metadata": {},
208+
"outputs": [],
209+
"source": []
210+
},
211+
{
212+
"cell_type": "markdown",
213+
"id": "e0f8ea90-1b5f-4761-8f2d-9e19d3b61e15",
214+
"metadata": {},
215+
"source": [
216+
"## 使用 LCEL 重写 RAG Bot"
217+
]
218+
},
219+
{
220+
"cell_type": "code",
221+
"execution_count": 23,
222+
"id": "46817304-1e17-4ca1-a5ba-faebd80c3728",
223+
"metadata": {},
224+
"outputs": [],
225+
"source": [
226+
"### 索引部分\n",
227+
"\n",
228+
"from bs4 import BeautifulSoup as Soup\n",
229+
"from langchain_community.vectorstores import Chroma\n",
230+
"from langchain_openai import OpenAIEmbeddings\n",
231+
"from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader\n",
232+
"from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
233+
"\n",
234+
"# 加载文档\n",
235+
"url = \"https://python.langchain.com/v0.1/docs/expression_language/\"\n",
236+
"loader = RecursiveUrlLoader(\n",
237+
" url=url, max_depth=20, extractor=lambda x: Soup(x, \"html.parser\").text\n",
238+
")\n",
239+
"docs = loader.load()\n",
240+
"\n",
241+
"# 分割文档为小块\n",
242+
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=4500, chunk_overlap=200)\n",
243+
"splits = text_splitter.split_documents(docs)\n",
244+
"\n",
245+
"# 嵌入并存储在 Chroma 中\n",
246+
"vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())\n",
247+
"\n",
248+
"# 创建检索器\n",
249+
"retriever = vectorstore.as_retriever()"
250+
]
251+
},
252+
{
253+
"cell_type": "code",
254+
"execution_count": null,
255+
"id": "096e3129-8e5e-42b9-8c42-d59f072f20c5",
256+
"metadata": {},
257+
"outputs": [],
258+
"source": [
259+
"### RAG 机器人部分\n",
260+
"\n",
261+
"import openai\n",
262+
"from langsmith import traceable\n",
263+
"from langsmith.wrappers import wrap_openai\n",
264+
"\n",
265+
"class RagBot:\n",
266+
"\n",
267+
" def __init__(self, retriever, model: str = \"gpt-4-0125-preview\"):\n",
268+
" self._retriever = retriever\n",
269+
" # 包装客户端以监测 LLM\n",
270+
" self._client = wrap_openai(openai.Client())\n",
271+
" self._model = model\n",
272+
"\n",
273+
" @traceable()\n",
274+
" def retrieve_docs(self, question):\n",
275+
" # 调用检索器获取相关文档\n",
276+
" return self._retriever.invoke(question)\n",
277+
"\n",
278+
" @traceable()\n",
279+
" def invoke_llm(self, question, docs):\n",
280+
" # 调用 LLM 生成回复\n",
281+
" response = self._client.chat.completions.create(\n",
282+
" model=self._model,\n",
283+
" messages=[\n",
284+
" {\n",
285+
" \"role\": \"system\",\n",
286+
" \"content\": \"你是一个乐于助人的 AI 编码助手,擅长 LCEL。使用以下文档生成简明的代码解决方案回答用户的问题。\\n\\n\"\n",
287+
" f\"## 文档\\n\\n{docs}\",\n",
288+
" },\n",
289+
" {\"role\": \"user\", \"content\": question},\n",
290+
" ],\n",
291+
" )\n",
292+
"\n",
293+
" # 评估器将期望 \"answer\"\"contexts\"\n",
294+
" return {\n",
295+
" \"answer\": response.choices[0].message.content,\n",
296+
" \"contexts\": [str(doc) for doc in docs],\n",
297+
" }\n",
298+
"\n",
299+
" @traceable()\n",
300+
" def get_answer(self, question: str):\n",
301+
" # 获取答案\n",
302+
" docs = self.retrieve_docs(question)\n",
303+
" return self.invoke_llm(question, docs)\n",
304+
"\n",
305+
"# 创建 RagBot 实例\n",
306+
"rag_bot = RagBot(retriever)"
307+
]
308+
},
309+
{
310+
"cell_type": "code",
311+
"execution_count": 32,
312+
"id": "431bbdb3-d4a3-445a-9cfc-2e62adff3ad0",
313+
"metadata": {},
314+
"outputs": [
315+
{
316+
"data": {
317+
"text/plain": [
318+
"\"To build a RAG (Retrieval-Augmented Generation) chain in LangChain Expression Language (LCEL), you integrate components that handle retrieval (searching for relevant information from a database or document collection) and generation (creating responses based on the retrieved information). The LCEL document provided doesn't go into specifics about a RAG chain configuration, but based on the principles of LCEL, I can guide you through constructing a simplified RAG chain using hypothetical LCEL com\""
319+
]
320+
},
321+
"execution_count": 32,
322+
"metadata": {},
323+
"output_type": "execute_result"
324+
}
325+
],
326+
"source": [
327+
"response = rag_bot.get_answer(\"How to build a RAG chain in LCEL?\")\n",
328+
"response[\"answer\"][:500]"
329+
]
330+
},
331+
{
332+
"cell_type": "code",
333+
"execution_count": null,
334+
"id": "1b4ca951-0ed8-41c5-adb9-694776a7a2e7",
335+
"metadata": {},
336+
"outputs": [],
337+
"source": []
338+
}
339+
],
340+
"metadata": {
341+
"kernelspec": {
342+
"display_name": "Python 3 (ipykernel)",
343+
"language": "python",
344+
"name": "python3"
345+
},
346+
"language_info": {
347+
"codemirror_mode": {
348+
"name": "ipython",
349+
"version": 3
350+
},
351+
"file_extension": ".py",
352+
"mimetype": "text/x-python",
353+
"name": "python",
354+
"nbconvert_exporter": "python",
355+
"pygments_lexer": "ipython3",
356+
"version": "3.10.14"
357+
}
358+
},
359+
"nbformat": 4,
360+
"nbformat_minor": 5
361+
}

0 commit comments

Comments
 (0)