title | description |
---|---|
Tracing |
How to capture LLM inputs and outputs and evaluate them. |
import CloudSignup from '/snippets/cloud_signup.mdx'; import CreateProject from '/snippets/create_project.mdx';
This tutorial shows how to set up tracing for an LLM app, collect its inputs and outputs, view them in Evidently Cloud, and optionally run evaluations. You will use the following tools:
-
Tracely: An open-source tracing library based on OpenTelemetry.
-
Evidently: An open-source library to run LLM evaluations and interact with Evidently Cloud.
-
Evidently Cloud: A web platform to view traces and run evaluations.
-
OpenAI: Used to simulate an LLM application.
Install the necessary libraries:
! pip install evidently[llm]
! pip install tracely
! pip install openai
Import the required modules:
import os
import openai
import time
import uuid
from tracely import init_tracing
from tracely import trace_event
from tracely import create_trace_event
from evidently.ui.workspace.cloud import CloudWorkspace
Optional. To load the traced dataset back to Python and run evals.
import pandas as pd
from evidently.future.datasets import Dataset
from evidently.future.datasets import DataDefinition
from evidently.future.datasets import Descriptor
from evidently.future.descriptors import *
from evidently.future.report import Report
from evidently.future.presets import TextEvals
from evidently.future.metrics import *
from evidently.future.tests import *
Set up the OpenAI key (Token page) as an environment variable. See Open AI docs.
os.environ["OPENAI_API_KEY"] = "YOUR_KEY"
Set up and initialize tracing:
project_id = str(project.id)
init_tracing(
address="https://app.evidently.cloud/",
api_key="YOUR_API_TOKEN",
project_id=project_id,
export_name="TRACING_DATASET"
)
-
The
address
is the destination backend to store collected traces. -
Project_id
is the ID of the Evidently Project you just created. Go to the Home page, enter the Project and copy its ID from above the dashboard. -
Dataset_name
helps identify the resulting Tracing dataset. All data with the same ID is grouped into a single dataset.
Let's create and trace a simple function that sends a list of questions to the LLM.
Initialize the OpenAI client with the API key:
client = openai.OpenAI(api_key=openai_api_key)
Define the list of questions to answer:
question_list = [
"What is Evidently Python library?",
"What is LLM observability?",
"How is MLOps different from LLMOps?",
"What is an LLM prompt?",
"Why should you care about LLM safety?"
]
Instruct the assistant to answer questions, and use the create_trace_event
from Tracely
to trace the execution of the function and treat each as a separate session. This loops through the list of questions, captures input arguments and outputs and sends the data to Evidently Cloud:
def qa_assistant(question):
system_prompt = "You are a helpful assistant. Please answer the following question in one sentence."
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": question},
]
return client.chat.completions.create(model="gpt-4o-mini", messages=messages).choices[0].message.content
# Iterate over the list of questions and pass each to the assistant
for question in question_list:
session_id = str(uuid.uuid4())
with create_trace_event("qa", session_id=session_id) as event:
response = qa_assistant(question=question)
event.set_attribute("question", question)
event.set_attribute("response", response)
time.sleep(1)
Go to the Evidently Cloud, open your Project, and navigate to the "Traces" in the left menu. Open the traces you just sent. It might take a few moments until OpenAI processes all the inputs.
You can now view, sort, export, and work with the traced dataset. You can switch between Traces, Dataset and Dialog view (select session there).
  You can run evaluations on this dataset both in the Cloud and locally. For local evaluations, first load the dataset to your Python environment:
traced_data = ws.load_dataset(dataset_id = "YOUR_DATASET_ID")
# traced_data.head()
You can copy the dataset ID from the Traces page inside your Project.
To run an evaluation, create an Evidently Dataset object and choose the descriptors:
eval_dataset = Dataset.from_pandas(pd.DataFrame(traced_data),
data_definition=DataDefinition(),
descriptors=[
SentenceCount("qa.response", alias="SentenceCount"),
TextLength("qa.response", alias="Length"),
Sentiment("qa.response", alias="Sentiment"),
])
Now you can summarize the results and add conditional checks. This will explicitly test if all responses are 1 sentence, and each one's length is <300 symbols.
report = Report([
TextEvals(),
MaxValue(column="SentenceCount", tests=[eq(1)]),
MaxValue(column="Length", tests=[lte(300)]),
])
my_eval = report.run(eval_dataset, None)
To upload the results to your Project:
ws.add_run(project.id, my_eval, include_data=True)
You can go to your Project and open the Report:
Check the tutorial on LLM evaluations for more details: how to run other evaluation methods, including LLM as a judge, or test for specific conditions.
Need help? Ask in our Discord community.