Skip to content

Commit a14402e

Browse files
ncybulYun-Kim
andauthored
feat(llmobs): [MLOB-4258] add support for OpenAI server-side MCP calls (#15057)
## Description This PR adds support for server-side MCP calls made via the OpenAI Responses API. In the Responses API, LLMs can invoke MCP tools on behalf of the client. They do this by asking the provided MCP server to list available tools and then calling the relevant tool. Our current support for these kinds of interactions is not great: we do not capture any tool calls, tool results, or tool spans. This PR provides better support by: 1. Capturing the `McpCall` output item and parsing it into a Tool Call and Tool Result for the current active LLM span 2. Generating a Tool span to represent the server-side tool invocation _(note that this tool span is a child of the active LLM span since it technically happens within the LLM operation)_ 3. Adding any tools returned from the MCP server to the LLM's available tools field. ## Manual Testing I manually tested my changes with the following script: ``` import asyncio from ddtrace.llmobs import LLMObs from agents import set_default_openai_client from agents.tracing import set_tracing_disabled from dd_internal_authentication.client import JWTDDToolAuthClientTokenManager from openai import AsyncOpenAI LLMObs.enable( ml_app="nicole-test", site="datadoghq.com" ) set_tracing_disabled(True) token=JWTDDToolAuthClientTokenManager.instance(name="rapid-ai-platform", datacenter='us1.staging.dog').get_token("rapid-ai-platform") ai_gateway_client = AsyncOpenAI(base_url="https://ai-gateway.us1.staging.dog/v1", default_headers={"source": "nicole-test", "org-id": "2"}, api_key=token) set_default_openai_client(ai_gateway_client) async def main(): resp = await ai_gateway_client.responses.create( model="gpt-5", tools=[ { "type": "mcp", "server_label": "dice_roller", "server_description": "Public dice-roller MCP server for testing.", "server_url": "https://dice-rolling-mcp.vercel.app/mcp", # or use the FastMCP URL "require_approval": "never", }, ], input="Roll 2d4+1", ) print(resp) if __name__ == "__main__": asyncio.run(main()) ``` ### Before Before, our experience for these types of server-side MCP use cases was very poor. Running the script, I get this [trace](https://app.datadoghq.com/llm/traces?query=%40ml_app%3Anicole-test%20%40event_type%3Aspan%20%40parent_id%3Aundefined&agg_m=count&agg_m_source=base&agg_t=count&fromUser=false&sp=%5B%7B%22p%22%3A%7B%22eventId%22%3A%22AwAAAZo7t33VDIDW2gAAABhBWm83dDMzVkFBRHU0Q2I1ZDNYT0FBQUEAAAAkZjE5YTNiYmEtMGM3NC00NzVlLWFlZDQtY2ExZmVmYWU2ZDRkAABNvg%22%7D%2C%22i%22%3A%22llm-obs-panel%22%7D%5D&spanId=2741126858136455689&start=1761936196353&end=1761939796353&paused=false) which does not parse the MCP server side calls: <img width="2668" height="1304" alt="image" src="https://github.com/user-attachments/assets/51426be1-d670-45ad-b2f2-d3dab8a85f52" /> ### After With the changes in this PR, the [trace](https://app.datadoghq.com/llm/traces?query=%40ml_app%3Anicole-test%20%40event_type%3Aspan%20%40parent_id%3Aundefined&agg_m=count&agg_m_source=base&agg_t=count&fromUser=true&sp=%5B%7B%22p%22%3A%7B%22eventId%22%3A%22AwAAAZqSG1GZuwlyswAAABhBWnFTRzFHWkFBQy13SDVmUXVhNkFBQUEAAAAkZjE5YTkyMWMtODg0Mi00M2YwLWJlYmQtNjM5MTdiNjkzNTI0AAAckw%22%7D%2C%22i%22%3A%22llm-obs-panel%22%7D%5D&spanId=11798342076467341016&start=1763387074443&end=1763387974443&paused=false) looks much cleaner and correctly parses all the information related to the MCP usage. #### Tool Calls and Tool Results are highlighted: <img width="1988" height="442" alt="image" src="https://github.com/user-attachments/assets/8db2cdbf-425a-426b-ab85-3e5d13dd439e" /> #### Available Tools from the MCP server are captured: <img width="2030" height="882" alt="image" src="https://github.com/user-attachments/assets/e7153191-b852-484a-817d-b5d24c687c2b" /> #### Separate tool span is emitted: <img width="1376" height="612" alt="image" src="https://github.com/user-attachments/assets/7ceaa977-61ff-4917-af53-6a97babf86e5" /> I also tried this out with more than one tool call in this [trace](https://app.datadoghq.com/llm/traces?query=%40ml_app%3Anicole-test%20%40event_type%3Aspan%20%40parent_id%3Aundefined&agg_m=count&agg_m_source=base&agg_t=count&fromUser=false&sp=%5B%7B%22p%22%3A%7B%22eventId%22%3A%22AwAAAZqSHLq7ndIu6wAAABhBWnFTSExxN0FBRG1Md0RzUlVWbUFBQUEAAAAkZjE5YTkyMWMtZjcxMi00ZGU5LTg4ODItN2Q5NTdlNGU2MTliAAACpA%22%7D%2C%22i%22%3A%22llm-obs-panel%22%7D%5D&spanId=4722629383297745650&start=1763387152732&end=1763388052732&paused=false). ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers --> --------- Co-authored-by: Yun Kim <[email protected]>
1 parent 9eb932e commit a14402e

File tree

7 files changed

+346
-21
lines changed

7 files changed

+346
-21
lines changed

ddtrace/contrib/internal/openai/_endpoint_hooks.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,10 @@
77
from ddtrace.contrib.internal.openai.utils import _loop_handler
88
from ddtrace.contrib.internal.openai.utils import _process_finished_stream
99
from ddtrace.internal.utils.version import parse_version
10+
from ddtrace.llmobs._constants import OAI_HANDOFF_TOOL_ARG
1011
from ddtrace.llmobs._integrations.base_stream_handler import make_traced_stream
12+
from ddtrace.llmobs._utils import _get_attr
13+
from ddtrace.llmobs._utils import safe_load_json
1114

1215

1316
API_VERSION = "v1"
@@ -520,6 +523,7 @@ class _ResponseHook(_BaseCompletionHook):
520523

521524
def _record_response(self, pin, integration, span, args, kwargs, resp, error):
522525
resp = super()._record_response(pin, integration, span, args, kwargs, resp, error)
526+
self._trace_mcp_tool_usage(pin, integration, resp)
523527
if not resp:
524528
integration.llmobs_set_tags(span, args=[], kwargs=kwargs, response=resp, operation="response")
525529
return resp
@@ -528,6 +532,35 @@ def _record_response(self, pin, integration, span, args, kwargs, resp, error):
528532
integration.llmobs_set_tags(span, args=[], kwargs=kwargs, response=resp, operation="response")
529533
return resp
530534

535+
def _trace_mcp_tool_usage(self, pin, integration, resp):
536+
"""Detect and trace server-side MCP tool usage in the response."""
537+
if not resp:
538+
return
539+
540+
messages = _get_attr(resp, "output", [])
541+
542+
if messages and isinstance(messages, list):
543+
for item in messages:
544+
message_type = _get_attr(item, "type", "")
545+
if message_type == "mcp_call":
546+
self._create_mcp_tool_span(item, integration, pin)
547+
548+
def _create_mcp_tool_span(self, item, integration, pin):
549+
"""Creates and submits a tool span to LLMObs to represent a server-side MCP tool call."""
550+
with integration.trace(pin, "client_tool_call", submit_to_llmobs=True, kind="tool") as span:
551+
tool_id = str(_get_attr(item, "id", ""))
552+
tool_name = str(_get_attr(item, "name", ""))
553+
raw_arguments = _get_attr(item, "arguments", OAI_HANDOFF_TOOL_ARG)
554+
tool_arguments = safe_load_json(str(raw_arguments))
555+
tool_output = str(_get_attr(item, "output", ""))
556+
integration.llmobs_set_tags(
557+
span,
558+
args=[],
559+
kwargs={"name": tool_name, "arguments": tool_arguments, "tool_id": tool_id},
560+
response=tool_output,
561+
operation="tool",
562+
)
563+
531564

532565
class _ResponseParseHook(_ResponseHook):
533566
OPERATION_ID = "parseResponse"

ddtrace/llmobs/_integrations/openai.py

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,12 @@
1010
from ddtrace.llmobs._constants import CACHE_READ_INPUT_TOKENS_METRIC_KEY
1111
from ddtrace.llmobs._constants import INPUT_DOCUMENTS
1212
from ddtrace.llmobs._constants import INPUT_TOKENS_METRIC_KEY
13+
from ddtrace.llmobs._constants import INPUT_VALUE
1314
from ddtrace.llmobs._constants import METADATA
1415
from ddtrace.llmobs._constants import METRICS
1516
from ddtrace.llmobs._constants import MODEL_NAME
1617
from ddtrace.llmobs._constants import MODEL_PROVIDER
18+
from ddtrace.llmobs._constants import NAME
1719
from ddtrace.llmobs._constants import OUTPUT_TOKENS_METRIC_KEY
1820
from ddtrace.llmobs._constants import OUTPUT_VALUE
1921
from ddtrace.llmobs._constants import PROXY_REQUEST
@@ -27,12 +29,15 @@
2729
from ddtrace.llmobs._integrations.utils import openai_set_meta_tags_from_response
2830
from ddtrace.llmobs._integrations.utils import update_proxy_workflow_input_output_value
2931
from ddtrace.llmobs._utils import _get_attr
32+
from ddtrace.llmobs._utils import safe_json
3033
from ddtrace.llmobs.types import Document
3134
from ddtrace.trace import Span
3235

3336

3437
log = get_logger(__name__)
3538

39+
OPENAI_LLM_OPERATIONS = ("completion", "chat", "response")
40+
3641

3742
class OpenAIIntegration(BaseLLMIntegration):
3843
_integration_name = "openai"
@@ -105,7 +110,11 @@ def _llmobs_set_tags(
105110
) -> None:
106111
"""Sets meta tags and metrics for span events to be sent to LLMObs."""
107112
span_kind = (
108-
"workflow" if span._get_ctx_item(PROXY_REQUEST) else "embedding" if operation == "embedding" else "llm"
113+
"workflow"
114+
if span._get_ctx_item(PROXY_REQUEST)
115+
else "llm"
116+
if operation in OPENAI_LLM_OPERATIONS
117+
else operation
109118
)
110119
model_name = span.get_tag("openai.response.model") or span.get_tag("openai.request.model")
111120

@@ -121,7 +130,9 @@ def _llmobs_set_tags(
121130
elif operation == "embedding":
122131
self._llmobs_set_meta_tags_from_embedding(span, kwargs, response)
123132
elif operation == "response":
124-
openai_set_meta_tags_from_response(span, kwargs, response)
133+
openai_set_meta_tags_from_response(span, kwargs, response, self)
134+
elif operation == "tool":
135+
self._llmobs_set_tags_from_tool(span, kwargs, response)
125136
update_proxy_workflow_input_output_value(span, span_kind)
126137
metrics = self._extract_llmobs_metrics_tags(span, response, span_kind, kwargs)
127138
span._set_ctx_items(
@@ -153,6 +164,27 @@ def _llmobs_set_meta_tags_from_embedding(span: Span, kwargs: Dict[str, Any], res
153164
return
154165
span._set_ctx_item(OUTPUT_VALUE, "[{} embedding(s) returned]".format(len(resp.data)))
155166

167+
@staticmethod
168+
def _llmobs_set_tags_from_tool(span: Span, kwargs: Dict[str, Any], response: Any) -> None:
169+
"""Extract tool name, arguments, and output from the request and response to be submitted to LLMObs."""
170+
tool_id = kwargs.get("tool_id", "unknown_tool_id")
171+
tool_name = kwargs.get("name", "unknown_tool")
172+
tool_arguments = kwargs.get("arguments")
173+
tool_output = response
174+
175+
span_name = "MCP Client Tool Call: {}".format(tool_name)
176+
span.name = span_name
177+
178+
span._set_ctx_items(
179+
{
180+
SPAN_KIND: "tool",
181+
NAME: span_name,
182+
INPUT_VALUE: safe_json(tool_arguments) if tool_arguments is not None else "",
183+
OUTPUT_VALUE: safe_json(tool_output) if tool_output is not None else "",
184+
METADATA: {"tool_id": tool_id},
185+
}
186+
)
187+
156188
@staticmethod
157189
def _extract_llmobs_metrics_tags(
158190
span: Span, resp: Any, span_kind: str, kwargs: Dict[str, Any]

ddtrace/llmobs/_integrations/openai_agents.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -231,7 +231,7 @@ def _llmobs_set_response_attributes(self, span: Span, oai_span: OaiSpanAdapter)
231231
span._set_ctx_item(INPUT_MESSAGES, messages)
232232

233233
if oai_span.response and oai_span.response.output:
234-
messages, tool_call_outputs = oai_span.llmobs_output_messages()
234+
messages, tool_call_outputs, _ = oai_span.llmobs_output_messages()
235235

236236
for tool_call_output in tool_call_outputs:
237237
core.dispatch(

ddtrace/llmobs/_integrations/utils.py

Lines changed: 59 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -628,32 +628,38 @@ def _openai_parse_input_response_messages(
628628
return processed, tool_call_ids
629629

630630

631-
def openai_get_output_messages_from_response(response: Optional[Any]) -> List[Message]:
631+
def openai_get_output_messages_from_response(
632+
response: Optional[Any], integration: Any = None
633+
) -> Tuple[List[Message], List[ToolDefinition]]:
632634
"""
633-
Parses the output to openai responses api into a list of output messages
635+
Parses the output to openai responses api into a list of output messages and a list of
636+
MCP tool definitions returned from the MCP server.
634637
635638
Args:
636639
response: An OpenAI response object or dictionary containing output messages
637640
638641
Returns:
639642
- A list of processed messages
643+
- A list of MCP tool definitions
640644
"""
641645
if not response:
642-
return []
646+
return [], []
643647

644648
messages = _get_attr(response, "output", [])
645649
if not messages:
646-
return []
650+
return [], []
647651

648-
processed_messages, _ = _openai_parse_output_response_messages(messages)
652+
processed_messages, _, mcp_tool_definitions = _openai_parse_output_response_messages(messages, integration)
649653

650-
return processed_messages
654+
return processed_messages, mcp_tool_definitions
651655

652656

653-
def _openai_parse_output_response_messages(messages: List[Any]) -> Tuple[List[Message], List[ToolCall]]:
657+
def _openai_parse_output_response_messages(
658+
messages: List[Any], integration: Any = None
659+
) -> Tuple[List[Message], List[ToolCall], List[ToolDefinition]]:
654660
"""
655661
Parses output messages from the openai responses api into a list of processed messages
656-
and a list of tool call outputs.
662+
and a list of tool call outputs and a list of MCP tool definitions.
657663
658664
Args:
659665
messages: A list of output messages
@@ -664,6 +670,7 @@ def _openai_parse_output_response_messages(messages: List[Any]) -> Tuple[List[Me
664670
"""
665671
processed: List[Message] = []
666672
tool_call_outputs: List[ToolCall] = []
673+
mcp_tool_definitions: List[ToolDefinition] = []
667674

668675
for item in messages:
669676
message: Message = Message()
@@ -707,12 +714,41 @@ def _openai_parse_output_response_messages(messages: List[Any]) -> Tuple[List[Me
707714
"role": "assistant",
708715
}
709716
)
717+
elif message_type == "mcp_call":
718+
call_id = str(_get_attr(item, "id", ""))
719+
name = str(_get_attr(item, "name", ""))
720+
raw_arguments = _get_attr(item, "arguments", OAI_HANDOFF_TOOL_ARG)
721+
arguments = safe_load_json(str(raw_arguments))
722+
output = str(_get_attr(item, "output", ""))
723+
tool_call_info = ToolCall(
724+
tool_id=call_id,
725+
arguments=arguments,
726+
name=name,
727+
type=str(message_type),
728+
)
729+
tool_call_outputs.append(tool_call_info)
730+
tool_result_info = ToolResult(
731+
name=name,
732+
result=output,
733+
tool_id=call_id,
734+
type="mcp_tool_result",
735+
)
736+
message.update(
737+
{
738+
"tool_calls": [tool_call_info],
739+
"tool_results": [tool_result_info],
740+
"role": "assistant",
741+
}
742+
)
743+
elif message_type == "mcp_list_tools":
744+
mcp_tool_definitions.extend(_openai_get_tool_definitions(_get_attr(item, "tools", [])))
745+
continue
710746
else:
711747
message.update({"content": str(item), "role": "assistant"})
712748

713749
processed.append(message)
714750

715-
return processed, tool_call_outputs
751+
return processed, tool_call_outputs, mcp_tool_definitions
716752

717753

718754
def openai_get_metadata_from_response(
@@ -802,7 +838,9 @@ def _extract_chat_template_from_instructions(
802838
return chat_template
803839

804840

805-
def openai_set_meta_tags_from_response(span: Span, kwargs: Dict[str, Any], response: Optional[Any]) -> None:
841+
def openai_set_meta_tags_from_response(
842+
span: Span, kwargs: Dict[str, Any], response: Optional[Any], integration: Any = None
843+
) -> None:
806844
"""Extract input/output tags from response and set them as temporary "_ml_obs.meta.*" tags."""
807845
input_data = kwargs.get("input", [])
808846

@@ -851,11 +889,11 @@ def openai_set_meta_tags_from_response(span: Span, kwargs: Dict[str, Any], respo
851889
metadata = span._get_ctx_item(METADATA) or {}
852890
metadata.update(openai_get_metadata_from_response(response))
853891
span._set_ctx_item(METADATA, metadata)
854-
output_messages: List[Message] = openai_get_output_messages_from_response(response)
892+
output_messages, mcp_tool_definitions = openai_get_output_messages_from_response(response, integration)
855893
span._set_ctx_item(OUTPUT_MESSAGES, output_messages)
856894
tools = _openai_get_tool_definitions(kwargs.get("tools") or [])
857-
if tools:
858-
span._set_ctx_item(TOOL_DEFINITIONS, tools)
895+
if mcp_tool_definitions or tools:
896+
span._set_ctx_item(TOOL_DEFINITIONS, tools + mcp_tool_definitions)
859897

860898

861899
def _openai_get_tool_definitions(tools: List[Any]) -> List[ToolDefinition]:
@@ -878,12 +916,14 @@ def _openai_get_tool_definitions(tools: List[Any]) -> List[ToolDefinition]:
878916
schema=_get_attr(custom_tool, "format", {}), # format is a dict
879917
)
880918
# chat API function access and response API tool access
881-
# only handles FunctionToolParam and CustomToolParam for response API for now
919+
# only handles FunctionToolParam, CustomToolParam and McpListToolsTool for response API for now
882920
else:
883921
tool_definition = ToolDefinition(
884922
name=str(_get_attr(tool, "name", "")),
885923
description=str(_get_attr(tool, "description", "")),
886-
schema=_get_attr(tool, "parameters", {}) or _get_attr(tool, "format", {}),
924+
schema=_get_attr(tool, "parameters", {})
925+
or _get_attr(tool, "format", {})
926+
or _get_attr(tool, "input_schema", {}),
887927
)
888928
if not any(tool_definition.values()):
889929
continue
@@ -1198,19 +1238,20 @@ def llmobs_input_messages(self) -> Tuple[List[Message], List[str]]:
11981238
"""
11991239
return _openai_parse_input_response_messages(self.input, self.response_system_instructions)
12001240

1201-
def llmobs_output_messages(self) -> Tuple[List[Message], List[ToolCall]]:
1241+
def llmobs_output_messages(self) -> Tuple[List[Message], List[ToolCall], List[ToolDefinition]]:
12021242
"""Returns processed output messages for LLM Obs LLM spans.
12031243
12041244
Returns:
12051245
- A list of processed messages
12061246
- A list of tool calls for span linking purposes
1247+
- A list of MCP tool definitions
12071248
"""
12081249
if not self.response or not self.response.output:
1209-
return [], []
1250+
return [], [], []
12101251

12111252
messages: List[Any] = self.response.output
12121253
if not messages:
1213-
return [], []
1254+
return [], [], []
12141255

12151256
if not isinstance(messages, list):
12161257
messages = [messages]
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
features:
2+
- |
3+
openai, LLM Observability: This introduces support for capturing server-side MCP tool calls invoked via the OpenAI Responses API as a separate span.
4+

0 commit comments

Comments
 (0)