Skip to content

RFC: Client / Server Content capabilities #223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

evalstate
Copy link
Contributor

@evalstate evalstate commented Mar 24, 2025

This PR adds a new contentTypes capability to the ClientCapabilities and a generatesHint Tool Annotation, allowing clients to advertise which MIME types they can render to Users and tokenize for LLM consumption. It also allows Tools to advertise the content types they may generate in a CallToolResult.

This enhancement works with the existing annotations system to optionally enable MCP Servers to adapt their content delivery to best match Host capabilities.

Motivation and Context

Different Host application/LLM pairs have different content handling requirements and capabilities (e.g. Chat Applications, IDEs, Video/Content Editing Suite, Agentic Applications).

This addition allows MCP Servers to make informed decisions to:

  • Select optimal formats for content
  • Use audience annotations more effectively
  • Gracefully enhance or degrade based on Host needs/preferences.

Update 2025-04-19:
The addition also enhances interoperability for implementors of the A2A protocol, which defines input and output modes for Agents. See AgentCard here and AgentSkill here.

How Has This Been Tested?

The extension has not been directly tested, however some example scenarios are:

  • A Host application that can render but not tokenize audio/video can receive adapted Tool Results and Text Content for the LLM.
  • An MCP Server can choose to return either a application/pdf or downgrade to text/plain based on LLM capabilities.
  • An MCP Server can provide additional instructions in the Tool Result to guide the User to obtain content that could not otherwise be rendered/processed.
  • A client may choose not to include Tools that generate content types that cannot be handled.

Breaking Changes

The change is backwards compatible.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

This is not intended to be a complicated content-type negotiation protocol - but to provide a simple way for participating Hosts and Servers to provide better User Experiences across a range of deployment scenarios. The list of mime-types is intended to be indicative and neither restrictive nor exhaustive.

An agreed convention for Resources where audience: [User], priority: 1 is used to indicate content that should be rendered and not tokenized would further enhance the proposal. For example a PDF could be sent for rendering, with the main content sent as text/plain for the LLM.

I do not think a reciprocal server capability is necessary, as "Roots" provide the ability for the Host to provide arbitrary content to the server.

Update 2025-04-19
After consideration, a Server "generates" capability is appropriate. By convention Servers that support "Structured Outputs" would advertise "application/json" in their generates list.

Update 2025-04-24
Migrated Server "generates" capability to a generatesHint in ToolAnnotation. By convention Servers that support "Structured Outputs" would advertise the content type (e.g. application/json or application/xml). This would be compatible with the potential addition of a Schema related to this tool.

This PR has been opened for discussion and refinement, with additional documentation to be prepared if there is agreement in principle.

@evalstate evalstate changed the title Client Content capabilities RFC: Client / Server Content capabilities Apr 19, 2025
@evalstate
Copy link
Contributor Author

evalstate commented Apr 19, 2025

A draft supplement is supplied below intended for inclusion in the documentation once the right place is identified if we progress with this PR.

ClientCapabilities contentTypes

  • renders[]: A non-exclusive list of MIME types intended to help Servers make informed decisions about content format selection. Servers may use this information along with audience annotations to target content appropriately. For example: ["text/plain","image/png","video/mp4"]
  • tokenizes[]: A non-exclusive list of MIME types to inform Servers what content types can be included in the LLM's context window. Servers may adapt their responses based on this such as providing alternative formats or using audience annotations to prioritise content appropriately. For example: ["text/plain","image/png"]

## ServerCapabilities contentTypes

- generates[]: A non-exclusive list of MIME types to inform Hosts what content types may be generated by a Server in a CallToolResult. Servers that support Structured Output SHOULD advertise application/json here.

ToolAnnotation generateHints

  • generatesHint: A non-exclusive list of MIME types to inform Hosts what content types may be generated by a Server in a CallToolResult. Servers that support Structured Output SHOULD advertise the appropriate output (e.g. application/json) here.

Audience Annotations

For Resources annotated with audience=user, priority=1 the Host MAY choose not to present the full content of the Resource to the LLM.

@evalstate evalstate marked this pull request as ready for review April 19, 2025 12:14
@patwhite
Copy link

ServerCapabilities contentTypes

  • generates[]: A non-exclusive list of MIME types to inform Hosts what content types may be generated by a Server in a CallToolResult. Servers that support Structured Output SHOULD advertise application/json here.

The structured output piece strikes me as the biggest issue here - are you imagining that structured output always flows through an embedded resource? If not, the TextContent type is notably missing a mime type property, so as near as I can tell there's no mechanism to actually communicate a mime type like application/json to a client. If you're imagining this always goes through an embedded resource unless the return is explicitly text/plain, and only text/plain maps to the TextContent, based on the fact that many MCP servers are using the text type to return json, this proposal would double the number of calls for a json heavy MCP server if they wanted to adhere exactly to spec.

I'd also think it would be useful to have prescriptive documentation around this - I know this is a philosophical discussion, but is a PDF an image content type with a mime type of pdf or, since it's a flat file format, that needs to be an embedded resource? In the negotiation scenario, are you saying the tool call will either return a text type for text/plain (again in the example you gave), an image type for application/pdf, or and embedded resource, say if they're returning something structured?

High level, I like this idea as a negotiation, but there might need to be some supporting changes to handle the structured output piece efficiently, and the tools documentation would need to be updated with a minimal amount of guidance for how servers should treat these content type requests, AND there should be documentation for clients on recommendations for what is expected of the client / host if they send in a renders mime type.

@evalstate
Copy link
Contributor Author

The structured output piece strikes me as the biggest issue here - are you imagining that structured output always flows through an embedded resource? If not, the TextContent type is notably missing a mime type property, so as near as I can tell there's no mechanism to actually communicate a mime type like application/json to a client. If you're imagining this always goes through an embedded resource unless the return is explicitly text/plain, and only text/plain maps to the TextContent, based on the fact that many MCP servers are using the text type to return json, this proposal would double the number of calls for a json heavy MCP server if they wanted to adhere exactly to spec.

The question on 371 was whether to use a TextResourceContents which has an optional MIME type and a uri. I would suggest continuing discussions on that aspect in #371, and consulting the schema for the data types under discussion.

I'd also think it would be useful to have prescriptive documentation around this - I know this is a philosophical discussion, but is a PDF an image content type with a mime type of pdf or, since it's a flat file format, that needs to be an embedded resource? In the negotiation scenario, are you saying the tool call will either return a text type for text/plain (again in the example you gave), an image type for application/pdf, or and embedded resource, say if they're returning something structured?

A PDF is a binary object that would be delivered as a "BlobResourceContents" with a MIME type of application/pdf. LLM Tokenization support for this varies hence Servers may choose to upgrade or downgrade based on known capabilities.

High level, I like this idea as a negotiation, but there might need to be some supporting changes to handle the structured output piece efficiently, and the tools documentation would need to be updated with a minimal amount of guidance for how servers should treat these content type requests, AND there should be documentation for clients on recommendations for what is expected of the client / host if they send in a renders mime type.

Questions on structured output specifically should be raised on #371. These are optional capabilities that Host and MCP Server Implementors can take advantage of to build enhanced applications.

@jonathanhefner
Copy link
Contributor

TextContent type is notably missing a mime type property, so as near as I can tell there's no mechanism to actually communicate a mime type like application/json to a client

Per #180 (comment), I believe TextContent will support mimeType in the future.

@evalstate
Copy link
Contributor Author

evalstate commented Apr 19, 2025

TextContent type is notably missing a mime type property, so as near as I can tell there's no mechanism to actually communicate a mime type like application/json to a client

Per #180 (comment), I believe TextContent will support mimeType in the future.

The suggestion was to use an EmbeddedResource of TextResourceContents type which contains a mimeType, a uri and text for the content. I think a mimeType on the TextContent itself would [potentially] be a good addition, but there is an alternative to TextContent.

Edited to say that mimeType on TextContent is potentially a good addition, my preference would be to fix it as text/plain though.

@patwhite
Copy link

The question on 371 was whether to use a TextResourceContents which has an optional MIME type and a uri. I would suggest continuing discussions on that aspect in #371, and consulting the schema for the data types under discussion.

Ok, then this is basically dependent on #371 going through? That should be called out in the PR.

I'd also think it would be useful to have prescriptive documentation around this - I know this is a philosophical discussion, but is a PDF an image content type with a mime type of pdf or, since it's a flat file format, that needs to be an embedded resource? In the negotiation scenario, are you saying the tool call will either return a text type for text/plain (again in the example you gave), an image type for application/pdf, or and embedded resource, say if they're returning something structured?

A PDF is a binary object that would be delivered as a "BlobResourceContents" with a MIME type of application/pdf. LLM Tokenization support for this varies hence Servers may choose to upgrade or downgrade based on known capabilities.

This doesn't actually address the concern around documentation (though it nicely explains the thinking, but again, this is an intrinsic dependency on 371) - this PR has no documentation updates that state that a server SHOULD take these actions based on client supported mime types. It also gives no guidance what the order of precedence is - speaking as someone working on the server side, without documentation it would be unclear and spotty how servers should respond to different client mime type capabilities. Should render be preferred over tokenize or vice verse, or should you return two content types if the renders and tokenizes are different sets? This kind of sneaks back to your comments on the search PR suggesting giving prescriptive guidance to the client developers on how to handle the search capability, but that same sort of guidance is helpful on the server side for potentially contradictory client mime types.

@evalstate
Copy link
Contributor Author

evalstate commented Apr 19, 2025

This proposal has no dependency on #371 and predates it by 4 weeks.

@patwhite
Copy link

This proposal has no dependency on #371 and predates it by 4 weeks.

Ah gotcha - then I guess the direct feedback is this PR should pickup the fields required to universally communicate back mime type on responses. Again, coming from the server side, I'm not sure how I'd support application/json or pdf (without looking at 371, which we're saying we shouldn't have to look at since it's not a dependency)

@evalstate
Copy link
Contributor Author

evalstate commented Apr 19, 2025

This proposal has no dependency on #371 and predates it by 4 weeks.

Ah gotcha - then I guess the direct feedback is this PR should pickup the fields required to universally communicate back mime type on responses. Again, coming from the server side, I'm not sure how I'd support application/json or pdf (without looking at 371, which we're saying we shouldn't have to look at since it's not a dependency)

The content types are already within the protocol, and well documented here: https://modelcontextprotocol.io/specification/2025-03-26/server/resources

The terminology on MAY, SHOULD and so on are defined here: https://modelcontextprotocol.io/specification/2025-03-26

@patwhite
Copy link

patwhite commented Apr 19, 2025

To put the request for documentation in context, here's the guidance offered by the HTTP spec on content types - it's multiple pages and includes recommendations on defaults, recommendations on sniffing, how to handle unknown responses from both the server and client side, and more. And, it's worth mentioning, the HTTP server use case is actually quite a bit simpler in so much as an http server really only has the ability to return a single result to a request and only gets a single accept header from the requestor. This is as opposed to MCP servers which can potentially return multiple responses and have a much more complex matrix of considerations for what to return, and with this PR actually present two distinct accept lists.

This PR creates a similar mechanism within MCP but without any actual documented guidance on what should be returned by default in the absence of accepted content types, which accept list should be given precedence, if it's acceptable or preferable to return multiple content blocks if multiple mime types are accepted by the client, etc.

Regarding the comment on content types being well documented, this is from your PR:

``contentTypes | Advertises content types the Server may generate in a CallToolResult

I think this is what confused me, because as it's written here contentTypes only partially applies to CallToolResults (since mime is notably absent from the text response). So, maybe a slight rewording, OR pull an optional mime type onto the TextContent as part of this PR?

@patwhite
Copy link

patwhite commented Apr 19, 2025

Actually, this all brings up an interesting question - right now, mime types are primarily on resources (plus image content and audio content) - would a server ever change the resources it presents to the client based on the accept lists?

@evalstate
Copy link
Contributor Author

To put the request for documentation in context, here's the guidance offered by the HTTP spec on content types - it's multiple pages and includes recommendations on defaults, recommendations on sniffing, how to handle unknown responses from both the server and client side, and more. And, it's worth mentioning, the HTTP server use case is actually quite a bit simpler in so much as an http server really only has the ability to return a single result to a request and only gets a single accept header from the requestor. This is as opposed to MCP servers which can potentially return multiple responses and have a much more complex matrix of considerations for what to return, and with this PR actually present two distinct accept lists.

This PR creates a similar mechanism within MCP but without any actual documented guidance on what should be returned by default in the absence of accepted content types, which accept list should be given precedence, if it's acceptable or preferable to return multiple content blocks if multiple mime types are accepted by the client, etc.

Regarding the comment on content types being well documented, this is from your PR:

``contentTypes | Advertises content types the Server may generate in a CallToolResult

I think this is what confused me, because as it's written here contentTypes only partially applies to CallToolResults (since mime is notably absent from the text response). So, maybe a slight rewording, OR pull an optional mime type onto the TextContent as part of this PR?

CallToolResult returns an array of content.

@patwhite
Copy link

To put the request for documentation in context, here's the guidance offered by the HTTP spec on content types - it's multiple pages and includes recommendations on defaults, recommendations on sniffing, how to handle unknown responses from both the server and client side, and more. And, it's worth mentioning, the HTTP server use case is actually quite a bit simpler in so much as an http server really only has the ability to return a single result to a request and only gets a single accept header from the requestor. This is as opposed to MCP servers which can potentially return multiple responses and have a much more complex matrix of considerations for what to return, and with this PR actually present two distinct accept lists.
This PR creates a similar mechanism within MCP but without any actual documented guidance on what should be returned by default in the absence of accepted content types, which accept list should be given precedence, if it's acceptable or preferable to return multiple content blocks if multiple mime types are accepted by the client, etc.
Regarding the comment on content types being well documented, this is from your PR:
``contentTypes | Advertises content types the Server may generate in a CallToolResult
I think this is what confused me, because as it's written here contentTypes only partially applies to CallToolResults (since mime is notably absent from the text response). So, maybe a slight rewording, OR pull an optional mime type onto the TextContent as part of this PR?

CallToolResult returns an array of content.

Yup, and my point is one of the array element options doesn't have a mime type.

@evalstate
Copy link
Contributor Author

Actually, this all brings up an interesting question - right now, mime types are primarily on resources (plus image content and audio content) - would a server ever change the resources it presents to the client based on the accept lists?

From the introduction to this PR:

This enhancement works with the existing annotations system to optionally enable MCP Servers to adapt their content delivery to best match Host capabilities.

"adapt their content delivery" means Servers adjusting the outputs of Prompts, Resources or Tools based on the content type hints from the Client. This is a good point to clarify for this discussion, thank you.

@evalstate
Copy link
Contributor Author

evalstate commented Apr 20, 2025

I've updated the comment in #371 to include an example CallToolResult and guidance text to make that clearer. Note this PR has been re-drafted and discussion moved to #356.

This is as opposed to MCP servers which can potentially return multiple responses and have a much more complex matrix of considerations for what to return, and with this PR actually present two distinct accept lists.

This PR is not proposing "accept lists", but optional content type hints. Since this PR is adding optional hints to the existing protocol, it may be more appropriate to start a separate discussion in the forums on that topic and whether MCP should contain that guidance.

As other points of discussion for this PR, I'd like to also get feedback on:

  • Whether generates should be marked on individual Tools with an assumed default of text/plain.
  • There was earlier discussion on a potential FileContent type that may make sense for larger items. @jerome3o-anthropic
  • Similar to the above, whether any conventions should be applied to using Roots to transfer larger Resources (perhaps a specialization of FileContent).
  • Linking to discussion on New Content Type for "UI" - will comment over on that thread - my understanding is the protocol already supports their need but want to confirm understanding.

@connor4312
Copy link
Contributor

VS Code and many other clients allow users to change models on-the-fly, even in the same "chat session." With this proposal, if that happens, a client would need to stop and restart their MCP connection if they were to announce a different set of content types that they're able to tokenize. Since some servers can be stateful (e.g. playwright/puppeteer) this isn't something that can be done safely. I think we would need some way to announce a new (sub)set of capabilities to servers.

@evalstate
Copy link
Contributor Author

I understand. I might think about this the other way though - if the Client can match content consumption/generation (e.g. image/*) then it can inform the user about the risk of shifting to a model which is text only. Or, if using a text only model indicate to the User that may be restrictive.

@connor4312
Copy link
Contributor

connor4312 commented May 29, 2025

In the current state of this PR, yes we might want to warn the user about the risk. But if there were a way to signal a change in capabilities, then it would 'just work' (given a well-implemented MCP server) and we wouldn't have to warn the user about anything 🙂

@evalstate
Copy link
Contributor Author

There's loads of scenarios, and I think another idea going around about exposing direct model information to the MCP Server. I guess we need to figure out the right level of abstraction for the Protocol. We already have mid-lifecycle capabilities change from the Server->Client (e.g. ToolListChangeNotification) so it doesn't seem out of the question.

@jonathanhefner
Copy link
Contributor

VS Code and many other clients allow users to change models on-the-fly, even in the same "chat session." With this proposal, if that happens, a client would need to stop and restart their MCP connection if they were to announce a different set of content types that they're able to tokenize. Since some servers can be stateful (e.g. playwright/puppeteer) this isn't something that can be done safely. I think we would need some way to announce a new (sub)set of capabilities to servers.

Instead of specifying content types as a capability during the initialization phase, what about specifying them as a _meta parameter on each relevant JSON-RPC request (similar to an Accept header)? That would avoid the need to restart.

@connor4312
Copy link
Contributor

connor4312 commented May 29, 2025

Maybe, though the MCP server may also want to change their prompts when the user's chat model changes. It's more likely they would change the prompt results, rather than the name/arguments that is announced to clients

@jonathanhefner
Copy link
Contributor

Instead of specifying content types as a capability during the initialization phase, what about specifying them as a _meta parameter on each relevant JSON-RPC request (similar to an Accept header)? That would avoid the need to restart.

Follow-up thought: perhaps we should add contentTypes to ClientCapabilities, and then, instead of adding a _meta.contentTypes param, we add a _meta.capabilities param. That would support not only contentTypes, but also sampling, roots, etc.

@evalstate
Copy link
Contributor Author

I can't see the harm in Clients using the _meta field for that, the point is to advertise to the MCP Server what can be handled (not a guarantee that if it's sent it will be handled - it's a hint). Ultimately it's the Host apps choice whether to allow the User to select or to optimize model selection.

We have to be careful - at some point the abstractions between Client and Server get so leaky that MCP is an inhibitor rather than an interop enabler....!

@kentcdodds
Copy link
Contributor

This issue is kind of related. I wonder if this could be combined or also supported in some way: #604

@dsp-ant
Copy link
Member

dsp-ant commented Jun 10, 2025

I think there is a bigger issue here that is "content negotiation". @connor4312's point on changing requirements between tool calls is a good example. I defer this for now, but I have a strong suspicion that we want something different that is more akin to accept headers in HTTP for each request itself.

@dsp-ant dsp-ant modified the milestones: DRAFT 2025-06-XX, DRAFT-XX-XX Jun 10, 2025
@dsp-ant dsp-ant moved this from In Review to Consulting in Standards Track Jun 10, 2025
@connor4312
Copy link
Contributor

Here's an idea of how dynamic capabilities could be represented:

diff --git a/schema/draft/schema.ts b/schema/draft/schema.ts
index c688dc3..f9d66b5 100644
--- a/schema/draft/schema.ts
+++ b/schema/draft/schema.ts
@@ -200,9 +200,10 @@ export interface InitializedNotification extends Notification {
 }
 
 /**
- * Capabilities a client may support. Known capabilities are defined here, in this schema, but this is not a closed set: any client can define its own, additional capabilities.
+ * Part of {@link ClientCapabilities} which are sent during initialization and
+ * cannot be changed during the course of a session.
  */
-export interface ClientCapabilities {
+export interface StaticClientCapabilities {
   /**
    * Experimental, non-standard capabilities that the client supports.
    */
@@ -224,6 +225,13 @@ export interface ClientCapabilities {
    * Present if the client supports elicitation from the server.
    */
   elicitation?: object;
+}
+
+/**
+ * Part of {@link ClientCapabilities} which can be dynamically changed during
+ * the course of a session.
+ */
+export interface DynamicClientCapabilities {
   /**
    * Present if the client advertises content types it can handle.
    */
@@ -239,6 +247,12 @@ export interface ClientCapabilities {
   };
 }
 
+
+/**
+ * Capabilities a client may support. Known capabilities are defined here, in this schema, but this is not a closed set: any client can define its own, additional capabilities.
+ */
+export interface ClientCapabilities extends DynamicClientCapabilities, StaticClientCapabilities {}
+
 /**
  * Capabilities that a server may support. Known capabilities are defined here, in this schema, but this is not a closed set: any server can define its own, additional capabilities.
  */
@@ -1333,6 +1347,22 @@ export interface ElicitResult extends Result {
   content?: { [key: string]: unknown };
 }
 
+/**
+ * A notification from the client to the server, informing it that its capabilities
+ * have changed. This is typically used when the client has updated its underlying
+ * model or configuration.
+ */
+export interface ClientCapabilitiesChangedNotification extends Notification {
+  method: "notifications/client_capabilities/changed";
+  params: {
+    /**
+     * The new client capabilities that the client supports.
+     */
+    capabilities: DynamicClientCapabilities;
+  };
+}
+
+
 /* Client messages */
 export type ClientRequest =
   | PingRequest
@@ -1353,7 +1383,8 @@ export type ClientNotification =
   | CancelledNotification
   | ProgressNotification
   | InitializedNotification
-  | RootsListChangedNotification;
+  | RootsListChangedNotification
+  | ClientCapabilitiesChangedNotification;
 
 export type ClientResult = EmptyResult | CreateMessageResult | ListRootsResult | ElicitResult;

As MCP and my understanding of it as a client implementor has grown, I no longer think per-request headers are ideal. Namely due to sampling: sampling requests and responses will represent different content types and they can be emitted async, outside the lifecycle of any particular client request, so I think a push mechanism for the client to announce changed capabilities is preferable.

@evalstate
Copy link
Contributor Author

Tagging @kentcdodds and referring to #679

@kentcdodds
Copy link
Contributor

I agree with @connor4312 here and (as a server implementer) I think that it would be useful to go both ways as well (so the server can announce changed capabilities as well as the client).

In general, what I mean by #679 is that both clients and servers should communicate both what they can offer and what they can accept.

Before now I hadn't considered the fact that these capabilities could change over time and I'm not sure I completely understand the use case there, but I do think that the client and server should both be able to communicate their full capabilities.

@connor4312
Copy link
Contributor

I'm not sure I completely understand the use case there

VS Code and most other clients let you change the model you're using during a chat session. Or even change it autmatically depending on the query. Different models will have different sets of mimetypes they natively understand, and we don't want to have to restart MCP servers to announce updated capabilities when that happens.

@kentcdodds
Copy link
Contributor

That makes complete sense. I don't want to take things too far off the rails here, but I think the discussion over on this issue has a bearing on what could happen in this PR as well: #679 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Consulting
Development

Successfully merging this pull request may close these issues.

6 participants