Gemini Context caching #3212

sahanatvessel · 2024-10-06T19:43:42Z

Feature Description

Context caching is particularly well suited to scenarios where a substantial initial context is referenced repeatedly by shorter requests. Consider using context caching for use cases such as:

Chatbots with extensive system instructions
Repetitive analysis of lengthy video files
Recurring queries against large document sets
Frequent code repository analysis or bug fixing

Use Case

In a typical AI workflow, you might pass the same input tokens over and over to a model. Using the Gemini API context caching feature, you can pass some content to the model once, cache the input tokens, and then refer to the cached tokens for subsequent requests. At certain volumes, using cached tokens is lower cost than passing in the same corpus of tokens repeatedly.

When you cache a set of tokens, you can choose how long you want the cache to exist before the tokens are automatically deleted. This caching duration is called the time to live (TTL). If not set, the TTL defaults to 1 hour. The cost for caching depends on the input token size and how long you want the tokens to persist.

Additional context

https://ai.google.dev/gemini-api/docs/caching?lang=node

ItsWendell · 2025-04-16T21:52:40Z

Would love to have some utilities around this too for the Google and Vertex AI providers, especially for use cases where you want to tools in combination with cached context. There's in a doc a way to give a cachedContent key to the provider, which works, but as soon as tools are involved, it breaks.

AI_APICallError: Tool config, tools and system instruction should not be set in therequest when using cached content.

      at null.<anonymous>
  (file:///node_modules/@ai-sdk/provider-utils/src/response-handler.ts:59:16)

This would likely require some re-working in how tools are passed on to Google APIs when a cachedContent name is provided, since you might have tools defined in your cache. But for them to be executed you need them in your vercel AI SDK client too ofc.

ItsWendell · 2025-04-17T12:48:55Z

I was exploring this last night and wanted to share what it would take to make this work. What I'm sharing here is quite hack-y but should give us some pointers on how to properly implement context caching within the Vercel AI SDK, including support for caching systemInstruction, tools, and toolConfig alongside message history.

Based on my understanding and experimentation, the core challenge is that the Gemini API (via Google AI or Vertex AI) doesn't allow sending tools, toolConfig, or systemInstruction in a request that references cachedContent. These configurations need to be part of the cache itself.

Here's a rough outline of the approach I explored:

1. Modify Provider Request for Cached Calls

When using cachedContent, we need to prevent the Google AI / Vertex AI provider from sending tools, toolConfig, and systemInstruction in the API request body. For my POC, I intercepted the call within the provider options:

import { createVertexProvider } from "@ai-sdk/google-vertex";
// Assuming other necessary imports

const provider = createVertexProvider({
  // ... other options
  fetch: async (input, init) => {
    const request = new Request(input, init);
    const headers = request.headers;
    const contentType = headers.get("content-type");

    // Check if it's a JSON request likely targeting the generateContent endpoint
    if (contentType?.includes("application/json")) {
      // Clone the request to read the body safely
      const clonedRequest = request.clone();
      try {
        const body = await clonedRequest.json<Record<string, unknown | undefined>>();

        // If cachedContent is present, remove fields disallowed by the API
        if (body?.cachedContent) {
          delete body.tools;
          delete body.toolConfig;
          delete body.systemInstruction;

          // Create a new init object with the modified body
          const newInit = {
            ...init,
            body: JSON.stringify(body), // Reserialize the modified body
          };
          // Fetch with the modified init object
          return fetch(input, newInit);
        }
      } catch (error) {
        console.error("Error processing request body:", error);
        // Proceed with the original request if JSON parsing fails or it's not the expected structure
      }
    }
    // Fallback to the original fetch call
    return fetch(input, init);
  },
});

2. Separate Configurations and Create Cache Explicitly

Since the configurations (tools, toolConfig, systemInstruction) must be defined when creating the cache but omitted from subsequent API calls using the cache (as handled in Step 1), they need to be managed separately.

Let's first define our configurations separately.

import { GoogleGenerativeAI, FunctionCallingConfigMode, Schema } from "@google/generative-ai";
import { asSchema, convertJSONSchemaToOpenAPISchema } from "@ai-sdk/provider-utils";
import dayjs from "dayjs"; // Example date library

// Define tools using Vercel AI SDK's tool definition
const tools = {
  think: tool({ /* ... think tool definition */ }),
  search: tool({ /* ... search tool definition */ }),
};

// Define system instruction
const systemInstructionText = "You're a friendly bot with a very long system instruction and potentially large files referenced in the cache.";

// Prepare tool definitions for Google API format
const functionDeclarations = Object.entries(tools).map(([name, toolDef]) => ({
  name,
  description: toolDef.description ?? "",
  parameters: convertJSONSchemaToOpenAPISchema(
    "jsonSchema" in toolDef.parameters
      ? toolDef.parameters.jsonSchema
      : asSchema(toolDef.parameters as any).jsonSchema ?? {}
  ) as Schema,
}));

// ToolConfig for Google API
const toolConfig = {
  functionCallingConfig: {
    mode: FunctionCallingConfigMode.AUTO,
    // allowedFunctionNames: Object.keys(tools), // Only needed if Mode is ANY
  },
};

Then, create the cache using the Google AI SDK directly:

// Initialize Google SDK (ensure auth is configured)
const genai = new GoogleGenerativeAI({
  vertexai: true, // or false for Google AI
  project: "your-gcp-project-id",
  // googleAuthOptions: { ... }, // If needed
});

// Example conversation ID or unique identifier for the cache
const conversationId = "unique-conversation-123";

// Create the cache
const cache = await genai.caches.create({
  model: "gemini-1.5-flash-001", // Ensure this matches the model used later
  displayName: `Cache for conversation ${conversationId}`,
  ttl: "600s", // How long the cache is allowed to be valid
  systemInstruction: {
    parts: [{ text: systemInstructionText }],
  },
  tools: [{ functionDeclarations }],
  toolConfig: toolConfig,
  contents: [ // Include initial context/files here
    {
      role: "user", // Can be 'user' or 'model'
      parts: [
        { text: "Here is a large document for context:" },
        {
          fileData: {
            // Use GCS URI for large files
            fileUri: "gs://your-bucket-name/your-large-file.pdf",
            mimeType: "application/pdf",
            // Alternatively, for small data:
            // inlineData: { data: Buffer.from("...").toString("base64"), mimeType: "text/plain" }
          },
        },
      ],
    },
    // Add more history/context if needed
  ],
});

console.log(`Cache created: ${cache.name}`);

Finally, use the cache name with the Vercel AI SDK's generateText (or streamText, etc.), providing the modified provider from Step 1:

import { generateText } from "ai";
import { CoreMessage } from "ai";

// Use the modified provider and pass the cache name
const model = provider(cache.model); // Use the same model name as the cache

const messages: CoreMessage[] = [
  // Add only *new* messages not already in the cache's 'contents'
  { role: "user", content: "Summarize the document provided earlier." }
];

const result = await generateText({
  model: model,
  // Pass cache name via provider options defined in Step 1
  providerOptions: {
    cachedContent: cache.name,
  },
  system: systemInstructionText,
  tools: tools, // Still required here for executing functions
  messages: messages, // Only new messages
  maxSteps: 20, // Maximum amount of steps / tool calls
});

console.log(result.text);

This approach seems viable for both Google AI and Vertex AI providers. It's definitely a bit more involved as it requires interacting with the underlying Google SDK directly for cache management and modifying the provider's fetch behavior.

I'd be happy to collaborate on refining an approach, exploring cleaner abstractions within the SDK if possible, and perhaps opening a PR with a more integrated solution if this direction seems promising.

YahngSungho · 2025-04-30T14:02:24Z

@ItsWendell

Just out of curiosity, if you're going to do that, does it have any meaning in using this Vercel library?

jeremyphilemon added the ai/provider label Oct 14, 2024

ItsWendell mentioned this issue Apr 17, 2025

Google Gen AI: Context caching breaks with generateObject / generateContent - #3333

Open

ItsWendell linked a pull request May 9, 2025 that will close this issue

feat(providers/google): improve cachedContent, expose rich token metadata and pass mediaResolution #6256

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gemini Context caching #3212

Gemini Context caching #3212

sahanatvessel commented Oct 6, 2024

ItsWendell commented Apr 16, 2025

Uh oh!

ItsWendell commented Apr 17, 2025

Uh oh!

YahngSungho commented Apr 30, 2025

Uh oh!

Gemini Context caching #3212

Gemini Context caching #3212

Comments

sahanatvessel commented Oct 6, 2024

Feature Description

Use Case

Additional context

ItsWendell commented Apr 16, 2025

Uh oh!

ItsWendell commented Apr 17, 2025

1. Modify Provider Request for Cached Calls

2. Separate Configurations and Create Cache Explicitly

Uh oh!

YahngSungho commented Apr 30, 2025

Uh oh!