Skip to content

Gemini Context caching #3212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sahanatvessel opened this issue Oct 6, 2024 · 3 comments · May be fixed by #6256
Open

Gemini Context caching #3212

sahanatvessel opened this issue Oct 6, 2024 · 3 comments · May be fixed by #6256

Comments

@sahanatvessel
Copy link

Feature Description

Context caching is particularly well suited to scenarios where a substantial initial context is referenced repeatedly by shorter requests. Consider using context caching for use cases such as:

Chatbots with extensive system instructions
Repetitive analysis of lengthy video files
Recurring queries against large document sets
Frequent code repository analysis or bug fixing

Use Case

In a typical AI workflow, you might pass the same input tokens over and over to a model. Using the Gemini API context caching feature, you can pass some content to the model once, cache the input tokens, and then refer to the cached tokens for subsequent requests. At certain volumes, using cached tokens is lower cost than passing in the same corpus of tokens repeatedly.

When you cache a set of tokens, you can choose how long you want the cache to exist before the tokens are automatically deleted. This caching duration is called the time to live (TTL). If not set, the TTL defaults to 1 hour. The cost for caching depends on the input token size and how long you want the tokens to persist.

Additional context

https://ai.google.dev/gemini-api/docs/caching?lang=node

@ItsWendell
Copy link

Would love to have some utilities around this too for the Google and Vertex AI providers, especially for use cases where you want to tools in combination with cached context. There's in a doc a way to give a cachedContent key to the provider, which works, but as soon as tools are involved, it breaks.

AI_APICallError: Tool config, tools and system instruction should not be set in therequest when using cached content.

      at null.<anonymous>
  (file:///node_modules/@ai-sdk/provider-utils/src/response-handler.ts:59:16)

This would likely require some re-working in how tools are passed on to Google APIs when a cachedContent name is provided, since you might have tools defined in your cache. But for them to be executed you need them in your vercel AI SDK client too ofc.

@ItsWendell
Copy link

I was exploring this last night and wanted to share what it would take to make this work. What I'm sharing here is quite hack-y but should give us some pointers on how to properly implement context caching within the Vercel AI SDK, including support for caching systemInstruction, tools, and toolConfig alongside message history.

Based on my understanding and experimentation, the core challenge is that the Gemini API (via Google AI or Vertex AI) doesn't allow sending tools, toolConfig, or systemInstruction in a request that references cachedContent. These configurations need to be part of the cache itself.

Here's a rough outline of the approach I explored:

1. Modify Provider Request for Cached Calls

When using cachedContent, we need to prevent the Google AI / Vertex AI provider from sending tools, toolConfig, and systemInstruction in the API request body. For my POC, I intercepted the call within the provider options:

import { createVertexProvider } from "@ai-sdk/google-vertex";
// Assuming other necessary imports

const provider = createVertexProvider({
  // ... other options
  fetch: async (input, init) => {
    const request = new Request(input, init);
    const headers = request.headers;
    const contentType = headers.get("content-type");

    // Check if it's a JSON request likely targeting the generateContent endpoint
    if (contentType?.includes("application/json")) {
      // Clone the request to read the body safely
      const clonedRequest = request.clone();
      try {
        const body = await clonedRequest.json<Record<string, unknown | undefined>>();

        // If cachedContent is present, remove fields disallowed by the API
        if (body?.cachedContent) {
          delete body.tools;
          delete body.toolConfig;
          delete body.systemInstruction;

          // Create a new init object with the modified body
          const newInit = {
            ...init,
            body: JSON.stringify(body), // Reserialize the modified body
          };
          // Fetch with the modified init object
          return fetch(input, newInit);
        }
      } catch (error) {
        console.error("Error processing request body:", error);
        // Proceed with the original request if JSON parsing fails or it's not the expected structure
      }
    }
    // Fallback to the original fetch call
    return fetch(input, init);
  },
});

2. Separate Configurations and Create Cache Explicitly

Since the configurations (tools, toolConfig, systemInstruction) must be defined when creating the cache but omitted from subsequent API calls using the cache (as handled in Step 1), they need to be managed separately.

Let's first define our configurations separately.

import { GoogleGenerativeAI, FunctionCallingConfigMode, Schema } from "@google/generative-ai";
import { asSchema, convertJSONSchemaToOpenAPISchema } from "@ai-sdk/provider-utils";
import dayjs from "dayjs"; // Example date library

// Define tools using Vercel AI SDK's tool definition
const tools = {
  think: tool({ /* ... think tool definition */ }),
  search: tool({ /* ... search tool definition */ }),
};

// Define system instruction
const systemInstructionText = "You're a friendly bot with a very long system instruction and potentially large files referenced in the cache.";

// Prepare tool definitions for Google API format
const functionDeclarations = Object.entries(tools).map(([name, toolDef]) => ({
  name,
  description: toolDef.description ?? "",
  parameters: convertJSONSchemaToOpenAPISchema(
    "jsonSchema" in toolDef.parameters
      ? toolDef.parameters.jsonSchema
      : asSchema(toolDef.parameters as any).jsonSchema ?? {}
  ) as Schema,
}));

// ToolConfig for Google API
const toolConfig = {
  functionCallingConfig: {
    mode: FunctionCallingConfigMode.AUTO,
    // allowedFunctionNames: Object.keys(tools), // Only needed if Mode is ANY
  },
};

Then, create the cache using the Google AI SDK directly:

// Initialize Google SDK (ensure auth is configured)
const genai = new GoogleGenerativeAI({
  vertexai: true, // or false for Google AI
  project: "your-gcp-project-id",
  // googleAuthOptions: { ... }, // If needed
});

// Example conversation ID or unique identifier for the cache
const conversationId = "unique-conversation-123";

// Create the cache
const cache = await genai.caches.create({
  model: "gemini-1.5-flash-001", // Ensure this matches the model used later
  displayName: `Cache for conversation ${conversationId}`,
  ttl: "600s", // How long the cache is allowed to be valid
  systemInstruction: {
    parts: [{ text: systemInstructionText }],
  },
  tools: [{ functionDeclarations }],
  toolConfig: toolConfig,
  contents: [ // Include initial context/files here
    {
      role: "user", // Can be 'user' or 'model'
      parts: [
        { text: "Here is a large document for context:" },
        {
          fileData: {
            // Use GCS URI for large files
            fileUri: "gs://your-bucket-name/your-large-file.pdf",
            mimeType: "application/pdf",
            // Alternatively, for small data:
            // inlineData: { data: Buffer.from("...").toString("base64"), mimeType: "text/plain" }
          },
        },
      ],
    },
    // Add more history/context if needed
  ],
});

console.log(`Cache created: ${cache.name}`);

Finally, use the cache name with the Vercel AI SDK's generateText (or streamText, etc.), providing the modified provider from Step 1:

import { generateText } from "ai";
import { CoreMessage } from "ai";

// Use the modified provider and pass the cache name
const model = provider(cache.model); // Use the same model name as the cache

const messages: CoreMessage[] = [
  // Add only *new* messages not already in the cache's 'contents'
  { role: "user", content: "Summarize the document provided earlier." }
];

const result = await generateText({
  model: model,
  // Pass cache name via provider options defined in Step 1
  providerOptions: {
    cachedContent: cache.name,
  },
  system: systemInstructionText,
  tools: tools, // Still required here for executing functions
  messages: messages, // Only new messages
  maxSteps: 20, // Maximum amount of steps / tool calls
});

console.log(result.text);

This approach seems viable for both Google AI and Vertex AI providers. It's definitely a bit more involved as it requires interacting with the underlying Google SDK directly for cache management and modifying the provider's fetch behavior.

I'd be happy to collaborate on refining an approach, exploring cleaner abstractions within the SDK if possible, and perhaps opening a PR with a more integrated solution if this direction seems promising.

@YahngSungho
Copy link

@ItsWendell

Just out of curiosity, if you're going to do that, does it have any meaning in using this Vercel library?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
@ItsWendell @jeremyphilemon @YahngSungho @sahanatvessel and others