Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] Support STT with Google realtime API #1321

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

jayeshp19
Copy link
Collaborator

No description provided.

Copy link

changeset-bot bot commented Jan 2, 2025

⚠️ No Changeset found

Latest commit: 25003f1

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@jayeshp19 jayeshp19 force-pushed the gemini-realtime-stt branch from 6407a82 to 663f44f Compare January 10, 2025 20:43
@jayeshp19 jayeshp19 force-pushed the gemini-realtime-stt branch from 663f44f to aee4c1c Compare January 10, 2025 21:03
Comment on lines +314 to +315
if self._model.capabilities.supports_truncate:
user_msg = ChatMessage.create(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this only done when it supports truncate? it seems you are trying to update an item, instead of truncate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some methods are not implemented in Gemini. We maintain remoteconversations in OpenAI, but not in Gemini. We should prevent invoking those methods when using Gemini. The purpose of supports_truncate is to differentiate between that

text="LiveKit is the platform for building realtime AI. The main use cases are to build AI voice agents. LiveKit also powers livestreaming apps, robotics, and video conferencing.",
role="assistant",
)
chat_ctx.append(text="What is the LiveKit Agents framework?", role="user")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the last message have to be user.. in order for gemini to respond first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it can be either assistant or user.

@self._session.on("agent_speech_completed")
def _agent_speech_completed():
self._update_state("listening")
if self._playing_handle is not None and not self._playing_handle.done():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you include comments on why this is needed?

def _agent_speech_completed():
self._update_state("listening")
if self._playing_handle is not None and not self._playing_handle.done():
self._playing_handle.interrupt()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we should interrupt here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we call this function when speech is interrupted as well. They likely made some changes, but now Gemini returns server.turn_complete instead of server.interrupted when interrupted. It's confusing. In both cases, we are calling this function.

from typing import Any, Dict, List, Literal, Sequence, Union

from livekit.agents import llm

from google.genai import types # type: ignore
Copy link
Member

@theomonnom theomonnom Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is hard to follow, really sad we don't have types (it's unclear what is the structure of the dicts)

self._transcriber.on("input_speech_done", self._on_input_speech_done)
self._agent_transcriber.on("input_speech_done", self._on_agent_speech_done)
# init dummy task
self._init_sync_task = asyncio.create_task(asyncio.sleep(0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really doing anything?

@theomonnom
Copy link
Member

Where do we make sure the transcribed user speech is inside the chat_ctx and always before the generated agent speech?

@jayeshp19
Copy link
Collaborator Author

This isn't really doing anything?

Yes, it is being called directly from the base class. We need to keep the dummy task unless we wrap it with capabilities.support_truncate in the base class.

Where do we make sure the transcribed user speech is inside the chat_ctx and always before the generated agent speech?

I don't think we need it, as the transcriber and LLM are independent of each other.

@theomonnom
Copy link
Member

theomonnom commented Jan 14, 2025

Yes, it is being called directly from the base class. We need to keep the dummy task unless we wrap it with capabilities.support_truncate in the base class.

I'm not sure to follow, the baseclass is utils.EventEmitter[EventTypes]

I don't think we need it, as the transcriber and LLM are independent of each other.

How do we get the user messages inside the ChatContext?

@jayeshp19
Copy link
Collaborator Author

I'm not sure to follow, the baseclass is utils.EventEmitter[EventTypes]

I mean multimodal.py

How do we get the user messages inside the ChatContext?

from here when audio transcription is done- https://github.com/livekit/agents/pull/1321/files#diff-4b3e6842c9b1bf3130541b6b2fd18dcc7d1b0051285496eca0355e62938d13fbR351

@theomonnom
Copy link
Member

theomonnom commented Jan 14, 2025

How do we get the user messages inside the ChatContext?

from here when audio transcription is done- #1321 (files)

What I mean here is that on some bad timings or if the VAD events are different, the data inside the chat context will not be "stable".

E.g;

  • You could have multiple user messages for only one assistant messages
  • The user messages could be appended after the assistant message (The order is wrong)
  • etc..

@jayeshp19
Copy link
Collaborator Author

Where do we make sure the transcribed user speech is inside the chat_ctx and always before the generated agent speech?

What I mean here is that on some bad timings or if the VAD events are different, the data inside the chat context will not be "stable".

E.g;

  • You could have multiple user messages for only one assistant messages
  • The user messages could be appended after the assistant message (The order is wrong)
  • etc..

User audio is usually processed in real-time, and we receive transcriptions quickly. However, you're right that these scenarios can occur.
Do you have any suggestions on how we can ensure the chat context remains stable and maintains the correct sequence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants