MicrosoftDocs · learn-build-service-prod · Feb 13, 2025 · Feb 13, 2025 · Feb 13, 2025 · Feb 13, 2025
diff --git a/semantic-kernel/concepts/ai-services/TOC.yml b/semantic-kernel/concepts/ai-services/TOC.yml
@@ -6,4 +6,6 @@
 - name: Embedding generation
   href: embedding-generation/TOC.yml
 - name: AI Integrations
-  href: integrations.md
+  href: integrations.md
+- name: Realtime
+  href: realtime.md
diff --git a/semantic-kernel/concepts/ai-services/index.md b/semantic-kernel/concepts/ai-services/index.md
@@ -14,21 +14,23 @@ One of the main features of Semantic Kernel is its ability to add different AI s
 
 Within Semantic Kernel, there are interfaces for the most popular AI tasks. In the table below, you can see the services that are supported by each of the SDKs.
 
-| Services                          |  C#  | Python | Java | Notes |
-|-----------------------------------|:----:|:------:|:----:|-------|
-| [Chat completion](./chat-completion/index.md)                    | ✅ | ✅ | ✅ |
-| Text generation                  | ✅ | ✅ | ✅ |
-| Embedding generation (Experimental)     | ✅ | ✅ | ✅ |
-| Text-to-image  (Experimental)       | ✅ | ✅ | ❌ |
-| Image-to-text (Experimental)       | ✅ | ❌ | ❌ |
-| Text-to-audio (Experimental)       | ✅ | ✅ | ❌ | 
-| Audio-to-text (Experimental)       | ✅ | ✅ | ❌ | 
+| Services                                      |  C#   | Python | Java  | Notes |
+| --------------------------------------------- | :---: | :----: | :---: | ----- |
+| [Chat completion](./chat-completion/index.md) |   ✅   |   ✅    |   ✅   |
+| Text generation                               |   ✅   |   ✅    |   ✅   |
+| Embedding generation (Experimental)           |   ✅   |   ✅    |   ✅   |
+| Text-to-image  (Experimental)                 |   ✅   |   ✅    |   ❌   |
+| Image-to-text (Experimental)                  |   ✅   |   ❌    |   ❌   |
+| Text-to-audio (Experimental)                  |   ✅   |   ✅    |   ❌   |
+| Audio-to-text (Experimental)                  |   ✅   |   ✅    |   ❌   |
+| [Realtime](./realtime.md) (Experimental)      |   ❌   |   ✅    |   ❌   |
 
 > [!TIP]
 > In most scenarios, you will only need to add chat completion to your kernel, but to support multi-modal AI, you can add any of the above services to your kernel.
 
 ## Next steps
+
 To learn more about each of the services, please refer to the specific articles for each service type. In each of the articles we provide sample code for adding the service to the kernel across multiple AI service providers.
 
 > [!div class="nextstepaction"]
-> [Learn about chat completion](./chat-completion/index.md)
+> [Learn about chat completion](./chat-completion/index.md)
diff --git a/semantic-kernel/concepts/ai-services/integrations.md b/semantic-kernel/concepts/ai-services/integrations.md
@@ -18,21 +18,22 @@ With the available AI connectors, developers can easily build AI agents with swa
 
 ### AI Services
 
-| Services                          |  C#  | Python | Java | Notes |
-|-----------------------------------|:----:|:------:|:----:|-------|
-| Text Generation                    | ✅ | ✅ | ✅ | Example: Text-Davinci-003 |
-| Chat Completion                    | ✅ | ✅ | ✅ | Example: GPT4, Chat-GPT |
-| Text Embeddings (Experimental)     | ✅ | ✅ | ✅ | Example: Text-Embeddings-Ada-002 |
-| Text to Image (Experimental)       | ✅ | ✅ | ❌ | Example: Dall-E |
-| Image to Text (Experimental)       | ✅ | ❌ | ❌ | Example: Pix2Struct |
-| Text to Audio (Experimental)       | ✅ | ✅ | ❌ | Example: Text-to-speech |
-| Audio to Text (Experimental)       | ✅ | ✅ | ❌ | Example: Whisper |
+| Services                       |  C#   | Python | Java  | Notes                            |
+| ------------------------------ | :---: | :----: | :---: | -------------------------------- |
+| Text Generation                |   ✅   |   ✅    |   ✅   | Example: Text-Davinci-003        |
+| Chat Completion                |   ✅   |   ✅    |   ✅   | Example: GPT4, Chat-GPT          |
+| Text Embeddings (Experimental) |   ✅   |   ✅    |   ✅   | Example: Text-Embeddings-Ada-002 |
+| Text to Image (Experimental)   |   ✅   |   ✅    |   ❌   | Example: Dall-E                  |
+| Image to Text (Experimental)   |   ✅   |   ❌    |   ❌   | Example: Pix2Struct              |
+| Text to Audio (Experimental)   |   ✅   |   ✅    |   ❌   | Example: Text-to-speech          |
+| Audio to Text (Experimental)   |   ✅   |   ✅    |   ❌   | Example: Whisper                 |
+| Realtime (Experimental)        |   ❌   |   ✅    |   ❌   | Example: gpt-4o-realtime-preview |
 
 ## Additional plugins
 
 If you want to extend the functionality of your AI agent, you can use plugins to integrate with other Microsoft services. Here are some of the plugins that are available for Semantic Kernel:
 
-| Plugin     | C#  | Python | Java | Description |
-| ---------- | :-: | :----: | :--: | ----------- |
-| Logic Apps | ✅  |   ✅   |  ✅  | Build workflows within Logic Apps using its available connectors and import them as plugins in Semantic Kernel. [Learn more](../plugins/adding-logic-apps-as-plugins.md). |
-| Azure Container Apps Dynamic Sessions | ✅  |   ✅   |  ❌  | With dynamic sessions, you can recreate the Code Interpreter experience from the Assistants API by effortlessly spinning up Python containers where AI agents can execute Python code. [Learn more](/azure/container-apps/sessions). |
+| Plugin                                |  C#   | Python | Java  | Description                                                                                                                                                                                                                          |
+| ------------------------------------- | :---: | :----: | :---: | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Logic Apps                            |   ✅   |   ✅    |   ✅   | Build workflows within Logic Apps using its available connectors and import them as plugins in Semantic Kernel. [Learn more](../plugins/adding-logic-apps-as-plugins.md).                                                            |
+| Azure Container Apps Dynamic Sessions |   ✅   |   ✅    |   ❌   | With dynamic sessions, you can recreate the Code Interpreter experience from the Assistants API by effortlessly spinning up Python containers where AI agents can execute Python code. [Learn more](/azure/container-apps/sessions). |
diff --git a/semantic-kernel/concepts/ai-services/realtime.md b/semantic-kernel/concepts/ai-services/realtime.md
@@ -0,0 +1,189 @@
+---
+title: Realtime AI Integrations for Semantic Kernel 
+description: Learn about realtime multi-modal AI integrations available in Semantic Kernel.
+author: eavanvalkenburg
+ms.topic: conceptual
+ms.author: edvan
+ms.date: 02/26/2025
+ms.service: semantic-kernel
+---
+
+# Realtime Multi-modal APIs
+
+The first realtime API integration for Semantic Kernel has been added, it is currently only available in Python and considered experimental. This is because the underlying services are still being developed and are subject to changes and we might need to make breaking changes to the API in Semantic Kernel as we learn from customers how to use this and as we add other providers of these kinds of models and APIs.
+
+## Realtime Client abstraction
+
+To support different realtime APIs from different vendors, using different protocols, a new client abstraction has been added to the kernel. This client is used to connect to the realtime service and send and receive messages.
+The client is responsible for handling the connection to the service, sending messages, and receiving messages. The client is also responsible for handling any errors that occur during the connection or message sending/receiving process. Considering the way these models work, they can be considered agents more than regular chat completions, therefore they also take instructions, rather than a system message, they keep their own internal state and can be invoked to do work on our behalf.
+### Realtime API
+
+Any realtime client implements the following methods:
+
+| Method           | Description                                                                                                        |
+| ---------------- | ------------------------------------------------------------------------------------------------------------------ |
+| `create_session` | Creates a new session                                                                                              |
+| `update_session` | Updates an existing session                                                                                        |
+| `delete_session` | Deletes an existing session                                                                                        |
+| `receive`        | This is a asynchronous generator method that listens for messages from the service and yields them as they arrive. |
+| `send`           | Sends a message to the service                                                                                     |
+
+### Python implementations
+
+The python version of Semantic Kernel currently supports the following realtime clients:
+
+| Client | Protocol  | Modalities   | Function calling enabled | Description                                                                                                                                                                                        |
+| ------ | --------- | ------------ | ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| OpenAI | Websocket | Text & Audio | Yes                      | The OpenAI Realtime API is a websocket based api that allows you to send and receive messages in realtime, this connector uses the OpenAI Python package to connect and receive and send messages. |
+| OpenAI | WebRTC    | Text & Audio | Yes                      | The OpenAI Realtime API is a WebRTC based api that allows you to send and receive messages in realtime, it needs a webRTC compatible audio track at session creation time.                         |
+| Azure  | Websocket | Text & Audio | Yes                      | The Azure Realtime API is a websocket based api that allows you to send and receive messages in realtime, this uses the same package as the OpenAI websocket connector.                            |
+
+## Getting started
+
+To get started with the Realtime API, you need to install the `semantic-kernel` package with the `realtime` extra.
+
+```bash
+pip install semantic-kernel[realtime]
+```
+
+Depending on how you want to handle audio, you might need additional packages to interface with speakers and microphones, like `pyaudio` or `sounddevice`.
+
+### Websocket clients
+
+Then you can create a kernel and add the realtime client to it, this shows how to do that with a AzureRealtimeWebsocket connection, you can replace AzureRealtimeWebsocket with OpenAIRealtimeWebsocket without any further changes.
+
+```python
+from semantic_kernel.connectors.ai.open_ai import (
+    AzureRealtimeWebsocket,
+    AzureRealtimeExecutionSettings,
+    ListenEvents,
+)
+from semantic_kernel.contents import RealtimeAudioEvent, RealtimeTextEvent
+
+# this will use environment variables to get the api key, endpoint, api version and deployment name.
+realtime_client = AzureRealtimeWebsocket()
+settings = AzureRealtimeExecutionSettings(voice='alloy')
+async with realtime_client(settings=settings, create_response=True):
+    async for event in realtime_client.receive():
+        match event:
+            # receiving a piece of audio (and send it to a undefined audio player)
+            case RealtimeAudioEvent():
+                await audio_player.add_audio(event.audio)
+            # receiving a piece of audio transcript
+            case RealtimeTextEvent():
+                # Semantic Kernel parses the transcript to a TextContent object captured in a RealtimeTextEvent
+                print(event.text.text, end="")
+            case _:
+                # OpenAI Specific events
+                if event.service_type == ListenEvents.SESSION_UPDATED:
+                    print("Session updated")
+                if event.service_type == ListenEvents.RESPONSE_CREATED:
+                    print("\nMosscap (transcript): ", end="")
+```
+
+There are two important things to note, the first is that the `realtime_client` is an async context manager, this means that you can use it in an async function and use `async with` to create the session.
+The second is that the `receive` method is an async generator, this means that you can use it in a for loop to receive messages as they arrive.
+
+### WebRTC client
+
+The setup of a WebRTC connection is a bit more complex and so we need a extra parameter when creating the client. This parameter, `audio_track` needs to be a object that implements the `MediaStreamTrack` protocol of the `aiortc` package, this is also demonstrated in the samples that are linked below.
+
+To create a client that uses WebRTC, you would do the following:
+
+```python
+from semantic_kernel.connectors.ai.open_ai import (
+    ListenEvents,
+    OpenAIRealtimeExecutionSettings,
+    OpenAIRealtimeWebRTC,
+)
+from aiortc.mediastreams import MediaStreamTrack
+
+class AudioRecorderWebRTC(MediaStreamTrack):
+    # implement the MediaStreamTrack methods.
+
+realtime_client = OpenAIRealtimeWebRTC(audio_track=AudioRecorderWebRTC())
+# Create the settings for the session
+settings = OpenAIRealtimeExecutionSettings(
+    instructions="""
+You are a chat bot. Your name is Mosscap and
+you have one goal: figure out what people need.
+Your full name, should you need to know it, is
+Splendid Speckled Mosscap. You communicate
+effectively, but you tend to answer with long
+flowery prose.
+""",
+    voice="shimmer",
+)
+audio_player = AudioPlayer
+async with realtime_client(settings=settings, create_response=True):
+    async for event in realtime_client.receive():
+        match event.event_type:
+            # receiving a piece of audio (and send it to a undefined audio player)
+            case "audio":
+                await audio_player.add_audio(event.audio)
+            case "text":
+                # the model returns both audio and transcript of the audio, which we will print
+                print(event.text.text, end="")
+            case "service":
+                # OpenAI Specific events
+                if event.service_type == ListenEvents.SESSION_UPDATED:
+                    print("Session updated")
+                if event.service_type == ListenEvents.RESPONSE_CREATED:
+                    print("\nMosscap (transcript): ", end="")
+```
+
+Both of these samples receive the audio as RealtimeAudioEvent and then they pass that to a unspecified audio_player object.
+
+### Audio output callback
+
+Next to this we have a parameter called `audio_output_callback`  on the `receive` method and on the class creation. This callback will be called first before any further handling of the audio and gets a `numpy` array of the audio data, instead of it being parsed into AudioContent and returned as a RealtimeAudioEvent that you can then handle, which is what happens above. This has shown to give smoother audio output because there is less overhead between the audio data coming in and it being given to the player.
+
+This example shows how to define and use the `audio_output_callback`:
+
+```python
+from semantic_kernel.connectors.ai.open_ai import (
+    ListenEvents,
+    OpenAIRealtimeExecutionSettings,
+    OpenAIRealtimeWebRTC,
+)
+from aiortc.mediastreams import MediaStreamTrack
+
+class AudioRecorderWebRTC(MediaStreamTrack):
+    # implement the MediaStreamTrack methods.
+
+class AudioPlayer:
+    async def play_audio(self, content: np.ndarray):
+        # implement the audio player
+
+realtime_client = OpenAIRealtimeWebRTC(audio_track=AudioRecorderWebRTC())
+# Create the settings for the session
+settings = OpenAIRealtimeExecutionSettings(
+    instructions="""
+You are a chat bot. Your name is Mosscap and
+you have one goal: figure out what people need.
+Your full name, should you need to know it, is
+Splendid Speckled Mosscap. You communicate
+effectively, but you tend to answer with long
+flowery prose.
+""",
+    voice="shimmer",
+)
+audio_player = AudioPlayer
+async with realtime_client(settings=settings, create_response=True):
+    async for event in realtime_client.receive(audio_output_callback=audio_player.play_audio):
+        match event.event_type:
+            # no need to handle case: "audio"
+            case "text":
+                # the model returns both audio and transcript of the audio, which we will print
+                print(event.text.text, end="")
+            case "service":
+                # OpenAI Specific events
+                if event.service_type == ListenEvents.SESSION_UPDATED:
+                    print("Session updated")
+                if event.service_type == ListenEvents.RESPONSE_CREATED:
+                    print("\nMosscap (transcript): ", end="")
+```
+
+### Samples
+
+There are four samples in [our repo](https://github.com/microsoft/semantic-kernel/tree/main/python/samples/concepts/realtime), they cover both the basics using both websockets and WebRTC, as well as a more complex setup including function calling. Finally there is a more [complex demo](https://github.com/microsoft/semantic-kernel/tree/main/python/samples/demos/call_automation) that uses [Azure Communication Services](/azure/communication-services/) to allow you to call your Semantic Kernel enhanced realtime API.