Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Add speech recognition with whisper use case #1971

Merged
merged 12 commits into from
Apr 4, 2025
2 changes: 1 addition & 1 deletion site/docs/getting-started/introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ This library is friendly to PC and laptop execution, and optimized for resource

## Key Features and Benefits

- **📦 Pre-built Generative AI Pipelines:** Ready-to-use pipelines for text generation (LLMs), image generation (Diffuser-based), speech processing (Whisper), and visual language models (VLMs). See all [supported use cases](/docs/category/use-cases).
- **📦 Pre-built Generative AI Pipelines:** Ready-to-use pipelines for text generation (LLMs), image generation (Diffuser-based), speech recognition (Whisper), and visual language models (VLMs). See all [supported use cases](/docs/category/use-cases).
- **👣 Minimal Footprint:** Smaller binary size and reduced memory footprint compared to other frameworks.
- **🚀 Performance Optimization:** Hardware-specific optimizations for CPU, GPU, and NPU devices.
- **👨‍💻 Programming Language Support:** Comprehensive APIs in both Python and C++.
Expand Down
21 changes: 15 additions & 6 deletions site/docs/guides/streaming.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ sidebar_position: 3
For more interactive UIs during generation, you can stream output tokens.

:::info
Streaming is supported for both `LLMPipeline` and `VLMPipeline`.
Streaming is supported for `LLMPipeline`, `VLMPipeline` and `WhisperPipeline`.
:::

## Streaming Function
Expand All @@ -18,6 +18,7 @@ In this example, a function outputs words to the console immediately upon genera
<TabItemPython>
```python showLineNumbers
import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline(model_path, "CPU")

# highlight-start
Expand Down Expand Up @@ -86,6 +87,7 @@ You can also create your custom streamer for more sophisticated processing:
<TabItemPython>
```python showLineNumbers
import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline(model_path, "CPU")

# highlight-start
Expand All @@ -95,8 +97,8 @@ You can also create your custom streamer for more sophisticated processing:
super().__init__()
# Initialization logic.

def write(self, token_id) -> bool:
# Custom decoding/tokens processing logic.
def write(self, token: int | list[int]) -> ov_genai.StreamingStatus:
# Custom processing logic for new decoded token(s).

# Return flag corresponds whether generation should be stopped.
return ov_genai.StreamingStatus.RUNNING
Expand Down Expand Up @@ -130,8 +132,15 @@ You can also create your custom streamer for more sophisticated processing:
// Create custom streamer class
class CustomStreamer: public ov::genai::StreamerBase {
public:
bool write(int64_t token) {
// Custom decoding/tokens processing logic.
ov::genai::StreamingStatus write(int64_t token) {
// Custom processing logic for new decoded token.

// Return flag corresponds whether generation should be stopped.
return ov::genai::StreamingStatus::RUNNING;
};

ov::genai::StreamingStatus write(const std::vector<int64_t>& tokens) {
// Custom processing logic for new vector of decoded tokens.

// Return flag corresponds whether generation should be stopped.
return ov::genai::StreamingStatus::RUNNING;
Expand Down Expand Up @@ -168,5 +177,5 @@ You can also create your custom streamer for more sophisticated processing:
</LanguageTabs>

:::info
For fully implemented iterable `CustomStreamer` refer to [multinomial_causal_lm](https://github.com/openvinotoolkit/openvino.genai/blob/releases/2025/0/samples/python/text_generation/multinomial_causal_lm.py) sample.
For fully implemented iterable `CustomStreamer` refer to [multinomial_causal_lm](https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/text_generation/multinomial_causal_lm.py) sample.
:::
2 changes: 1 addition & 1 deletion site/docs/supported-models/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ pip install timm einops
```
:::

## Speech Processing Models (Whisper-based)
## Speech Recognition Models (Whisper-based)

<WhisperModelsTable />

Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,5 @@
#### Basic Generation Configuration

1. Get the model default config with `get_generation_config()`
2. Modify parameters
3. Apply the updated config using one of the following methods:
- Use `set_generation_config(config)`
- Pass config directly to `generate()` (e.g. `generate(prompt, config)`)
- Specify options as inputs in the `generate()` method (e.g. `generate(prompt, max_new_tokens=100)`)

{/* Python and C++ code examples */}
{props.children}

Expand All @@ -21,6 +14,6 @@
- `top_p`: Selects from the smallest set of tokens whose cumulative probability exceeds p. Helps balance diversity and quality.
- `repetition_penalty`: Reduces the likelihood of repeating tokens. Values above 1.0 discourage repetition.

For the full list of generation parameters, refer to the [API reference](https://docs.openvino.ai/2025/api/genai_api/_autosummary/openvino_genai.GenerationConfig.html#openvino-genai-generationconfig).
For the full list of generation parameters, refer to the [Generation Config API](https://docs.openvino.ai/2025/api/genai_api/_autosummary/openvino_genai.GenerationConfig.html#openvino-genai-generationconfig).

:::
17 changes: 17 additions & 0 deletions site/docs/use-cases/_shared/_beam_search_generation.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#### Optimizing Generation with Grouped Beam Search

Beam search helps explore multiple possible text completions simultaneously, often leading to higher quality outputs.

{/* Python and C++ code examples */}
{props.children}

:::info Understanding Beam Search Generation Parameters

- `max_new_tokens`: The maximum numbers of tokens to generate, excluding the number of tokens in the prompt. `max_new_tokens` has priority over `max_length`.
- `num_beams`: The number of beams for beam search. 1 disables beam search.
- `num_beam_groups`: The number of groups to divide `num_beams` into in order to ensure diversity among different groups of beams.
- `diversity_penalty`: value is subtracted from a beam's score if it generates the same token as any beam from other group at a particular time.

For the full list of generation parameters, refer to the [Generation Config API](https://docs.openvino.ai/2025/api/genai_api/_autosummary/openvino_genai.GenerationConfig.html#openvino-genai-generationconfig).

:::
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#### Generation Configuration Workflow

1. Get the model default config with `get_generation_config()`
2. Modify parameters
3. Apply the updated config using one of the following methods:
- Use `set_generation_config(config)`
- Pass config directly to `generate()` (e.g. `generate(prompt, config)`)
- Specify options as inputs in the `generate()` method (e.g. `generate(prompt, max_new_tokens=100)`)
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import GenerationConfigurationWorkflow from '@site/docs/use-cases/_shared/_generation_configuration_workflow.mdx';

## Additional Usage Options

:::tip
Expand All @@ -6,6 +8,10 @@ Check out [Python](https://github.com/openvinotoolkit/openvino.genai/tree/master

### Use Different Generation Parameters

<GenerationConfigurationWorkflow />

#### Image Generation Configuration

You can adjust several parameters to control the image generation process, including dimensions and the number of inference steps:

<LanguageTabs>
Expand Down Expand Up @@ -65,7 +71,7 @@ You can adjust several parameters to control the image generation process, inclu
- `guidance_scale`: Balances prompt adherence vs. creativity. Higher values follow prompt more strictly, lower values allow more creative freedom.
- `rng_seed`: Controls randomness for reproducible results. Same seed produces identical images across runs.

For the full list of generation parameters, refer to the [API reference](https://docs.openvino.ai/2025/api/genai_api/_autosummary/openvino_genai.ImageGenerationConfig.html).
For the full list of generation parameters, refer to the [Image Generation Config API](https://docs.openvino.ai/2025/api/genai_api/_autosummary/openvino_genai.ImageGenerationConfig.html).

:::

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import BasicGenerationConfiguration from '@site/docs/use-cases/_shared/_basic_generation_configuration.mdx';
import ChatScenario from '@site/docs/use-cases/_shared/_chat_scenario.mdx';
import GenerationConfigurationWorkflow from '@site/docs/use-cases/_shared/_generation_configuration_workflow.mdx';
import Streaming from '@site/docs/use-cases/_shared/_streaming.mdx';

## Additional Usage Options
Expand All @@ -12,11 +13,14 @@ Check out [Python](https://github.com/openvinotoolkit/openvino.genai/tree/master

Similar to [text generation](/docs/use-cases/text-generation/#use-different-generation-parameters), VLM pipelines support various generation parameters to control the text output.

<GenerationConfigurationWorkflow />

<BasicGenerationConfiguration>
<LanguageTabs>
<TabItemPython>
```python
import openvino_genai as ov_genai

pipe = ov_genai.VLMPipeline(model_path, "CPU")

# Get default configuration
Expand Down
5 changes: 0 additions & 5 deletions site/docs/use-cases/speech-processing.md

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import CodeBlock from '@theme/CodeBlock';

<CodeBlock language="cpp" showLineNumbers>
{`#include "openvino/genai/whisper_pipeline.hpp"
#include "audio_utils.hpp"
#include <iostream>

int main(int argc, char* argv[]) {
std::filesystem::path models_path = argv[1];
std::string wav_file_path = argv[2];

ov::genai::RawSpeechInput raw_speech = utils::audio::read_wav(wav_file_path);

ov::genai::WhisperPipeline pipe(models_path, "${props.device || 'CPU'}");
auto result = pipe.generate(raw_speech, ov::genai::max_new_tokens(100));
std::cout << result << std::endl;
}
`}
</CodeBlock>
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import CodeBlock from '@theme/CodeBlock';

<CodeBlock language="python" showLineNumbers>
{`import openvino_genai as ov_genai
import librosa

def read_wav(filepath):
raw_speech, samplerate = librosa.load(filepath, sr=16000)
return raw_speech.tolist()

raw_speech = read_wav('sample.wav')

pipe = ov_genai.WhisperPipeline(model_path, "${props.device || 'CPU'}")
result = pipe.generate(raw_speech, max_new_tokens=100)
print(result)
`}
</CodeBlock>
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import CodeExampleCPP from './_code_example_cpp.mdx';
import CodeExamplePython from './_code_example_python.mdx';

## Run Model Using OpenVINO GenAI

OpenVINO GenAI introduces the [`WhisperPipeline`](https://docs.openvino.ai/2025/api/genai_api/_autosummary/openvino_genai.WhisperPipeline.html) pipeline for inference of speech recognition Whisper models.
You can construct it straight away from the folder with the converted model.
It will automatically load the model, tokenizer, detokenizer and default generation configuration.

:::info
`WhisperPipeline` expects normalized audio files in WAV format at sampling rate of 16 kHz as input.
:::

<LanguageTabs>
<TabItemPython>
<Tabs groupId="device">
<TabItem label="CPU" value="cpu">
<CodeExamplePython device="CPU" />
</TabItem>
<TabItem label="GPU" value="gpu">
<CodeExamplePython device="GPU" />
</TabItem>
</Tabs>
</TabItemPython>
<TabItemCpp>
<Tabs groupId="device">
<TabItem label="CPU" value="cpu">
<CodeExampleCPP device="CPU" />
</TabItem>
<TabItem label="GPU" value="gpu">
<CodeExampleCPP device="GPU" />
</TabItem>
</Tabs>
</TabItemCpp>
</LanguageTabs>

:::tip

Use CPU or GPU as devices without any other code change.

:::
Loading
Loading