3.3.0
π₯ Transformers.js v3.3 β StyleTTS 2 (Kokoro) for state-of-the-art text-to-speech, Grounding DINO for zero-shot object detection
π€ New models: StyleTTS 2, Grounding DINO
StyleTTS 2 for high-quality speech synthesis
See #1148 for more information and here for the list of supported models.
First, install the kokoro-js
library, which uses Transformers.js, from NPM using:
npm i kokoro-js
You can then generate speech as follows:
import { KokoroTTS } from "kokoro-js";
const model_id = "onnx-community/Kokoro-82M-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
dtype: "q8", // Options: "fp32", "fp16", "q8", "q4", "q4f16"
});
const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text, {
// Use `tts.list_voices()` to list all available voices
voice: "af_bella",
});
audio.save("audio.wav");
Grounding DINO for zero-shot object detection
See #1137 for more information and here for the list of supported models.
Example: Zero-shot object detection with onnx-community/grounding-dino-tiny-ONNX
using the pipeline
API.
import { pipeline } from "@huggingface/transformers";
const detector = await pipeline("zero-shot-object-detection", "onnx-community/grounding-dino-tiny-ONNX");
const url = "http://images.cocodataset.org/val2017/000000039769.jpg";
const candidate_labels = ["a cat."];
const output = await detector(url, candidate_labels, {
threshold: 0.3,
});
See example output
[
{ score: 0.45316222310066223, label: "a cat", box: { xmin: 343, ymin: 23, xmax: 637, ymax: 372 } },
{ score: 0.36190420389175415, label: "a cat", box: { xmin: 12, ymin: 52, xmax: 317, ymax: 472 } },
]
π οΈ Other improvements
- Add the RawAudio class by @Th3G33k in #682
- Update React guide for v3 by @sroussey in #1128
- Add option to skip special tokens in TextStreamer by @sroussey in #1139
π€ New contributors
Full Changelog: 3.2.4...3.3.0