machine learning

Beginner’s Guide to VibeVoice – KDnuggets

September 23, 2025

Image by Author | Canva

# Introduction

Open-source AI is experiencing a significant moment. With advancements in large language models, general machine learning, and now speech technologies, open-source models are rapidly narrowing the gap with proprietary systems. One of the most exciting entrants in this space is Microsoft’s open-source voice stack, VibeVoice. This model family is designed for natural, expressive, and interactive conversation, rivaling the quality of top-tier commercial offerings.

In this article, we will explore VibeVoice, download the model, and run inference on Google Colab using the GPU runtime. Additionally, we will address troubleshooting common issues that may arise while running model inference.

# Introduction to VibeVoice

VibeVoice is a next-generation Text-to-Speech (TTS) framework for creating expressive, long-form, multi-speaker audio such as podcasts and dialogues. Unlike traditional TTS, it excels in scalability, speaker consistency, and natural turn-taking.

Its core innovation lies in continuous acoustic and semantic tokenizers operating at 7.5 Hz, paired with a Large Language Model (Qwen2.5-1.5B) and a diffusion head for generating high-fidelity audio. This design enables up to 90 minutes of speech with 4 distinct speakers, surpassing prior systems.

VibeVoice is available as an open-source model on Hugging Face, with community-maintained code for easy experimentation and use.

Image from VibeVoice

# Getting Started with VibeVoice-1.5B

In this guide, we will learn how to clone the VibeVoice repository and run the demo by providing it with a text file to generate multi-speaker natural speech. It only takes around 5 minutes from setup to generating the audio.

// 1. Clone the community repository & install

First, clone the community version of the VibeVoice repository (vibevoice-community/VibeVoice), install the required Python packages, and also install the Hugging Face Hub library to download the model using the Python API.

Note: Before starting the Colab session, ensure your runtime type is set to T4 GPU.

!git clone -q --depth 1 https://github.com/vibevoice-community/VibeVoice.git /content/VibeVoice
%pip install -q -e /content/VibeVoice
%pip install -q -U huggingface_hub

// 2. Download the model snapshot from Hugging Face

Download the model repository using the Hugging Face snapshot API. This will download all the files from the microsoft/VibeVoice-1.5B repository.

from huggingface_hub import snapshot_download
snapshot_download(
    "microsoft/VibeVoice-1.5B",
    local_dir="/content/models/VibeVoice-1.5B",
    local_dir_use_symlinks=False
)

// 3. Create a transcript with speakers

We will create a text file within Google Colab. For that, we will use the magic function %%writefile to provide the content. Below is a sample conversation between two speakers about KDnuggets.

%%writefile /content/my_transcript.txt
Speaker 1: Have you read the latest article on KDnuggets?
Speaker 2: Yes, it's one of the best resources for data science and AI.
Speaker 1: I like how KDnuggets always keeps up with the latest trends.
Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community.

// 4. Run inference (multi-speaker)

Now, we will run the demo Python script within the VibeVoice repository. The script requires the model path, text file path, and speaker names.

Run #1: Map Speaker 1 → Alice, Speaker 2 → Frank

!python /content/VibeVoice/demo/inference_from_file.py \
  --model_path /content/models/VibeVoice-1.5B \
  --txt_path /content/my_transcript.txt \
  --speaker_names Alice Frank

As a result, you will see the following output. The model will use CUDA to generate the audio, with Frank and Alice as the two speakers. It will also provide a summary that you can use for analysis.

Using device: cuda
Found 9 voice files in /content/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Reading script from: /content/my_transcript.txt
Found 4 speaker segments:
  1. Speaker 1
     Text preview: Speaker 1: Have you read the latest article on KDnuggets?...
  2. Speaker 2
     Text preview: Speaker 2: Yes, it's one of the best resources for data science and AI....
  3. Speaker 1
     Text preview: Speaker 1: I like how KDnuggets always keeps up with the latest trends....
  4. Speaker 2
     Text preview: Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community....

Speaker mapping:
  Speaker 2 -> Frank
  Speaker 1 -> Alice
Speaker 1 ('Alice') -> Voice: en-Alice_woman.wav
Speaker 2 ('Frank') -> Voice: en-Frank_man.wav
Loading processor & model from /content/models/VibeVoice-1.5B
==================================================
GENERATION SUMMARY
==================================================
Input file: /content/my_transcript.txt
Output file: ./outputs/my_transcript_generated.wav
Speaker names: ['Alice', 'Frank']
Number of unique speakers: 2
Number of segments: 4
Prefilling tokens: 368
Generated tokens: 118
Total tokens: 486
Generation time: 28.27 seconds
Audio duration: 15.47 seconds
RTF (Real Time Factor): 1.83x
==================================================

Play the audio in notebook:

We will now use the IPython function to listen to the generated audio within Colab.

from IPython.display import Audio, display
out_path = "/content/outputs/my_transcript_generated.wav"
display(Audio(out_path))

It took 28 seconds to generate the audio, and it sounds clear, natural, and smooth. I love it!

Try again with different voice actors.

Run #2: Try different voices (Mary for Speaker 1, Carter for Speaker 2)

!python /content/VibeVoice/demo/inference_from_file.py \
  --model_path /content/models/VibeVoice-1.5B \
  --txt_path /content/my_transcript.txt \
  --speaker_names Mary Carter

The audio generated was even better, with background music at the start and a smooth transition between speakers.

Found 9 voice files in /content/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Reading script from: /content/my_transcript.txt
Found 4 speaker segments:
  1. Speaker 1
     Text preview: Speaker 1: Have you read the latest article on KDnuggets?...
  2. Speaker 2
     Text preview: Speaker 2: Yes, it's one of the best resources for data science and AI....
  3. Speaker 1
     Text preview: Speaker 1: I like how KDnuggets always keeps up with the latest trends....
  4. Speaker 2
     Text preview: Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community....

Speaker mapping:
  Speaker 2 -> Carter
  Speaker 1 -> Mary
Speaker 1 ('Mary') -> Voice: en-Mary_woman_bgm.wav
Speaker 2 ('Carter') -> Voice: en-Carter_man.wav
Loading processor & model from /content/models/VibeVoice-1.5B

Tip: If you are unsure which names are available, the script prints “Available voices:” on startup.

Common ones include:

en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman

# Troubleshooting

// 1. Repo Doesn’t Have Demo Scripts?

The official Microsoft VibeVoice repository has been pulled and reset. Community reports indicate that some code and demos have been removed or are no longer accessible in the original location. If you find that the official repository is missing inference examples, please check a community mirror or archive that has preserved the original demos and instructions: https://github.com/vibevoice-community/VibeVoice

// 2. Slow Generation or CUDA Errors in Colab

Verify you are on a GPU runtime: Runtime → Change runtime type → Hardware accelerator: GPU (T4 or any available GPU).

// 3. CUDA OOM (Out of Memory)

To minimize the load, you can take several steps. Begin by shortening the input text and reducing the generation length. Consider lowering the audio sample rate and/or adjusting internal chunk sizes if the script permits it. Set the batch size to 1 and opt for a smaller model variant.

// 4. No Audio or Missing Outputs Folder

The script typically prints the final output path in the console; scroll up to find the exact location

find /content -name "*generated.wav"

// 5. Voice Names Not Found?

Copy the exact names listed under Available voices. Use the alias names (Alice, Frank, Mary, Carter) shown in the demo. They correspond to the .wav assets.

# Final Thoughts

For many projects, I would choose an open-source stack like VibeVoice over paid APIs due to several compelling reasons. First and foremost, it is easy to integrate and offers flexibility for customization, making it suitable for a wide range of applications. Additionally, it is surprisingly light on GPU requirements, which can be a significant advantage in resource-constrained environments.

VibeVoice is open source, meaning that in the future, you can expect better frameworks that enable faster generation even on CPUs.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.