Setting up a professional voice cloning workflow with VibeVoice | Alpha | PandaiTech

Setting up a professional voice cloning workflow with VibeVoice

How to configure the ComfyUI workflow for VibeVoice, including uploading voice samples, formatting dialogue scripts, and adjusting key parameters like 'temperature' and 'steps' for optimal audio quality.

Learning Timeline
Key Insights

Diffusion Steps 'Sweet Spot' Tip

A value of 20 steps is the 'sweet spot' for VibeVoice. Exceeding 40 or 50 steps typically results in diminishing returns, providing no significant quality improvement while wasting GPU processing power.

GPU VRAM Management

If you have limited VRAM, ensure 'free memory after generate' is set to 'True'. This is essential to prevent ComfyUI from crashing when you try to run other workflows after generating audio.

Quality vs. Speed

The 7B model delivers incredibly realistic voice clones (comparable to Sam Altman) but demands high VRAM and long loading times. If you need quick results for a draft, use a smaller model instead.
Prompts

VibeVoice Dialogue Script Format

Target: VibeVoice Transcript Node
[Speaker 1] Hello, this is the first speaker. [Speaker 2] Hi there. I'm the second speaker. [Speaker 1] Nice to meet you. [Speaker 2] Nice to meet you, too.
Step by Step

Configuring the VibeVoice Workflow in ComfyUI

  1. Download and drag-and-drop the VibeVoice workflow file into the ComfyUI interface.
  2. On the 'Speaker 1' node, click the 'upload' button to import a short audio clip of the voice you wish to clone.
  3. Repeat the same step for the 'Speaker 2' node if you require a second voice for the dialogue.
  4. Enter your script into the 'transcript' input box. Use the [Speaker 1] and [Speaker 2] format in square brackets to differentiate between speakers.
  5. Select a model in the 'model selection' section. Choose the '7B' version for the best audio quality if you have sufficient VRAM (approximately 17GB required).
  6. Set 'Attention Type' to 'auto' to allow the system to automatically detect the best acceleration method.
  7. Adjust the 'free memory after generate' setting. Set it to 'True' to clear the model from VRAM once finished, or 'False' if you intend to perform rapid, repetitive generations.
  8. Set 'Diffusion Steps' to 20 for an optimal balance between quality and speed.
  9. Select 'Seed' and set it to 'randomize' for a unique result every time, or 'fixed' to maintain voice consistency.
  10. Adjust 'Temperature' (lower values for consistency, higher for more creativity) and 'CFG' (which controls how closely the AI follows the text prompt).
  11. Click 'Queue Prompt' to start the inference and audio generation process.

Using External Text Files for Transcripts

  1. Prepare a text file (.txt) containing the complete dialogue script.
  2. Save the file into the 'input' folder within your ComfyUI directory.
  3. Locate the text input node in the workflow, right-click, and select 'Bypass' (or press Ctrl+B) to activate the node (ensure the purple highlight disappears).
  4. Click the dropdown menu on that node and select the name of the text file you saved (e.g., transcript.txt).
  5. Drag the output from the text file node to the transcript input on the main VibeVoice node.

More from Create AI Voice & Music

View All