Learning Timeline
Key Insights
Diffusion Steps 'Sweet Spot' Tip
A value of 20 steps is the 'sweet spot' for VibeVoice. Exceeding 40 or 50 steps typically results in diminishing returns, providing no significant quality improvement while wasting GPU processing power.
GPU VRAM Management
If you have limited VRAM, ensure 'free memory after generate' is set to 'True'. This is essential to prevent ComfyUI from crashing when you try to run other workflows after generating audio.
Quality vs. Speed
The 7B model delivers incredibly realistic voice clones (comparable to Sam Altman) but demands high VRAM and long loading times. If you need quick results for a draft, use a smaller model instead.
Prompts
VibeVoice Dialogue Script Format
Target:
VibeVoice Transcript Node
[Speaker 1] Hello, this is the first speaker. [Speaker 2] Hi there. I'm the second speaker. [Speaker 1] Nice to meet you. [Speaker 2] Nice to meet you, too.
Step by Step
Configuring the VibeVoice Workflow in ComfyUI
- Download and drag-and-drop the VibeVoice workflow file into the ComfyUI interface.
- On the 'Speaker 1' node, click the 'upload' button to import a short audio clip of the voice you wish to clone.
- Repeat the same step for the 'Speaker 2' node if you require a second voice for the dialogue.
- Enter your script into the 'transcript' input box. Use the [Speaker 1] and [Speaker 2] format in square brackets to differentiate between speakers.
- Select a model in the 'model selection' section. Choose the '7B' version for the best audio quality if you have sufficient VRAM (approximately 17GB required).
- Set 'Attention Type' to 'auto' to allow the system to automatically detect the best acceleration method.
- Adjust the 'free memory after generate' setting. Set it to 'True' to clear the model from VRAM once finished, or 'False' if you intend to perform rapid, repetitive generations.
- Set 'Diffusion Steps' to 20 for an optimal balance between quality and speed.
- Select 'Seed' and set it to 'randomize' for a unique result every time, or 'fixed' to maintain voice consistency.
- Adjust 'Temperature' (lower values for consistency, higher for more creativity) and 'CFG' (which controls how closely the AI follows the text prompt).
- Click 'Queue Prompt' to start the inference and audio generation process.
Using External Text Files for Transcripts
- Prepare a text file (.txt) containing the complete dialogue script.
- Save the file into the 'input' folder within your ComfyUI directory.
- Locate the text input node in the workflow, right-click, and select 'Bypass' (or press Ctrl+B) to activate the node (ensure the purple highlight disappears).
- Click the dropdown menu on that node and select the name of the text file you saved (e.g., transcript.txt).
- Drag the output from the text file node to the transcript input on the main VibeVoice node.