Labs Newsletter

AI for Productivity

ChatGPT workflows and templates

Prompt collections and image systems

AI (Vibe) Coding

Save credits with better prompt engineering

Shared tutorials and breakdowns

Tool guides and model insights

AI workers for your business

Automate your ad creatives

Login

Generate and edit expressive speech with Step Audio EditX

Guide to using Step Audio EditX for text-to-speech generation, voice cloning, emotion control, and modifying existing audio styles.

Step Audio EditX Audio Generation Audio Editing

Segment Details

Source Video Time - 13:39

Duration 5.2 mins

Learning Timeline

Key Insights

Hardware Requirements & Optimization

The model contains 3 billion parameters. While 16GB VRAM is recommended for optimal performance, it is possible to run on consumer-grade GPUs with 12GB (or lower) by utilizing CPU offloading settings.

Low-Data Efficiency

Due to its reinforcement learning architecture, the model allows for high-fidelity voice cloning using only a 5-second reference clip, unlike older models that required minutes of training data.

Prompts

Voice Cloning Test Script

Target: Step Audio EditX

Underneath the courtyard is a large underground exhibition room which connects the two buildings.

Angry Emotion Test Script

Target: Step Audio EditX

Seriously, your call is very important to us. If it were important, you should pick up the phone. This is the last time I'm calling.

Whisper Style Test Script

Target: Step Audio EditX

I'm right here with you. You're safe. Everything is okay.

Step by Step

Cloning a Voice with Short Reference Audio

Prepare a reference audio file approximately 5 seconds in length containing the target voice.
Upload the reference audio into Step Audio EditX.
Input the text transcript you wish the cloned voice to speak.
Initiate the generation process to synthesize the new audio matching the reference tone and timbre.

Modifying Audio Emotion and Tone

Load the original audio clip and its corresponding text transcript into the editor.
Select the target emotional parameter (e.g., 'Angry', 'Fearful') or tonal style (e.g., 'Exaggerated').
Run the modification process.
Verify that the output reflects the change, such as increased intensity for anger or racing pacing for fear.

Applying Style Transfers (Whisper/Roar)

Input the source audio file.
Choose a specific acoustic style preset, such as 'Whisper' or 'Roar'.
Process the audio to apply the transfer.
Review the result to ensure the vocal characteristics (e.g., breathiness for whisper) are applied while retaining the original message.

Inserting Paralinguistics (Breathing/Laughing)

Enter the base text transcript for the speech generation.
Locate the specific cursor position within the text where a non-verbal sound is required.
Insert the specific paralinguistic instruction (e.g., add a 'breath' or 'laugh' marker).
Generate the audio to confirm the sound effect is seamlessly integrated between the spoken words.

More from Create AI Voice & Music

Automating Customer Feedback to Slack with Voice AI

Slack Voice AI Agent

Build an AI Property Manager with VAPI and Twilio

Build an Answering Service for Contractors with Location Filtering

Building a Custom Voice AI Assistant with Vapi

Build Customer Service Voice Agents with ElevenLabs

Clean up noisy audio with ElevenLabs Voice Isolator