ByteDance's Volcano Engine Unveils Doubao Audio Generation Model 1.0 with Multimodal Input and Long-Form Voice Consistency
Volcano Engine, ByteDance’s cloud and AI platform, has officially released the Doubao Audio Generation Model 1.0 (Doubao-Seed-Audio 1.0), bringing a significant leap to AI-powered audio creation. For the first time, the model supports reference-based generation: feed it text, audio, or any combination of modalities as input, and it produces polished target audio end-to-end. Perhaps most impressively, it maintains consistent voice timbre for multiple characters across long-duration generation, slashing the need for tedious post-production voice correction.

With a single prompt, creators can orchestrate multi-character dialogue, emotional tone, background music, and environmental ambiance — producing a complete audio work with genuine narrative tension. This collapses the traditional workflow of recording voices, sound effects, and music separately before stitching them together in a multi-track editor.
From Single-Clip Tool to Audio Director
Traditional film-grade audio production requires recording dialogue, sound effects, and background scores independently, then manually aligning and mixing multiple tracks — a labor-intensive process demanding significant post-production skill. Doubao-Seed-Audio 1.0 compresses all of this into a single prompt, delivering ready-to-use, narratively complete audio without any multi-track editing, alignment, or mixing:
- Multi-character dialogue: Define multiple characters’ lines, tone, and emotional pacing in a single instruction, with consistent voice identity maintained across all roles.
- Non-verbal expression: Embed laughter, sighs, pauses, and dialect accents directly into the prompt — the model reproduces them with precision, breathing life into conversations.
- Integrated music and effects: Background music and environmental sound effects are generated together with voices in one pass — no additional mixing required.
The shift is stark: a creator writes a description, and receives a broadcast-ready audio drama, podcast episode, or brand audio spot in return.
Long-Form Consistency Without Character Drift
One of the hardest problems in long-form audio generation is consistency — making sure a character at minute 10 sounds like the same person as at minute 1. Doubao-Seed-Audio 1.0 tackles this through deep coupling between text-to-audio and reference-audio generation, sustaining highly unified timbre across extended output. Creators no longer need to compare segments or repeatedly touch up voices; the model delivers consistent character sound in one go. This makes it viable for audiobooks, podcasts, and long episodic content.
Currently, the model supports up to 2 minutes of audio per generation pass. Using that output as a reference, creators can extend the audio in subsequent passes while maintaining timbre consistency, achieving controllable voice identity across much longer works.
Zero-Shot Multimodal Audio Creation
Doubao-Seed-Audio 1.0 supports multimodal inputs — text descriptions, reference audio, and more — generating high-quality target audio end-to-end without additional training. Creators can define character voices and expressive styles through a simple text prompt, or combine a reference audio clip to quickly generate a sound that matches their needs. The barrier to professional voice creation has been lowered dramatically.
Beyond zero-shot generation, the model also achieves decoupled control over timbre and style. The same voice can adapt to different emotions, contexts, and expression scenarios; conversely, a single voice reference can produce differentiated performances across multiple character identities — a capability that dramatically expands flexibility for character dubbing, narrative performance, and creative audio production.
Availability
The Doubao Audio Generation Model 1.0 API is now open for invite-only testing on Volcano Ark (Volcano Engine’s ML platform). Individual users can try it directly through the Volcano Ark experience center with 30 minutes of free creation credit. For audio creators, the model is also set to roll out soon to products including Jianying (CapCut), Jimeng, and Fanqie.
The launch signals ByteDance’s deepening push into generative AI for creative workflows, positioning Doubao-Seed-Audio as a direct challenge to the fragmented, labor-intensive pipelines that have long defined professional audio production.