输入“/”快速插入内容

​

Arxiv | Hugging Face | GitHub Repo​
Abstract​
We introduce KALL-E, a novel autoregressive (AR) language model for text-to-speech (TTS) synthesis that operates by predicting the next distribution of continuous speech frames. Unlike existing methods, KALL-E directly models the continuous speech distribution conditioned on text, eliminating the need for any diffusion-based components. Specifically, we utilize a Flow-VAE to extract a continuous latent speech representation from waveforms, instead of relying on discrete speech tokens. A single AR Transformer is then trained to predict these continuous speech distributions from text, optimizing a Kullback–Leibler divergence loss as its objective. Experimental results demonstrate that KALL-E achieves superior speech synthesis quality and can even adapt to a target speaker from just a single sample. Importantly, KALL-E provides a more direct and effective approach for utilizing continuous speech representations in TTS.​
​
​
System Overview​
common.docs_name - LarkCCM_Docs_Menu_Image
​
​
Voice cloning​
​
Given prompt audio and text, KALL-E can generate audio with a specific timbre and content.​
​
Prompt Audio And Text​
50%
Target Text And Generated Audio​
50%
voice_clone_prompt_en_1
Youth Congress always remained in highlights of the leading state news.​
​
50%
voice_clone_target_en_1
There are narrow teeth on lateral margin of prothorax which are long.​
​
50%
voice_clone_prompt_en_2
Without public access to the advisory, it is obviously impossible to comment in detail.​
​
50%
voice_clone_target_en_2
The concept of the piece was to "make an orchestra speak".​
50%