Microsoft’s new text-to-speech model can replicate anyone’s voice in 3 seconds

Microsoft’s new text-to-speech model, VALL-E, can replicate anyone’s voice just by listening to a 3-second voice sample. VALL-E is a transformer-based text-to-speech model that represents a significant improvement over previous models, which took a long time to train to generate new sounds. Furthermore, the intonation, charisma (or charisma) and style of the voices are all consistent across the generated speech. It’s a major step toward a more natural-sounding voice for text-to-speech systems.

This article is transferred from: https://www.solidot.org/story?sid=73846
This site is only for collection, and the copyright belongs to the original author.