Microsoft's AI imitates anyone's voice and speech by listening to original voice for just 3 seconds

January 10, 2023  19:10

VALL-E, an artificial intelligence program developed by Microsoft, can imitate any human voice by listening to the original voice for just 3 seconds. It can even retain the timbre and emotional tone of the original.

This project is based on EnCodec technology developed by Meta company. While other text-to-speech methods usually manipulate waveforms for speech synthesis, Microsoft's development is different in that it analyzes a specific person's voice, after which that information is broken down into individual "tokens" and used in AI to "teach" so that it "imagines" how that voice would sound if the given person uttered other phrases.

VALL-E was "trained" in the LibriLight library. It contains 60,000 hours of English spoken by over 7,000 people. Examples of this AI's work can be found on the project's website, and they are truly impressive.

In the Speaker Prompt column, you can listen to three-second speech samples provided to the AI for it to "learn" and imitate. In the Ground Truth column, the necessary expressions are pronounced by the person himself, and in the VALL-E column, the same expression is performed by the VALL-E AB. And for comparison, in the Baseline column, you can listen to a sample of the work of traditional text-to-speech converters.

As can be seen, AI gives the generated sound not only the necessary emotional coloring, but also imitates the "acoustic environment" of the original sample. For example, if the original recording was made during a telephone conversation, the voice generated by AI will sound like a telephone conversation.

Such AI can be used in various fields, including for selfish purposes, therefore, in order to avoid the misuse of this technology, Microsoft has not published the code of VALL-E so that no experiments are made with it. According to the representatives of the company, they will do the same in the case of other projects that contain a potential threat of abuse.


 
 
 
 
  • Archive