Meta introduces speech AI models that identify 4000 spoken languages

May 23, 2023  20:08

Meta, formerly known as Facebook, has unveiled a unique AI language model called the Massively Multilingual Speech (MMS) project, which stands out from the typical ChatGPT clones. This groundbreaking technology can recognize an astonishing array of over 4,000 spoken languages and generate speech (text-to-speech) in more than 1,100 languages. In an effort to promote language diversity and foster further research, Meta has decided to open-source MMS, making its models and code publicly available. Their aim is to contribute to the preservation of the world's remarkable linguistic richness.

Usually, speech recognition and text-to-speech models require extensive training on large volumes of audio data, accompanied by transcription labels. However, for languages that are not widely spoken in industrialized nations, many of which are at risk of disappearing, such data is simply non-existent. Meta tackled this challenge using an unconventional method of collecting audio data. They tapped into audio recordings of translated religious texts, such as the Bible, which have been extensively studied for text-based language translation research. These translations offer publicly available audio recordings of people reading the texts in various languages. By incorporating these unlabeled recordings into their research, Meta's team expanded the model's language capabilities to encompass more than 4,000 languages.

At first glance, Meta's approach may raise concerns about potential biases favoring Christian perspectives. However, Meta has reassured that this is not the case. Despite the religious content of the audio recordings, Meta's analysis demonstrates that the model does not exhibit a bias towards producing religious language. They attribute this to their use of a connectionist temporal classification (CTC) approach, which is more restricted compared to large language models (LLMs) or sequence-to-sequence models used in speech recognition. Additionally, the fact that most of the religious recordings were read by male speakers did not introduce a gender bias, as the model performed equally well with both male and female voices.

After enhancing the usability of the data through an alignment model, Meta utilized wav2vec 2.0, their self-supervised speech representation learning model capable of training on unlabeled data. The combination of unconventional data sources and the self-supervised speech model yielded impressive results. Meta's findings revealed that the Massively Multilingual Speech models outperformed existing models and covered ten times as many languages. Specifically, when compared to OpenAI's Whisper, Meta's MMS achieved a significantly lower word error rate while encompassing eleven times more languages.

While Meta acknowledges that their new models are not flawless, they highlight the potential risks of mistranscribing certain words or phrases in the speech-to-text model. Depending on the output, this could lead to offensive or inaccurate language. They emphasize the importance of collaboration within the AI community to ensure responsible development of AI technologies.

With the release of MMS for open-source research, Meta aims to counteract the trend of technology eroding linguistic diversity to support only the hundred or fewer languages predominantly backed by major tech companies. They envision a world where assistive technology, text-to-speech, and even virtual reality/augmented reality technologies enable individuals to speak and learn in their native languages. By granting access to information and technology in people's preferred languages, Meta hopes to inspire the preservation and vitality of diverse languages.


 
 
 
 
  • Archive