In Meta’s pursuit of advancing AI capable of comprehending various languages, the tech giant company has developed a new AI model called SeamlessM4T. The newly designed AI can translate and transcribe over 100 languages, spanning both text and speech formats, according to Meta.
SeamlessM4T represents a significant breakthrough in the realm of AI-powered speech-to-speech and speech-to-text capabilities. Meta elaborates in a blog post shared with TechCrunch that their singular model is an instant translation, enabling effective communication between individuals speaking different languages, without the necessity of a separate language identification model.
SeamlessM4T follows Meta’s text-to-text machine translation model called Perceptron AI, and the Universal Speech Translator, notable for supporting the Hokkien language in direct speech-to-speech translation systems.
This innovation builds upon Massively Multilingual Speech, Meta’s framework offering speech recognition, language identification, and speech synthesis technology across an extensive spectrum of over 1,100 languages.
Although Meta’s efforts are notable, other entities are also investing resources in the development of advanced AI translation and transcription tools. Companies like Amazon, Microsoft, OpenAI, and various startups already offer commercial services and open-source models.
Google is also working on the Universal Speech Model, aiming to comprehend the world’s most spoken languages, while Mozilla’s Common Voice initiative aims to create a diverse collection of voices for training automatic speech recognition algorithms. Among these endeavours, SeamlessM4T stands out as an attempt to unite translation and transcription capabilities within a single model.
According to Meta, in order to create it, it incurred 4 million hours of speech and “tens of billions” of sentences of publicly accessible text from the internet. Juan Pino, a research scientist at Meta’s AI research division and a collaborator in the project, did not disclose the sources of the data in an interview with TechCrunch, simply stating that there were “a variety” of them.
But Meta claims that the data it mined — which might contain personally identifiable information, the company admits — wasn’t copyrighted and came primarily from open source. Meta used the scraped text and speech to create the training data set for SeamlessM4T, called SeamlessAlign.
Researchers aligned 443,000 hours of speech with texts and created 29,000 hours of “speech-to-speech” alignments, which allowed SeamlessM4T how to transcribe speech-to-text, translate text, generate speech from text, and even translate words spoken in one language into words in another language.
About SeamlessM4T
Meta claims that on an internal benchmark, SeamlessM4T performed better against background noises and “speaker variations” in speech-to-text tasks compared to the current state-of-the-art speech transcription model. It attributes this to the combination of speech and text data in the training data set, which Meta believes gives SeamlessM4T a leg up over speech-only and text-only models.
According to Meta, SeamlessM4T exhibits superior performance compared to the current state-of-the-art speech transcription model in dealing with background noises and variations in speaker tone, as demonstrated in an internal benchmark. Meta attributes this success to the substantial combination of speech and text data present in the training dataset, which provides SeamlessM4T with an advantage over models that rely solely on speech or text.
In a blog post, Meta expressed its belief that SeamlessM4T signifies a significant advancement in the pursuit of universal multitask AI systems, delivering state-of-the-art results. However, it is worth considering the potential biases that the model might contain.
A recent article in The Conversation highlights the various shortcomings found in AI-powered translations, including instances of gender bias. A study from The Proceedings of the National Academy of Sciences revealed that prominent speech recognition systems were twice as likely to inaccurately transcribe audio from Black speakers compared to White speakers.
In a published blog post by Meta, the company discloses that the model tends to “overgeneralize to masculine forms when translating from neutral terms” and performs better when translating from masculine references (such as nouns like “he” in English) for most languages. Additionally, when gender information is absent, SeamlessM4T leans towards translating in the masculine form around 10% of the time. Meta speculates that this might be due to an “overrepresentation of masculine lexica” in the training data.
The tech giant argues that SeamlessM4T doesn’t produce an excessive amount of toxic text in its translations, a common error with various translation and generative AI text models. However, in specific languages like Bengali and Kyrgyz, the model generates more toxic translations related to socio-economic status and culture. Generally, SeamlessM4T tends to exhibit more toxicity in translations dealing with sexual orientation and religion.
Meta points out that the public demo of SeamlessM4T incorporates a filter for toxicity in inputted speech and outputted speech. However, this filter is not included as the default setting in the open-source release of the model.
Read More: Meta to stop sharing news content on Facebook and Instagram for Canadian users
Some of the challenges attributed to AI translation
There is a potential loss of linguistic richness that can arise from the overuse of AI. Unlike humans, AI lacks the individualized choices that human interpreters employ during translation, leading to a distinct style known as “translationese.” Although AI can offer more precise translations, it might come at the cost of variety and diversity in translations.
Due to this concern, Meta recommends against utilizing SeamlessM4T for extensive or certified translations, which are officially recognized by government agencies and translation authorities.
Moreover, Meta cautions against employing SeamlessM4T for medical or legal purposes, likely to mitigate potential misinterpretations. This is a notable stance, as there have been instances where AI mistranslations have led to errors in law enforcement. For instance, a mistranslated text message resulted in police wrongly accusing a Kurdish man of financing terrorism, and in another case, a flawed translation led to a misunderstanding during a police car search, ultimately leading to the case’s dismissal.
Overall, while AI translation can improve in accuracy, there may be a cost associated with compromising translation diversity. This highlights the necessity of utilizing AI-powered tools sparingly, especially in sensitive situations.
Read More: Treepz appoints former BOI MD and Meta executive to its board