The open-source model improves on Whisper by using multi-head attention to achieve speedup and reduced latency while retaining full speech recognition accuracy
aiOla, a leader in speech recognition technology, has announced today the release of its new open-source AI model, Whisper-Medusa. The new model, based on a multi-head attention architecture, outperforms OpenAI’s Whisper, the most popular and best available AI speech recognition model, by performing 50% faster with no loss in performance.
The automatic speech recognition market size is projected to grow to $7.14 billion this year. As voice becomes an integrated feature in most connected devices and AI chatbots, speech recognition has emerged as a vital technology field. Amid this rapid expansion, OpenAI disrupted the automatic speech recognition landscape by releasing Whisper, an open-source model considered superior to any other commercial or open-source speech recognition model available today. Whisper, with more than 5 million downloads per month, has become the gold standard for automatic speech recognition systems and is powering tens of thousands of applications.
aiOla’s new open-source model, Whisper-Medusa, greatly improves the speed compared to Whisper by altering how the model predicts tokens. While Whisper predicts one token at a time, Whisper-Medusa can predict ten at a time, resulting in a 50% increase in speech prediction speed and generation runtime. As a result of this significant advancement, aiOla has decided to release the model’s weights and code today on GitHub and Hugging Face for the community to access.
“Creating Whisper-Medusa was not an easy task, but its significance to the community is profound,” said Gill Hetz, VP of Research at aiOla.”Improving the speed and latency of LLMs is much easier to do than with automatic speech recognition systems. The encoder and decoder architectures present unique challenges due to the complexity of processing continuous audio signals and handling noise or accents. We addressed these challenges by employing our novel multi-head attention approach, which resulted in a model with nearly double the prediction speed while maintaining Whisper’s high levels of accuracy. It’s a major feat, and we are very proud to be the first in the industry to successfully leverage multi-head attention architecture for automatic speech recognition systems and bring it to the public. “
Whisper-Medusa, based on multi-head attention, is trained using weak supervision. In this process, the main components of Whisper are initially frozen while additional parameters are trained. This training process involves using Whisper to transcribe audio datasets and employing these transcriptions as labels for training Medusa’s additional token prediction modules. aiOla currently offers Whisper-Medusa as a 10-head model, with future plans to release a 20-head version with equivalent accuracy.
Explore AITechPark for the latest advancements in AI, IOT, Cybersecurity, AITech News, and insightful updates from industry experts!