Large general-purpose transformer models have recently become the mainstay in the realm of speech analysis. In particular, Whisper achieves state-of-the-art results in relevant tasks such as speech recognition, translation, language identification, and voice activity detection. However, Whisper models are not designed to be used in real-time conditions, and this limitation makes them unsuitable for a vast plethora of practical applications. In this talk, we introduce Whispy, a system intended to bring live capabilities to the Whisper pretrained models. As a result of a number of architectural optimisations, Whispy is able to consume live audio streams and generate high level, coherent voice transcriptions, while still maintaining a low computational cost. We evaluate the performance of our system on a large repository of publicly available technical recordings, investigating how the transcription mechanism introduced by Whispy impact on the Whisper output. Experimental results show how Whispy excels in robustness, promptness, and accuracy.
Antonio Bevilacqua receives his Master Degree in Computer Engineering from the University Federico II of Naples in March 2014. Following a brief period during which he works as a software developer across different business realities, in 2020 he is awarded a PhD in Machine Learning from the Insight Centre for Data Analytics, a research group part of University College of Dublin, Ireland. In the same institutes, he obtains and completes a postdoctoral position, ended in December 2022. Antonio's area of academic interest is on timeseries analysis, with an emphasis on data obtained in the healthcare context. Starting October 2023, he covers a position as Machine Learning engineer for Meetecho.