Large Language Models are now used across many tasks and input types. In speech-to-text tasks, a common approach is to connect Speech Foundation Models (SFMs) to LLMs using adapter modules.
But how much does each part of this chain really matter?
Francesco Cariaggi will discuss his latest research, in which he tested different combinations of adapters, LLMs (Mistral and Llama), and SFMs (Whisper and SeamlessM4T) in speech recognition and speech translation tasks.
The study shows that the choice of SFM makes the most significant difference, while the adapter has a moderate impact that depends on the SFM and LLM used.
This talk is based on the paper: How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not, written by Francesco Verdini, Pierfrancesco Melucci, Stefano Perna, Francesco Cariaggi, Marco Gaido, Sara Papi, Szymon Mazurek, Marek Kasztelnik, Luisa Bentivogli, Sébastien Bratières, Paolo Merialdo, Simone Scardapane, 2024-2025.
Deep Learning Scientist
Francesco Cariaggi has been a Deep Learning Scientist at Pi School since 2021. After attending the School of AI in 2020, he started his journey as an AI Coach at Pi School. He specialises in speech AI, working at the intersection of language and technology.
He is actively contributing to Meetween, an EU-funded project developing AI-based solutions for the next generation of video conferencing platforms. The project aims to enable smooth, barrier-free collaboration across languages, geographies, and time zones.
Meetween began in January 2024 and will continue until the end of 2027, having received €7.1 million in funding from the European Commission’s Horizon Europe Framework Programme. The consortium comprises eight organisations from Europe and Turkey, including Translated (the coordinator), Fondazione Bruno Kessler, Karlsruhe Institute of Technology, ITUNOVA, TAUS, Zoom Video Communications Germany, Pi School, and Cyfronet.
The project is developing foundational AI technologies that integrate speech, text and video for video conferencing and adapt to users’ contexts and cultural specificities. Meetween’s roadmap includes releasing three generations of a large, multimodal AI foundation model for speech (SpeechLMM), as well as a large audiovisual speech dataset. Both will be openly available for research and commercial use.
Meetween’s mission is to support real-time speech-to-speech translation, face dubbing, summarisation and virtual assistant services, all the while promoting a European vision for AI that is safe, privacy-aware and grounded in ethics.
Explore Pi School of AI tech talks