Overview

Large language models (LLMs) have achieved remarkable success in natural language processing, where a model pre-trained on large amounts of data generalizes well on various downstream tasks through in-context learning. Though originally designed for text, LLMs have also demonstrated strong performance in other modalities, such as vision and speech. This has led to an emerging research topic for speech processing: spoken language models (SLMs).

A spoken language model is usually a fusion of speech and text language models. The fusion may take different forms, such as combining speech encoders with LLMs or using a joint vocabulary of speech and text tokens. Alternatively, one may optimize pure speech-based LLMs to directly model acoustic data without accessing any textual supervision. This can also be extended to modeling prosodic features of the spoken utterance.

SLMs simplify the modeling of speech, making it easier to scale up to more data, languages, and tasks. A single model can often perform multiple speech processing tasks such as speech recognition, speech translation, speech synthesis, and natural dialogue modeling. By using pre-trained LLMs, certain SLMs exhibit strong instruction-following capabilities, which can be used for the tasks above as well as spoken language understanding including intent classification, slot filling, and spoken question answering. This presents a promising avenue for developing “universal speech foundation models”, which take natural language instructions as input and proficiently execute diverse downstream tasks.

Topics

This special session aims to promote and advance the study of SLMs. We anticipate the session format to be panel and poster.

We welcome submissions on various topics related to spoken language models, including but not limited to:

  • Data creation
  • Speech representation learning (e.g., speech tokenizers)
  • Modeling architectures and algorithms
  • Training strategies (e.g., supervised fine-tuning, reinforcement learning)
  • Efficient adaptation of pre-trained models (e.g., adapters, low-rank adaptation)
  • Model compression (e.g., pruning, distillation, quantization)
  • Novel applications
  • Evaluation benchmarks and analysis methods
  • Fairness and bias

Paper Submission

Please follow the regular INTERSPEECH paper submission guidelines in the official website. Be sure to list “Spoken Language Models for Universal Speech Processing” as your paper subject area when making a submission.

Important Dates for INTERSPEECH 2024:

  • Paper Submission Portal Open: 20 January 2024
  • Paper Submission Deadline: 2 March 2024
  • Paper Update Deadline: 9 March 2024
  • Paper Acceptance Notification: 6 June 2024

Organizers