WAVLab | Open Whisper-style Speech Models (OWSM)

Overview

Open Whisper-style Speech Models (OWSM, pronounced as “awesome”) are a series of speech foundation models developed by WAVLab at Carnegie Mellon University. We reproduce Whisper-style training using publicly available data and our open-source toolkit ESPnet. By publicly releasing data preparation scripts, training and inference code, pre-trained model weights and training logs, we aim to promote transparency and open science in large-scale speech pre-training.

Demo

Gradio demo:
Colab notebook:

Pre-trained models

We publicly release a series of pre-trained models. The training logs are also available for major models. We recommend using OWSM v3.1 or later versions for better performance and efficiency.

Name	Data (hours)	Encoder	Parameters	Model Link	ESPnet Recipe
OWSM v1	38k	Transformer	272M	espnet/owsm_v1	egs2/owsm_v1/s2t1
OWSM v2	129k	Transformer	712M	espnet/owsm_v2	egs2/owsm_v2/s2t1
OWSM v2	129k	E-Branchformer	739M	espnet/owsm_v2_ebranchformer	egs2/owsm_v2/s2t1
OWSM v3	180k	Transformer	889M	espnet/owsm_v3	egs2/owsm_v3/s2t1
OWSM v3.1 base	180k	E-Branchformer	101M	espnet/owsm_v3.1_ebf_base	egs2/owsm_v3.1/s2t1
OWSM v3.1 small	180k	E-Branchformer	367M	espnet/owsm_v3.1_ebf_small	egs2/owsm_v3.1/s2t1
OWSM v3.1 medium	180k	E-Branchformer	1.02B	espnet/owsm_v3.1_ebf	egs2/owsm_v3.1/s2t1
OWSM v3.1 small low-restriction	70k	E-Branchformer	367M	espnet/owsm_v3.1_ebf_small_lowrestriction	egs2/owsm_v3.1/s2t1
OWSM-CTC v3.1 medium	180k	E-Branchformer	1.01B	pyf98/owsm_ctc_v3.1_1B	Check model page
OWSM v3.2 small	180k	E-Branchformer	367M	espnet/owsm_v3.2	Coming soon

Data details

The latest OWSM v3.1 models are trained on a diverse combination of public datasets as listed below.

OWSM v3.1 training data mixtures

AIDATATANG
AISHELL-1
AMI
Babel
Common Voice
Googlei18n
CoVoST2
Fisher Callhome Spanish
Fisher (Switchboard)
FLEURS
GigaSpeech
GigaST
KsponSpeech
LibriSpeech
MagicData
Multilingual LibriSpeech
MuST-C
ReazonSpeech
Russian Open STT
SPGISpeech
TEDLIUM3
VCTK
VoxForge
VoxPopuli
WenetSpeech

The low-restriction model is trained on a subset of the above data with “more flexible licenses”.

OWSM v3.1 low-restriction data

AMI: CC-BY-4.0
Common Voice: CC0-1.0
FLEURS: CC-BY-4.0
KsponSpeech: MIT
LibriSpeech: CC-BY-4.0
Multilingual LibriSpeech: CC-BY-4.0
VCTK: CC-BY-4.0

Inference

Similar to other ESPnet models, the pre-trained OWSM models can be easily downloaded and used in a python script. Below are some examples using OWSM v3.1. For earlier versions (v2 and before), the language code should follow the two-letter format (e.g., <en>, <de>).

Language Identification

We pass the Hugging Face model tag when initializing Speech2Language. The model will be automatically downloaded from Hugging Face to a local cache directory.

from espnet2.bin.s2t_inference_language import Speech2Language
s2l = Speech2Language.from_pretrained(
    model_tag="espnet/owsm_v3.1_ebf",
    device="cuda",
    nbest=3,  # return nbest prediction and probability
)

import soundfile as sf
speech, rate = sf.read("audio.wav")

result = s2l(speech)
print(result)
# list of tuples (language, probability)
# [('<eng>', 0.9994348883628845), ('<jpn>', 0.00010286537144565955), ('<rus>', 6.185896199895069e-05)]

Speech Recognition or Translation

We use Speech2Text for speech recognition or translation. We also pass the model tag so that the model can be automatically downloaded. When initializing this object, we set the default values for lang_sym, task_sym and predict_time. These variables can be overwritten later, which provides more flexibility. Note that the language must be known to use this functionality. If it is unknown, one can first perform language identification and then recognition or translation.

from espnet2.bin.s2t_inference import Speech2Text
s2t = Speech2Text.from_pretrained(
    model_tag="espnet/owsm_v3.1_ebf",
    device="cuda",
    beam_size=5,
    ctc_weight=0.0,
    maxlenratio=0.0,
    # below are default values which can be overwritten in __call__
    lang_sym="<eng>",
    task_sym="<asr>",
    predict_time=False,
)

import soundfile as sf
speech, rate = sf.read("audio.wav")


result = s2t(speech)[0][-2]

# an optional text prompt can be passed
result = s2t(
    speech,
    text_prev="this is an optional prompt"
)[0][-2]

# lang_sym, task_sym, predict_time can be overwritten
result = s2t(
    speech,
    lang_sym="<eng>",
    task_sym="<st_zho>",    # translation into Chinese
    predict_time=True,
)[0][-2]

Long-form Speech Recognition or Translation

OWSM processes an entire audio recording in a chunk-by-chunk manner. Each chunk has a fixed length of 30s. The chunk is shifted based on the predicted timestamps. We still use Speech2Text but we call its decode_long method.

from espnet2.bin.s2t_inference import Speech2Text
s2t = Speech2Text.from_pretrained(
    model_tag="espnet/owsm_v3.1_ebf",
    device="cuda",
    beam_size=5,
    ctc_weight=0.0,
    maxlenratio=0.0,
    # below are default values which can be overwritten in __call__
    lang_sym="<eng>",
    task_sym="<asr>",
)

import soundfile as sf
speech, rate = sf.read("covid.wav")

result = s2t.decode_long(speech)
# list of tuples (start_time, end_time, text)

Fine-tuning on custom data

Our latest work (accepted to SLT 2024), “ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration”, will provide an easier way for fine-tuning pre-trained models. We are preparing demos and notebooks. Please stay tuned!

Papers

Please cite our papers if you find OWSM helpful.

We also collect other papers related to OWSM. Please contact Yifan Peng (yifanpen@andrew.cmu.edu) if you use OWSM in your work and want to include it here.

OWSM applications

ASRU 2023 SPARKS Workshop: SLUE-PERB: A Spoken Language Understanding Performance Benchmark and Toolkit

Foundational work used by OWSM