Overview

Open Whisper-style Speech Models (OWSM, pronounced as “awesome”) are a series of speech foundation models developed by WAVLab at Carnegie Mellon University. We reproduce Whisper-style training using publicly available data and our open-source toolkit ESPnet. By publicly releasing data preparation scripts, training and inference code, pre-trained model weights and training logs, we aim to promote transparency and open science in large-scale speech pre-training.

Demo

  • Gradio demo: Static Badge
  • Colab notebook: Open All Collab

Pre-trained models

We publicly release a series of pre-trained models. The training logs are also available for major models. We recommend using OWSM v3.1 or later versions for better performance and efficiency.

Name Data (hours) Encoder Parameters Model Link ESPnet Recipe
OWSM v1 38k Transformer 272M espnet/owsm_v1 egs2/owsm_v1/s2t1
OWSM v2 129k Transformer 712M espnet/owsm_v2 egs2/owsm_v2/s2t1
OWSM v2 129k E-Branchformer 739M espnet/owsm_v2_ebranchformer egs2/owsm_v2/s2t1
OWSM v3 180k Transformer 889M espnet/owsm_v3 egs2/owsm_v3/s2t1
OWSM v3.1 base 180k E-Branchformer 101M espnet/owsm_v3.1_ebf_base egs2/owsm_v3.1/s2t1
OWSM v3.1 small 180k E-Branchformer 367M Coming soon egs2/owsm_v3.1/s2t1
OWSM v3.1 medium 180k E-Branchformer 1.02B espnet/owsm_v3.1_ebf egs2/owsm_v3.1/s2t1
OWSM v3.1 medium license-free 70k E-Branchformer 1.02B Coming soon Coming soon

Data details

The latest OWSM v3.1 models are trained on a diverse combination of public datasets as listed below.

OWSM v3.1 training data mixtures
  • AIDATATANG
  • AISHELL-1
  • AMI
  • Babel
  • Common Voice
  • Googlei18n
  • CoVoST2
  • Fisher Callhome Spanish
  • Fisher (Switchboard)
  • FLEURS
  • GigaSpeech
  • GigaST
  • KsponSpeech
  • LibriSpeech
  • MagicData
  • Multilingual LibriSpeech
  • MuST-C
  • ReazonSpeech
  • Russian Open STT
  • SPGISpeech
  • TEDLIUM3
  • VCTK
  • VoxForge
  • VoxPopuli
  • WenetSpeech

The license-free model is trained on a subset of the above data with “free licenses”.

OWSM v3.1 license-free data
  • AMI: CC-BY-4.0
  • Common Voice: CC0-1.0
  • FLEURS: CC-BY-4.0
  • KsponSpeech: MIT
  • LibriSpeech: CC-BY-4.0
  • Multilingual LibriSpeech: CC-BY-4.0
  • VCTK: CC-BY-4.0

Inference

Similar to other ESPnet models, the pre-trained OWSM models can be easily downloaded and used in a python script. Below are some examples using OWSM v3.1. For earlier versions (v2 and before), the language code should follow the two-letter format (e.g., <en>, <de>).

Language Identification

We pass the Hugging Face model tag when initializing Speech2Language. The model will be automatically downloaded from Hugging Face to a local cache directory.

from espnet2.bin.s2t_inference_language import Speech2Language
s2l = Speech2Language.from_pretrained(
    model_tag="espnet/owsm_v3.1_ebf",
    device="cuda",
    nbest=3,  # return nbest prediction and probability
)

import soundfile as sf
speech, rate = sf.read("audio.wav")

result = s2l(speech)
print(result)
# list of tuples (language, probability)
# [('<eng>', 0.9994348883628845), ('<jpn>', 0.00010286537144565955), ('<rus>', 6.185896199895069e-05)]

Speech Recognition or Translation

We use Speech2Text for speech recognition or translation. We also pass the model tag so that the model can be automatically downloaded. When initializing this object, we set the default values for lang_sym, task_sym and predict_time. These variables can be overwritten later, which provides more flexibility. Note that the language must be known to use this functionality. If it is unknown, one can first perform language identification and then recognition or translation.

from espnet2.bin.s2t_inference import Speech2Text
s2t = Speech2Text.from_pretrained(
    model_tag="espnet/owsm_v3.1_ebf",
    device="cuda",
    beam_size=5,
    ctc_weight=0.0,
    maxlenratio=0.0,
    # below are default values which can be overwritten in __call__
    lang_sym="<eng>",
    task_sym="<asr>",
    predict_time=False,
)

import soundfile as sf
speech, rate = sf.read("audio.wav")


result = s2t(speech)[0][-2]

# an optional text prompt can be passed
result = s2t(
    speech,
    text_prev="this is an optional prompt"
)[0][-2]

# lang_sym, task_sym, predict_time can be overwritten
result = s2t(
    speech,
    lang_sym="<eng>",
    task_sym="<st_zho>",    # translation into Chinese
    predict_time=True,
)[0][-2]

Long-form Speech Recognition or Translation

OWSM processes an entire audio recording in a chunk-by-chunk manner. Each chunk has a fixed length of 30s. The chunk is shifted based on the predicted timestamps. We still use Speech2Text but we call its decode_long method.

from espnet2.bin.s2t_inference import Speech2Text
s2t = Speech2Text.from_pretrained(
    model_tag="espnet/owsm_v3.1_ebf",
    device="cuda",
    beam_size=5,
    ctc_weight=0.0,
    maxlenratio=0.0,
    # below are default values which can be overwritten in __call__
    lang_sym="<eng>",
    task_sym="<asr>",
)

import soundfile as sf
speech, rate = sf.read("covid.wav")

result = s2t.decode_long(speech)
# list of tuples (start_time, end_time, text)

Fine-tuning on custom data

Coming soon!

Papers

Please cite our papers if you use OWSM in your project.

We also collect other papers related to OWSM. Please contact Yifan Peng (yifanpen@andrew.cmu.edu) if you use OWSM in your work and want to include it here.

OWSM applications
Foundational work used by OWSM