XEUS - Towards Robust Speech Representation Learning for Thousands of Languages
Overview
WAVLab at Carnegie Mellon University presents XEUS - a Cross-lingual Encoder for Universal Speech. XEUS (pronounced “Zeus”) is an open-source speech foundation model trained on nearly 1.1 million hours of unlabeled speech data across 4,057 languages. XEUS sets the new state-of-the-art on the ML-SUPERB multilingual speech recognition benchmark, while also achieving strong results on different tasks on the English-only SUPERB evaluations. We open-source XEUS’ checkpoints, along with our training code and 4,000+ language speech data in this project page.
Project Contacts: William Chen, Shinji Watanabe
Model Info
XEUS is a 19-layer E-Branchformer encoder trained using the HuBERT masked prediction objective. XEUS is trained on 37 public datasets, along with our crawled data detailed below. This totals to roughly 1.1 million hours of speech. From these audio files, we generate 180 billion speech tokens as the prediction targets using a pre-trained WavLabLM. We augment the training task with acoustic denoising from WavLM, along with a new dereverberation objective that we propose. More details can be found in the paper.
| Name | Data (hours) | Parameters | Model Link | ESPnet Recipe | License | 
|---|---|---|---|---|---|
| XEUS | 1.1 million | 577M | HuggingFace | Coming soon | MIT | 
Released Data
| Name | Data (hours) | Languages | Link | License | 
|---|---|---|---|---|
| MMS ulab v2 | 8,900 | 4,023 | espnet/mms_ulab_v2 | CC BY-NC-SA 4.0 | 
| WikiTongues | 70 | ~700 | espnet/wikitongues | CC BY-NC-SA 4.0 | 
| JesusDramas | 645 | 430 | espnet/jesus_dramas | CC BY-NC-SA 4.0 | 
Intermediate Checkpoints and Logs
To encourage research on the training dynamics of large-scale speech models, we will also release the logs and intermediate checkpoints obtained from training XEUS. In total, we have around 200 total intermediate checkpoints. These will be made available in the near future.
Usage
Similar to other ESPnet models, the pre-trained XEUS model can be downloaded and used in a python script. The code for XEUS is still in progress of being merged into the main ESPnet repo. It can instead be used from the following fork:
pip install 'espnet @ git+https://github.com/wanchichen/espnet.git@ssl'
git lfs install
git clone https://huggingface.co/espnet/XEUS
Feature Extraction
from torch.nn.utils.rnn import pad_sequence
from espnet2.tasks.ssl import SSLTask
import soundfile as sf
device = "cuda" if torch.cuda.is_available() else "cpu"
xeus_model, xeus_train_args = SSLTask.build_model_from_file(
    config = None,
    ckpt = '/path/to/checkpoint/here/checkpoint.pth',
    device,
)
wavs, sampling_rate = sf.read('/path/to/audio.wav') # sampling rate should be 16000
wav_lengths = torch.LongTensor([len(wav) for wav in [wavs]]).to(device)
wavs = pad_sequence(torch.Tensor([wavs]), batch_first=True).to(device) 
# we recommend use_mask=True during fine-tuning
feats = xeus_model.encode(wavs, wav_lengths, use_mask=False, use_final_output=False)[0][-1] # take the output of the last layer -> batch_size x seq_len x hdim
Flash Attention
XEUS supports Flash Attention for more efficient training.
pip install flash-attn --no-build-isolation
[layer.use_flash_attn = True for layer in xeus_model.encoder.encoders]
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
  feats = xeus_model.encode(wavs, wav_lengths, use_mask=False, use_final_output=False)[0][-1]
Masking
The masking settings can be tuned to better fit certain downstream tasks during fine-tuning.
xeus_model.masker.mask_prob = 0.65 # default 0.8
xeus_model.masker.mask_length = 20 # default 10
xeus_model.masker.mask_selection = 'static' # default 'uniform'
xeus_model.train()
feats = xeus_model.encode(wavs, wav_lengths, use_mask=True, use_final_output=False)[0][-1]
Papers
If you use XEUS or our released data in your project, please consider citing the paper.