Revolutionizing Speech and Audio Evaluation

A Comprehensive Speech and Audio Evaluation Toolkit

Introducing VERSA: A Comprehensive Speech and Audio Evaluation Toolkit

The WAVLab team is excited to announce the public release of VERSA (Versatile Evaluation of Speech and Audio), our comprehensive toolkit designed to revolutionize how researchers and developers evaluate speech and audio quality.

Why We Built VERSA

Audio quality assessment has long been fragmented across numerous specialized metrics, each requiring different setups, dependencies, and formats. This fragmentation creates significant barriers for researchers and practitioners alike:

VERSA solves these problems by providing a unified framework that brings together over 80 evaluation metrics under a single, easy-to-use interface.

What Sets VERSA Apart

Comprehensive Coverage

VERSA provides seamless access to more than 80 evaluation and profiling metrics with 10x variants, covering:

Practical Benefits

Getting Started

Installation

git clone https://github.com/wavlab-speech/versa.git
cd versa
pip install .

For metrics requiring additional dependencies, our tools directory provides convenient installers.

Quick Example

Evaluating speech quality is as simple as:

python versa/bin/scorer.py \
    --score_config egs/speech.yaml \
    --gt /path/to/reference/audio \
    --pred /path/to/generated/audio \
    --output_file results \
    --io dir

Real-World Applications

VERSA enables researchers and developers to:

Interactive Demo

Want to see VERSA in action? Try our interactive demo from

Join the Community

VERSA is open-source and community-driven. We welcome contributions and feedback:

Citation

If you find VERSA useful in your research, please cite our papers:

NAACL 2025 Demo Paper:

Shi, J., Shim, H., Tian, J., Arora, S., Wu, H., Petermann, D., Yip, J. Q., Zhang, Y., Tang, Y., Zhang, W., Alharthi, D. S., Huang, Y., Saito, K., Han, J., Zhao, Y., Donahue, C., & Watanabe, S. (2025). VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music. Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics – System Demonstration Track. OpenReview

SLT 2024 Paper:

Shi, J., Tian, J., Wu, Y., Jung, J., Yip, J. Q., Masuyama, Y., Chen, W., Wu, Y., Tang, Y., Baali, M., Alharthi, D., Zhang, D., Deng, R., Srivastava, T., Wu, H., Liu, A., Raj, B., Jin, Q., Song, R., & Watanabe, S. (2024). ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs For Audio, Music, and Speech. 2024 IEEE Spoken Language Technology Workshop (SLT), 562-569. DOI: 10.1109/SLT61566.2024.10832289

Looking Forward

We’re committed to continuously improving VERSA with new metrics, enhanced usability, and expanded documentation. Stay tuned for upcoming features and don’t hesitate to reach out with suggestions!