A Comprehensive Speech and Audio Evaluation Toolkit
The WAVLab team is excited to announce the public release of VERSA (Versatile Evaluation of Speech and Audio), our comprehensive toolkit designed to revolutionize how researchers and developers evaluate speech and audio quality.
Audio quality assessment has long been fragmented across numerous specialized metrics, each requiring different setups, dependencies, and formats. This fragmentation creates significant barriers for researchers and practitioners alike:
VERSA solves these problems by providing a unified framework that brings together over 80 evaluation metrics under a single, easy-to-use interface.
VERSA provides seamless access to more than 80 evaluation and profiling metrics with 10x variants, covering:
git clone https://github.com/wavlab-speech/versa.git
cd versa
pip install .
For metrics requiring additional dependencies, our tools
directory provides convenient installers.
Evaluating speech quality is as simple as:
python versa/bin/scorer.py \
--score_config egs/speech.yaml \
--gt /path/to/reference/audio \
--pred /path/to/generated/audio \
--output_file results \
--io dir
VERSA enables researchers and developers to:
Want to see VERSA in action? Try our interactive demo from
VERSA is open-source and community-driven. We welcome contributions and feedback:
If you find VERSA useful in your research, please cite our papers:
NAACL 2025 Demo Paper:
Shi, J., Shim, H., Tian, J., Arora, S., Wu, H., Petermann, D., Yip, J. Q., Zhang, Y., Tang, Y., Zhang, W., Alharthi, D. S., Huang, Y., Saito, K., Han, J., Zhao, Y., Donahue, C., & Watanabe, S. (2025). VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music. Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics – System Demonstration Track. OpenReview
SLT 2024 Paper:
Shi, J., Tian, J., Wu, Y., Jung, J., Yip, J. Q., Masuyama, Y., Chen, W., Wu, Y., Tang, Y., Baali, M., Alharthi, D., Zhang, D., Deng, R., Srivastava, T., Wu, H., Liu, A., Raj, B., Jin, Q., Song, R., & Watanabe, S. (2024). ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs For Audio, Music, and Speech. 2024 IEEE Spoken Language Technology Workshop (SLT), 562-569. DOI: 10.1109/SLT61566.2024.10832289
We’re committed to continuously improving VERSA with new metrics, enhanced usability, and expanded documentation. Stay tuned for upcoming features and don’t hesitate to reach out with suggestions!