2025 Papers
- Summarization EMNLPSummarizing Speech: A Comprehensive SurveyIn Proceedings of EMNLP 2025
- ASR APSIPAPhoneme-grapheme Dictionary-based Prompting for Robust Proper Noun Recognition in Japanese ASRIn Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2025
- SLU&Dialogue ASRUAURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven TasksIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- Evaluation ASRUVERSA-v2: A Modular and Scalable Toolkit for Speech and Audio Evaluation with Expanded Metrics, Visualization, and LLM IntegrationIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- SSL ASRUEvaluating Self-Supervised Speech Models via Text-based LLMsIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- LID ASRUGeolocation-Aware Robust Spoken Language IdentificationIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- Music ASRURobust Training of Singing Voice Synthesis Using Prior and Posterior UncertaintyIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- ASR&Diarization ASRUUnifying Diarization, Separation, and ASR with Multi-Speaker EncoderIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- Tokenizer ASRUPURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec LearningIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- Compression ASRUSSVD: Structured SVD for Parameter-Efficient Fine-Tuning and Benchmarking under Domain Shift in ASRIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- SE ASRULess is More: Data Curation Matters in Scaling Speech EnhancementIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- ASR ASRUSpiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early ExitingIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- SE ASRUURGENT-PK: Perceptually-Aligned Ranking Model Designed for Speech Enhancement CompetitionIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- SE&Evaluation ASRUImproving Speech Enhancement with Multi-Metric Supervision from Learned Quality AssessmentIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- Dialogue ASRUStreaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed TrainingIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- Speech-LLM ASRUBalancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLMIn IEEE Automatic Speech Recogiton and Understanding Workshop (ASRU) 2025
- Audio WASPAAOpenBEATs: A Fully Open-Source General-Purpose Audio EncoderIn IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2025
- Audio WASPAALearning Robust Spatial Representations from Binaural Audio through Feature DistillationIn IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2025
- Music&Evaluation ISMIRAligning Text-to-Music Evaluation with Human PreferencesIn Proceedings of ISMIR 2025
- TTS&Dataset InterspeechThe text-to-speech in the wild (TITW) datasetIn Proceedings of Interspeech 2025
- ASR InterspeechImproving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTCIn Proceedings of Interspeech 2025
- Dataset&Dialogue InterspeechScalable Spontaneous Speech Dataset (SSSD): Crowdsourcing Data Collection to Promote Dialogue ResearchIn Proceedings of Interspeech 2025
- SLU&Dialogue InterspeechA Chain-of-Thought Reasoning Approach to E2E Spoken Dialogue Systems with an Open-Source ToolkitIn Proceedings of Interspeech 2025
- Dataset InterspeechCS-FLEURS: A Massively Multilingual and Code-Switched Speech DatasetIn Proceedings of Interspeech 2025
- ASR InterspeechExploring Linear Variant Transformers and k-NN Memory Inference for Long-Form ASRIn Proceedings of Interspeech 2025
- AV InterspeechThe Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and RecognitionIn Proceedings of Interspeech 2025
- Evaluation InterspeechUni-VERSA: Versatile Evaluation of Speech with a Unified FrameworkIn Proceedings of Interspeech 2025
- S2ST InterspeechScheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMsIn Proceedings of Interspeech 2025
- SE InterspeechInterspeech 2025 URGENT Speech Enhancement ChallengeIn Proceedings of Interspeech 2025
- Summarization InterspeechPick and Summarize: Integrating Extractive and Abstractive Speech SummarizationIn Proceedings of Interspeech 2025
- SE InterspeechLessons Learned from the URGENT 2024 Speech Enhancement ChallengeIn Proceedings of Interspeech 2025
- Compression InterspeechContext-Driven Dynamic Pruning for Large Multi-Modal Foundation ModelIn Proceedings of Interspeech 2025
- Speech-LLM InterspeechOpusLM: A Family of Open Unified Speech Language ModelsIn Proceedings of Interspeech 2025
- Health InterspeechExplainable Depression Detection using Masked Hard Instance MiningIn Proceedings of Interspeech 2025
- ASR InterspeechOWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and CleaningIn Proceedings of Interspeech 2025
- ASR InterspeechThe ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language VarietiesIn Proceedings of Interspeech 2025
- Tokenizer InterspeechOn-device Streaming Discrete Speech UnitsIn Proceedings of Interspeech 2025
- Dataset InterspeechGALAXY: A Large-Scale Open-Domain Dataset for Multimodal LearningIn Proceedings of Interspeech 2025
- Tokenizer InterspeechDifferentiable K-means for Fully-optimized Discrete Token-based ASRIn Proceedings of Interspeech 2025
- ASR InterspeechDYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech RecognitionIn Proceedings of Interspeech 2025
- SSL InterspeechDiceHuBERT: Distilling HuBERT with a Self-Supervised Learning ObjectiveIn Proceedings of Interspeech 2025
- Speech-LLM ACLSIQ: Exterminating Speech Intelligence Quotient Cross Cognitive Levels in Voice Understanding Large Language ModelsIn Proceedings of the Annual Meeting of the Association for Computational Linguistics 2025
- ASR&ST ICMLOWLS: Scaling Laws for Multilingual Speech Recognition and Translation ModelsIn Proceedings of the International Conference on Machine Learning (ICML) 2025
- SLU&Dialogue NAACLESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue SystemsIn Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2025
- Evaluation NAACLVERSA: A Versatile Evaluation Toolkit for Speech, Audio, and MusicIn Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2025
- Speech-LLM NAACLESPnet-SpeechLM: An Open Speech Language Model ToolkitIn Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2025
- Pronunciation NAACLLeveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation AssessmentIn Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2025
- Speech-LLM NAACLVoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-TuningIn Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2025
- Evaluation ICLRDynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 TasksIn Proceedings of the International Conference on Learning Representations (ICLR) 2025
- Dialogue ICLRTalking Turns: Benchmarking Audio Foundation Models on Turn-Taking DynamicsIn Proceedings of the International Conference on Learning Representations (ICLR) 2025
- Compression ICLRContext-aware Dynamic Pruning for Speech Foundation ModelsIn Proceedings of the International Conference on Learning Representations (ICLR) 2025
- Speaker ICASSPSpeaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-LabelsIn Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025
- ASR ICASSPImproving Multilingual ASR in the Wild Using Simple N-best Re-rankingIn Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025
- ASR ICASSPHypothesis Clustering and Merging: MultiTalker Speech Recognition with Speaker Token EstimationIn Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025
- TTS ICASSPPreference Alignment Improves Language Model-Based TTSIn Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025
- SSL ICASSPExploring Prediction Targets in Masked Pre-Training for Speech Foundation ModelsIn Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025
- Speech-Text ICASSPBridging Speech and Text Foundation Models with ReShape AttentionIn Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025
- SSL&ASR ICASSPInvestigation of Spatial Self-Supervised Learning and Its Application to Target Speaker Speech RecognitionIn Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025
- AVSR AAAIEnhancing Audiovisual Speech Recognition through Bifocal Preference OptimizationIn Proceedings of the AAAI Conference on Artificial Intelligence 2025