Click on column names to sort.

Searching uses the 'and' of terms e.g. Smith Interspeech matches all papers by Smith in any Interspeech. The order of terms is not significant.

Use double quotes for exact phrasal matches e.g. "acoustic features" .

Case is ignored.

Diacritics are optional e.g. lefevre also matches lefèvre (but not vice versa).

It can be useful to turn off spell-checking for the search box in your browser preferences.

If you prefer to scroll rather than page, increase the number in the show entries dropdown.

Interspeech 2021

Brno, czechia 30 august - 3 september 2021, general chairs: hynek heřmanský, honza černocký; technical chairs: lukáš burget, lori lamel, odette scharenborg, petr motlicek.

Speech Synthesis: Other Topics

Conversion of Airborne to Bone-Conducted Speech with Deep Neural Networks Michael Pucher, Thomas Woltron

T5G2P: Using Text-to-Text Transfer Transformer for Grapheme-to-Phoneme Conversion Markéta Řezáčková, Jan Švec, Daniel Tihelka

Evaluating the Extrapolation Capabilities of Neural Vocoders to Extreme Pitch Values Olivier Perrotin, Hussein El Amouri, Gérard Bailly, Thomas Hueber

A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages Phat Do, Matt Coler, Jelske Dijkstra, Esther Klabbers

Disordered Speech

Acoustic Indicators of Speech Motor Coordination in Adults With and Without Traumatic Brain Injury Tanya Talkar, Nancy Pearl Solomon, Douglas S. Brungart, Stefanie E. Kuchinsky, Megan M. Eitel, Sara M. Lippa, Tracey A. Brickell, Louis M. French, Rael T. Lange, Thomas F. Quatieri

On Modeling Glottal Source Information for Phonation Assessment in Parkinson’s Disease J.C. Vásquez-Correa, Julian Fritsch, J.R. Orozco-Arroyave, Elmar Nöth, Mathew Magimai-Doss

Distortion of Voiced Obstruents for Differential Diagnosis Between Parkinson’s Disease and Multiple System Atrophy Khalid Daoudi, Biswajit Das, Solange Milhé de Saint Victor, Alexandra Foubert-Samier, Anne Pavy-Le Traon, Olivier Rascol, Wassilios G. Meissner, Virginie Woisard

A Study into Pre-Training Strategies for Spoken Language Understanding on Dysarthric Speech Pu Wang, Bagher BabaAli, Hugo Van hamme

EasyCall Corpus: A Dysarthric Speech Dataset Rosanna Turrisi, Arianna Braccia, Marco Emanuele, Simone Giulietti, Maura Pugliatti, Mariachiara Sensi, Luciano Fadiga, Leonardo Badino

Speech Signal Analysis and Representation II

A Benchmark of Dynamical Variational Autoencoders Applied to Speech Spectrogram Modeling Xiaoyu Bie, Laurent Girin, Simon Leglaive, Thomas Hueber, Xavier Alameda-Pineda

Fricative Phoneme Detection Using Deep Neural Networks and its Comparison to Traditional Methods Metehan Yurt, Pavan Kantharaju, Sascha Disch, Andreas Niedermeier, Alberto N. Escalante-B, Veniamin I. Morgenshtern

Identification of F1 and F2 in Speech Using Modified Zero Frequency Filtering RaviShankar Prasad, Mathew Magimai-Doss

Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice Yann Teytaut, Axel Roebel

Feature, Embedding and Neural Architecture for Speaker Recognition

Adaptive Convolutional Neural Network for Text-Independent Speaker Recognition Seong-Hu Kim, Yong-Hwa Park

Bidirectional Multiscale Feature Aggregation for Speaker Verification Jiajun Qi, Wu Guo, Bin Gu

Improving Time Delay Neural Network Based Speaker Recognition with Convolutional Block and Feature Aggregation Methods Yu-Jia Zhang, Yih-Wen Wang, Chia-Ping Chen, Chung-Li Lu, Bo-Cheng Chan

Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification Yanfeng Wu, Junan Zhao, Chenkai Guo, Jing Xu

Binary Neural Network for Speaker Verification Tinglong Zhu, Xiaoyi Qin, Ming Li

Mutual Information Enhanced Training for Speaker Embedding Youzhi Tu, Man-Wai Mak

Y-Vector: Multiscale Waveform Encoder for Speaker Embedding Ge Zhu, Fei Jiang, Zhiyao Duan

Phoneme-Aware and Channel-Wise Attentive Learning for Text Dependent Speaker Verification Yan Liu, Zheng Li, Lin Li, Qingyang Hong

Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding Hongning Zhu, Kong Aik Lee, Haizhou Li

Speech Synthesis: Toward End-to-End Synthesis II

TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions Cheng Gong, Longbiao Wang, Ju Zhang, Shaotong Guo, Yuguang Wang, Jianwu Dang

FastPitchFormant: Source-Filter Based Decomposed Modeling for Speech Synthesis Taejun Bak, Jae-Sung Bae, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho

Sequence-to-Sequence Learning for Deep Gaussian Process Based Speech Synthesis Using Self-Attention GP Layer Taiki Nakamura, Tomoki Koriyama, Hiroshi Saruwatari

Phonetic and Prosodic Information Estimation from Texts for Genuine Japanese End-to-End Text-to-Speech Naoto Kakegawa, Sunao Hara, Masanobu Abe, Yusuke Ijima

Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis Xudong Dai, Cheng Gong, Longbiao Wang, Kaili Zhang

Deliberation-Based Multi-Pass Speech Synthesis Qingyun Dou, Xixin Wu, Moquan Wan, Yiting Lu, Mark J.F. Gales

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, R.J. Skerry-Ryan, Yonghui Wu

Transformer-Based Acoustic Modeling for Streaming Speech Synthesis Chunyang Wu, Zhiping Xiu, Yangyang Shi, Ozlem Kalinli, Christian Fuegen, Thilo Koehler, Qing He

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS Ye Jia, Heiga Zen, Jonathan Shen, Yu Zhang, Yonghui Wu

Speed up Training with Variable Length Inputs by Efficient Batching Strategies Zhenhao Ge, Lakshmish Kaushik, Masanori Omote, Saket Kumar

Speech Enhancement and Intelligibility

Funnel Deep Complex U-Net for Phase-Aware Speech Enhancement Yuhang Sun, Linju Yang, Huifeng Zhu, Jie Hao

Temporal Convolutional Network with Frequency Dimension Adaptive Attention for Speech Enhancement Qiquan Zhang, Qi Song, Aaron Nicolson, Tian Lan, Haizhou Li

Perceptual Contributions of Vowels and Consonant-Vowel Transitions in Understanding Time-Compressed Mandarin Sentences Changjie Pan, Feng Yang, Fei Chen

Transfer Learning for Speech Intelligibility Improvement in Noisy Environments Ritujoy Biswas, Karan Nathwani, Vinayak Abrol

Comparison of Remote Experiments Using Crowdsourcing and Laboratory Experiments on Speech Intelligibility Ayako Yamamoto, Toshio Irino, Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani

Know Your Enemy, Know Yourself: A Unified Two-Stage Framework for Speech Enhancement Wenzhe Liu, Andong Li, Yuxuan Ke, Chengshi Zheng, Xiaodong Li

Speech Enhancement with Weakly Labelled Data from AudioSet Qiuqiang Kong, Haohe Liu, Xingjian Du, Li Chen, Rui Xia, Yuxuan Wang

Improving Perceptual Quality by Phone-Fortified Perceptual Loss Using Wasserstein Distance for Speech Enhancement Tsun-An Hsieh, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, Yu Tsao

A Spectro-Temporal Glimpsing Index (STGI) for Speech Intelligibility Prediction Amin Edraki, Wai-Yip Chan, Jesper Jensen, Daniel Fogerty

Self-Supervised Learning Based Phone-Fortified Speech Enhancement Yuanhang Qiu, Ruili Wang, Satwinder Singh, Zhizhong Ma, Feng Hou

Incorporating Embedding Vectors from a Human Mean-Opinion Score Prediction Model for Monaural Speech Enhancement Khandokar Md. Nayem, Donald S. Williamson

Restoring Degraded Speech via a Modified Diffusion Model Jianwei Zhang, Suren Jayasuriya, Visar Berisha

Spoken Dialogue Systems I

User-Initiated Repetition-Based Recovery in Multi-Utterance Dialogue Systems Hoang Long Nguyen, Vincent Renkens, Joris Pelemans, Srividya Pranavi Potharaju, Anil Kumar Nalamalapu, Murat Akbacak

Self-Supervised Dialogue Learning for Spoken Conversational Question Answering Nuo Chen, Chenyu You, Yuexian Zou

Act-Aware Slot-Value Predicting in Multi-Domain Dialogue State Tracking Ruolin Su, Ting-Wei Wu, Biing-Hwang Juang

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information Yuya Chiba, Ryuichiro Higashinaka

Neural Spoken-Response Generation Using Prosodic and Linguistic Context for Conversational Systems Yoshihiro Yamazaki, Yuya Chiba, Takashi Nose, Akinori Ito

Semantic Transportation Prototypical Network for Few-Shot Intent Detection Weiyuan Xu, Peilin Zhou, Chenyu You, Yuexian Zou

Domain-Specific Multi-Agent Dialog Policy Learning in Multi-Domain Task-Oriented Scenarios Li Tang, Yuke Si, Longbiao Wang, Jianwu Dang

Leveraging ASR N-Best in Deep Entity Retrieval Haoyu Wang, John Chen, Majid Laali, Kevin Durda, Jeff King, William Campbell, Yang Liu

Topics in ASR: Robustness, Feature Extraction, and Far-Field ASR

End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-Switching Speech Recognition Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao, Xuefei Liu, Zhengqi Wen

Phoneme Recognition Through Fine Tuning of Phonetic Representations: A Case Study on Luhya Language Varieties Kathleen Siminyu, Xinjian Li, Antonios Anastasopoulos, David R. Mortensen, Michael R. Marlo, Graham Neubig

Speech Acoustic Modelling Using Raw Source and Filter Components Erfan Loweimi, Zoran Cvetkovic, Peter Bell, Steve Renals

Noise Robust Acoustic Modeling for Single-Channel Speech Recognition Based on a Stream-Wise Transformer Architecture Masakiyo Fujimoto, Hisashi Kawai

IR-GAN: Room Impulse Response Generator for Far-Field Speech Recognition Anton Ratnarajah, Zhenyu Tang, Dinesh Manocha

Scaling Sparsemax Based Channel Selection for Speech Recognition with ad-hoc Microphone Arrays Junqi Chen, Xiao-Lei Zhang

Multi-Channel Transformer Transducer for Speech Recognition Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo

Data Augmentation Methods for End-to-End Speech Recognition on Distant-Talk Scenarios Emiru Tsunoo, Kentaro Shibata, Chaitanya Narisetty, Yosuke Kashiwagi, Shinji Watanabe

Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition Guodong Ma, Pengfei Hu, Jian Kang, Shen Huang, Hao Huang

Rethinking Evaluation in ASR: Are Our Models Robust Enough? Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Paden Tomasello, Jacob Kahn, Gilad Avidov, Ronan Collobert, Gabriel Synnaeve

Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition Max W.Y. Lam, Jun Wang, Chao Weng, Dan Su, Dong Yu

Voice Activity Detection and Keyword Spotting

Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams Yuanbo Hou, Zhesong Yu, Xia Liang, Xingjian Du, Bilei Zhu, Zejun Ma, Dick Botteldooren

Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection Ui-Hyun Kim

Noisy Student-Teacher Training for Robust Keyword Spotting Hyun-Jin Park, Pai Zhu, Ignacio Lopez Moreno, Niranjan Subrahmanya

Multi-Channel VAD for Transcription of Group Discussion Osamu Ichikawa, Kaito Nakano, Takahiro Nakayama, Hajime Shirouzu

Audio-Visual Information Fusion Using Cross-Modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments Hengshun Zhou, Jun Du, Hang Chen, Zijun Jing, Shifu Xiong, Chin-Hui Lee

Enrollment-Less Training for Personalized Voice Activity Detection Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura

Voice Activity Detection for Live Speech of Baseball Game Based on Tandem Connection with Speech/Noise Separation Model Yuto Nonaka, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki

FastICARL: Fast Incremental Classifier and Representation Learning with Efficient Budget Allocation in Audio Sensing Applications Young D. Kwon, Jagmohan Chauhan, Cecilia Mascolo

End-to-End Transformer-Based Open-Vocabulary Keyword Spotting with Location-Guided Local Attention Bo Wei, Meirong Yang, Tao Zhang, Xiao Tang, Xing Huang, Kyuhong Kim, Jaeyun Lee, Kiho Cho, Sung-Un Park

Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak

A Lightweight Framework for Online Voice Activity Detection in the Wild Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu

Voice and Voicing

“See what I mean, huh?” Evaluating Visual Inspection of F 0 Tracking in Nasal Grunts Aurélie Chlébowski, Nicolas Ballier

System Performance as a Function of Calibration Methods, Sample Size and Sampling Variability in Likelihood Ratio-Based Forensic Voice Comparison Bruce Xiao Wang, Vincent Hughes

Voicing Assimilations by French Speakers of German in Stop-Fricative Sequences Anne Bonneau

The Four-Way Classification of Stops with Voicing and Aspiration for Non-Native Speech Evaluation Titas Chakraborty, Vaishali Patil, Preeti Rao

Acoustic and Prosodic Correlates of Emotions in Urdu Speech Saba Urooj, Benazir Mumtaz, Sarmad Hussain, Ehsan ul Haq

Voicing Contrasts in the Singleton Stops of Palestinian Arabic: Production and Perception Nour Tamim, Silke Hamann

A Comparison of the Accuracy of Dissen and Keshet’s (2016) DeepFormants and Traditional LPC Methods for Semi-Automatic Speaker Recognition Thomas Coy, Vincent Hughes, Philip Harrison, Amelia J. Gully

MAP Adaptation Characteristics in Forensic Long-Term Formant Analysis Michael Jessen

Cross-Linguistic Speaker Individuality of Long-Term Formant Distributions: Phonetic and Forensic Perspectives Justin J.H. Lo

Sound Change in Spontaneous Bilingual Speech: A Corpus Study on the Cantonese n-l Merger in Cantonese-English Bilinguals Rachel Soo, Khia A. Johnson, Molly Babel

Characterizing Voiced and Voiceless Nasals in Mizo Wendy Lalhminghlui, Priyankoo Sarmah

The INTERSPEECH 2021 Computational Paralinguistics Challenge (ComParE) — COVID-19 Cough, COVID-19 Speech, Escalation & Primates

The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates Björn W. Schuller, Anton Batliner, Christian Bergler, Cecilia Mascolo, Jing Han, Iulia Lefter, Heysem Kaya, Shahin Amiriparian, Alice Baird, Lukas Stappen, Sandra Ottl, Maurice Gerczuk, Panagiotis Tzirakis, Chloë Brown, Jagmohan Chauhan, Andreas Grammenos, Apinan Hasthanasombat, Dimitris Spathis, Tong Xia, Pietro Cicuta, Leon J.M. Rothkrantz, Joeri A. Zwerts, Jelle Treep, Casper S. Kaandorp

Transfer Learning-Based Cough Representations for Automatic Detection of COVID-19 Rubén Solera-Ureña, Catarina Botelho, Francisco Teixeira, Thomas Rolland, Alberto Abad, Isabel Trancoso

The Phonetic Footprint of Covid-19? P. Klumpp, T. Bocklet, T. Arias-Vergara, J.C. Vásquez-Correa, P.A. Pérez-Toro, S.P. Bayerl, J.R. Orozco-Arroyave, Elmar Nöth

Transfer Learning and Data Augmentation Techniques to the COVID-19 Identification Tasks in ComParE 2021 Edresson Casanova, Arnaldo Candido Jr., Ricardo Corso Fernandes Jr., Marcelo Finger, Lucas Rafael Stefanel Gris, Moacir Antonelli Ponti, Daniel Peixoto Pinto da Silva

Visual Transformers for Primates Classification and Covid Detection Steffen Illium, Robert Müller, Andreas Sedlmeier, Claudia-Linnhoff Popien

Deep-Learning-Based Central African Primate Species Classification with MixUp and SpecAugment Thomas Pellegrini

A Deep and Recurrent Architecture for Primate Vocalization Classification Robert Müller, Steffen Illium, Claudia Linnhoff-Popien

Introducing a Central African Primate Vocalisation Dataset for Automated Species Classification Joeri A. Zwerts, Jelle Treep, Casper S. Kaandorp, Floor Meewis, Amparo C. Koot, Heysem Kaya

Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild Georgios Rizos, Jenna Lawson, Zhuoda Han, Duncan Butler, James Rosindell, Krystian Mikolajczyk, Cristina Banks-Leite, Björn W. Schuller

Identifying Conflict Escalation and Primates by Using Ensemble X-Vectors and Fisher Vector Features José Vicente Egas-López, Mercedes Vetráb, László Tóth, Gábor Gosztolya

Ensemble-Within-Ensemble Classification for Escalation Prediction from Speech Oxana Verkholyak, Denis Dresvyanskiy, Anastasia Dvoynikova, Denis Kotov, Elena Ryumina, Alena Velichko, Danila Mamontov, Wolfgang Minker, Alexey Karpov

Analysis by Synthesis: Using an Expressive TTS Model as Feature Extractor for Paralinguistic Speech Classification Dominik Schiller, Silvan Mertes, Pol van Rijn, Elisabeth André

Survey Talk 1: Heidi Christensen

Towards Automatic Speech Recognition for People with Atypical Speech Heidi Christensen

Embedding and Network Architecture for Speaker Recognition

Leveraging Speaker Attribute Information Using Multi Task Learning for Speaker Verification and Diarization Chau Luu, Peter Bell, Steve Renals

Spine2Net: SpineNet with Res2Net and Time-Squeeze-and-Excitation Blocks for Speaker Recognition Magdalena Rybicka, Jesús Villalba, Piotr Żelasko, Najim Dehak, Konrad Kowalczyk

Speaker Embeddings by Modeling Channel-Wise Correlations Themos Stafylakis, Johan Rohdin, Lukáš Burget

Multi-Task Neural Network for Robust Multiple Speaker Embedding Extraction Weipeng He, Petr Motlicek, Jean-Marc Odobez

ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform Junyi Peng, Xiaoyang Qu, Jianzong Wang, Rongzhi Gu, Jing Xiao, Lukáš Burget, Jan Černocký

Speech Perception I

Prosodic Disambiguation Using Chironomic Stylization of Intonation with Native and Non-Native Speakers Xiao Xiao, Nicolas Audibert, Grégoire Locqueville, Christophe d'Alessandro, Barbara Kuhnert, Claire Pillot-Loiseau

Variation in Perceptual Sensitivity and Compensation for Coarticulation Across Adult and Child Naturally-Produced and TTS Voices Aleese Block, Michelle Cohn, Georgia Zellou

Extracting Different Levels of Speech Information from EEG Using an LSTM-Based Model Mohammad Jalilpour Monesi, Bernd Accou, Tom Francart, Hugo Van hamme

Word Competition: An Entropy-Based Approach in the DIANA Model of Human Word Comprehension Louis ten Bosch, Lou Boves

Time-to-Event Models for Analyzing Reaction Time Sequences Louis ten Bosch, Lou Boves

Models of Reaction Times in Auditory Lexical Decision: RTonset versus RToffset Sophie Brand, Kimberley Mulder, Louis ten Bosch, Lou Boves

Acoustic Event Detection and Acoustic Scene Classification

SpecMix : A Mixed Sample Data Augmentation Method for Training with Time-Frequency Domain Features Gwantae Kim, David K. Han, Hanseok Ko

SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification Helin Wang, Yuexian Zou, Wenwu Wang

An Effective Mutual Mean Teaching Based Domain Adaptation Method for Sound Event Detection Xu Zheng, Yan Song, Li-Rong Dai, Ian McLoughlin, Lin Liu

Acoustic Scene Classification Using Kervolution-Based SubSpectralNet Ritika Nandi, Shashank Shekhar, Manjunath Mulimani

Event Specific Attention for Polyphonic Sound Event Detection Harshavardhan Sundar, Ming Sun, Chao Wang

AST: Audio Spectrogram Transformer Yuan Gong, Yu-An Chung, James Glass

Shallow Convolution-Augmented Transformer with Differentiable Neural Computer for Low-Complexity Classification of Variable-Length Acoustic Scene Soonshin Seo, Donghyun Lee, Ji-Hwan Kim

An Evaluation of Data Augmentation Methods for Sound Scene Geotagging Helen L. Bear, Veronica Morfi, Emmanouil Benetos

Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers Chiori Hori, Takaaki Hori, Jonathan Le Roux

Variational Information Bottleneck for Effective Low-Resource Audio Classification Shijing Si, Jianzong Wang, Huiming Sun, Jianhan Wu, Chuanyao Zhang, Xiaoyang Qu, Ning Cheng, Lei Chen, Jing Xiao

Improving Weakly Supervised Sound Event Detection with Self-Supervised Auxiliary Tasks Soham Deshmukh, Bhiksha Raj, Rita Singh

Acoustic Event Detection with Classifier Chains Tatsuya Komatsu, Shinji Watanabe, Koichi Miyazaki, Tomoki Hayashi

Diverse Modes of Speech Acquisition and Processing

Segment and Tone Production in Continuous Speech of Hearing and Hearing-Impaired Children Shu-Chuan Tseng, Yi-Fen Liu

Effect of Carrier Bandwidth on Understanding Mandarin Sentences in Simulated Electric-Acoustic Hearing Feng Wang, Jing Chen, Fei Chen

A Comparative Study of Different EMG Features for Acoustics-to-EMG Mapping Manthan Sharma, Navaneetha Gaddam, Tejas Umesh, Aditya Murthy, Prasanta Kumar Ghosh

Image-Based Assessment of Jaw Parameters and Jaw Kinematics for Articulatory Simulation: Preliminary Results Ajish K. Abraham, V. Sivaramakrishnan, N. Swapna, N. Manohar

An Attention Self-Supervised Contrastive Learning Based Three-Stage Model for Hand Shape Feature Representation in Cued Speech Jianrong Wang, Nan Gu, Mei Yu, Xuewei Li, Qiang Fang, Li Liu

Remote Smartphone-Based Speech Collection: Acceptance and Barriers in Individuals with Major Depressive Disorder Judith Dineley, Grace Lavelle, Daniel Leightley, Faith Matcham, Sara Siddi, Maria Teresa Peñarrubia-María, Katie M. White, Alina Ivan, Carolin Oetzmann, Sara Simblett, Erin Dawe-Lane, Stuart Bruce, Daniel Stahl, Yatharth Ranjan, Zulqarnain Rashid, Pauline Conde, Amos A. Folarin, Josep Maria Haro, Til Wykes, Richard J.B. Dobson, Vaibhav A. Narayan, Matthew Hotopf, Björn W. Schuller, Nicholas Cummins, The RADAR-CNS Consortium

An Automatic, Simple Ultrasound Biofeedback Parameter for Distinguishing Accurate and Misarticulated Rhotic Syllables Sarah R. Li, Colin T. Annand, Sarah Dugan, Sarah M. Schwab, Kathryn J. Eary, Michael Swearengen, Sarah Stack, Suzanne Boyce, Michael A. Riley, T. Douglas Mast

Silent versus Modal Multi-Speaker Speech Recognition from Ultrasound and Video Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

RaSSpeR: Radar-Based Silent Speech Recognition David Ferreira, Samuel Silva, Francisco Curado, António Teixeira

Investigating Speech Reconstruction for Laryngectomees for Silent Speech Interfaces Beiming Cao, Nordine Sebkhi, Arpan Bhavsar, Omer T. Inan, Robin Samlan, Ted Mau, Jun Wang

Multi-Channel Speech Enhancement and Hearing Aids

LACOPE: Latency-Constrained Pitch Estimation for Speech Enhancement Hendrik Schröter, Tobias Rosenkranz, Alberto N. Escalante-B, Andreas Maier

Alpha-Stable Autoregressive Fast Multichannel Nonnegative Matrix Factorization for Joint Speech Enhancement and Dereverberation Mathieu Fontaine, Kouhei Sekiguchi, Aditya Arie Nugraha, Yoshiaki Bando, Kazuyoshi Yoshii

Microphone Array Generalization for Multichannel Narrowband Deep Speech Enhancement Siyuan Zhang, Xiaofei Li

Multiple Sound Source Localization Based on Interchannel Phase Differences in All Frequencies with Spectral Masks Hyungchan Song, Jong Won Shin

Cancellation of Local Competing Speaker with Near-Field Localization for Distributed ad-hoc Sensor Network Pablo Pérez Zarazaga, Mariem Bouafif Mansali, Tom Bäckström, Zied Lachiri

A Deep Learning Method to Multi-Channel Active Noise Control Hao Zhang, DeLiang Wang

Clarity-2021 Challenges: Machine Learning Challenges for Advancing Hearing Aid Processing Simone Graetzer, Jon Barker, Trevor J. Cox, Michael Akeroyd, John F. Culling, Graham Naylor, Eszter Porter, Rhoddy Viveros Muñoz

Optimising Hearing Aid Fittings for Speech in Noise with a Differentiable Hearing Loss Model Zehai Tu, Ning Ma, Jon Barker

Explaining Deep Learning Models for Speech Enhancement Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr

Minimum-Norm Differential Beamforming for Linear Array with Directional Microphones Weilong Huang, Jinwei Feng

Self-Supervision and Semi-Supervision for Neural ASR Training

Improving Streaming Transformer Based ASR Under a Framework of Self-Supervised Learning Songjun Cao, Yueteng Kang, Yanzhe Fu, Xiaoshuo Xu, Sining Sun, Yike Zhang, Long Ma

wav2vec-C: A Self-Supervised Model for Speech Representation Learning Samik Sadhu, Di He, Che-Wei Huang, Sri Harish Mallidi, Minhua Wu, Ariya Rastrow, Andreas Stolcke, Jasha Droppo, Roland Maas

On the Learning Dynamics of Semi-Supervised Training for ASR Electra Wallington, Benji Kershenbaum, Ondřej Klejch, Peter Bell

Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli

Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori

A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models Ananya Misra, Dongseong Hwang, Zhouyuan Huo, Shefali Garg, Nikhil Siddhartha, Arun Narayanan, Khe Chai Sim

Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation Zhehuai Chen, Andrew Rosenberg, Yu Zhang, Heiga Zen, Mohammadreza Ghodsi, Yinghui Huang, Jesse Emond, Gary Wang, Bhuvana Ramabhadran, Pedro J. Moreno

slimIPL: Language-Model-Free Iterative Pseudo-Labeling Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, Ronan Collobert

Phonetically Motivated Self-Supervised Speech Representation Learning Xianghu Yue, Haizhou Li

Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS Yan Deng, Rui Zhao, Zhong Meng, Xie Chen, Bing Liu, Jinyu Li, Yifan Gong, Lei He

Spoken Language Processing I

Speaker-Conversation Factorial Designs for Diarization Error Analysis Scott Seyfarth, Sundararajan Srinivasan, Katrin Kirchhoff

SmallER: Scaling Neural Entity Resolution for Edge Devices Ross McGowan, Jinru Su, Vince DiCocco, Thejaswi Muniyappa, Grant P. Strimel

Disfluency Detection with Unlabeled Data and Small BERT Models Johann C. Rocholl, Vicky Zayats, Daniel D. Walker, Noah B. Murad, Aaron Schneider, Daniel J. Liebling

Discriminative Self-Training for Punctuation Prediction Qian Chen, Wen Wang, Mengzhe Chen, Qinglin Zhang

Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks Using Switching Tokens Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura

A Noise Robust Method for Word-Level Pronunciation Assessment Binghuai Lin, Liyuan Wang

Targeted Keyword Filtering for Accelerated Spoken Topic Identification Jonathan Wintrode

Multimodal Speech Summarization Through Semantic Concept Learning Shruti Palaskar, Ruslan Salakhutdinov, Alan W. Black, Florian Metze

Enhancing Semantic Understanding with Self-Supervised Methods for Abstractive Dialogue Summarization Hyunjae Lee, Jaewoong Yun, Hyunjin Choi, Seongho Joe, Youngjune L. Gwon

Speaker Transition Patterns in Three-Party Conversation: Evidence from English, Estonian and Swedish Marcin Włodarczak, Emer Gilmartin

Voice Conversion and Adaptation II

Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion Samuel J. Broughton, Md. Asif Jalal, Roger K. Moore

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training Kun Zhou, Berrak Sisman, Haizhou Li

Adversarial Voice Conversion Against Neural Spoofing Detectors Yi-Yang Ding, Li-Juan Liu, Yu Hu, Zhen-Hua Ling

An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation Xiangheng He, Junjie Chen, Georgios Rizos, Björn W. Schuller

TVQVC: Transformer Based Vector Quantized Variational Autoencoder with CTC Loss for Voice Conversion Ziyi Chen, Pengyuan Zhang

Enriching Source Style Transfer in Recognition-Synthesis Based Non-Parallel Voice Conversion Zhichao Wang, Xinyong Zhou, Fengyu Yang, Tao Li, Hongqiang Du, Lei Xie, Wendong Gan, Haitao Chen, Hai Li

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations Jheng-hao Lin, Yist Y. Lin, Chung-Ming Chien, Hung-yi Lee

An Exemplar Selection Algorithm for Native-Nonnative Voice Conversion Christopher Liberatore, Ricardo Gutierrez-Osuna

Adversarially Learning Disentangled Speech Representations for Robust Multi-Factor Voice Conversion Jie Wang, Jingbei Li, Xintao Zhao, Zhiyong Wu, Shiyin Kang, Helen Meng

Many-to-Many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder Manh Luong, Viet Anh Tran

Privacy-Preserving Machine Learning for Audio & Speech Processing

Privacy-Preserving Voice Anti-Spoofing Using Secure Multi-Party Computation Oubaïda Chouchane, Baptiste Brossier, Jorge Esteban Gamboa Gamboa, Thomas Lardy, Hemlata Tak, Orhan Ermis, Madhu R. Kamble, Jose Patino, Nicholas Evans, Melek Önen, Massimiliano Todisco

Configurable Privacy-Preserving Automatic Speech Recognition Ranya Aloufi, Hamed Haddadi, David Boyle

Adjunct-Emeritus Distillation for Semi-Supervised Language Model Adaptation Scott Novotney, Yile Gu, Ivan Bulyko

Communication-Efficient Agnostic Federated Averaging Jae Ro, Mingqing Chen, Rajiv Mathews, Mehryar Mohri, Ananda Theertha Suresh

Privacy-Preserving Feature Extraction for Cloud-Based Wake Word Verification Timm Koppelmann, Alexandru Nelus, Lea Schönherr, Dorothea Kolossa, Rainer Martin

PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee

Continual Learning for Fake Audio Detection Haoxin Ma, Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengkun Tian, Chenglong Wang

Evaluating the Vulnerability of End-to-End Automatic Speech Recognition Models to Membership Inference Attacks Muhammad A. Shah, Joseph Szurley, Markus Mueller, Athanasios Mouchtaris, Jasha Droppo

SynthASR: Unlocking Synthetic Data for Speech Recognition Amin Fazel, Wei Yang, Yulan Liu, Roberto Barra-Chicote, Yixiong Meng, Roland Maas, Jasha Droppo

The First DiCOVA Challenge: Diagnosis of COVID-19 Using Acoustics

DiCOVA Challenge: Dataset, Task, and Baseline System for COVID-19 Diagnosis Using Acoustics Ananya Muguli, Lancelot Pinto, Nirmala R, Neeraj Sharma, Prashant Krishnan, Prasanta Kumar Ghosh, Rohit Kumar, Shrirama Bhat, Srikanth Raj Chetupalli, Sriram Ganapathy, Shreyas Ramoji, Viral Nanda

PANACEA Cough Sound-Based Diagnosis of COVID-19 for the DiCOVA 2021 Challenge Madhu R. Kamble, Jose A. Gonzalez-Lopez, Teresa Grau, Juan M. Espin, Lorenzo Cascioli, Yiqing Huang, Alejandro Gomez-Alanis, Jose Patino, Roberto Font, Antonio M. Peinado, Angel M. Gomez, Nicholas Evans, Maria A. Zuluaga, Massimiliano Todisco

Recognising Covid-19 from Coughing Using Ensembles of SVMs and LSTMs with Handcrafted and Deep Audio Features Vincent Karas, Björn W. Schuller

Detecting COVID-19 from Audio Recording of Coughs Using Random Forests and Support Vector Machines Isabella Södergren, Maryam Pahlavan Nodeh, Prakash Chandra Chhipa, Konstantina Nikolaidou, György Kovács

Diagnosis of COVID-19 Using Auditory Acoustic Cues Rohan Kumar Das, Maulik Madhavi, Haizhou Li

Classification of COVID-19 from Cough Using Autoregressive Predictive Coding Pretraining and Spectral Data Augmentation John Harvill, Yash R. Wani, Mark Hasegawa-Johnson, Narendra Ahuja, David Beiser, David Chestek

The DiCOVA 2021 Challenge — An Encoder-Decoder Approach for COVID-19 Recognition from Coughing Audio Gauri Deshpande, Björn W. Schuller

COVID-19 Detection from Spectral Features on the DiCOVA Dataset Kotra Venkata Sai Ritwik, Shareef Babu Kalluri, Deepu Vijayasenan

Cough-Based COVID-19 Detection with Contextual Attention Convolutional Neural Networks and Gender Information Adria Mallol-Ragolta, Helena Cuesta, Emilia Gómez, Björn W. Schuller

Contrastive Learning of Cough Descriptors for Automatic COVID-19 Preliminary Diagnosis Swapnil Bhosale, Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu

Investigating Feature Selection and Explainability for COVID-19 Diagnostics from Cough Sounds Flavio Avila, Amir H. Poorjam, Deepak Mittal, Charles Dognin, Ananya Muguli, Rohit Kumar, Srikanth Raj Chetupalli, Sriram Ganapathy, Maneesh Singh

Show and Tell 1

Application for Detecting Depression, Parkinson’s Disease and Dysphonic Speech Gábor Kiss, Dávid Sztahó, Miklós Gábriel Tulics

Beey: More Than a Speech-to-Text Editor Lenka Weingartová, Veronika Volná, Ewa Balejová

Downsizing of Vocal-Tract Models to Line up Variations and Reduce Manufacturing Costs Takayuki Arai

ROXANNE Research Platform: Automate Criminal Investigations Maël Fabien, Shantipriya Parida, Petr Motlicek, Dawei Zhu, Aravind Krishnan, Hoang H. Nguyen

The LIUM Human Active Correction Platform for Speaker Diarization Alexandre Flucha, Anthony Larcher, Ambuj Mehrish, Sylvain Meignier, Florian Plaut, Nicolas Poupon, Yevhenii Prokopalo, Adrien Puertolas, Meysam Shamsi, Marie Tahon

On-Device Streaming Transformer-Based End-to-End Speech Recognition Yoo Rhee Oh, Kiyoung Park

Advanced Semi-Blind Speaker Extraction and Tracking Implemented in Experimental Device with Revolving Dense Microphone Array J. Čmejla, T. Kounovský, J. Janský, Jiri Malek, M. Rozkovec, Z. Koldovský

Keynote 1: Hermann Ney

Forty Years of Speech and Language Processing: From Bayes Decision Rule to Deep Learning Hermann Ney

ASR Technologies and Systems

Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw Jan Chorowski, Grzegorz Ciesielski, Jarosław Dzikowski, Adrian Łańcucki, Ricard Marxer, Mateusz Opala, Piotr Pusz, Paweł Rychlikowski, Michał Stypułkowski

Aligned Contrastive Predictive Coding Jan Chorowski, Grzegorz Ciesielski, Jarosław Dzikowski, Adrian Łańcucki, Ricard Marxer, Mateusz Opala, Piotr Pusz, Paweł Rychlikowski, Michał Stypułkowski

Neural Text Denormalization for Speech Transcripts Benjamin Suter, Josef Novak

Fearless Steps Challenge Phase-3 (FSC P3): Advancing SLT for Unseen Channel and Mission Data Across NASA Apollo Audio Aditya Joglekar, Seyed Omid Sadjadi, Meena Chandra-Shekar, Christopher Cieri, John H.L. Hansen

Phonation and Voicing

Voice Quality in Verbal Irony: Electroglottographic Analyses of Ironic Utterances in Standard Austrian German Hannah Leykum

Synchronic Fortition in Five Romance Languages? A Large Corpus-Based Study of Word-Initial Devoicing Mathilde Hutin, Yaru Wu, Adèle Jatteau, Ioana Vasilescu, Lori Lamel, Martine Adda-Decker

Glottal Stops in Upper Sorbian: A Data-Driven Approach Ivan Kraljevski, Maria Paola Bissiri, Frank Duckhorn, Constanze Tschoepe, Matthias Wolff

Cue Interaction in the Perception of Prosodic Prominence: The Role of Voice Quality Bogdan Ludusan, Petra Wagner, Marcin Włodarczak

Glottal Sounds in Korebaju Jenifer Vega Rodriguez, Nathalie Vallée

Automatic Classification of Phonation Types in Spontaneous Speech: Towards a New Workflow for the Characterization of Speakers’ Voice Quality Anaïs Chanclu, Imen Ben Amor, Cédric Gendrot, Emmanuel Ferragne, Jean-François Bonastre

Health and Affect I

Measuring Voice Quality Parameters After Speaker Pseudonymization Rob J.J.H. van Son

Audio-Visual Recognition of Emotional Engagement of People with Dementia Lars Steinert, Felix Putze, Dennis Küster, Tanja Schultz

Speaking Corona? Human and Machine Recognition of COVID-19 from Voice Pascal Hecker, Florian B. Pokorny, Katrin D. Bartl-Pokorny, Uwe Reichel, Zhao Ren, Simone Hantke, Florian Eyben, Dagmar M. Schuller, Bert Arnrich, Björn W. Schuller

Acoustic-Prosodic, Lexical and Demographic Cues to Persuasiveness in Competitive Debate Speeches Huyen Nguyen, Ralph Vente, David Lupea, Sarah Ita Levitan, Julia Hirschberg

Robust Speaker Recognition

Unsupervised Bayesian Adaptation of PLDA for Speaker Verification Bengt J. Borgström

The DKU-Duke-Lenovo System Description for the Fearless Steps Challenge Phase III Weiqing Wang, Danwei Cai, Jin Wang, Qingjian Lin, Xuyang Wang, Mi Hong, Ming Li

Improved Meta-Learning Training for Speaker Verification Yafeng Chen, Wu Guo, Bin Gu

Variational Information Bottleneck Based Regularization for Speaker Recognition Dan Wang, Yuanjie Dong, Yaxing Li, Yunfei Zi, Zhihui Zhang, Xiaoqi Li, Shengwu Xiong

Out of a Hundred Trials, How Many Errors Does Your Speaker Verifier Make? Niko Brümmer, Luciana Ferrer, Albert Swart

SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System Roza Chojnacka, Jason Pelecanos, Quan Wang, Ignacio Lopez Moreno

AntVoice Neural Speaker Embedding System for FFSVC 2020 Zhiming Wang, Furong Xu, Kaisheng Yao, Yuan Cheng, Tao Xiong, Huijia Zhu

Gradient Regularization for Noise-Robust Speaker Verification Jianchen Li, Jiqing Han, Hongwei Song

Deep Feature CycleGANs: Speaker Identity Preserving Non-Parallel Microphone-Telephone Domain Adaptation for Speaker Verification Saurabh Kataria, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak

Scaling Effect of Self-Supervised Speech Models Jie Pu, Yuguang Yang, Ruirui Li, Oguz Elibol, Jasha Droppo

Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network Yibo Wu, Longbiao Wang, Kong Aik Lee, Meng Liu, Jianwu Dang

Multi-Level Transfer Learning from Near-Field to Far-Field Speaker Verification Li Zhang, Qing Wang, Kong Aik Lee, Lei Xie, Haizhou Li

Speaker Anonymisation Using the McAdams Coefficient Jose Patino, Natalia Tomashenko, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans

Source Separation, Dereverberation and Echo Cancellation

Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments Yiyu Luo, Jing Wang, Liang Xu, Lidong Yang

TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation Helin Wang, Bo Wu, Lianwu Chen, Meng Yu, Jianwei Yu, Yong Xu, Shi-Xiong Zhang, Chao Weng, Dan Su, Dong Yu

Residual Echo and Noise Cancellation with Feature Attention Module and Multi-Domain Loss Function Jianjun Gu, Longbiao Cheng, Xingwei Sun, Junfeng Li, Yonghong Yan

MIMO Self-Attentive RNN Beamformer for Multi-Speaker Speech Separation Xiyun Li, Yong Xu, Meng Yu, Shi-Xiong Zhang, Jiaming Xu, Bo Xu, Dong Yu

Personalized PercepNet: Real-Time, Low-Complexity Target Voice Separation and Enhancement Ritwik Giri, Shrikant Venkataramani, Jean-Marc Valin, Umut Isik, Arvindh Krishnaswamy

Scene-Agnostic Multi-Microphone Speech Dereverberation Yochai Yemini, Ethan Fetaya, Haggai Maron, Sharon Gannot

Manifold-Aware Deep Clustering: Maximizing Angles Between Embedding Vectors Based on Regular Simplex Keitaro Tanaka, Ryosuke Sawata, Shusuke Takahashi

A Deep Learning Approach to Multi-Channel and Multi-Microphone Acoustic Echo Cancellation Hao Zhang, DeLiang Wang

Joint Online Multichannel Acoustic Echo Cancellation, Speech Dereverberation and Source Separation Yueyue Na, Ziteng Wang, Zhang Liu, Biao Tian, Qiang Fu

Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoyuki Kamo

Speech Signal Analysis and Representation I

Estimating Articulatory Movements in Speech Production with Transformer Networks Sathvik Udupa, Anwesha Roy, Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh

Unsupervised Multi-Target Domain Adaptation for Acoustic Scene Classification Dongchao Yang, Helin Wang, Yuexian Zou

Speech Decomposition Based on a Hybrid Speech Model and Optimal Segmentation Alfredo Esquivel Jaramillo, Jesper Kjær Nielsen, Mads Græsbøll Christensen

Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation Jian Luo, Jianzong Wang, Ning Cheng, Jing Xiao

Noise Robust Pitch Stylization Using Minimum Mean Absolute Error Criterion Chiranjeevi Yarra, Prasanta Kumar Ghosh

An Attribute-Aligned Strategy for Learning Speech Representation Yu-Lin Huang, Bo-Hao Su, Y.-W. Peter Hong, Chi-Chun Lee

Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation Abdolreza Sabzi Shahrebabaki, Sabato Marco Siniscalchi, Torbjørn Svendsen

Unsupervised Training of a DNN-Based Formant Tracker Jason Lilley, H. Timothy Bunnell

SUPERB: Speech Processing Universal PERformance Benchmark Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee

Synchronising Speech Segments with Musical Beats in Mandarin and English Singing Cong Zhang, Jian Zhu

FRILL: A Non-Semantic Speech Embedding for Mobile Devices Jacob Peplinski, Joel Shor, Sachin Joglekar, Jake Garrison, Shwetak Patel

Pitch Contour Separation from Overlapping Speech Hiroki Mori

Do Sound Event Representations Generalize to Other Audio Tasks? A Case Study in Audio Transfer Learning Anurag Kumar, Yun Wang, Vamsi Krishna Ithapu, Christian Fuegen

Spoken Language Understanding I

Data Augmentation for Spoken Language Understanding via Pretrained Language Models Baolin Peng, Chenguang Zhu, Michael Zeng, Jianfeng Gao

FANS: Fusing ASR and NLU for On-Device SLU Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann, Ariya Rastrow

Sequential End-to-End Intent and Slot Label Classification and Localization Yiran Cao, Nihal Potdar, Anderson R. Avila

DEXTER: Deep Encoding of External Knowledge for Named Entity Recognition in Virtual Assistants Deepak Muralidharan, Joel Ruben Antony Moniz, Weicheng Zhang, Stephen Pulman, Lin Li, Megan Barnes, Jingjing Pan, Jason Williams, Alex Acero

A Context-Aware Hierarchical BERT Fusion Network for Multi-Turn Dialog Act Detection Ting-Wei Wu, Ruolin Su, Biing-Hwang Juang

Pre-Training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning Qian Chen, Wen Wang, Qinglin Zhang

Predicting Temporal Performance Drop of Deployed Production Spoken Language Understanding Models Quynh Do, Judith Gaspers, Daniil Sorokin, Patrick Lehnen

Integrating Dialog History into End-to-End Spoken Language Understanding Systems Jatin Ganhotra, Samuel Thomas, Hong-Kwang J. Kuo, Sachindra Joshi, George Saon, Zoltán Tüske, Brian Kingsbury

Coreference Augmentation for Multi-Domain Task-Oriented Dialogue State Tracking Ting Han, Chongxuan Huang, Wei Peng

Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan, Siddharth Dalmia, Florian Metze, Shinji Watanabe, Alan W. Black

Topics in ASR: Adaptation, Transfer Learning, Children’s Speech, and Low-Resource Settings

Semantic Data Augmentation for End-to-End Mandarin Speech Recognition Jianwei Sun, Zhiyuan Tang, Hengxin Yin, Wei Wang, Xi Zhao, Shuaijiang Zhao, Xiaoning Lei, Wei Zou, Xiangang Li

Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition Xun Gong, Yizhou Lu, Zhikai Zhou, Yanmin Qian

Low Resource German ASR with Untranscribed Data Spoken by Non-Native Children — INTERSPEECH 2021 Shared Task SPAPL System Jinhan Wang, Yunzheng Zhu, Ruchao Fan, Wei Chu, Abeer Alwan

Robust Continuous On-Device Personalization for Automatic Speech Recognition Khe Chai Sim, Angad Chandorkar, Fan Gao, Mason Chua, Tsendsuren Munkhdalai, Françoise Beaufays

Speaker Normalization Using Joint Variational Autoencoder Shashi Kumar, Shakti P. Rath, Abhishek Pandey

The TAL System for the INTERSPEECH2021 Shared Task on Automatic Speech Recognition for Non-Native Childrens Speech Gaopeng Xu, Song Yang, Lu Ma, Chengfei Li, Zhongqin Wu

On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR Tsz Kin Lam, Mayumi Ohta, Shigehiko Schamoni, Stefan Riezler

Zero-Shot Cross-Lingual Phonetic Recognition with External Language Embedding Heting Gao, Junrui Ni, Yang Zhang, Kaizhi Qian, Shiyu Chang, Mark Hasegawa-Johnson

Rapid Speaker Adaptation for Conformer Transducer: Attention and Bias Are All You Need Yan Huang, Guoli Ye, Jinyu Li, Yifan Gong

Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning Nilaksh Das, Sravan Bodapati, Monica Sunkara, Sundararajan Srinivasan, Duen Horng Chau

Extending Pronunciation Dictionary with Automatically Detected Word Mispronunciations to Improve PAII’s System for Interspeech 2021 Non-Native Child English Close Track ASR Challenge Wei Chu, Peng Chang, Jing Xiao

Voice Conversion and Adaptation I

CVC: Contrastive Learning for Non-Parallel Voice Conversion Tingle Li, Yichen Liu, Chenxu Hu, Hang Zhao

A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Ching-Feng Liu, Yu Tsao, Hsin-Min Wang, Tomoki Toda

One-Shot Voice Conversion with Speaker-Agnostic StarGAN Sefik Emre Eskimez, Dimitrios Dimitriadis, Kenichi Kumatani, Robert Gmyr

Fine-Tuning Pre-Trained Voice Conversion Model for Adding New Target Speakers with Limited Data Takeshi Koshizuka, Hidefumi Ohmura, Kouichi Katsurada

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng

StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion Yinghao Aaron Li, Ali Zare, Nima Mesgarani

Normalization Driven Zero-Shot Multi-Speaker Speech Synthesis Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall

StarGAN-VC+ASR: StarGAN-Based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition Shoki Sakamoto, Akira Taniguchi, Tadahiro Taniguchi, Hirokazu Kameoka

Two-Pathway Style Embedding for Arbitrary Voice Conversion Xuexin Xu, Liang Shi, Jinhui Chen, Xunquan Chen, Jie Lian, Pingyuan Lin, Zhihong Zhang, Edwin R. Hancock

Non-Parallel Any-to-Many Voice Conversion by Replacing Speaker Statistics Yufei Liu, Chengzhu Yu, Wang Shuai, Zhenchuan Yang, Yang Chao, Weibin Zhang

Cross-Lingual Voice Conversion with a Cycle Consistency Loss on Linguistic Representation Yi Zhou, Xiaohai Tian, Zhizheng Wu, Haizhou Li

Improving Robustness of One-Shot Voice Conversion with Deep Discriminative Speaker Encoder Hongqiang Du, Lei Xie

Voice Quality Characterization for Clinical Voice Assessment: Voice Production, Acoustics, and Auditory Perception

Optimizing an Automatic Creaky Voice Detection Method for Australian English Speaking Females Hannah White, Joshua Penney, Andy Gibson, Anita Szakay, Felicity Cox

A Comparison of Acoustic Correlates of Voice Quality Across Different Recording Devices: A Cautionary Tale Joshua Penney, Andy Gibson, Felicity Cox, Michael Proctor, Anita Szakay

Investigating Voice Function Characteristics of Greek Speakers with Hearing Loss Using Automatic Glottal Source Feature Extraction Anna Sfakianaki, George P. Kafentzis

Automated Detection of Voice Disorder in the Saarbrücken Voice Database: Effects of Pathology Subset and Audio Materials Mark Huckvale, Catinca Buciuleac

Accelerometer-Based Measurements of Voice Quality in Children During Semi-Occluded Vocal Tract Exercise with a Narrow Straw in Air Steven M. Lulich, Rita R. Patel

Articulatory Coordination for Speech Motor Tracking in Huntington Disease Matthew Perez, Amrit Romana, Angela Roberts, Noelle Carlozzi, Jennifer Ann Miner, Praveen Dayalu, Emily Mower Provost

Modeling Dysphonia Severity as a Function of Roughness and Breathiness Ratings in the GRBAS Scale Carlos A. Ferrer, Efren Aragón, María E. Hdez-Díaz, Marc S. de Bodt, Roman Cmejla, Marina Englert, Mara Behlau, Elmar Nöth

Miscellanous Topics in ASR

Golos: Russian Dataset for Speech Research Nikolay Karpov, Alexander Denisenko, Fedor Minkin

Radically Old Way of Computing Spectra: Applications in End-to-End ASR Samik Sadhu, Hynek Hermansky

Self-Supervised End-to-End ASR for Low Resource L2 Swedish Ragheb Al-Ghezi, Yaroslav Getman, Aku Rouhe, Raili Hildén, Mikko Kurimo

SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition Patrick K. O’Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko

LeBenchmark : A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech Solène Evain, Ha Nguyen, Hang Le, Marcely Zanon Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong, Natalia Tomashenko, Marco Dinarelli, Titouan Parcollet, Alexandre Allauzen, Yannick Estève, Benjamin Lecouteux, François Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier

Phonetics I

Prosodic Accommodation in Face-to-Face and Telephone Dialogues Pavel Šturm, Radek Skarnitzl, Tomáš Nechanský

Dialect Features in Heterogeneous and Homogeneous Gheg Speaking Communities Josiane Riverin-Coutlée, Conceição Cunha, Enkeleida Kapia, Jonathan Harrington

An Exploration of the Acoustic Space of Rhotics and Laterals in Ruruuli Margaret Zellers, Alena Witzlack-Makarevich, Lilja Saeboe, Saudah Namyalo

Domain-Initial Strengthening in Turkish: Acoustic Cues to Prosodic Hierarchy in Stop Consonants Kubra Bodur, Sweeney Branje, Morgane Peirolo, Ingrid Tiscareno, James S. German

Target Speaker Detection, Localization and Separation

Auxiliary Loss Function for Target Speech Extraction and Recognition with Weak Supervision Based on Speaker Characteristics Katerina Zmolikova, Marc Delcroix, Desh Raj, Shinji Watanabe, Jan Černocký

Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers Marvin Borsdorf, Chenglin Xu, Haizhou Li, Tanja Schultz

Using X-Vectors for Speech Activity Detection in Broadcast Streams Lukas Mateju, Frantisek Kynych, Petr Cerva, Jindrich Zdansky, Jiri Malek

Time Delay Estimation for Speaker Localization Using CNN-Based Parametrized GCC-PHAT Features Daniele Salvati, Carlo Drioli, Gian Luca Foresti

Real-Time Speaker Counting in a Cocktail Party Scenario Using Attention-Guided Convolutional Neural Network Midia Yousefi, John H.L. Hansen

Language and Accent Recognition

End-to-End Language Diarization for Bilingual Code-Switching Speech Hexin Liu, Leibny Paola García Perera, Xinyi Zhang, Justin Dauwels, Andy W.H. Khong, Sanjeev Khudanpur, Suzy J. Styles

Modeling and Training Strategies for Language Recognition Systems Raphaël Duroselle, Md. Sahidullah, Denis Jouvet, Irina Illina

A Weight Moving Average Based Alternate Decoupled Learning Algorithm for Long-Tailed Language Identification Hui Wang, Lin Liu, Yan Song, Lei Fang, Ian McLoughlin, Li-Rong Dai

Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-Supervised Learning Keqi Deng, Songjun Cao, Long Ma

Exploring wav2vec 2.0 on Speaker Verification and Language Identification Zhiyun Fan, Meng Li, Shiyu Zhou, Bo Xu

Self-Supervised Phonotactic Representations for Language Identification G. Ramesh, C. Shiva Kumar, K. Sri Rama Murty

E2E-Based Multi-Task Learning Approach to Joint Speech and Accent Recognition Jicheng Zhang, Yizhou Peng, Van Tung Pham, Haihua Xu, Hao Huang, Eng Siong Chng

Excitation Source Feature Based Dialect Identification in Ao — A Low Resource Language Moakala Tzudir, Shikha Baghel, Priyankoo Sarmah, S.R. Mahadeva Prasanna

Low-Resource Speech Recognition

Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration Shreya Khare, Ashish Mittal, Anuj Diwan, Sunita Sarawagi, Preethi Jyothi, Samarth Bharadwaj

Unsupervised Acoustic Unit Discovery by Leveraging a Language-Independent Subword Discriminative Feature Representation Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Odette Scharenborg

Towards Unsupervised Phone and Word Segmentation Using Self-Supervised Vector-Quantized Neural Networks Herman Kamper, Benjamin van Niekerk

Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-Supervised Speech Representation Learning Dongwei Jiang, Wubo Li, Miao Cao, Wei Zou, Xiangang Li

Multilingual Transfer of Acoustic Word Embeddings Improves When Training on Languages Related to the Target Zero-Resource Language Christiaan Jacobs, Herman Kamper

Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing Benjamin van Niekerk, Leanne Nortje, Matthew Baas, Herman Kamper

Unsupervised Neural-Based Graph Clustering for Variable-Length Speech Representation Discovery of Zero-Resource Languages Shun Takahashi, Sakriani Sakti, Satoshi Nakamura

Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021 Takashi Maekaku, Xuankai Chang, Yuya Fujita, Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

Identifying Indicators of Vulnerability from Short Speech Segments Using Acoustic and Textual Features Xia Cui, Amila Gamage, Terry Hanley, Tingting Mu

The Zero Resource Speech Challenge 2021: Spoken Language Modelling Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Eugene Kharitonov, Emmanuel Dupoux

Zero-Shot Federated Learning with New Classes for Audio Classification Gautham Krishna Gudur, Satheesh Kumar Perepu

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass

Speech Synthesis: Singing, Multimodal, Crosslingual Synthesis

N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement Gyeong-Hoon Lee, Tae-Woo Kim, Hanbin Bae, Min-Ji Lee, Young-Ik Kim, Hoon-Young Cho

Cross-Lingual Low Resource Speaker Adaptation Using Phonological Features Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information Haoyue Zhan, Haitong Zhang, Wenjie Ou, Yue Lin

Cross-Lingual Voice Conversion with Disentangled Universal Linguistic Representations Zhenchuan Yang, Weibin Zhang, Yufei Liu, Xiaofen Xing

EfficientSing: A Chinese Singing Voice Synthesis System Using Duration-Free Acoustic Model and HiFi-GAN Vocoder Zhengchen Liu, Chenfeng Miao, Qingying Zhu, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao

Cross-Lingual Speaker Adaptation Using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari

Incorporating Cross-Speaker Style Transfer for Multi-Language Text-to-Speech Zengqiang Shang, Zhihua Huang, Haozhe Zhang, Pengyuan Zhang, Yonghong Yan

Investigating Contributions of Speech and Facial Landmarks for Talking Head Generation Ege Kesim, Engin Erzin

Speech2Video: Cross-Modal Distillation for Speech to Video Generation Shijing Si, Jianzong Wang, Xiaoyang Qu, Ning Cheng, Wenqi Wei, Xinghua Zhu, Jing Xiao

Speech Coding and Privacy

NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling Junhyeok Lee, Seungu Han

QISTA-Net-Audio: Audio Super-Resolution via Non-Convex ℓ_q-Norm Minimization Gang-Xuan Lin, Shih-Wei Hu, Yen-Ju Lu, Yu Tsao, Chun-Shien Lu

X-net: A Joint Scale Down and Scale Up Method for Voice Call Liang Wen, Lizhong Wang, Xue Wen, Yuxing Zheng, Youngo Park, Kwang Pyo Choi

WSRGlow: A Glow-Based Waveform Generative Model for Audio Super-Resolution Kexun Zhang, Yi Ren, Changliang Xu, Zhou Zhao

Half-Truth: A Partially Fake Audio Detection Dataset Jiangyan Yi, Ye Bai, Jianhua Tao, Haoxin Ma, Zhengkun Tian, Chenglong Wang, Tao Wang, Ruibo Fu

Data Quality as Predictor of Voice Anti-Spoofing Generalization Bhusan Chettri, Rosa González Hautamäki, Md. Sahidullah, Tomi Kinnunen

Coded Speech Enhancement Using Neural Network-Based Vector-Quantized Residual Features Youngju Cheon, Soojoong Hwang, Sangwook Han, Inseon Jang, Jong Won Shin

Multi-Channel Opus Compression for Far-Field Automatic Speech Recognition with a Fixed Bitrate Budget Lukas Drude, Jahn Heymann, Andreas Schwarz, Jean-Marc Valin

Effects of Prosodic Variations on Accidental Triggers of a Commercial Voice Assistant Ingo Siegert

Improving the Expressiveness of Neural Vocoding with Non-Affine Normalizing Flows Adam Gabryś, Yunlong Jiao, Viacheslav Klimkov, Daniel Korzekwa, Roberto Barra-Chicote

Voice Privacy Through x-Vector and CycleGAN-Based Anonymization Gauri P. Prajapati, Dipesh K. Singh, Preet P. Amin, Hemant A. Patil

A Two-Stage Approach to Speech Bandwidth Extension Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, Christian Fuegen

Development of a Psychoacoustic Loss Function for the Deep Neural Network (DNN)-Based Speech Coder Joon Byun, Seungmin Shin, Youngcheol Park, Jongmo Sung, Seungkwon Beack

Protecting Gender and Identity with Disentangled Speech Representations Dimitrios Stoidis, Andrea Cavallaro

Speech Perception II

Perception of Standard Arabic Synthetic Speech Rate Yahya Aldholmi, Rawan Aldhafyan, Asma Alqahtani

The Influence of Parallel Processing on Illusory Vowels Takeshi Kishiyama

Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors Anupama Chingacham, Vera Demberg, Dietrich Klakow

SpeechAdjuster: A Tool for Investigating Listener Preferences and Speech Intelligibility Olympia Simantiraki, Martin Cooke

VocalTurk: Exploring Feasibility of Crowdsourced Speaker Identification Susumu Saito, Yuta Ide, Teppei Nakano, Tetsuji Ogawa

Effects of Aging and Age-Related Hearing Loss on Talker Discrimination Min Xu, Jing Shao, Lan Wang

Relationships Between Perceptual Distinctiveness, Articulatory Complexity and Functional Load in Speech Communication Yuqing Zhang, Zhu Li, Bin Wu, Yanlu Xie, Binghuai Lin, Jinsong Zhang

Human Spoofing Detection Performance on Degraded Speech Camryn Terblanche, Philip Harrison, Amelia J. Gully

Reliable Estimates of Interpretable Cue Effects with Active Learning in Psycholinguistic Research Marieke Einfeldt, Rita Sevastjanova, Katharina Zahner-Ritter, Ekaterina Kazak, Bettina Braun

Towards the Explainability of Multimodal Speech Emotion Recognition Puneet Kumar, Vishesh Kaushik, Balasubramanian Raman

Primacy of Mouth over Eyes: Eye Movement Evidence from Audiovisual Mandarin Lexical Tones and Vowels Biao Zeng, Rui Wang, Guoxing Yu, Christian Dobel

Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance Takanori Ashihara, Takafumi Moriya, Makio Kashino

Streaming for ASR/RNN Transducers

Super-Human Performance in Online Low-Latency Recognition of Conversational Speech Thai-Son Nguyen, Sebastian Stüker, Alex Waibel

Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems Vikas Joshi, Amit Das, Eric Sun, Rupesh R. Mehta, Jinyu Li, Yifan Gong

Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Ruoming Pang, David Rybach, Cyril Allauzen, Ehsan Variani, James Qin, Quoc-Nam Le-The, Shuo-Yiin Chang, Bo Li, Anmol Gulati, Jiahui Yu, Chung-Cheng Chiu, Diamantino Caseiro, Wei Li, Qiao Liang, Pat Rondon

Streaming Multi-Talker Speech Recognition with Joint Speaker Identification Liang Lu, Naoyuki Kanda, Jinyu Li, Yifan Gong

Streaming End-to-End Speech Recognition for Hybrid RNN-T/Attention Architecture Takafumi Moriya, Tomohiro Tanaka, Takanori Ashihara, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Ryo Masumura, Marc Delcroix, Taichi Asami

Improving RNN-T ASR Accuracy Using Context Audio Andreas Schwarz, Ilya Sklyar, Simon Wiesler

HMM-Free Encoder Pre-Training for Streaming RNN Transducer Lu Huang, Jingyu Sun, Yufeng Tang, Junfeng Hou, Jinkun Chen, Jun Zhang, Zejun Ma

Reducing Exposure Bias in Training Recurrent Neural Network Transducers Xiaodong Cui, Brian Kingsbury, George Saon, David Haws, Zoltán Tüske

Bridging the Gap Between Streaming and Non-Streaming ASR Systems by Distilling Ensembles of CTC and RNN-T Models Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao

Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition Kartik Audhkhasi, Tongzhou Chen, Bhuvana Ramabhadran, Pedro J. Moreno

StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR Hirofumi Inaguma, Tatsuya Kawahara

Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition Niko Moritz, Takaaki Hori, Jonathan Le Roux

Multi-Mode Transformer Transducer with Stochastic Future Context Kwangyoun Kim, Felix Wu, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

ConferencingSpeech 2021 Challenge: Far-Field Multi-Channel Speech Enhancement for Video Conferencing

A Causal U-Net Based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement Xinlei Ren, Xu Zhang, Lianwu Chen, Xiguang Zheng, Chen Zhang, Liang Guo, Bing Yu

A Partitioned-Block Frequency-Domain Adaptive Kalman Filter for Stereophonic Acoustic Echo Cancellation Rui Zhu, Feiran Yang, Yuepeng Li, Shidong Shang

Real-Time Independent Vector Analysis Using Semi-Supervised Nonnegative Matrix Factorization as a Source Model Taihui Wang, Feiran Yang, Rui Zhu, Jun Yang

Improving Channel Decorrelation for Multi-Channel Target Speech Extraction Jiangyu Han, Wei Rao, Yannan Wang, Yanhua Long

Inplace Gated Convolutional Recurrent Neural Network for Dual-Channel Speech Enhancement Jinjiang Liu, Xueliang Zhang

SRIB-LEAP Submission to Far-Field Multi-Channel Speech Enhancement Challenge for Video Conferencing R.G. Prithvi Raj, Rohit Kumar, M.K. Jayesh, Anurenjan Purushothaman, Sriram Ganapathy, M.A. Basha Shaik

Real-Time Multi-Channel Speech Enhancement Based on Neural Network Masking with Attention Model Cheng Xue, Weilong Huang, Weiguang Chen, Jinwei Feng

Survey Talk 2: Sriram Ganapathy

Uncovering the Acoustic Cues of COVID-19 Infection Sriram Ganapathy

Keynote 2: Pascale Fung

Ethical and Technological Challenges of Conversational AI Pascale Fung

Language Modeling and Text-Based Innovations for ASR

BERT-Based Semantic Model for Rescoring N-Best Speech Recognition List Dominique Fohr, Irina Illina

Text Augmentation for Language Models in High Error Recognition Scenario Karel Beneš, Lukáš Burget

On Sampling-Based Training Criteria for Neural Language Modeling Yingbo Gao, David Thulke, Alexander Gerstenberger, Khoa Viet Tran, Ralf Schlüter, Hermann Ney

Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network Janne Pylkkönen, Antti Ukkonen, Juho Kilpikoski, Samu Tamminen, Hannes Heikinheimo

Speaker, Language, and Privacy

Using Games to Augment Corpora for Language Recognition and Confusability Christopher Cieri, James Fiumara, Jonathan Wright

Fair Voice Biometrics: Impact of Demographic Imbalance on Group Fairness in Speaker Recognition Gianni Fenu, Mirko Marras, Giacomo Medda, Giacomo Meloni

Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification Leying Zhang, Zhengyang Chen, Yanmin Qian

Adversarial Disentanglement of Speaker Representation for Attribute-Driven Privacy Preservation Paul-Gauthier Noé, Mohammad Mohammadamini, Driss Matrouf, Titouan Parcollet, Andreas Nautsch, Jean-François Bonastre

Assessment of Pathological Speech and Language I

Automatically Detecting Errors and Disfluencies in Read Speech to Predict Cognitive Impairment in People with Parkinson’s Disease Amrit Romana, John Bandon, Matthew Perez, Stephanie Gutierrez, Richard Richter, Angela Roberts, Emily Mower Provost

Automatic Extraction of Speech Rhythm Descriptors for Speech Intelligibility Assessment in the Context of Head and Neck Cancers Robin Vaysse, Jérôme Farinas, Corine Astésano, Régine André-Obrecht

Speech Disorder Classification Using Extended Factorized Hierarchical Variational Auto-Encoders Jinzi Qi, Hugo Van hamme

The Impact of Forced-Alignment Errors on Automatic Pronunciation Evaluation Vikram C. Mathad, Tristan J. Mahr, Nancy Scherer, Kathy Chapman, Katherine C. Hustad, Julie Liss, Visar Berisha

Late Fusion of the Available Lexicon and Raw Waveform-Based Acoustic Modeling for Depression and Dementia Recognition Esaú Villatoro-Tello, S. Pavankumar Dubagunta, Julian Fritsch, Gabriela Ramírez-de-la-Rosa, Petr Motlicek, Mathew Magimai-Doss

Neural Speaker Embeddings for Ultrasound-Based Silent Speech Interfaces Amin Honarmandi Shandiz, László Tóth, Gábor Gosztolya, Alexandra Markó, Tamás Gábor Csapó

Communication and Interaction, Multimodality

Cross-Modal Learning for Audio-Visual Video Parsing Jatin Lamba, Abhishek, Jayaprakash Akula, Rishabh Dabral, Preethi Jyothi, Ganesh Ramakrishnan

A Psychology-Driven Computational Analysis of Political Interviews Darren Cook, Miri Zilka, Simon Maskell, Laurence Alison

Speech Emotion Recognition Based on Attention Weight Correction Using Word-Level Confidence Measure Jennifer Santoso, Takeshi Yamada, Shoji Makino, Kenkichi Ishizuka, Takekatsu Hiramura

Effects of Voice Type and Task on L2 Learners’ Awareness of Pronunciation Errors Alif Silpachai, Ivana Rehman, Taylor Anne Barriuso, John Levis, Evgeny Chukharev-Hudilainen, Guanlong Zhao, Ricardo Gutierrez-Osuna

Lexical Entrainment and Intra-Speaker Variability in Cooperative Dialogues Alla Menshikova, Daniil Kocharov, Tatiana Kachkovskaia

Detecting Alzheimer’s Disease Using Interactional and Acoustic Features from Spontaneous Speech Shamila Nasreen, Julian Hough, Matthew Purver

Investigating the Interplay Between Affective, Phonatory and Motoric Subsystems in Autism Spectrum Disorder Using a Multimodal Dialogue Agent Hardik Kothare, Vikram Ramanarayanan, Oliver Roesler, Michael Neumann, Jackson Liscombe, William Burke, Andrew Cornish, Doug Habberstad, Alaa Sakallah, Sara Markuson, Seemran Kansara, Afik Faerman, Yasmine Bensidi-Slimane, Laura Fry, Saige Portera, David Suendermann-Oeft, David Pautler, Carly Demopoulos

Analysis of Eye Gaze Reasons and Gaze Aversions During Three-Party Conversations Carlos Toshinori Ishi, Taiken Shintani

Language and Lexical Modeling for ASR

Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding Suyoun Kim, Abhinav Arora, Duc Le, Ching-Feng Yeh, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer

A Light-Weight Contextual Spelling Correction Model for Customizing Transducer-Based Speech Recognition Systems Xiaoqiang Wang, Yanqing Liu, Sheng Zhao, Jinyu Li

Incorporating External POS Tagger for Punctuation Restoration Ning Shi, Wei Wang, Boxin Wang, Jinfeng Li, Xiangyu Liu, Zhouhan Lin

Phonetically Induced Subwords for End-to-End Speech Recognition Vasileios Papadourakis, Markus Müller, Jing Liu, Athanasios Mouchtaris, Maurizio Omologo

Revisiting Parity of Human vs. Machine Conversational Speech Transcription Courtney Mansfield, Sara Ng, Gina-Anne Levow, Richard A. Wright, Mari Ostendorf

Lookup-Table Recurrent Language Models for Long Tail Speech Recognition W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman

Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems Jesús Andrés-Ferrer, Dario Albesano, Puming Zhan, Paul Vozila

Token-Level Supervised Contrastive Learning for Punctuation Restoration Qiushi Huang, Tom Ko, H. Lilian Tang, Xubo Liu, Bo Wu

BART Based Semantic Correction for Mandarin Automatic Speech Recognition System Yun Zhao, Xuerui Yang, Jinchao Wang, Yongyu Gao, Chao Yan, Yuanfu Zhou

Class-Based Neural Network Language Model for Second-Pass Rescoring in ASR Lingfeng Dai, Qi Liu, Kai Yu

Improving Customization of Neural Transducers by Mitigating Acoustic Mismatch of Synthesized Audio Gakuto Kurata, George Saon, Brian Kingsbury, David Haws, Zoltán Tüske

A Discriminative Entity-Aware Language Model for Virtual Assistants Mandana Saebi, Ernest Pusateri, Aaksha Meghawat, Christophe Van Gysel

Correcting Automated and Manual Speech Transcription Errors Using Warped Language Models Mahdi Namazifar, John Malik, Li Erran Li, Gokhan Tur, Dilek Hakkani Tür

Novel Neural Network Architectures for ASR

Dynamic Encoder Transducer: A Flexible Solution for Trading Off Accuracy for Latency Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer

Domain-Aware Self-Attention for Multi-Domain Neural Machine Translation Shiqi Zhang, Yan Liu, Deyi Xiong, Pei Zhang, Boxing Chen

Librispeech Transducer Model with Internal Language Model Prior Correction Albert Zeyer, André Merboldt, Wilfried Michel, Ralf Schlüter, Hermann Ney

A Deliberation-Based Joint Acoustic and Text Decoder Sepand Mavandadi, Tara N. Sainath, Ke Hu, Zelin Wu

On the Limit of English Conversational Speech Recognition Zoltán Tüske, George Saon, Brian Kingsbury

Deformable TDNN with Adaptive Receptive Fields for Speech Recognition Keyu An, Yi Zhang, Zhijian Ou

SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts Zhao You, Shulin Feng, Dan Su, Dong Yu

Online Compressive Transformer for End-to-End Speech Recognition Chi-Hang Leong, Yu-Han Huang, Jen-Tzung Chien

End to End Transformer-Based Contextual Speech Recognition Based on Pointer Network Binghuai Lin, Liyuan Wang

A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition Shigeki Karita, Yotaro Kubo, Michiel Adriaan Unico Bacchiani, Llion Jones

Advanced Long-Context End-to-End Speech Recognition Using Context-Expanded Transformers Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux

Transformer-Based ASR Incorporating Time-Reduction Layer and Fine-Tuning with Self-Knowledge Distillation Md. Akmal Haidar, Chao Xing, Mehdi Rezagholizadeh

Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

Speech Localization, Enhancement, and Quality Assessment

Difference in Perceived Speech Signal Quality Assessment Among Monolingual and Bilingual Teenage Students Przemyslaw Falkowski-Gilski

PILOT: Introducing Transformers for Probabilistic Sound Event Localization Christopher Schymura, Benedikt Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

Sound Source Localization with Majorization Minimization Masahito Togami, Robin Scheibler

NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets Gabriel Mittag, Babak Naderi, Assmaa Chehadi, Sebastian Möller

Subjective Evaluation of Noise Suppression Algorithms in Crowdsourcing Babak Naderi, Ross Cutler

Reliable Intensity Vector Selection for Multi-Source Direction-of-Arrival Estimation Using a Single Acoustic Vector Sensor Jianhua Geng, Sifan Wang, Juan Li, JingWei Li, Xin Lou

MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment Meng Yu, Chunlei Zhang, Yong Xu, Shi-Xiong Zhang, Dong Yu

CNN-Based Processing of Acoustic and Radio Frequency Signals for Speaker Localization from MAVs Andrea Toma, Daniele Salvati, Carlo Drioli, Gian Luca Foresti

Assessment of von Mises-Bernoulli Deep Neural Network in Sound Source Localization Katsutoshi Itoyama, Yoshiya Morimoto, Shungo Masaki, Ryosuke Kojima, Kenji Nishida, Kazuhiro Nakadai

Feature Fusion by Attention Networks for Robust DOA Estimation Rongliang Liu, Nengheng Zheng, Xi Chen

Far-Field Speaker Localization and Adaptive GLMB Tracking Shoufeng Lin, Zhaojie Luo

On the Design of Deep Priors for Unsupervised Audio Restoration Vivek Sivaraman Narayanaswamy, Jayaraman J. Thiagarajan, Andreas Spanias

Cramér-Rao Lower Bound for DOA Estimation with an Array of Directional Microphones in Reverberant Environments Weiguang Chen, Cheng Xue, Xionghu Zhong

Speech Synthesis: Neural Waveform Generation

GAN Vocoder: Multi-Resolution Discriminator Is All You Need Jaeseong You, Dalhyun Kim, Gyuhyeon Nam, Geumbyeol Hwang, Gyeongsu Chae

Glow-WaveGAN: Learning Speech Representations from GAN-Based Variational Auto-Encoder for High Fidelity Flow-Based Speech Synthesis Jian Cong, Shan Yang, Lei Xie, Dan Su

Unified Source-Filter GAN: Unified Source-Filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda

Harmonic WaveGAN: GAN-Based Speech Waveform Generation Model with Harmonic Structure Discriminator Kazuki Mizuta, Tomoki Koriyama, Hiroshi Saruwatari

Fre-GAN: Adversarial Frequency-Consistent Audio Synthesis Ji-Hoon Kim, Sang-Hoon Lee, Ji-Hyun Lee, Seong-Whan Lee

GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis Jinhyeok Yang, Jae-Sung Bae, Taejun Bak, Young-Ik Kim, Hoon-Young Cho

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, Juntae Kim

Continuous Wavelet Vocoder-Based Decomposition of Parametric Speech Waveform Synthesis Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Csaba Zainkó, Géza Németh

High-Fidelity and Low-Latency Universal Neural Vocoder Based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling Patrick Lumban Tobing, Tomoki Toda

Basis-MelGAN: Efficient Neural Vocoder Based on Audio Decomposition Zhengxi Liu, Yanmin Qian

High-Fidelity Parallel WaveGAN with Multi-Band Harmonic-Plus-Noise Model Min-Jae Hwang, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim

Spoken Machine Translation

SpecRec: An Alternative Solution for Improving End-to-End Speech-to-Text Translation via Spectrogram Reconstruction Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang

Subtitle Translation as Markup Translation Colin Cherry, Naveen Arivazhagan, Dirk Padfield, Maxim Krikun

Large-Scale Self- and Semi-Supervised Learning for Speech Translation Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau

CoVoST 2 and Massively Multilingual Speech Translation Changhan Wang, Anne Wu, Jiatao Gu, Juan Pino

AlloST: Low-Resource Speech Translation Without Source Transcription Yao-Fei Cheng, Hung-Shin Lee, Hsin-Min Wang

Weakly-Supervised Speech-to-Text Mapping with Visually Connected Non-Parallel Speech-Text Data Using Cyclic Partially-Aligned Transformer Johanes Effendi, Sakriani Sakti, Satoshi Nakamura

Transcribing Paralinguistic Acoustic Cues to Target Language Text in Transformer-Based Speech-to-Text Translation Hirotaka Tokuyama, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura

End-to-End Speech Translation via Cross-Modal Progressive Training Rong Ye, Mingxuan Wang, Lei Li

ASR Posterior-Based Loss for Multi-Task End-to-End Speech Translation Yuka Ko, Katsuhito Sudoh, Sakriani Sakti, Satoshi Nakamura

Towards Simultaneous Machine Interpretation Alejandro Pérez-González-de-Martos, Javier Iranzo-Sánchez, Adrià Giménez Pastor, Javier Jorge, Joan-Albert Silvestre-Cerdà, Jorge Civera, Albert Sanchis, Alfons Juan

Lexical Modeling of ASR Errors for Robust Speech Translation Giuseppe Martucci, Mauro Cettolo, Matteo Negri, Marco Turchi

Optimally Encoding Inductive Biases into the Transformer Improves End-to-End Speech Translation Piyush Vyas, Anastasia Kuznetsova, Donald S. Williamson

Effects of Feature Scaling and Fusion on Sign Language Translation Tejaswini Ananthanarayana, Lipisha Chaudhary, Ifeoma Nwogu

SdSV Challenge 2021: Analysis and Exploration of New Ideas on Short-Duration Speaker Verification

The ID R&D System Description for Short-Duration Speaker Verification Challenge 2021 Alexander Alenin, Anton Okhotnikov, Rostislav Makarov, Nikita Torgashov, Ilya Shigabeev, Konstantin Simonchik

Integrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification Jenthe Thienpondt, Brecht Desplanques, Kris Demuynck

SdSVC Challenge 2021: Tips and Tricks to Boost the Short-Duration Speaker Verification System Performance Aleksei Gusev, Alisa Vinogradova, Sergey Novoselov, Sergei Astapov

Team02 Text-Independent Speaker Verification System for SdSV Challenge 2021 Woo Hyun Kang, Nam Soo Kim

Our Learned Lessons from Cross-Lingual Speaker Verification: The CRMI-DKU System Description for the Short-Duration Speaker Verification Challenge 2021 Xiaoyi Qin, Chao Wang, Yong Ma, Min Liu, Shilei Zhang, Ming Li

Investigation of IMU&Elevoc Submission for the Short-Duration Speaker Verification Challenge 2021 Peng Zhang, Peng Hu, Xueliang Zhang

The Sogou System for Short-Duration Speaker Verification Challenge 2021 Jie Yan, Shengyu Yao, Yiqian Pan, Wei Chen

The SJTU System for Short-Duration Speaker Verification Challenge 2021 Bing Han, Zhengyang Chen, Zhikai Zhou, Yanmin Qian

Show and Tell 2

Multi-Speaker Emotional Text-to-Speech Synthesizer Sungjae Cho, Soo-Young Lee

Live TV Subtitling Through Respeaking Aleš Pražák, Zdeněk Loose, Josef V. Psutka, Vlasta Radová, Josef Psutka, Jan Švec

Autonomous Robot for Measuring Room Impulse Responses Stefan Fragner, Tobias Topar, Maximilian Giller, Lukas Pfeifenberger, Franz Pernkopf

Expressive Robot Performance Based on Facial Motion Capture Jonas Beskow, Charlie Caper, Johan Ehrenfors, Nils Hagberg, Anne Jansen, Chris Wood

ThemePro 2.0: Showcasing the Role of Thematic Progression in Engaging Human-Computer Interaction Mónica Domínguez, Juan Soler-Company, Leo Wanner

Addressing Compliance in Call Centers with Entity Extraction Sai Guruju, Jithendra Vepa

Audio Segmentation Based Conversational Silence Detection for Contact Center Calls Krishnachaitanya Gogineni, Tarun Reddy Yadama, Jithendra Vepa

Graph and End-to-End Learning for Speaker Recognition

Reformulating DOVER-Lap Label Mapping as a Graph Partitioning Problem Desh Raj, Sanjeev Khudanpur

Graph Attention Networks for Anti-Spoofing Hemlata Tak, Jee-weon Jung, Jose Patino, Massimiliano Todisco, Nicholas Evans

Log-Likelihood-Ratio Cost Function as Objective Loss for Speaker Verification Systems Victoria Mingote, Antonio Miguel, Alfonso Ortega, Eduardo Lleida

Effective Phase Encoding for End-To-End Speaker Verification Junyi Peng, Xiaoyang Qu, Rongzhi Gu, Jianzong Wang, Jing Xiao, Lukáš Burget, Jan Černocký

Spoken Language Processing II

Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation Ha Nguyen, Yannick Estève, Laurent Besacier

Lost in Interpreting: Speech Translation from Source or Interpreter? Dominik Macháček, Matúš Žilinec, Ondřej Bojar

Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-Based Multimodal Fusion Baptiste Pouthier, Laurent Pilati, Leela K. Gudupudi, Charles Bouveyron, Frederic Precioso

It’s Not What You Said, it’s How You Said it: Discriminative Perception of Speech as a Multichannel Communication System Sarenne Wallbridge, Peter Bell, Catherine Lai

Speech and Audio Analysis

Extending the Fullband E-Model Towards Background Noise, Bursty Packet Loss, and Conversational Degradations Thilo Michael, Gabriel Mittag, Andreas Bütow, Sebastian Möller

ORCA-SLANG: An Automatic Multi-Stage Semi-Supervised Deep Learning Framework for Large-Scale Killer Whale Call Type Identification Christian Bergler, Manuel Schmitt, Andreas Maier, Helena Symonds, Paul Spong, Steven R. Ness, George Tzanetakis, Elmar Nöth

Audiovisual Transfer Learning for Audio Tagging and Sound Event Detection Wim Boes, Hugo Van hamme

Non-Intrusive Speech Quality Assessment with Transfer Learning and Subject-Specific Scaling Natalia Nessler, Milos Cernak, Paolo Prandoni, Pablo Mainar

Audio Retrieval with Natural Language Queries Andreea-Maria Oncescu, A. Sophia Koepke, João F. Henriques, Zeynep Akata, Samuel Albanie

Cross/Multi-Lingual and Code-Switched ASR

Bootstrap an End-to-End ASR System by Multilingual Training, Transfer Learning, Text-to-Text Mapping and Synthetic Audio Manuel Giollo, Deniz Gunceler, Yulan Liu, Daniel Willett

Efficient Weight Factorization for Multilingual Speech Recognition Ngoc-Quan Pham, Tuan-Nam Nguyen, Sebastian Stüker, Alex Waibel

Unsupervised Cross-Lingual Representation Learning for Speech Recognition Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli

Language and Speaker-Independent Feature Transformation for End-to-End Multilingual Speech Recognition Tomoaki Hayakawa, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki

Using Large Self-Supervised Models for Low-Resource Speech Recognition Krishna D. N, Pinyi Wang, Bruno Bozza

Dual Script E2E Framework for Multilingual and Code-Switching ASR Mari Ganesh Kumar, Jom Kuriakose, Anand Thyagachandran, Arun Kumar A, Ashish Seth, Lodagala V.S.V. Durga Prasad, Saish Jaiswal, Anusha Prakash, Hema A. Murthy

MUCS 2021: Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan, Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan

Adapt-and-Adjust: Overcoming the Long-Tail Problem of Multilingual Speech Recognition Genta Indra Winata, Guangsen Wang, Caiming Xiong, Steven Hoi

SRI-B End-to-End System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages Hardik Sailor, Kiran Praveen T, Vikas Agrawal, Abhinav Jain, Abhishek Pandey

Hierarchical Phone Recognition with Compositional Phonetics Xinjian Li, Juncheng Li, Florian Metze, Alan W. Black

Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR Shammur Absar Chowdhury, Amir Hussein, Ahmed Abdelali, Ahmed Ali

Differentiable Allophone Graphs for Language-Universal Speech Recognition Brian Yan, Siddharth Dalmia, David R. Mortensen, Florian Metze, Shinji Watanabe

Health and Affect II

Automatic Speech Recognition Systems Errors for Objective Sleepiness Detection Through Voice Vincent P. Martin, Jean-Luc Rouas, Florian Boyer, Pierre Philip

Robust Laughter Detection in Noisy Environments Jon Gillick, Wesley Deng, Kimiko Ryokai, David Bamman

Impact of Emotional State on Estimation of Willingness to Buy from Advertising Speech Mizuki Nagano, Yusuke Ijima, Sadao Hiroya

Stacked Recurrent Neural Networks for Speech-Based Inference of Attachment Condition in School Age Children Huda Alsofyani, Alessandro Vinciarelli

Language or Paralanguage, This is the Problem: Comparing Depressed and Non-Depressed Speakers Through the Analysis of Gated Multimodal Units Nujud Aloshban, Anna Esposito, Alessandro Vinciarelli

Emotion Carrier Recognition from Personal Narratives Aniruddha Tammewar, Alessandra Cervone, Giuseppe Riccardi

Non-Verbal Vocalisation and Laughter Detection Using Sequence-to-Sequence Models and Multi-Label Training Scott Condron, Georgia Clarke, Anita Klementiev, Daniela Morse-Kopp, Jack Parry, Dimitri Palaz

TDCA-Net: Time-Domain Channel Attention Network for Depression Detection Cong Cai, Mingyue Niu, Bin Liu, Jianhua Tao, Xuefei Liu

Visual Speech for Obstructive Sleep Apnea Detection Catarina Botelho, Alberto Abad, Tanja Schultz, Isabel Trancoso

Analysis of Contextual Voice Changes in Remote Meetings Hector A. Cordourier Maruri, Sinem Aslan, Georg Stemmer, Nese Alyuz, Lama Nachman

Speech Based Depression Severity Level Classification Using a Multi-Stage Dilated CNN-LSTM Model Nadee Seneviratne, Carol Espy-Wilson

Neural Network Training Methods for ASR

Multi-Domain Knowledge Distillation via Uncertainty-Matching for End-to-End ASR Models Ho-Gyeong Kim, Min-Joong Lee, Hoshik Lee, Tae Gyoon Kang, Jihyun Lee, Eunho Yang, Sung Ju Hwang

Learning a Neural Diff for Speech Models Jonathan Macoskey, Grant P. Strimel, Ariya Rastrow

Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals

Model-Agnostic Fast Adaptive Multi-Objective Balancing Algorithm for Multilingual Automatic Speech Recognition Model Training Jiabin Xue, Tieran Zheng, Jiqing Han

Towards Lifelong Learning of End-to-End ASR Heng-Jui Chang, Hung-yi Lee, Lin-shan Lee

Self-Adaptive Distillation for Multilingual Speech Recognition: Leveraging Student Independence Isabel Leal, Neeraj Gaur, Parisa Haghani, Brian Farris, Pedro J. Moreno, Manasa Prasad, Bhuvana Ramabhadran, Yun Zhu

Regularizing Word Segmentation by Creating Misspellings Hainan Xu, Kartik Audhkhasi, Yinghui Huang, Jesse Emond, Bhuvana Ramabhadran

Multitask Training with Text Data for End-to-End Speech Recognition Peidong Wang, Tara N. Sainath, Ron J. Weiss

Emitting Word Timings with HMM-Free End-to-End System in Automatic Speech Recognition Xianzhao Chen, Hao Ni, Yi He, Kang Wang, Zejun Ma, Zongxia Xie

Scaling Laws for Acoustic Models Jasha Droppo, Oguz Elibol

Leveraging Non-Target Language Resources to Improve ASR Performance in a Target Language Jayadev Billa

4-Bit Quantization of LSTM-Based Speech Recognition Models Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei Zhang, Zoltán Tüske, Kailash Gopalakrishnan

Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation Ryo Masumura, Daiki Okamura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi

Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric Sun, Jinyu Li, Yifan Gong

Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning Dongcheng Jiang, Chao Zhang, Philip C. Woodland

Prosodic Features and Structure

How f0 and Phrase Position Affect Papuan Malay Word Identification Constantijn Kaland, Matthew Gordon

On the Feasibility of the Danish Model of Intonational Transcription: Phonetic Evidence from Jutlandic Danish Anna Bothe Jespersen, Pavel Šturm, Míša Hejná

An Experiment in Paratone Detection in a Prosodically Annotated EAP Spoken Corpus Adrien Méli, Nicolas Ballier, Achille Falaise, Alice Henderson

ProsoBeast Prosody Annotation Tool Branislav Gerazov, Michael Wagner

Assessing the Use of Prosody in Constituency Parsing of Imperfect Transcripts Trang Tran, Mari Ostendorf

Targeted and Targetless Neutral Tones in Taiwanese Southern Min Roger Cheng-yen Liu, Feng-fan Hsieh, Yueh-chin Chang

The Interaction of Word Complexity and Word Duration in an Agglutinative Language Mária Gósy, Kálmán Abari

Taiwan Min Nan (Taiwanese) Checked Tones Sound Change Ho-hsien Pan, Shao-ren Lyu

In-Group Advantage in the Perception of Emotions: Evidence from Three Varieties of German Moritz Jakob, Bettina Braun, Katharina Zahner-Ritter

The LF Model in the Frequency Domain for Glottal Airflow Modelling Without Aliasing Distortion Christer Gobl

Parsing Speech for Grouping and Prominence, and the Typology of Rhythm Michael Wagner, Alvaro Iturralde Zurita, Sijia Zhang

Prosody of Case Markers in Urdu Benazir Mumtaz, Massimiliano Canzi, Miriam Butt

Articulatory Characteristics of Icelandic Voiced Fricative Lenition: Gradience, Categoricity, and Speaker/Gesture-Specific Effects Brynhildur Stefansdottir, Francesco Burroni, Sam Tilsen

Leveraging the Uniformity Framework to Examine Crosslinguistic Similarity for Long-Lag Stops in Spontaneous Cantonese-English Bilingual Speech Khia A. Johnson

Single-Channel Speech Enhancement

Personalized Speech Enhancement Through Self-Supervised Data Augmentation and Purification Aswin Sivaraman, Sunwoo Kim, Minje Kim

Speech Denoising with Auditory Models Mark R. Saddler, Andrew Francl, Jenelle Feather, Kaizhi Qian, Yang Zhang, Josh H. McDermott

Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement Sefik Emre Eskimez, Xiaofei Wang, Min Tang, Hemin Yang, Zirun Zhu, Zhuo Chen, Huaming Wang, Takuya Yoshioka

Multi-Stage Progressive Speech Enhancement Network Xinmeng Xu, Yang Wang, Dongxiang Xu, Yiyuan Peng, Cong Zhang, Jie Jia, Binbin Chen

Single-Channel Speech Enhancement Using Learnable Loss Mixup Oscar Chang, Dung N. Tran, Kazuhito Koishida

A Maximum Likelihood Approach to SNR-Progressive Learning Using Generalized Gaussian Distribution for LSTM-Based Speech Enhancement Xiao-Qi Zhang, Jun Du, Li Chai, Chin-Hui Lee

Whisper Speech Enhancement Using Joint Variational Autoencoder for Improved Speech Recognition Vikas Agrawal, Shashi Kumar, Shakti P. Rath

DEMUCS-Mobile : On-Device Lightweight Speech Enhancement Lukas Lee, Youna Ji, Minjae Lee, Min-Seok Choi

Speech Denoising Without Clean Training Data: A Noise2Noise Approach Madhav Mahesh Kashyap, Anuj Tambwekar, Krishnamoorthy Manohara, S. Natarajan

Improved Speech Enhancement Using a Complex-Domain GAN with Fused Time-Domain and Time-Frequency Domain Constraints Feng Dang, Pengyuan Zhang, Hangting Chen

Speech Enhancement with Topology-Enhanced Generative Adversarial Networks (GANs) Xudong Zhang, Liang Zhao, Feng Gu

Learning Speech Structure to Improve Time-Frequency Masks Suliang Bu, Yunxin Zhao, Shaojun Wang, Mei Han

SE-Conformer: Time-Domain Speech Enhancement Using Conformer Eesung Kim, Hyeji Seo

Speech Synthesis: Tools, Data, Evaluation

Spectral and Latent Speech Representation Distortion for TTS Evaluation Thananchai Kongthaworn, Burin Naowarat, Ekapol Chuangsuwanich

Detection and Analysis of Attention Errors in Sequence-to-Sequence Text-to-Speech Cassia Valentini-Botinhao, Simon King

RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis Rohola Zandie, Mohammad H. Mahoor, Julia Madsen, Eshrat S. Emamian

AISHELL-3: A Multi-Speaker Mandarin TTS Corpus Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li

Comparing Speech Enhancement Techniques for Voice Adaptation-Based Speech Synthesis Nicholas Eng, C.T. Justine Hui, Yusuke Hioka, Catherine I. Watson

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, Zhou Zhao

Perception of Social Speaker Characteristics in Synthetic Speech Sai Sirisha Rallabandi, Abhinav Bharadwaj, Babak Naderi, Sebastian Möller

Hi-Fi Multi-Speaker English TTS Dataset Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang

Utilizing Self-Supervised Representations for MOS Prediction Wei-Cheng Tseng, Chien-yu Huang, Wei-Tsung Kao, Yist Y. Lin, Hung-yi Lee

KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset Saida Mussakhojayeva, Aigerim Janaliyeva, Almas Mirzakhmetov, Yerbolat Khassanov, Huseyin Atakan Varol

Confidence Intervals for ASR-Based TTS Evaluation Jason Taylor, Korin Richmond

INTERSPEECH 2021 Deep Noise Suppression Challenge

INTERSPEECH 2021 Deep Noise Suppression Challenge Chandan K.A. Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, Sriram Srinivasan

A Simultaneous Denoising and Dereverberation Framework with Target Decoupling Andong Li, Wenzhe Liu, Xiaoxue Luo, Guochen Yu, Chengshi Zheng, Xiaodong Li

Deep Noise Suppression with Non-Intrusive PESQNet Supervision Enabling the Use of Real Training Data Ziyi Xu, Maximilian Strake, Tim Fingscheidt

DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement Xiaohuai Le, Hongsheng Chen, Kai Chen, Jing Lu

DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement Shubo Lv, Yanxin Hu, Shimin Zhang, Lei Xie

DBNet: A Dual-Branch Network Architecture Processing on Spectrum and Waveform for Single-Channel Speech Enhancement Kanghao Zhang, Shulin He, Hao Li, Xueliang Zhang

Low-Delay Speech Enhancement Using Perceptually Motivated Target and Loss Xu Zhang, Xinlei Ren, Xiguang Zheng, Lianwu Chen, Chen Zhang, Liang Guo, Bing Yu

Lightweight Causal Transformer with Local Self-Attention for Real-Time Speech Enhancement Koen Oostermeijer, Qing Wang, Jun Du

Neural Network Training Methods and Architectures for ASR

Self-Paced Ensemble Learning for Speech and Audio Classification Nicolae-Cătălin Ristea, Radu Tudor Ionescu

Knowledge Distillation for Streaming Transformer–Transducer Atsushi Kojima

Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition Timo Lohrenz, Zhengyang Li, Tim Fingscheidt

Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning Salah Zaiem, Titouan Parcollet, Slim Essid

Investigating Methods to Improve Language Model Integration for Attention-Based Encoder-Decoder ASR Models Mohammad Zeineldeen, Aleksandr Glushko, Wilfried Michel, Albert Zeyer, Ralf Schlüter, Hermann Ney

Comparing CTC and LFMMI for Out-of-Domain Adaptation of wav2vec 2.0 Acoustic Model Apoorv Vyas, Srikanth Madikeri, Hervé Bourlard

Emotion and Sentiment Analysis I

Speaker Attentive Speech Emotion Recognition Clément Le Moine, Nicolas Obin, Axel Roebel

Separation of Emotional and Reconstruction Embeddings on Ladder Network to Improve Speech Emotion Recognition Robustness in Noisy Conditions Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso

M 3 : MultiModal Masking Applied to Sentiment Analysis Efthymios Georgiou, Georgios Paraskevopoulos, Alexandros Potamianos

Linguistic Components in End-to-End ASR

The CSTR System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages Ondřej Klejch, Electra Wallington, Peter Bell

Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition Wei Zhou, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schlüter, Hermann Ney

Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept Wei Zhou, Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney

Modeling Dialectal Variation for Swiss German Automatic Speech Recognition Abbas Khosravani, Philip N. Garner, Alexandros Lazaridis

Out-of-Vocabulary Words Detection with Attention and CTC Alignments in an End-to-End ASR System Ekaterina Egorova, Hari Krishna Vydana, Lukáš Burget, Jan Černocký

Training Hybrid Models on Noisy Transliterated Transcripts for Code-Switched Speech Recognition Matthew Wiesner, Mousmita Sarma, Ashish Arora, Desh Raj, Dongji Gao, Ruizhe Huang, Supreet Preet, Moris Johnson, Zikra Iqbal, Nagendra Goel, Jan Trmal, Leibny Paola García Perera, Sanjeev Khudanpur

Assessment of Pathological Speech and Language II

Speech Intelligibility of Dysarthric Speech: Human Scores and Acoustic-Phonetic Features Wei Xue, Roeland van Hout, Fleur Boogmans, Mario Ganzeboom, Catia Cucchiarini, Helmer Strik

Analyzing Short Term Dynamic Speech Features for Understanding Behavioral Traits of Children with Autism Spectrum Disorder Young-Kyung Kim, Rimita Lahiri, Md. Nasir, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth S. Narayanan

Vocalization Recognition of People with Profound Intellectual and Multiple Disabilities (PIMD) Using Machine Learning Algorithms Waldemar Jęśko

Phonetic Complexity, Speech Accuracy and Intelligibility Assessment of Italian Dysarthric Speech Barbara Gili Fivela, Vincenzo Sallustio, Silvia Pede, Danilo Patrocinio

Detection of Consonant Errors in Disordered Speech Based on Consonant-Vowel Segment Embedding Si-Ioi Ng, Cymie Wing-Yee Ng, Jingyu Li, Tan Lee

Assessing Posterior-Based Mispronunciation Detection on Field-Collected Recordings from Child Speech Therapy Sessions Adam Hair, Guanlong Zhao, Beena Ahmed, Kirrie J. Ballard, Ricardo Gutierrez-Osuna

Identifying Cognitive Impairment Using Sentence Representation Vectors Bahman Mirheidari, Yilin Pan, Daniel Blackburn, Ronan O’Malley, Heidi Christensen

Parental Spoken Scaffolding and Narrative Skills in Crowd-Sourced Storytelling Samples of Young Children Zhengjun Yue, Jon Barker, Heidi Christensen, Cristina McKean, Elaine Ashton, Yvonne Wren, Swapnil Gadgil, Rebecca Bright

Uncertainty-Aware COVID-19 Detection from Imbalanced Sound Data Tong Xia, Jing Han, Lorena Qendro, Ting Dang, Cecilia Mascolo

Unsupervised Domain Adaptation for Dysarthric Speech Detection via Domain Adversarial Training and Mutual Information Minimization Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng

Source and Vocal Tract Cues for Speech-Based Classification of Patients with Parkinson’s Disease and Healthy Subjects Tanuka Bhattacharjee, Jhansi Mallela, Yamini Belur, Nalini Atchayaram, Ravi Yadav, Pradeep Reddy, Dipanjan Gope, Prasanta Kumar Ghosh

CLAC: A Speech Corpus of Healthy English Speakers R’mani Haulcy, James Glass

Multimodal Systems

Direct Multimodal Few-Shot Learning of Speech and Images Leanne Nortje, Herman Kamper

Talk, Don’t Write: A Study of Direct Speech-Based Image Retrieval Ramon Sanabria, Austin Waters, Jason Baldridge

A Fast Discrete Two-Step Learning Hashing for Scalable Cross-Modal Retrieval Huan Zhao, Kaili Ma

Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition Jianrong Wang, Ziyue Tang, Xuewei Li, Mei Yu, Qiang Fang, Li Liu

Attention-Based Keyword Localisation in Speech Using Visual Grounding Kayode Olaleye, Herman Kamper

Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models Khazar Khorrami, Okko Räsänen

Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries Hang Chen, Jun Du, Yu Hu, Li-Rong Dai, Bao-Cai Yin, Chin-Hui Lee

Cascaded Multilingual Audio-Visual Learning from Videos Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass

LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Björn W. Schuller, Maja Pantic

End-to-End Audio-Visual Speech Recognition for Overlapping Speech Richard Rose, Olivier Siohan, Anshuman Tripathi, Otavio Braga

Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party Yifei Wu, Chenda Li, Song Yang, Zhongqin Wu, Yanmin Qian

Source Separation I

Ultra Fast Speech Separation Model with Teacher Student Learning Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Takuya Yoshioka, Shujie Liu, Jinyu Li, Xiangzhan Yu

Group Delay Based Re-Weighted Sparse Recovery Algorithms for Robust and High-Resolution Source Separation in DOA Framework Murtiza Ali, Ashwani Koul, Karan Nathwani

Continuous Speech Separation Using Speaker Inventory for Long Recording Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen

Crossfire Conditional Generative Adversarial Networks for Singing Voice Extraction Weitao Yuan, Shengbei Wang, Xiangrui Li, Masashi Unoki, Wenwu Wang

End-to-End Speech Separation Using Orthogonal Representation in Complex and Real Time-Frequency Domain Kai Wang, Hao Huang, Ying Hu, Zhihua Huang, Sheng Li

Efficient and Stable Adversarial Learning Using Unpaired Data for Unsupervised Multichannel Speech Separation Yu Nakagome, Masahito Togami, Tetsuji Ogawa, Tetsunori Kobayashi

Stabilizing Label Assignment for Speech Separation by Self-Supervised Pre-Training Sung-Feng Huang, Shun-Po Chuang, Da-Rong Liu, Yi-Chen Chen, Gene-Ping Yang, Hung-yi Lee

Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation Fan-Lin Wang, Yu-Huai Peng, Hung-Shin Lee, Hsin-Min Wang

Investigation of Practical Aspects of Single Channel Speech Separation for ASR Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu, Jinyu Li

Implicit Filter-and-Sum Network for End-to-End Multi-Channel Speech Separation Yi Luo, Nima Mesgarani

Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation Yong Xu, Zhuohuang Zhang, Meng Yu, Shi-Xiong Zhang, Dong Yu

Speaker Diarization I

End-to-End Neural Diarization: From Transformer to Conformer Yi Chieh Liu, Eunjung Han, Chul Lee, Andreas Stolcke

Three-Class Overlapped Speech Detection Using a Convolutional Recurrent Neural Network Jee-weon Jung, Hee-Soo Heo, Youngki Kwon, Joon Son Chung, Bong-Jin Lee

Online Speaker Diarization Equipped with Discriminative Modeling and Guided Inference Xucheng Wan, Kai Liu, Huan Zhou

Semi-Supervised Training with Pseudo-Labeling for End-To-End Neural Diarization Yuki Takashima, Yusuke Fujita, Shota Horiguchi, Shinji Watanabe, Leibny Paola García Perera, Kenji Nagamatsu

Adapting Speaker Embeddings for Speaker Diarisation Youngki Kwon, Jee-weon Jung, Hee-Soo Heo, You Jin Kim, Bong-Jin Lee, Joon Son Chung

Scenario-Dependent Speaker Diarization for DIHARD-III Challenge Yu-Xuan Wang, Jun Du, Maokui He, Shu-Tong Niu, Lei Sun, Chin-Hui Lee

End-To-End Speaker Segmentation for Overlap-Aware Resegmentation Hervé Bredin, Antoine Laurent

Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers Yawen Xue, Shota Horiguchi, Yusuke Fujita, Yuki Takashima, Shinji Watanabe, Leibny Paola García Perera, Kenji Nagamatsu

A Thousand Words are Worth More Than One Recording: Word-Embedding Based Speaker Change Detection Or Haim Anidjar, Itshak Lapidot, Chen Hajaj, Amit Dvir

Speech Synthesis: Prosody Modeling I

Phrase Break Prediction with Bidirectional Encoder Representations in Japanese Text-to-Speech Synthesis Kosuke Futamata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana

Improving Multi-Speaker TTS Prosody Variance with a Residual Encoder and Normalizing Flows Iván Vallés-Pérez, Julian Roth, Grzegorz Beringer, Roberto Barra-Chicote, Jasha Droppo

Rich Prosody Diversity Modelling with Phone-Level Mixture Density Network Chenpeng Du, Kai Yu

Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis Kenichi Fujita, Atsushi Ando, Yusuke Ijima

Fine-Grained Prosody Modeling in Neural Speech Synthesis Using ToBI Representation Yuxiang Zou, Shichao Liu, Xiang Yin, Haopeng Lin, Chunfeng Wang, Haoyu Zhang, Zejun Ma

Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing Mayank Sharma, Yogesh Virkar, Marcello Federico, Roberto Barra-Chicote, Robert Enyedi

Applying the Information Bottleneck Principle to Prosodic Representation Learning Guangyan Zhang, Ying Qin, Daxin Tan, Tan Lee

A Prototypical Network Approach for Evaluating Generated Emotional Speech Alice Baird, Silvan Mertes, Manuel Milling, Lukas Stappen, Thomas Wiest, Elisabeth André, Björn W. Schuller

Speech Production II

A Simplified Model for the Vocal Tract of [s] with Inclined Incisors Tsukasa Yoshinaga, Kohei Tada, Kazunori Nozaki, Akiyoshi Iida

Vocal-Tract Models to Visualize the Airstream of Human Breath and Droplets While Producing Speech Takayuki Arai

Using Transposed Convolution for Articulatory-to-Acoustic Conversion from Real-Time MRI Data Ryo Tanji, Hidefumi Ohmura, Kouichi Katsurada

Comparison Between Lumped-Mass Modeling and Flow Simulation of the Reed-Type Artificial Vocal Fold Rafia Inaam, Tsukasa Yoshinaga, Takayuki Arai, Hiroshi Yokoyama, Akiyoshi Iida

Inhalations in Speech: Acoustic and Physiological Characteristics Raphael Werner, Susanne Fuchs, Jürgen Trouvain, Bernd Möbius

Model-Based Exploration of Linking Between Vowel Articulatory Space and Acoustic Space Anqi Xu, Daniel van Niekerk, Branislav Gerazov, Paul Konstantin Krug, Santitham Prom-on, Peter Birkholz, Yi Xu

Take a Breath: Respiratory Sounds Improve Recollection in Synthetic Speech Mikey Elmers, Raphael Werner, Beeke Muhlack, Bernd Möbius, Jürgen Trouvain

Modeling Sensorimotor Adaptation in Speech Through Alterations to Forward and Inverse Models Taijing Chen, Adam Lammert, Benjamin Parrell

Mixture of Orthogonal Sequences Made from Extended Time-Stretched Pulses Enables Measurement of Involuntary Voice Fundamental Frequency Response to Pitch Perturbation Hideki Kawahara, Toshie Matsui, Kohei Yatabe, Ken-Ichi Sakakibara, Minoru Tsuzaki, Masanori Morise, Toshio Irino

Spoken Dialogue Systems II

Contextualized Attention-Based Knowledge Transfer for Spoken Conversational Question Answering Chenyu You, Nuo Chen, Yuexian Zou

Injecting Descriptive Meta-Information into Pre-Trained Language Models with Hypernetworks Wenying Duan, Xiaoxi He, Zimu Zhou, Hong Rao, Lothar Thiele

Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy Mahdin Rohmatillah, Jen-Tzung Chien

Timing Generating Networks: Neural Network Based Precise Turn-Taking Timing Prediction in Multiparty Conversation Shinya Fujie, Hayato Katayama, Jin Sakuma, Tetsunori Kobayashi

Human-to-Human Conversation Dataset for Learning Fine-Grained Turn-Taking Action Kehan Chen, Zezhong Li, Suyang Dai, Wei Zhou, Haiqing Chen

PhonemeBERT: Joint Language Modelling of Phoneme Sequence and ASR Transcript Mukuntha Narayanan Sundararaman, Ayush Kumar, Jithendra Vepa

Joint Retrieval-Extraction Training for Evidence-Aware Dialog Response Selection Hongyin Luo, James Glass, Garima Lalwani, Yi Zhang, Shang-Wen Li

Adapting Long Context NLM for ASR Rescoring in Conversational Agents Ashish Shenoy, Sravan Bodapati, Monica Sunkara, Srikanth Ronanki, Katrin Kirchhoff

Oriental Language Recognition

Oriental Language Recognition (OLR) 2020: Summary and Analysis Jing Li, Binling Wang, Yiming Zhi, Zheng Li, Lin Li, Qingyang Hong, Dong Wang

Language Recognition on Unknown Conditions: The LORIA-Inria-MULTISPEECH System for AP20-OLR Challenge Raphaël Duroselle, Md. Sahidullah, Denis Jouvet, Irina Illina

Dynamic Multi-Scale Convolution for Dialect Identification Tianlong Kong, Shouyi Yin, Dawei Zhang, Wang Geng, Xin Wang, Dandan Song, Jinwen Huang, Huiyu Shi, Xiaorui Wang

An End-to-End Dialect Identification System with Transfer Learning from a Multilingual Automatic Speech Recognition Model Ding Wang, Shuaishuai Ye, Xinhui Hu, Sheng Li, Xinkang Xu

Language Recognition Based on Unsupervised Pretrained Models Haibin Yu, Jing Zhao, Song Yang, Zhongqin Wu, Yuting Nie, Wei-Qiang Zhang

Additive Phoneme-Aware Margin Softmax Loss for Language Recognition Zheng Li, Yan Liu, Lin Li, Qingyang Hong

Automatic Speech Recognition in Air Traffic Management

Towards an Accent-Robust Approach for ATC Communications Transcription Nataly Jahchan, Florentin Barbier, Ariyanidevi Dharma Gita, Khaled Khelif, Estelle Delpech

Detecting English Speech in the Air Traffic Control Voice Communication Igor Szöke, Santosh Kesiraju, Ondřej Novotný, Martin Kocour, Karel Veselý, Jan Černocký

Robust Command Recognition for Lithuanian Air Traffic Control Tower Utterances Oliver Ohneiser, Seyyed Saeed Sarfjoo, Hartmut Helmke, Shruthi Shetty, Petr Motlicek, Matthias Kleinert, Heiko Ehr, Šarūnas Murauskas

Contextual Semi-Supervised Learning: An Approach to Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems Juan Zuluaga-Gomez, Iuliia Nigmatulina, Amrutha Prasad, Petr Motlicek, Karel Veselý, Martin Kocour, Igor Szöke

Boosting of Contextual Information in ASR for Air-Traffic Call-Sign Recognition Martin Kocour, Karel Veselý, Alexander Blatt, Juan Zuluaga Gomez, Igor Szöke, Jan Černocký, Dietrich Klakow, Petr Motlicek

Modeling the Effect of Military Oxygen Masks on Speech Characteristics Benjamin Elie, Jodie Gauvain, Jean-Luc Gauvain, Lori Lamel

Show and Tell 3

MoM: Minutes of Meeting Bot Benjamin Milde, Tim Fischer, Steffen Remus, Chris Biemann

Articulatory Data Recorder: A Framework for Real-Time Articulatory Data Recording Alexander Wilbrandt, Simon Stone, Peter Birkholz

The INGENIOUS Multilingual Operations App Joan Codina-Filbà, Guillermo Cámbara, Alex Peiró-Lilja, Jens Grivolla, Roberto Carlini, Mireia Farrús

Digital Einstein Experience: Fast Text-to-Speech for Conversational AI Joanna Rownicka, Kilian Sprenkamp, Antonio Tripiana, Volodymyr Gromoglasov, Timo P. Kunz

Live Subtitling for BigBlueButton with Open-Source Software Robert Geislinger, Benjamin Milde, Timo Baumann, Chris Biemann

Expressive Latvian Speech Synthesis for Dialog Systems Dāvis Nicmanis, Askars Salimbajevs

ViSTAFAE: A Visual Speech-Training Aid with Feedback of Articulatory Efforts Pramod H. Kachare, Prem C. Pandey, Vishal Mane, Hirak Dasgupta, K.S. Nataraj, Akshada Rathod, Sheetal K. Pathak

Survey Talk 3: Karen Livescu

Learning Speech Models from Multi-Modal Data Karen Livescu

Keynote 3: Mounya Elhilali

Adaptive Listening to Everyday Soundscapes Mounya Elhilali

Speech Production I

Towards the Prediction of the Vocal Tract Shape from the Sequence of Phonemes to be Articulated Vinicius Ribeiro, Karyna Isaieva, Justine Leclere, Pierre-André Vuissoz, Yves Laprie

Comparison of the Finite Element Method, the Multimodal Method and the Transmission-Line Model for the Computation of Vocal Tract Transfer Functions Rémi Blandin, Marc Arnela, Simon Félix, Jean-Baptiste Doc, Peter Birkholz

Effects of Time Pressure and Spontaneity on Phonotactic Innovations in German Dialogues Petra Wagner, Sina Zarrieß, Joana Cholin

Importance of Parasagittal Sensor Information in Tongue Motion Capture Through a Diphonic Analysis Salvador Medina, Sarah Taylor, Mark Tiede, Alexander Hauptmann, Iain Matthews

Learning Robust Speech Representation with an Articulatory-Regularized Variational Autoencoder Marc-Antoine Georges, Laurent Girin, Jean-Luc Schwartz, Thomas Hueber

Changes in Glottal Source Parameter Values with Light to Moderate Physical Load Heather Weston, Laura L. Koenig, Susanne Fuchs

Speech Enhancement and Coding

End-to-End Optimized Multi-Stage Vector Quantization of Spectral Envelopes for Speech and Audio Coding Mohammad Hassan Vali, Tom Bäckström

Fusion-Net: Time-Frequency Information Fusion Y-Network for Speech Enhancement Santhan Kumar Reddy Nareddula, Subrahmanyam Gorthi, Rama Krishna Sai S. Gorthi

N-MTTL SI Model: Non-Intrusive Multi-Task Transfer Learning-Based Speech Intelligibility Prediction Model with Scenery Classification Ľuboš Marcinek, Michael Stone, Rebecca Millman, Patrick Gaydecki

Emotion and Sentiment Analysis II

Temporal Context in Speech Emotion Recognition Yangyang Xia, Li-Wei Chen, Alexander Rudnicky, Richard M. Stern

Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition Hang Li, Wenbiao Ding, Zhongqin Wu, Zitao Liu

Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit Einari Vaaras, Sari Ahlqvist-Björkroth, Konstantinos Drossos, Okko Räsänen

Multimodal Sentiment Analysis with Temporal Modality Attention Fan Qian, Jiqing Han

Stochastic Process Regression for Cross-Cultural Speech Emotion Recognition Mani Kumar T, Enrique Sanchez, Georgios Tzimiropoulos, Timo Giesbrecht, Michel Valstar

Acted vs. Improvised: Domain Adaptation for Elicitation Approaches in Audio-Visual Emotion Recognition Haoqi Li, Yelin Kim, Cheng-Hao Kuo, Shrikanth S. Narayanan

Emotion Recognition from Speech Using wav2vec 2.0 Embeddings Leonardo Pepino, Pablo Riera, Luciana Ferrer

Graph Isomorphism Network for Speech Emotion Recognition Jiawang Liu, Haoxiang Wang

Applying TDNN Architectures for Analyzing Duration Dependencies on Speech Emotion Recognition Pooja Kumawat, Aurobinda Routray

Acoustic Features and Neural Representations for Categorical Emotion Recognition from Speech Aaron Keesing, Yun Sing Koh, Michael Witbrock

Leveraging Pre-Trained Language Model for Speech Sentiment Analysis Suwon Shon, Pablo Brusco, Jing Pan, Kyu J. Han, Shinji Watanabe

Multi- and Cross-Lingual ASR, Other Topics in ASR

Cross-Domain Speech Recognition with Unsupervised Character-Level Distribution Matching Wenxin Hou, Jindong Wang, Xu Tan, Tao Qin, Takahiro Shinozaki

Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone Naoyuki Kanda, Guoli Ye, Yu Wu, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer Liang Lu, Zhong Meng, Naoyuki Kanda, Jinyu Li, Yifan Gong

Reducing Streaming ASR Model Delay with Self Alignment Jaeyoung Kim, Han Lu, Anshuman Tripathi, Qian Zhang, Hasim Sak

Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages Anuj Diwan, Preethi Jyothi

Knowledge Distillation Based Training of Universal ASR Source Models for Cross-Lingual Transfer Takashi Fukuda, Samuel Thomas

Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End Swayambhu Nath Ray, Minhua Wu, Anirudh Raju, Pegah Ghahremani, Raghavendra Bilgi, Milind Rao, Harish Arsikere, Ariya Rastrow, Andreas Stolcke, Jasha Droppo

Exploring Targeted Universal Adversarial Perturbations to End-to-End ASR Models Zhiyun Lu, Wei Han, Yu Zhang, Liangliang Cao

Earnings-21: A Practical Benchmark for ASR in the Wild Miguel Del Rio, Natalie Delworth, Ryan Westerman, Michelle Huang, Nishchal Bhandari, Joseph Palakapilly, Quinten McNamara, Joshua Dong, Piotr Żelasko, Miguel Jetté

Improving Multilingual Transformer Transducer Models by Reducing Language Confusions Eric Sun, Jinyu Li, Zhong Meng, Yu Wu, Jian Xue, Shujie Liu, Yifan Gong

Arabic Code-Switching Speech Recognition Using Monolingual Data Ahmed Ali, Shammur Absar Chowdhury, Amir Hussein, Yasser Hifny

Source Separation II

Online Blind Audio Source Separation Using Recursive Expectation-Maximization Aviad Eisenberg, Boaz Schwartz, Sharon Gannot

Empirical Analysis of Generalized Iterative Speech Separation Networks Yi Luo, Cong Han, Nima Mesgarani

Graph-PIT: Generalized Permutation Invariant Training for Continuous Separation of Arbitrary Numbers of Speakers Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

Teacher-Student MixIT for Unsupervised and Semi-Supervised Speech Separation Jisi Zhang, Cătălin Zorilă, Rama Doddipatla, Jon Barker

Few-Shot Learning of New Sound Classes for Target Sound Extraction Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Shoko Araki

Binaural Speech Separation of Moving Speakers With Preserved Spatial Cues Cong Han, Yi Luo, Nima Mesgarani

AvaTr: One-Shot Speaker Extraction with Transformers Shell Xu Hu, Md. Rifat Arefin, Viet-Nhat Nguyen, Alish Dipani, Xaq Pitkow, Andreas Savas Tolias

Vocal Harmony Separation Using Time-Domain Neural Networks Saurjya Sarkar, Emmanouil Benetos, Mark Sandler

Speaker Verification-Based Evaluation of Single-Channel Speech Separation Matthew Maciejewski, Shinji Watanabe, Sanjeev Khudanpur

Improved Speech Separation with Time-and-Frequency Cross-Domain Feature Selection Tian Lan, Yuxin Qian, Yilan Lyu, Refuoe Mokhosi, Wenxin Tai, Qiao Liu

Robust Speaker Extraction Network Based on Iterative Refined Adaptation Chengyun Deng, Shiqian Ma, Yongtao Sha, Yi Zhang, Hui Zhang, Hui Song, Fei Wang

Neural Speaker Extraction with Speaker-Speech Cross-Attention Network Wupeng Wang, Chenglin Xu, Meng Ge, Haizhou Li

Deep Audio-Visual Speech Separation Based on Facial Motion Rémi Rigal, Jacques Chodorowski, Benoît Zerr

Speaker Diarization II

LEAP Submission for the Third DIHARD Diarization Challenge Prachi Singh, Rajat Varma, Venkat Krishnamohan, Srikanth Raj Chetupalli, Sriram Ganapathy

Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings Shiliang Zhang, Siqi Zheng, Weilong Huang, Ming Lei, Hongbin Suo, Jinwei Feng, Zhijie Yan

Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker Maokui He, Desh Raj, Zili Huang, Jun Du, Zhuo Chen, Shinji Watanabe

ECAPA-TDNN Embeddings for Speaker Diarization Nauman Dawalatabad, Mirco Ravanelli, François Grondin, Jenthe Thienpondt, Brecht Desplanques, Hwidong Na

Advances in Integration of End-to-End Neural and Clustering-Based Diarization for Real Conversational Speech Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara

The Third DIHARD Diarization Challenge Neville Ryant, Prachi Singh, Venkat Krishnamohan, Rajat Varma, Kenneth Church, Christopher Cieri, Jun Du, Sriram Ganapathy, Mark Liberman

Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty Tsun-Yat Leung, Lahiru Samarakoon

Anonymous Speaker Clusters: Making Distinctions Between Anonymised Speech Recordings with Clustering Interface Benjamin O’Brien, Natalia Tomashenko, Anaïs Chanclu, Jean-François Bonastre

Speaker Diarization Using Two-Pass Leave-One-Out Gaussian PLDA Clustering of DNN Embeddings Kiran Karra, Alan McCree

Speech Synthesis: Toward End-to-End Synthesis I

Federated Learning with Dynamic Transformer for Text to Speech Zhenhou Hong, Jianzong Wang, Xiaoyang Qu, Jie Liu, Chendong Zhao, Jing Xiao

LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks Huu-Kim Nguyen, Kihyuk Jeong, Seyun Um, Min-Jae Hwang, Eunwoo Song, Hong-Goo Kang

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Dacheng Yin, Yucheng Zhao, Wenjun Zeng

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, Nam Soo Kim

Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech Jae-Sung Bae, Taejun Bak, Young-Sun Joo, Hoon-Young Cho

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux

A Learned Conditional Prior for the VAE Acoustic Space of a TTS System Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar Abbas, Simon Slangen, Jaime Lorenzo-Trueba, Thomas Drugman

A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning Based on Rényi Divergence Minimization Dipjyoti Paul, Sankar Mukherjee, Yannis Pantazis, Yannis Stylianou

Relational Data Selection for Data Augmentation of Speaker-Dependent Multi-Band MelGAN Vocoder Yi-Chiao Wu, Cheng-Hung Hu, Hung-Shin Lee, Yu-Huai Peng, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda

Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech Hyunseung Chung, Sang-Hoon Lee, Seong-Whan Lee

Triple M: A Practical Text-to-Speech Synthesis System with Multi-Guidance Attention and Multi-Band Multi-Time LPCNet Shilun Lin, Fenglong Xie, Li Meng, Xinhui Li, Li Lu

SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Jr., Anderson da Silva Soares, Sandra Maria Aluisio, Moacir Antonelli Ponti

Tools, Corpora and Resources

Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset Ian Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James Glass

The Multilingual TEDx Corpus for Speech Recognition and Translation Elizabeth Salesky, Matthew Wiesner, Jacob Bremerman, Roldano Cattoni, Matteo Negri, Marco Turchi, Douglas W. Oard, Matt Post

Tusom2021: A Phonetically Transcribed Speech Dataset from an Endangered Language for Universal Phone Recognition Experiments David R. Mortensen, Jordan Picone, Xinjian Li, Kathleen Siminyu

AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, Jingdong Chen

GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Zhao You, Zhiyong Yan

Look Who’s Talking: Active Speaker Detection in the Wild You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-Jin Lee, Youngki Kwon, Joon Son Chung

AusKidTalk: An Auditory-Visual Corpus of 3- to 12-Year-Old Australian Children’s Speech Beena Ahmed, Kirrie J. Ballard, Denis Burnham, Tharmakulasingam Sirojan, Hadi Mehmood, Dominique Estival, Elise Baker, Felicity Cox, Joanne Arciuli, Titia Benders, Katherine Demuth, Barbara Kelly, Chloé Diskin-Holdaway, Mostafa Shahin, Vidhyasaharan Sethu, Julien Epps, Chwee Beng Lee, Eliathamby Ambikairajah

Human-in-the-Loop Efficiency Analysis for Binary Classification in Edyson Per Fallgren, Jens Edlund

Annotation Confidence vs. Training Sample Size: Trade-Off Solution for Partially-Continuous Categorical Emotion Recognition Elena Ryumina, Oxana Verkholyak, Alexey Karpov

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization Gonçal V. Garcés Díaz-Munío, Joan-Albert Silvestre-Cerdà, Javier Jorge, Adrià Giménez Pastor, Javier Iranzo-Sánchez, Pau Baquero-Arnal, Nahuel Roselló, Alejandro Pérez-González-de-Martos, Jorge Civera, Albert Sanchis, Alfons Juan

Towards Automatic Speech to Sign Language Generation Parul Kapoor, Rudrabha Mukhopadhyay, Sindhu B. Hegde, Vinay Namboodiri, C.V. Jawahar

kosp2e: Korean Speech to English Translation Corpus Won Ik Cho, Seok Min Kim, Hyunchang Cho, Nam Soo Kim

speechocean762: An Open-Source Non-Native English Speech Corpus for Pronunciation Assessment Junbo Zhang, Zhiwen Zhang, Yongqing Wang, Zhiyong Yan, Qiong Song, Yukai Huang, Ke Li, Daniel Povey, Yujun Wang

Non-Autoregressive Sequential Modeling for Speech Processing

An Improved Single Step Non-Autoregressive Transformer for Automatic Speech Recognition Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao, Abeer Alwan

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain Pengcheng Guo, Xuankai Chang, Shinji Watanabe, Lei Xie

Pushing the Limits of Non-Autoregressive Speech Recognition Edwin G. Ng, Chung-Cheng Chiu, Yu Zhang, William Chan

Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies Alexander H. Liu, Yu-An Chung, James Glass

Relaxing the Conditional Independence Assumption of CTC-Based ASR by Conditioning on Intermediate Predictions Jumon Nozaki, Tatsuya Komatsu

Toward Streaming ASR with Non-Autoregressive Insertion-Based Model Yuya Fujita, Tianzi Wang, Shinji Watanabe, Motoi Omachi

Layer Pruning on Demand with Intermediate CTC Jaesong Lee, Jingu Kang, Shinji Watanabe

Real-Time End-to-End Monaural Multi-Speaker Speech Recognition Song Li, Beibei Ouyang, Fuchuan Tong, Dexin Liao, Lin Li, Qingyang Hong

Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models Tianzi Wang, Yuya Fujita, Xuankai Chang, Shinji Watanabe

TalkNet: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis Stanislav Beliaev, Boris Ginsburg

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, William Chan

Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition Nanxin Chen, Piotr Żelasko, Laureano Moro-Velázquez, Jesús Villalba, Najim Dehak

VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis Hui Lu, Zhiyong Wu, Xixin Wu, Xu Li, Shiyin Kang, Xunying Liu, Helen Meng

The ADReSSo Challenge: Detecting Cognitive Decline Using Speech Only

Detecting Cognitive Decline Using Speech Only: The ADReSSo Challenge Saturnino Luz, Fasih Haider, Sofia de la Fuente, Davida Fromm, Brian MacWhinney

Influence of the Interviewer on the Automatic Assessment of Alzheimer’s Disease in the Context of the ADReSSo Challenge P.A. Pérez-Toro, S.P. Bayerl, T. Arias-Vergara, J.C. Vásquez-Correa, P. Klumpp, M. Schuster, Elmar Nöth, J.R. Orozco-Arroyave, K. Riedhammer

WavBERT: Exploiting Semantic and Non-Semantic Speech Using Wav2vec and BERT for Dementia Detection Youxiang Zhu, Abdelrahman Obyat, Xiaohui Liang, John A. Batsis, Robert M. Roth

Alzheimer Disease Recognition Using Speech-Based Embeddings From Pre-Trained Models Lara Gauder, Leonardo Pepino, Luciana Ferrer, Pablo Riera

Comparing Acoustic-Based Approaches for Alzheimer’s Disease Detection Aparna Balagopalan, Jekaterina Novikova

Alzheimer’s Disease Detection from Spontaneous Speech Through Combining Linguistic Complexity and (Dis)Fluency Features with Pretrained Language Models Yu Qiao, Xuefeng Yin, Daniel Wiechmann, Elma Kerz

Using the Outputs of Different Automatic Speech Recognition Paradigms for Acoustic- and BERT-Based Alzheimer’s Dementia Detection Through Spontaneous Speech Yilin Pan, Bahman Mirheidari, Jennifer M. Harris, Jennifer C. Thompson, Matthew Jones, Julie S. Snowden, Daniel Blackburn, Heidi Christensen

Tackling the ADRESSO Challenge 2021: The MUET-RMIT System for Alzheimer’s Dementia Recognition from Spontaneous Speech Zafi Sherhan Syed, Muhammad Shehram Shah Syed, Margaret Lech, Elena Pirogova

Alzheimer’s Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs Morteza Rohanian, Julian Hough, Matthew Purver

Automatic Detection and Assessment of Alzheimer Disease Using Speech and Language Technologies in Low-Resource Scenarios Raghavendra Pappagari, Jaejin Cho, Sonal Joshi, Laureano Moro-Velázquez, Piotr Żelasko, Jesús Villalba, Najim Dehak

Automatic Detection of Alzheimer’s Disease Using Spontaneous Speech Only Jun Chen, Jieping Ye, Fengyi Tang, Jiayu Zhou

Modular Multi-Modal Attention Network for Alzheimer’s Disease Detection Using Patient Audio and Language Data Ning Wang, Yupeng Cao, Shuai Hao, Zongru Shao, K.P. Subbalakshmi

Robust and Far-Field ASR

Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-Field Speech Recognition Rong Gong, Carl Quillen, Dushyant Sharma, Andrew Goderre, José Laínez, Ljubomir Milanović

ETLT 2021: Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech R. Gretter, Marco Matassoni, D. Falavigna, A. Misra, C.W. Leong, K. Knill, L. Wang

Age-Invariant Training for End-to-End Child Speech Recognition Using Adversarial Multi-Task Learning Lars Rumberg, Hanna Ehlert, Ulrike Lüdtke, Jörn Ostermann

Learning to Rank Microphones for Distant Speech Recognition Samuele Cornell, Alessio Brutti, Marco Matassoni, Stefano Squartini

Simulating Reading Mistakes for Child Speech Transformer-Based Phone Recognition Lucile Gelin, Thomas Pellegrini, Julien Pinquier, Morgane Daniel

Speech Synthesis: Prosody Modeling II

Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input Brooke Stephenson, Thomas Hueber, Laurent Girin, Laurent Besacier

Exploring Emotional Prototypes in a High Dimensional TTS Latent Space Pol van Rijn, Silvan Mertes, Dominik Schiller, Peter M.C. Harrison, Pauline Larrouy-Maestri, Elisabeth André, Nori Jacoby

Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis Devang S. Ram Mohan, Vivian Hu, Tian Huey Teh, Alexandra Torresquintero, Christopher G.R. Wallis, Marlene Staib, Lorenzo Foglianti, Jiameng Gao, Simon King

ADEPT: A Dataset for Evaluating Prosody Transfer Alexandra Torresquintero, Tian Huey Teh, Christopher G.R. Wallis, Marlene Staib, Devang S. Ram Mohan, Vivian Hu, Lorenzo Foglianti, Jiameng Gao, Simon King

Prosodic Boundary Prediction Model for Vietnamese Text-To-Speech Nguyen Thi Thu Trang, Nguyen Hoang Ky, Albert Rilliard, Christophe d'Alessandro

Source Separation III

Many-Speakers Single Channel Speech Separation with Optimal Permutation Training Shaked Dovrat, Eliya Nachmani, Lior Wolf

Combating Reverberation in NTF-Based Speech Separation Using a Sub-Source Weighted Multichannel Wiener Filter and Linear Prediction Mieszko Fraś, Marcin Witkowski, Konrad Kowalczyk

A Hands-On Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation Martin Strauss, Jouni Paulus, Matteo Torcoli, Bernd Edler

GlobalPhone Mix-To-Separate Out of 2: A Multilingual 2000 Speakers Mixtures Database for Speech Separation Marvin Borsdorf, Chenglin Xu, Haizhou Li, Tanja Schultz

Non-Native Speech

Cross-Linguistic Perception of the Japanese Singleton/Geminate Contrast: Korean, Mandarin and Mongolian Compared Kimiko Tsukada, Yurong, Joo-Yeon Kim, Jeong-Im Han, John Hajek

Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek

Testing Acoustic Voice Quality Classification Across Languages and Speech Styles Bettina Braun, Nicole Dehé, Marieke Einfeldt, Daniela Wochner, Katharina Zahner-Ritter

Acquisition of Prosodic Focus Marking by Three- to Six-Year-Old Children Learning Mandarin Chinese Qianyutong Zhang, Kexin Lyu, Zening Chen, Ping Tang

Adaptive Listening Difficulty Detection for L2 Learners Through Moderating ASR Resources Maryam Sadat Mirzaei, Kourosh Meshgi

F 0 Patterns of L2 English Speech by Mandarin Chinese Learners Hongwei Ding, Binghuai Lin, Liyuan Wang

A Neural Network-Based Noise Compensation Method for Pronunciation Assessment Binghuai Lin, Liyuan Wang

Phonetic Distance and Surprisal in Multilingual Priming: Evidence from Slavic Jacek Kudera, Philip Georgis, Bernd Möbius, Tania Avgustinova, Dietrich Klakow

A Preliminary Study on Discourse Prosody Encoding in L1 and L2 English Spontaneous Narratives Yuqing Zhang, Zhu Li, Binghuai Lin, Jinsong Zhang

Transformer Based End-to-End Mispronunciation Detection and Diagnosis Minglin Wu, Kun Li, Wai-Kim Leung, Helen Meng

L1 Identification from L2 Speech Using Neural Spectrogram Analysis Calbert Graham

Phonetics II

Leveraging Real-Time MRI for Illuminating Linguistic Velum Action Miran Oh, Dani Byrd, Shrikanth S. Narayanan

Segmental Alignment of English Syllables with Singleton and Cluster Onsets Zirui Liu, Yi Xu

Exploration of Welsh English Pre-Aspiration: How Wide-Spread is it? Míša Hejná

Revisiting Recall Effects of Filler Particles in German and English Beeke Muhlack, Mikey Elmers, Heiner Drenhaus, Jürgen Trouvain, Marjolein van Os, Raphael Werner, Margarita Ryzhova, Bernd Möbius

How Reliable Are Phonetic Data Collected Remotely? Comparison of Recording Devices and Environments on Acoustic Measurements Chunyu Ge, Yixuan Xiong, Peggy Mok

A Cross-Dialectal Comparison of Apical Vowels in Beijing Mandarin, Northeastern Mandarin and Southwestern Mandarin: An EMA and Ultrasound Study Jing Huang, Feng-fan Hsieh, Yueh-chin Chang

Dissecting the Aero-Acoustic Parameters of Open Articulatory Transitions Mark Gibson, Oihane Muxika, Marianne Pouplier

Quantifying Vocal Tract Shape Variation and its Acoustic Impact: A Geometric Morphometric Approach Amelia J. Gully

Speech Perception and Loanword Adaptations: The Case of Copy-Vowel Epenthesis Adriana Guevara-Rukoz, Shi Yu, Sharon Peperkamp

Speakers Coarticulate Less When Facing Real and Imagined Communicative Difficulties: An Analysis of Read and Spontaneous Speech from the LUCID Corpus Zhe-chen Guo, Rajka Smiljanic

Developmental Changes of Vowel Acoustics in Adolescents Einar Meister, Lya Meister

Context and Co-Text Influence on the Accuracy Production of Italian L2 Non-Native Sounds Sonia d'Apolito, Barbara Gili Fivela

A New Vowel Normalization for Sociophonetics Wilbert Heeringa, Hans Van de Velde

The Pacific Expansion: Optimizing Phonetic Transcription of Archival Corpora Rosey Billington, Hywel Stoakes, Nick Thieberger

Search/Decoding Techniques and Confidence Measures for ASR

FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao, Shuai Zhang, Zhengqi Wen

LT-LM: A Novel Non-Autoregressive Language Model for Single-Shot Lattice Rescoring Anton Mitrofanov, Mariya Korenevskaya, Ivan Podluzhny, Yuri Khokhlov, Aleksandr Laptev, Andrei Andrusenko, Aleksei Ilin, Maxim Korenevsky, Ivan Medennikov, Aleksei Romanenko

A Hybrid Seq-2-Seq ASR Design for On-Device and Server Applications Cyril Allauzen, Ehsan Variani, Michael Riley, David Rybach, Hao Zhang

VAD-Free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording Hirofumi Inaguma, Tatsuya Kawahara

WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, Xin Lei

Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Takanori Ashihara, Shota Orihashi, Naoki Makishima

Deep Neural Network Calibration for E2E Speech Recognition System Mun-Hak Lee, Joon-Hyuk Chang

Residual Energy-Based Models for End-to-End Speech Recognition Qiujia Li, Yu Zhang, Bo Li, Liangliang Cao, Philip C. Woodland

Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction David Qiu, Yanzhang He, Qiujia Li, Yu Zhang, Liangliang Cao, Ian McGraw

Insights on Neural Representations for End-to-End Speech Recognition Anna Ollerenshaw, Md. Asif Jalal, Thomas Hain

Sequence-Level Confidence Classifier for ASR Utterance Accuracy and Application to Acoustic Models Amber Afshan, Kshitiz Kumar, Jian Wu

Speech Synthesis: Linguistic Processing, Paradigms and Other Topics

Unsupervised Learning of Disentangled Speech Content and Style Representation Andros Tjandra, Ruoming Pang, Yu Zhang, Shigeki Karita

Label Embedding for Chinese Grapheme-to-Phoneme Conversion Eunbi Choi, Hwa-Yeon Kim, Jong-Hwan Kim, Jae-Min Kim

PDF: Polyphone Disambiguation in Chinese by Using FLAT Haiteng Zhang

Improving Polyphone Disambiguation for Mandarin Chinese by Combining Mix-Pooling Strategy and Window-Based Attention Junjie Li, Zhiyu Zhang, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao

Polyphone Disambiguation in Mandarin Chinese with Semi-Supervised Learning Yi Shi, Congyi Wang, Yu Chen, Bin Wang

A Neural-Network-Based Approach to Identifying Speakers in Novels Yue Chen, Zhen-Hua Ling, Qing-Feng Liu

UnitNet-Based Hybrid Speech Synthesis Xiao Zhou, Zhen-Hua Ling, Li-Rong Dai

Dynamically Adaptive Machine Speech Chain Inference for TTS in Noisy Environment: Listen and Speak Louder Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura

LinearSpeech: Parallel Text-to-Speech with Linear Complexity Haozhe Zhang, Zhihua Huang, Zengqiang Shang, Pengyuan Zhang, Yonghong Yan

Speech Type Classification and Diagnosis

An Agent for Competing with Humans in a Deceptive Game Based on Vocal Cues Noa Mansbach, Evgeny Hershkovitch Neiterman, Amos Azaria

A Multi-Branch Deep Learning Network for Automated Detection of COVID-19 Ahmed Fakhry, Xinyi Jiang, Jaclyn Xiao, Gunvant Chaudhari, Asriel Han

RW-Resnet: A Novel Speech Anti-Spoofing Model Using Raw Waveform Youxuan Ma, Zongze Ren, Shugong Xu

Fake Audio Detection in Resource-Constrained Settings Using Microfeatures Hira Dhamyal, Ayesha Ali, Ihsan Ayyub Qazi, Agha Ali Raza

Coughing-Based Recognition of Covid-19 with Spatial Attentive ConvLSTM Recurrent Neural Networks Tianhao Yan, Hao Meng, Emilia Parada-Cabaleiro, Shuo Liu, Meishu Song, Björn W. Schuller

Knowledge Distillation for Singing Voice Detection Soumava Paul, Gurunath Reddy M, K. Sreenivasa Rao, Partha Pratim Das

Age Estimation with Speech-Age Model for Heterogeneous Speech Datasets Ryu Takeda, Kazunori Komatani

Open-Set Audio Classification with Limited Training Resources Based on Augmentation Enhanced Variational Auto-Encoder GAN with Detection-Classification Joint Training Kah Kuan Teh, Huy Dat Tran

Deep Spectral-Cepstral Fusion for Shouted and Normal Speech Classification Takahiro Fukumori

Automatic Detection of Shouted Speech Segments in Indian News Debates Shikha Baghel, Mrinmoy Bhattacharjee, S.R. Mahadeva Prasanna, Prithwijit Guha

Generalized Spoofing Detection Inspired from Audio Generation Artifacts Yang Gao, Tyler Vuong, Mahsa Elyasi, Gaurav Bharaj, Rita Singh

Overlapped Speech Detection Based on Spectral and Spatial Feature Fusion Weiguang Chen, Van Tung Pham, Eng Siong Chng, Xionghu Zhong

Spoken Term Detection & Voice Search

Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study Badr M. Abdullah, Marius Mosbach, Iuliia Zaitova, Bernd Möbius, Dietrich Klakow

Paraphrase Label Alignment for Voice Application Retrieval in Spoken Language Understanding Zheng Gao, Radhika Arava, Qian Hu, Xibin Gao, Thahir Mohamed, Wei Xiao, Mohamed AbdelHady

Personalized Keyphrase Detection Using Speaker and Environment Information Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ding Zhao, Yiteng Huang, Arun Narayanan, Ian McGraw

Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation Vineet Garg, Wonil Chang, Siddharth Sigtia, Saurabh Adya, Pramod Simha, Pranay Dighe, Chandra Dhir

Few-Shot Keyword Spotting in Any Language Mark Mazumder, Colby Banbury, Josh Meyer, Pete Warden, Vijay Janapa Reddi

Text Anchor Based Metric Learning for Small-Footprint Keyword Spotting Li Wang, Rongzhi Gu, Nuo Chen, Yuexian Zou

A Meta-Learning Approach for User-Defined Spoken Term Classification with Varying Classes and Examples Yangbin Chen, Tom Ko, Jianping Wang

Auxiliary Sequence Labeling Tasks for Disfluency Detection Dongyub Lee, Byeongil Ko, Myeong Cheol Shin, Taesun Whang, Daniel Lee, Eunhwa Kim, Eunggyun Kim, Jaechoon Jo

Energy-Friendly Keyword Spotting System Using Add-Based Convolution Hang Zhou, Wenchao Hu, Yu Ting Yeung, Xiao Chen

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results Yan Jia, Xingming Wang, Xiaoyi Qin, Yinping Zhang, Xuyang Wang, Junjie Wang, Dong Zhang, Ming Li

Auto-KWS 2021 Challenge: Task, Datasets, and Baselines Jingsong Wang, Yuxuan He, Chunyu Zhao, Qijie Shao, Wei-Wei Tu, Tom Ko, Hung-yi Lee, Lei Xie

Keyword Transformer: A Self-Attention Model for Keyword Spotting Axel Berg, Mark O’Connor, Miguel Tairum Cruz

Teaching Keyword Spotters to Spot New Keywords with Limited Examples Abhijeet Awasthi, Kevin Kilgour, Hassan Rom

Voice Anti-Spoofing and Countermeasure

A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection Xin Wang, Junichi Yamagishi

An Initial Investigation for Detecting Partially Spoofed Audio Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi, Jose Patino, Nicholas Evans

Siamese Network with wav2vec Feature for Spoofing Speech Detection Yang Xie, Zhenchuan Zhang, Yingchun Yang

Cross-Database Replay Detection in Terminal-Dependent Speaker Verification Xingliang Cheng, Mingxing Xu, Thomas Fang Zheng

The Effect of Silence and Dual-Band Fusion in Anti-Spoofing System Yuxiang Zhang, Wenchao Wang, Pengyuan Zhang

Pairing Weak with Strong: Twin Models for Defending Against Adversarial Attack on Speaker Verification Zhiyuan Peng, Xu Li, Tan Lee

Attention-Based Convolutional Neural Network for ASV Spoofing Detection Hefei Ling, Leichao Huang, Junrui Huang, Baiyan Zhang, Ping Li

Voting for the Right Answer: Adversarial Defense for Speaker Verification Haibin Wu, Yang Zhang, Zhiyong Wu, Dong Wang, Hung-yi Lee

Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing Tomi Kinnunen, Andreas Nautsch, Md. Sahidullah, Nicholas Evans, Xin Wang, Massimiliano Todisco, Héctor Delgado, Junichi Yamagishi, Kong Aik Lee

Representation Learning to Classify and Detect Adversarial Attacks Against Speaker and Speech Recognition Systems Jesús Villalba, Sonal Joshi, Piotr Żelasko, Najim Dehak

An Empirical Study on Channel Effects for Synthetic Voice Spoofing Countermeasure Systems You Zhang, Ge Zhu, Fei Jiang, Zhiyao Duan

Channel-Wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks Xu Li, Xixin Wu, Hui Lu, Xunying Liu, Helen Meng

Partially-Connected Differentiable Architecture Search for Deepfake and Spoofing Detection Wanying Ge, Michele Panariello, Jose Patino, Massimiliano Todisco, Nicholas Evans

OpenASR20 and Low Resource ASR Development

OpenASR20: An Open Challenge for Automatic Speech Recognition of Conversational Telephone Speech in Low-Resource Languages Kay Peterson, Audrey Tong, Yan Yu

Multitask Adaptation with Lattice-Free MMI for Multi-Genre Speech Recognition of Low Resource Languages Srikanth Madikeri, Petr Motlicek, Hervé Bourlard

An Improved Wav2Vec 2.0 Pre-Training Approach Using Enhanced Local Dependency Modeling for Speech Recognition Qiu-shi Zhu, Jie Zhang, Ming-hui Wu, Xin Fang, Li-Rong Dai

Systems for Low-Resource Speech Recognition Tasks in Open Automatic Speech Recognition and Formosa Speech Recognition Challenges Hung-Pang Lin, Yu-Jia Zhang, Chia-Ping Chen

The TNT Team System Descriptions of Cantonese and Mongolian for IARPA OpenASR20 Jing Zhao, Zhiqiang Lv, Ambyera Han, Guan-Bo Wang, Guixin Shi, Jian Kang, Jinghao Yan, Pengfei Hu, Shen Huang, Wei-Qiang Zhang

Combining Hybrid and End-to-End Approaches for the OpenASR20 Challenge Tanel Alumäe, Jiaming Kong

One Size Does Not Fit All in Resource-Constrained ASR Ethan Morris, Robbie Jimerson, Emily Prud’hommeaux

Survey Talk 4: Alejandrina Cristia

Child Language Acquisition Studied with Wearables Alejandrina Cristia

Keynote 4: Tomáš Mikolov

Language Modeling and Artificial Intelligence Tomáš Mikolov

Voice Activity Detection

Unsupervised Representation Learning for Speech Activity Detection in the Fearless Steps Challenge 2021 Pablo Gimeno, Alfonso Ortega, Antonio Miguel, Eduardo Lleida

The Application of Learnable STRF Kernels to the 2021 Fearless Steps Phase-03 SAD Challenge Tyler Vuong, Yangyang Xia, Richard M. Stern

Speech Activity Detection Based on Multilingual Speech Recognition System Seyyed Saeed Sarfjoo, Srikanth Madikeri, Petr Motlicek

Voice Activity Detection with Teacher-Student Domain Emulation Jarrod Luckenbaugh, Samuel Abplanalp, Rachel Gonzalez, Daniel Fulford, David Gard, Carlos Busso

EML Online Speech Activity Detection for the Fearless Steps Challenge Phase-III Omid Ghahabi, Volker Fischer

Keyword Search and Spoken Language Processing

Device Playback Augmentation with Echo Cancellation for Keyword Spotting Kuba Łopatka, Katarzyna Kaszuba-Miotke, Piotr Klinke, Paweł Trella

End-to-End Open Vocabulary Keyword Search Bolaji Yusuf, Alican Gok, Batuhan Gundogdu, Murat Saraclar

Semantic Sentence Similarity: Size does not Always Matter Danny Merkx, Stefan L. Frank, Mirjam Ernestus

Spoken Term Detection and Relevance Score Estimation Using Dot-Product of Pronunciation Embeddings Jan Švec, Luboš Šmídl, Josef V. Psutka, Aleš Pražák

Toward Genre Adapted Closed Captioning François Buet, François Yvon

Applications in Transcription, Education and Learning

Weakly-Supervised Word-Level Pronunciation Error Detection in Non-Native English Speech Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek

End-to-End Speaker-Attributed ASR with Transformer Naoyuki Kanda, Guoli Ye, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

Understanding Medical Conversations: Rich Transcription, Confidence Scores & Information Extraction Hagen Soltau, Mingqiu Wang, Izhak Shafran, Laurent El Shafey

Phone-Level Pronunciation Scoring for Spanish Speakers Learning English Using a GOP-DNN System Jazmín Vidal, Cyntia Bonomi, Marcelo Sancinetti, Luciana Ferrer

Explore wav2vec 2.0 for Mispronunciation Detection Xiaoshuo Xu, Yueteng Kang, Songjun Cao, Binghuai Lin, Long Ma

Lexical Density Analysis of Word Productions in Japanese English Using Acoustic Word Embeddings Shintaro Ando, Nobuaki Minematsu, Daisuke Saito

Deep Feature Transfer Learning for Automatic Pronunciation Assessment Binghuai Lin, Liyuan Wang

Multilingual Speech Evaluation: Case Studies on English, Malay and Tamil Huayun Zhang, Ke Shi, Nancy F. Chen

A Study on Fine-Tuning wav2vec2.0 Model for the Task of Mispronunciation Detection and Diagnosis Linkai Peng, Kaiqi Fu, Binghuai Lin, Dengfeng Ke, Jinsong Zhan

The Impact of ASR on the Automatic Analysis of Linguistic Complexity and Sophistication in Spontaneous L2 Speech Yu Qiao, Wei Zhou, Elma Kerz, Ralf Schlüter

End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota Orihashi, Naoki Makishima

“You don’t understand me!”: Comparing ASR Results for L1 and L2 Speakers of Swedish Ronald Cumbal, Birger Moell, José Lopes, Olov Engwall

NeMo Inverse Text Normalization: From Development to Production Yang Zhang, Evelina Bakhturina, Kyle Gorman, Boris Ginsburg

Improvement of Automatic English Pronunciation Assessment with Small Number of Utterances Using Sentence Speakability Satsuki Naijo, Akinori Ito, Takashi Nose

Emotion and Sentiment Analysis III

Affect Recognition Through Scalogram and Multi-Resolution Cochleagram Features Fasih Haider, Saturnino Luz

A Speech Emotion Recognition Framework for Better Discrimination of Confusions Jiawang Liu, Haoxiang Wang

Speech Emotion Recognition via Multi-Level Cross-Modal Distillation Ruichen Li, Jinming Zhao, Qin Jin

Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes Koichiro Ito, Takuya Fujioka, Qinghua Sun, Kenji Nagamatsu

Parametric Distributions to Model Numerical Emotion Labels Deboshree Bose, Vidhyasaharan Sethu, Eliathamby Ambikairajah

Metric Learning Based Feature Representation with Gated Fusion Model for Speech Emotion Recognition Yuan Gao, Jiaxing Liu, Longbiao Wang, Jianwu Dang

Speech Emotion Recognition with Multi-Task Learning Xingyu Cai, Jiahong Yuan, Renjie Zheng, Liang Huang, Kenneth Church

Generalized Dilated CNN Models for Depression Detection Using Inverted Vocal Tract Variables Nadee Seneviratne, Carol Espy-Wilson

Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition Yuhua Wang, Guang Shen, Yuezhu Xu, Jiahang Li, Zhengdao Zhao

Time-Frequency Representation Learning with Graph Convolutional Network for Dialogue-Level Speech Emotion Recognition Jiaxing Liu, Yaodong Song, Longbiao Wang, Jianwu Dang, Ruiguo Yu

Resource-Constrained ASR

Compressing 1D Time-Channel Separable Convolutions Using Sparse Random Ternary Matrices Gonçalo Mordido, Matthijs Van keirsbilck, Alexander Keller

Weakly Supervised Construction of ASR Systems from Massive Video Data Mengli Cheng, Chengyu Wang, Jun Huang, Xiaobo Wang

Broadcasted Residual Learning for Efficient Keyword Spotting Byeonggeun Kim, Simyung Chang, Jinkyu Lee, Dooyong Sung

CoDERT: Distilling Encoder Representations with Co-Learning for Transducer-Based Speech Recognition Rupak Vignesh Swaminathan, Brian King, Grant P. Strimel, Jasha Droppo, Athanasios Mouchtaris

Extremely Low Footprint End-to-End ASR System for Smart Device Zhifu Gao, Yiwu Yao, Shiliang Zhang, Jun Yang, Ming Lei, Ian McLoughlin

Dissecting User-Perceived Latency of On-Device E2E Speech Recognition Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer

Amortized Neural Networks for Low-Latency Speech Recognition Jonathan Macoskey, Grant P. Strimel, Jinru Su, Ariya Rastrow

Tied & Reduced RNN-T Decoder Rami Botros, Tara N. Sainath, Robert David, Emmanuel Guzman, Wei Li, Yanzhang He

PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation Jangho Kim, Simyung Chang, Nojun Kwak

Collaborative Training of Acoustic Encoders for Speech Recognition Varun Nagaraja, Yangyang Shi, Ganesh Venkatesh, Ozlem Kalinli, Michael L. Seltzer, Vikas Chandra

Efficient Conformer with Prob-Sparse Attention Mechanism for End-to-End Speech Recognition Xiong Wang, Sining Sun, Lei Xie, Long Ma

The Energy and Carbon Footprint of Training End-to-End Speech Recognizers Titouan Parcollet, Mirco Ravanelli

Speaker Recognition: Applications

Graph-Based Label Propagation for Semi-Supervised Speaker Identification Long Chen, Venkatesh Ravichandran, Andreas Stolcke

Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition Ruirui Li, Chelsea J.-T. Ju, Zeya Chen, Hongda Mao, Oguz Elibol, Andreas Stolcke

A Generative Model for Duration-Dependent Score Calibration Sandro Cumani, Salvatore Sarni

Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition Jason Pelecanos, Quan Wang, Ignacio Lopez Moreno

Multi-Channel Speaker Verification for Single and Multi-Talker Speech Saurabh Kataria, Shi-Xiong Zhang, Dong Yu

Chronological Self-Training for Real-Time Speaker Diarization Dirk Padfield, Daniel J. Liebling

Adaptive Margin Circle Loss for Speaker Verification Runqiu Xiao, Xiaoxiao Miao, Wenchao Wang, Pengyuan Zhang, Bin Cai, Liuping Luo

Presentation Matters: Evaluating Speaker Identification Tasks Benjamin O’Brien, Christine Meunier, Alain Ghio

Automatic Error Correction for Speaker Embedding Learning with Noisy Labels Fuchuan Tong, Yan Liu, Song Li, Jie Wang, Lin Li, Qingyang Hong

An Integrated Framework for Two-Pass Personalized Voice Trigger Dexin Liao, Jing Li, Yiming Zhi, Song Li, Qingyang Hong, Lin Li

Masked Proxy Loss for Text-Independent Speaker Verification Jiachen Lian, Aiswarya Vinod Kumar, Hira Dhamyal, Bhiksha Raj, Rita Singh

Speech Synthesis: Speaking Style and Emotion

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Kyumin Park, Daeyoung Kim

Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability Rui Liu, Berrak Sisman, Haizhou Li

Emotional Prosody Control for Speech Generation Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi

Controllable Context-Aware Conversational Speech Synthesis Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su

Expressive Text-to-Speech Using Style Tag Minchan Kim, Sung Jun Cheon, Byoung Jin Choi, Jong Jin Kim, Nam Soo Kim

Adaptive Text to Speech for Spontaneous Style Yuzi Yan, Xu Tan, Bohan Li, Guangyan Zhang, Tao Qin, Sheng Zhao, Yuan Shen, Wei-Qiang Zhang, Tie-Yan Liu

Towards Multi-Scale Style Control for Expressive Speech Synthesis Xiang Li, Changhe Song, Jingbei Li, Zhiyong Wu, Jia Jia, Helen Meng

Cross-Speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis Shifeng Pan, Lei He

Fine-Grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement Daxin Tan, Tan Lee

Improving Performance of Seen and Unseen Speech Style Transfer in End-to-End Neural TTS Xiaochun An, Frank K. Soong, Lei Xie

Synthesis of Expressive Speaking Styles with Limited Training Data in a Multi-Speaker, Prosody-Controllable Sequence-to-Sequence Architecture Slava Shechtman, Raul Fernandez, Alexander Sorin, David Haws

Spoken Language Understanding II

Intent Detection and Slot Filling for Vietnamese Mai Hoang Dao, Thinh Hung Truong, Dat Quoc Nguyen

Augmenting Slot Values and Contexts for Spoken Language Understanding with Pretrained Models Haitao Lin, Lu Xiang, Yu Zhou, Jiajun Zhang, Chengqing Zong

The Impact of Intent Distribution Mismatch on Semi-Supervised Spoken Language Understanding Judith Gaspers, Quynh Do, Daniil Sorokin, Patrick Lehnen

Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification Yidi Jiang, Bidisha Sharma, Maulik Madhavi, Haizhou Li

Three-Module Modeling For End-to-End Spoken Language Understanding Using Pre-Trained DNN-HMM-Based Acoustic-Phonetic Model Nick J.C. Wang, Lu Wang, Yandan Sun, Haimei Kang, Dejun Zhang

Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs Sujeong Cha, Wangrui Hou, Hyun Jung, My Phung, Michael Picheny, Hong-Kwang J. Kuo, Samuel Thomas, Edmilson Morais

End-to-End Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining Xianwei Zhang, Liang He

Factorization-Aware Training of Transformers for Natural Language Understanding on the Edge Hamidreza Saghir, Samridhi Choudhary, Sepehr Eghbali, Clement Chung

End-to-End Spoken Language Understanding for Generalized Voice Assistants Michael Saxon, Samridhi Choudhary, Joseph P. McKenna, Athanasios Mouchtaris

Bi-Directional Joint Neural Networks for Intent Classification and Slot Filling Soyeon Caren Han, Siqu Long, Huichun Li, Henry Weld, Josiah Poon

INTERSPEECH 2021 Acoustic Echo Cancellation Challenge

INTERSPEECH 2021 Acoustic Echo Cancellation Challenge Ross Cutler, Ando Saabas, Tanel Parnamaa, Markus Loide, Sten Sootla, Marju Purin, Hannes Gamper, Sebastian Braun, Karsten Sorensen, Robert Aichner, Sriram Srinivasan

Acoustic Echo Cancellation with Cross-Domain Learning Lukas Pfeifenberger, Matthias Zoehrer, Franz Pernkopf

F-T-LSTM Based Complex Network for Joint Acoustic Echo Cancellation and Speech Enhancement Shimin Zhang, Yuxiang Kong, Shubo Lv, Yanxin Hu, Lei Xie

Y 2 -Net FCRN for Acoustic Echo and Noise Suppression Ernst Seidel, Jan Franzen, Maximilian Strake, Tim Fingscheidt

Acoustic Echo Cancellation Using Deep Complex Neural Network with Nonlinear Magnitude Compression and Phase Information Renhua Peng, Linjuan Cheng, Chengshi Zheng, Xiaodong Li

Nonlinear Acoustic Echo Cancellation with Deep Learning Amir Ivry, Israel Cohen, Baruch Berdugo

Speech Recognition of Atypical Speech

Automatic Speech Recognition of Disordered Speech: Personalized Models Outperforming Human Listeners on Short Phrases Jordan R. Green, Robert L. MacDonald, Pan-Pan Jiang, Julie Cattiau, Rus Heywood, Richard Cave, Katie Seaver, Marilyn A. Ladewig, Jimmy Tobin, Michael P. Brenner, Philip C. Nelson, Katrin Tomanek

Investigating the Utility of Multimodal Conversational Technology and Audiovisual Analytic Measures for the Assessment and Monitoring of Amyotrophic Lateral Sclerosis at Scale Michael Neumann, Oliver Roesler, Jackson Liscombe, Hardik Kothare, David Suendermann-Oeft, David Pautler, Indu Navar, Aria Anvar, Jochen Kumm, Raquel Norel, Ernest Fraenkel, Alexander V. Sherman, James D. Berry, Gary L. Pattee, Jun Wang, Jordan R. Green, Vikram Ramanarayanan

Handling Acoustic Variation in Dysarthric Speech Recognition Systems Through Model Combination Enno Hermann, Mathew Magimai-Doss

Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition Mengzhe Geng, Shansong Liu, Jianwei Yu, Xurong Xie, Shoukang Hu, Zi Ye, Zengrui Jin, Xunying Liu, Helen Meng

Speaking with a KN95 Face Mask: ASR Performance and Speaker Compensation Sarah E. Gutz, Hannah P. Rowe, Jordan R. Green

Adversarial Data Augmentation for Disordered Speech Recognition Zengrui Jin, Mengzhe Geng, Xurong Xie, Jianwei Yu, Shansong Liu, Xunying Liu, Helen Meng

Variational Auto-Encoder Based Variability Encoding for Dysarthric Speech Recognition Xurong Xie, Rukiye Ruzi, Xunying Liu, Lan Wang

Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion Disong Wang, Songxiang Liu, Lifa Sun, Xixin Wu, Xunying Liu, Helen Meng

Bayesian Parametric and Architectural Domain Adaptation of LF-MMI Trained TDNNs for Elderly and Dysarthric Speech Recognition Jiajun Deng, Fabian Ritter Gutierrez, Shoukang Hu, Mengzhe Geng, Xurong Xie, Zi Ye, Shansong Liu, Jianwei Yu, Xunying Liu, Helen Meng

A Voice-Activated Switch for Persons with Motor and Speech Impairments: Isolated-Vowel Spotting Using Neural Networks Shanqing Cai, Lisie Lillianfeld, Katie Seaver, Jordan R. Green, Michael P. Brenner, Philip C. Nelson, D. Sculley

Conformer Parrotron: A Faster and Stronger End-to-End Speech Conversion and Recognition Model for Atypical Speech Zhehuai Chen, Bhuvana Ramabhadran, Fadi Biadsy, Xia Zhang, Youzheng Chen, Liyang Jiang, Fang Chu, Rohan Doshi, Pedro J. Moreno

Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia Robert L. MacDonald, Pan-Pan Jiang, Julie Cattiau, Rus Heywood, Richard Cave, Katie Seaver, Marilyn A. Ladewig, Jimmy Tobin, Michael P. Brenner, Philip C. Nelson, Jordan R. Green, Katrin Tomanek

Automatic Severity Classification of Korean Dysarthric Speech Using Phoneme-Level Pronunciation Features Eun Jung Yeo, Sunhee Kim, Minhwa Chung

Comparing Supervised Models and Learned Speech Representations for Classifying Intelligibility of Disordered Speech on Selected Phrases Subhashini Venugopalan, Joel Shor, Manoj Plakal, Jimmy Tobin, Katrin Tomanek, Jordan R. Green, Michael P. Brenner

Analysis and Tuning of a Voice Assistant System for Dysfluent Speech Vikramjit Mitra, Zifang Huang, Colin Lea, Lauren Tooley, Sarah Wu, Darren Botten, Ashwini Palekar, Shrinath Thelapurath, Panayiotis Georgiou, Sachin Kajarekar, Jefferey Bigham

Show and Tell 4

Interactive and Real-Time Acoustic Measurement Tools for Speech Data Acquisition and Presentation: Application of an Extended Member of Time Stretched Pulses Hideki Kawahara, Kohei Yatabe, Ken-Ichi Sakakibara, Mitsunori Mizumachi, Masanori Morise, Hideki Banno, Toshio Irino

Save Your Voice: Voice Banking and TTS for Anyone Daniel Tihelka, Markéta Řezáčková, Martin Grůber, Zdeněk Hanzlíček, Jakub Vít, Jindřich Matoušek

NeMo (Inverse) Text Normalization: From Development to Production Yang Zhang, Evelina Bakhturina, Boris Ginsburg

Lalilo: A Reading Assistant for Children Featuring Speech Recognition-Based Reading Mistake Detection Corentin Hembise, Lucile Gelin, Morgane Daniel

Automatic Radiology Report Editing Through Voice Manh Hung Nguyen, Vu Hoang, Tu Anh Nguyen, Trung H. Bui

WittyKiddy: Multilingual Spoken Language Learning for Kids Ke Shi, Kye Min Tan, Huayun Zhang, Siti Umairah Md. Salleh, Shikang Ni, Nancy F. Chen

Duplex Conversation in Outbound Agent System Chunxiang Jin, Minghui Yang, Zujie Wen

Web Interface for Estimating Articulatory Movements in Speech Production from Acoustics and Text Sathvik Udupa, Anwesha Roy, Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh

Article

interspeech 2021

  • solidarity - (ua) - (ru)
  • news - (ua) - (ru)
  • donate - donate - donate

for scientists:

  • ERA4Ukraine
  • Assistance in Germany
  • Ukrainian Global University
  • #ScienceForUkraine

search dblp

default search action

  • combined dblp search
  • author search
  • venue search
  • publication search

clear

Conference of the International Speech Communication Association (INTERSPEECH)

  • > Home > Conferences and Workshops

Venue statistics

records by year

interspeech 2021

frequent authors

Venue Information

interspeech 2021

  • has part: International Workshop on the History of Speech Communication Research (HSCR)
  • has part: International Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction (MA3HMI)
  • has part: Workshop on Statistical and Perceptual Audition (SAPA)
  • has part: International Workshop on Speech, Language and Audio in Multimedia (SLAM)
  • related: European Conference on Speech Communication and Technology (EUROSPEECH)
  • AVSP - Auditory-Visual Speech Processing: 1997 , 1998 , 1999 , 2001 , 2003 , 2005 , 2007 , 2008 , 2009 , 2010 , 2011 , 2013 , 2015 , 2017 , 2019
  • Diss - Disfluency in Spontaneous Speech: 2001 , 2003 , 2005 , 2010 , 2013 , 2023
  • ExLing - Experimental Linguistics: 2006 , 2008 , 2010 , 2011
  • HSCR - History of Speech Communication Research: 2015 , 2017 , 2019 , 2021 , 2022
  • IWSLT - Spoken Language Translation: 2004 , 2005 , 2006 , 2007 , 2008 , 2009 , 2010 , 2011 , 2012
  • MA3HMI - Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction; 2014
  • MAVEBA - Models and Analysis of Vocal Emissions for Biomedical Applications: 1999 , 2001 , 2003 , 2005 , 2007 , 2009 , 2011
  • MLSLP - Machine Learning in Speech and Language Processing: 2011 , 2012
  • Odyssey - The Speaker and Language Recognition Workshop: 2001 , 2004 , 2008 , 2010 , 2012 , 2014 , 2016 , 2018 , 2020 , 2022 , 2024
  • SAPA - Statistical and Perceptual Audio Processing: 2004 , 2006 , 2008 , 2010 , 2012
  • SLAM - Speech, Language and Audio in Multimedia: 2013 , 2014 ,
  • SLaTE - Speech and Language Technology in Education: 2007 , 2009 , 2011 , 2013 , 2015
  • SLTU - Spoken Language Technologies for Under-resourced Languages: 2008 , 2010 , 2012 , 2014
  • SSW - Speech Synthesis: 1990 , 1994 , 1998 , 2001 , 2004 , 2007 , 2010 , 2013 , 2016 , 2019 , 2021
  • WOCCI - Child, Computer and Interaction: 2008 , 2009 , 2012 , 2014

24th INTERSPEECH 2023: Dublin, Ireland

interspeech2023.org

interspeech 2021

23rd INTERSPEECH 2022: Incheon, Korea

interspeech2022.org

22nd INTERSPEECH 2021: Brno, Czechia

interspeech2021.org

21st INTERSPEECH 2020: Shanghai, China

interspeech2020.org

20th INTERSPEECH 2019: Graz, Austria

www.interspeech2019.org

19th INTERSPEECH 2018: Hyderabad, India

www.interspeech2018.org

18th INTERSPEECH 2017: Stockholm, Sweden

www.interspeech2017.org

interspeech 2021

17th INTERSPEECH 2016: San Francesco, CA, USA

16th interspeech 2015: dresden, germany.

www.interspeech2015.org

15th INTERSPEECH 2014: Singapore

14th interspeech 2013: lyon, france, 13th interspeech 2012: portland, oregon, usa, 12th interspeech 2011: florence, italy, 11th interspeech 2010: makuhari, japan, 10th interspeech 2009: brighton, uk, 9th interspeech 2008: brisbane, australia, 8th interspeech 2007: antwerp, belgium, 9th icslp 2006: pittsburgh, pa, usa, 8th icslp 2004: jeju island, korea, 7th icslp 2002: denver, colorado, usa, 6th icslp 2000: beijing, china, 5th icslp 1998: sydney, australia, 4th icslp 1996: philadelphia, pa, usa, 3rd icslp 1994: yokohama, japan, 2nd icslp 1992: banff, alberta, canada, 1st icslp 1990: kobe, japan.

Schloss Dagstuhl - Leibniz Center for Informatics

manage site settings

To protect your privacy, all features that rely on external API calls from your browser are turned off by default . You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.

Unpaywalled article links

unpaywall.org

load links from unpaywall.org

Privacy notice: By enabling the option above, your browser will contact the API of unpaywall.org to load hyperlinks to open access articles. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the Unpaywall privacy policy .

Archived links via Wayback Machine

web.archive.org

load content from archive.org

Privacy notice: By enabling the option above, your browser will contact the API of archive.org to check for archived content of web pages that are no longer available. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the Internet Archive privacy policy .

Reference lists

crossref.org

load references from crossref.org and opencitations.net

Privacy notice: By enabling the option above, your browser will contact the APIs of crossref.org , opencitations.net , and semanticscholar.org to load article reference information. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the Crossref privacy policy and the OpenCitations privacy policy , as well as the AI2 Privacy Policy covering Semantic Scholar.

Citation data

load citations from opencitations.net

Privacy notice: By enabling the option above, your browser will contact the API of opencitations.net and semanticscholar.org to load citation information. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the OpenCitations privacy policy as well as the AI2 Privacy Policy covering Semantic Scholar.

OpenAlex data

openalex.org

load data from openalex.org

Privacy notice: By enabling the option above, your browser will contact the API of openalex.org to load additional information. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the information given by OpenAlex .

w3c valid html

see also: Terms of Use | Privacy Policy | Imprint

dblp was originally created in 1993 at:

University of Trier

since 2018, dblp has been operated and maintained by:

Schloss Dagstuhl - Leibniz Center for Informatics

the dblp computer science bibliography is funded and supported by:

BMBF

   

The 22nd Annual Conference of the International Speech Communication Association

Interspeech 2021, brno, czechia 30 august - 3 september 2021.

   :


Enter Password:
 

   New user? please register first by clicking HERE .

   If you lost or forgot your password, click HERE .

NVIDIA at INTERSPEECH 2021

August 30 – September 3, 2021

Join us at INTERSPEECH, a technical conference focused on the latest research and technologies in speech processing. NVIDIA will present accepted papers on our latest research in speech recognition and speech synthesis.

interspeech 2021

Explore NVIDIA’s work in conversational AI research across automatic speech recognition, natural language processing, and text-to-speech. This chapter of I AM AI reveals how NVIDIA developers and creators deploy state-of-the art models for expressive speech synthesis capabilities.

Conference Schedule at a Glance

Come check out NVIDIA’s papers at this year’s hybrid INTERSPEECH event. They cover a wide range of groundbreaking research in the field of conversational AI, including datasets, pre-trained models, and real-world applications for speech recognition and text-to-speech.

TUESDAY 8/31 THURSDAY 9/2 FRIDAY 9/3

Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe, Georg Kucsko
11:00 a.m. - 01:00 p.m. CET

Stanislav Beliaev, Boris Ginsburg
04:00 - 06:00 p.m. CET

Gonçalo Mordido, Matthijs Van Keirsbilck, Alexander Keller
04:00 - 06:00 p.m. CET

Yang Zhang, Evelina Bakhturina, Kyle Gorman, Boris Ginsburg
04:00 - 06:00 p.m. CET

Yochai Yemini, Ethan Fetaya, Haggai Maron, Sharon Gannot
07:00 - 09:00 p.m. CET

Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang
07:00 - 09:00 p.m. CET

Get Started With Pre-trained Models

NVIDIA offers pre-trained models for speech recognition, language understanding, and speech synthesis through the NGC catalog. These models are highly accurate and have been trained on a variety of open and proprietary datasets for thousands of hours using GPUs. The NGC models are seamlessly integrated with SDKs such as NVIDIA NeMo for building, training, and fine-tuning conversational AI models

interspeech 2021

Create Cutting-Edge Conversational AI Models

Explore NVIDIA NeMo, an open-source toolkit for researchers developing new state-of-the-art conversational AI models. It provides a collection of modules and models for automatic speech recognition, natural language processing, and text-to-speech. NeMo modules and models are highly interoperable with popular PyTorch and PyTorch Lightning frameworks, giving researchers exceptional flexibility.

Develop Conversational AI Apps For Enterprise

NVIDIA offers Riva, a GPU- accelerated SDK to help enterprises develop multimodal conversational AI applications. It includes highly accurate pre-trained models in NGC, tools for fine-tuning these models on custom datasets, and optimized real-time speech and language skills for tasks like transcription and natural language-understanding.

NVIDIA Developer Program

Get the advanced tools and training you need to successfully build applications on all NVIDIA technology platforms.

NVIDIA Deep Learning Institute (DLI)

With the NVIDIA Deep Learning Institute (DLI), developers, data scientists, researchers, and students can access hands-on training in AI, accelerated computing, and accelerated data science to advance their knowledge in  topics like AI for speech processing.

Use code INTERSPEECH25 to receive 25% off the upcoming workshops:

Building Transformer-Based Natural Language Processing Applications September 23, 2021 at 9:00am-5:00pm PDT.

Building Conversational AI Applications November 24, 2021 at 9:00am-5:00pm CET

Unlock Your Startup’s Potential

NVIDIA Inception nurtures cutting-edge startups that are revolutionizing industries with artificial intelligence. Our acceleration platform offers go-to-market support, expertise, and technology—all tailored to a new business’s evolution.

LIKE NO PLACE YOU’VE EVER WORKED

At NVIDIA, you’ll solve some of the world’s hardest problems and discover never-before-seen ways to improve the quality of life for people everywhere. From healthcare to robots, self-driving cars to blockbuster movies—and a growing list of new opportunities every single day. Explore all of our open roles, including internships and new college graduate positions.

Learn  more about our career opportunities by exploring current job openings as well as university jobs .

Sign up to receive the latest news from NVIDIA

Get the latest from NVIDIA on Supercomputing

Sign up for enterprise news, announcements, and more from NVIDIA.

  • Company Overview
  • Venture Capital (NVentures)
  • NVIDIA Foundation
  • Corporate Sustainability
  • Technologies
  • Company Blog
  • Technical Blog
  • Stay Informed
  • Events Calendar
  • GTC AI Conference
  • NVIDIA On-Demand
  • Executive Insights
  • Startups and VCs
  • NVIDIA Connect for ISVs
  • Documentation
  • Technical Training
  • Training for IT Professionals
  • Professional Services for Data Science

interspeech 2021

  • Privacy Policy
  • Manage My Privacy
  • Do Not Sell or Share My Data
  • Terms of Service
  • Accessibility
  • Corporate Policies
  • Product Security

This week: the arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Computation and Language

Title: superb: speech processing universal performance benchmark.

Abstract: Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge this gap, we introduce Speech processing Universal PERformance Benchmark (SUPERB). SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, we especially focus on extracting the representation learned from SSL due to its preferable re-usability. We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model. Our results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks. We release SUPERB as a challenge with a leaderboard and a benchmark toolkit to fuel the research in representation learning and general speech processing.
Comments: To appear in Interspeech 2021
Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as: [cs.CL]
  (or [cs.CL] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Microsoft at INTERSPEECH 2021

Location: Brno, Czech Republic & Virtual

Website: INTERSPEECH 2021 (opens in new tab)

Microsoft is proud to be a diamond sponsor of INTERSPEECH 2021 , the world’s largest and most comprehensive conference on the science and technology of spoken language processing. Microsoft attendees will be presenting 32 papers , one workshop , one special session , and two challenges during this event.

  • Follow on X
  • Like on Facebook
  • Follow on LinkedIn
  • Subscribe on Youtube
  • Follow on Instagram
  • Subscribe to our RSS feed

Share this page:

  • Share on Facebook
  • Share on LinkedIn
  • Share on Reddit

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

@tlikhomanenko

  • Repositories

View tlikhomanenko's full-sized avatar

Tatiana Likhomanenko tlikhomanenko

Achievements.

Achievement: Galaxy Brain

Block or report tlikhomanenko

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users .

You must be logged in to block users.

Contact GitHub support about this user’s behavior. Learn more about reporting abuse .

Dr. Tatiana Likhomanenko

Research scientist and software developer. Semi-supervised and unsupervised learning, speech recognition. Gravitating to core ML, video processing, and private federated learning.

Github

  • Apple , Staff Research Scientist (Oct 2023 - present)
  • Apple , Senior Research Scientist (Sep 2021 - Oct 2023)
  • Fundamental AI Research , Postdoctoral Researcher (Aug 2019 - Aug 2021) Speech recognition and natural language processing for speech Advisors : Ronan Collobert , Gabriel Synnaeve
  • Fundamental AI Research , AI Resident (Sep 2018 - Aug 2019) Speech recognition and natural language processing for speech Advisors : Ronan Collobert , Gabriel Synnaeve
  • NTechLab , Machine Learning Expert (Aug 2017 - Sep 2018) Face recognition and facial attributes predictions with deep learning at top-1 face recognition team
  • Yandex & CERN , Researcher (Apr 2013 - May 2017) Machine learning for High Energy Physics studies at the Large Hadron Collider: particle identification system, trigger system (online identification which collisions worth being stored), specific rare decays search (high-level data analysis), and B mesons oscillations (main subject of the LHCb studies)
  • Membership at Large Hadron Collider beauty (LHCb) collaboration, CERN (2013 - 2018)
  • Ph.D. in Computer Science , Lomonosov Moscow State University (2017) Faculty of Computational Mathematics and Cybernetics Advisor : Eugene Moiseev Thesis : Research on solutions of non-classical boundary-value problems for mixed type equations
  • M.S. in Computer Science , Yandex School of Data Analysis , 5.0/5.0 (2014)
  • M.S. in Computer Science , Lomonosov Moscow State University, 5.0/5.0 (2013) Faculty of Computational Mathematics and Cybernetics
  • Summer School on Bayesian Methods in Deep Learning (2017)
  • Rome-Moscow School of Matrix Methods and Applied Linear Algebra (2012, 2013)
  • mlx-data : framework agnostic data loading library brought to you by Apple machine learning research; it works with PyTorch, Jax or MLX
  • Flashlight : a fast, flexible machine learning library written entirely in C++ blog post
  • Wav2letter++ : speech recognition toolkit and recipes for papers
  • BDT reweigter tutorial
  • HepML : specific machine learning tools for purposes of high energy physics
  • REP : ipython-based environment for conducting data-driven research in a consistent and reproducible way
  • Private Federated Learning for Speech Recognition , Apple Workshop on Privacy-Preserving Machine Learning , Cupertino (2024)
  • Simple and Efficient Self-Training Approaches for Speech Recognition , Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III), NeurIPS, New Orleans (2023)
  • Simple and Efficient Pseudo-Labeling for Speech Recognition , On-Device Workshop MLSys, Miami (2023)
  • Machine Learning at Apple , WiML@ICML, Baltimore (2022)
  • CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings , ReWork Deep Learning Summit, San Francisco (2022)
  • Positional Embedding in Transformer-based Models , Higher School of Economics (2021)
  • slimIPL: Language-Model-Free Iterative Pseudo-Labeling , NTR Lab and Tomsk University (2021, in Russian)
  • Pseudo-labeling for speech recognition , NTR Lab and Tomsk University (2021, in Russian)
  • Machine learning in Science and Industry , Heidelberg University (2017)
  • LHCb topological trigger optimization , Data&Science: Large Hadron Collider, public series, Yandex, Moscow (2016)
  • Classifier output calibration to probability , Heavy Flavour Data Mining workshop, Zurich University (2016)
  • Machine Learning and Optimization of LHC Real-Time Event Stream Filter for New Physics Discoveries , Machine Learning: Prospects and Applications Conference, Berlin (2015)

Private Federated Learning

  • Pelikan*, M., Azam, S.S., Feldman, V., Silovsky, J., Talwar, K. and Likhomanenko*, T. Federated Learning with Differential Privacy for End-to-End Speech Recognition, 2023. arXiv preprint arXiv:2310.00098.
  • Azam*, S.S., Pelikan*, M., Feldman, V., Talwar, K., Silovsky, J. and Likhomanenko*, T. Federated Learning for Speech Recognition: Revisiting Current Trends Towards Large-Scale ASR. In International Workshop on Federated Learning in the Age of Foundation Models in Conjunction with NeurIPS 2023. Oral overview , video , slides , poster
  • Azam, S.S., Likhomanenko, T., Pelikan, M. and Silovsky, J. Importance of Smoothness Induced by Optimizers in FL4ASR: Towards Understanding Federated Learning for End-to-End ASR , ASRU 2023.

Machine Learning

  • Busbridge*, D., Ramapuram*, J., Ablin*, P., Likhomanenko*, T., Dhekane, E.G., Suau, X. and Webb, R. How to Scale Your EMA . Thirty-Seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. Spotlight . overview , video , slides , poster
  • Zhai*, S., Likhomanenko*, T., Littwin*, E., Busbridge*, D., Ramapuram*, J., Zhang, Y., Gu, J. and Susskind, J. Stabilizing Transformer Training by Preventing Attention Entropy Collapse. In International Conference on Machine Learning (ICML), 2023. overview , video , poster , code
  • Gheini, M., Likhomanenko, T., Sperber, M. and Setiawan, H. Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data. ACL Findings, 2023. overview
  • Zhai, S., Jaitly, N., Ramapuram, J., Busbridge, D., Likhomanenko, T., Cheng, J.Y., Talbott, W., Huang, C., Goh, H. and Susskind, J.M. Position Prediction as an Effective Pretraining Strategy . In International Conference on Machine Learning (ICML), 2022, pp. 26010-26027. PMLR. (Spotlight) overview , video , poster
  • Kahn, J.D., Pratap, V., Likhomanenko, T., Xu, Q., Hannun, A., Cai, J., Tomasello, P., Lee, A., Grave, E., Avidov, G., Steiner, B., Liptchinsky, V., Synnaeve, G., Collobert, R. Flashlight: Enabling Innovation in Tools for Machine Learning . In International Conference on Machine Learning (ICML), 2022, pp. 10557-10574. PMLR. (Spotlight) video , presentation , poster , code
  • Likhomanenko, T., Xu, Q., Synnaeve, G., Collobert, R. and Rogozhnikov, A. CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings . Thirty-Fifth Conference on Neural Information Processing Systems (NeurIPS), 2021. openreview , video , presentation , code
  • Rogozhnikov, A., Likhomanenko, T. InfiniteBoost: building infinite ensembles with gradient descent . arXiv preprint arXiv:1706.01109. 2017.
  • Garg, S., Gheini, M., Emmanuel, C., Likhomanenko, T., Gao, Q. and Paulik, M. Generating Gender Alternatives in Machine Translation. 5th Workshop on Gender Bias in Natural Language Processing at ACL 2024.

Speech Processing

  • Bai, H., Likhomanenko, T., Zhang, R., Gu, Z., Aldeneh, Z. and Jaitly, N., 2024. dMel: Speech Tokenization made Simple. arXiv preprint arXiv:2407.15835. Under review.
  • Gu, Z., Likhomanenko, T., Bai, H., McDermott, E., Collobert, R. and Jaitly, N., 2024. Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition. arXiv preprint arXiv:2405.15216. Under review.
  • Aldeneh, Z., Higuchi, T., Jung, J.W., Seto, S., Likhomanenko, T., Shum, S., Abdelaziz, A.H., Watanabe, S. and Theobald, B.J. Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features? Interspeech 2024.
  • Rouditchenko, A., Collobert, R. and Likhomanenko, T., AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition . AVGenL: Audio-Visual Generation and Learning Workshop at ECCV 2024.
  • Likhomanenko, T., Lugosch, L. and Collobert, R. Unsupervised ASR via Cross-Lingual Pseudo-Labeling , 2023. arXiv preprint arXiv:2305.13330.
  • Berrebbi, D., Collobert, R., Jaitly, N., Likhomanenko, T. More Speaking or More Speakers? ICASSP 2023. overview
  • Berrebbi, D., Collobert, R., Bengio, S., Jaitly, N., Likhomanenko, T. Continuous Pseudo-Labeling from the Start . ICLR 2023. overview , video , slides , poster
  • Likhomanenko, T., Collobert, R., Jaitly, N., Bengio, S. Continuous Soft Pseudo-Labeling in ASR . I Can’t Believe It’s Not Better Workshop at NeurIPS 2022. video , poster
  • Lugosch, L., Likhomanenko, T., Synnaeve, G. and Collobert, R. Pseudo-Labeling for Massively Multilingual Speech Recognition . ICASSP 2022. blog post , code
  • Pratap, V., Xu, Q., Likhomanenko, T., Synnaeve, G. and Collobert, R. Word Order Does Not Matter For Speech Recognition . ICASSP 2022.
  • Manohar, V., Likhomanenko, T., Xu, Q., Hsu, W.N., Collobert, R., Saraf, Y., Zweig, G. and Mohamed, A., 2021. Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition . ASRU 2021.
  • Likhomanenko, T., Xu, Q., Kahn, J., Synnaeve, G. and Collobert, R. slimIPL: Language-model-free iterative pseudo-labeling . Interspeech 2021. video , poster , code
  • Likhomanenko*, T., Xu*, Q., Pratap*, V., Tomasello, P., Kahn, J., Avidov, G., Collobert, R. and Synnaeve, G. Rethinking evaluation in asr: Are our models robust enough? Interspeech 2021. video , poster , code
  • Hsu, W.N., Sriram, A., Baevski, A., Likhomanenko, T., Xu, Q., Pratap, V., Kahn, J., Lee, A., Collobert, R., Synnaeve, G. and Auli, M., 2021. Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training . Interspeech 2021.
  • Xu, Q., Baevski, A., Likhomanenko, T., Tomasello, P., Conneau, A., Collobert, R., Synnaeve, G. and Auli, M., 2021, June. Self-training and pre-training are complementary for speech recognition . In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3030-3034). IEEE. video
  • Talnikar, C., Likhomanenko, T., Collobert, R. and Synnaeve, G., 2021, June. Joint masked cpc and ctc training for asr . In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3045-3049). IEEE. video , poster , presentation
  • Xu, Q., Likhomanenko, T., Kahn, J., Hannun, A., Synnaeve, G. and Collobert, R., 2020. Iterative Pseudo-Labeling for Speech Recognition . Proc. Interspeech 2020, pp.1006-1010. video , code
  • Pratap, V., Xu, Q., Kahn, J., Avidov, G., Likhomanenko, T., Hannun, A., Liptchinsky, V., Synnaeve, G., Collobert, R. (2020) Scaling Up Online Speech Recognition Using ConvNets . Proc. Interspeech 2020, 3376-3380. video , blog post , news
  • Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazaré, P.E., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C. and Likhomanenko, T., 2020, May. Libri-light: A benchmark for asr with limited or no supervision . In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7669-7673). IEEE. presentation , blog post , code
  • Synnaeve*, G., Xu*, Q., Kahn*, J., Likhomanenko*, T., Grave*, E., Pratap, V., Sriram, A., Liptchinsky, V. and Collobert, R. End-to-end asr: from supervised to semi-supervised learning with modern architectures . SAS Workshop ICML 2020. video , code
  • Likhomanenko, T., Synnaeve, G. and Collobert, R., 2019. Who Needs Words? Lexicon-Free Speech Recognition . Proc. Interspeech 2019, pp.3915-3919. presentation , blog post , code

Machine Learning in High Energy Physics

  • Derkach, D., Hushchyn, M., Likhomanenko, T., Rogozhnikov, A., Kazeev, N., Chekalina, V., Neychev, R., Kirillov, S., Ratnikov, F. and LHCb collaboration. Machine-Learning-based global particle-identifiritcation algohms at the LHCb experiment . Journal of Physics: Conference Series. 2018. Vol. 1085. No. 4. P. 1-5. ACAT 2017 , poster
  • Likhomanenko, T., Derkach, D., Rogozhnikov, A. Inclusive Flavour Tagging Algorithm. Journal of Physics: Conference Series, 2016. ACAT 2016 , poster , code
  • LHCb collaboration (2016). Search for decays of neutral beauty mesons into four muons, JHEP 03 (2017) 001.
  • Likhomanenko, T., Ilten, P., Khairullin, E., Rogozhnikov, A., Ustyuzhanin, A., Williams, M. LHCb Topological Trigger Reoptimization . Journal of Physics: Conference Series, 2015. CHEP 2015 , presentation , code
  • CMS collaboration, LHCb collaboration. Observation of the rare Bs0→ μ+ μ− decay from the combined analysis of CMS and LHCb data . Nature, 2015.
  • Likhomanenko, T., Rogozhnikov, A., Baranov, A., Khairullin, E., & Ustyuzhanin, A. Reproducible Experiment Platform . Journal of Physics: Conference Series (Vol. 664, No. 5, p. 052022). CHEP 2015 , poster
  • LHCb collaboration. Search for the lepton flavour violating decay τ−→ μ− μ+ μ− . Journal of High Energy Physics, 2015.
  • Likhomanenko, T., Rogozhnikov, A., Baranov, A., Khairullin, E., Ustyuzhanin, A. Improving reproducibility of data science experiments , ICML 2015 AutoML Workshop, 2015 poster spotlight

Partial Differential Equations (Ph.D.)

  • Moiseev, E.I., Likhomanenko, T.N. Eigenfunctions of the Gellerstedt problem with an inclined-type change line . Integral Transforms and Special Functions, 2017, pp. 1–8.
  • Moiseev E. I., Likhomanenko T. N. On the basis property of a two-part trigonometric series . Doklady Mathematics, 2016, Vol. 94, No. 1, pp. 1–4. oral talk, International scientific conference Actual Problems in Theory of Partial Differential Equations, dedicated to the centenary of Andrey V. Bitsadze, 2016
  • Moiseev, E.I., Likhomanenko, T.N. Eigenfunctions of the Tricomi problem with an inclined type change line . Differential Equations, 2016, Vol. 52, No. 10, pp 1323– 1330. oral talk, International scientific conference Actual Problems in Theory of Partial Differential Equations, dedicated to the centenary of Andrey V. Bitsadze, 2016
  • Moiseev, E.I., Likhomanenko, T.N. On the basis property of a trigonometric system arising in the Frankl problem . Differential Equations, 2013, Vol. 49, No. 3, pp. 325–331. oral talk, AMEE-2013 and Lomonosov-2013
  • Moiseev E.I., Likhomanenko T.N. A nonlocal boundary value problem for the Lavrent’ev-Bitsadze equation . Doklady Mathematics, 2012, Vol. 86, No. 2, pp. 635–637. oral talk, AMEE-2012 and Lomonosov-2012
  • DeepLearn Autumn School , Self-, Weakly-, Semi-Supervised Learning in Speech Recognition (Oct 2022)
  • Heidelberg University, Grad Days , Machine learning in Science and Industry , invited lecturer (2017) lectures
  • Imperial College London, Introduction to Machine Learning , TA (2016, 2017) lectures/seminars 2016 , lectures/seminars 2017
  • Yandex School of Data Analysis, Machine learning in High Energy Physics , lecturer (2016)
  • Lund University, Summer School on Machine Learning in High Energy Physics (MLHEP) , program committee & lecturer (2016) lectures/seminars
  • Saint Petersburg Academic University, Summer School on Machine Learning in High Energy Physics (MLHEP) , organizing committee & lecturer (2015) lectures/seminars

Serving as Reviewer

  • Transactions on Machine Learning Research (TMLR) ( Expert Reviewer )
  • Journal of Artificial Intelligence Research
  • NeurIPS 2021, 2022 ( top-8% reviewer ), 2023 ( top-8% reviewer )
  • ICLR 2021, 2022 ( highlighted reviewer ), 2023, 2024
  • ICLR Blogposts 2023, 2024
  • ICML 2022, 2023
  • Interspeech 2020, 2021, 2022, 2023 (top-2% reviewer), 2024
  • ICASSP 2021, 2022, 2023 ( outstanding reviewer ), 2024
  • Machine Learning and the Physical Sciences workshop NeurIPS 2019, 2020, 2022, 2023, 2024
  • SynS and ML Workshop ICML 2023
  • Vision-based InduStrial InspectiON (VISION) Workshop CVPR 2023
  • CHIME 2023, 2024
  • BayLearn 2022, 2023, 2024
  • An advisor in the LHCb statistics and machine learning working group (2016-2017)

Serving as Area Chair

  • NeurIPS 2024
  • NeurIPS Datasets and Benchmarks 2023, 2024
  • Vision-based InduStrial InspectiON (VISION) Workshop ECCV 2024
  • WiML, Research Mentorship, NeurIPS, New Orleans (2023)
  • LatinX in AI, Mentorship Hour (Panel), ICML, Honolulu (2023)
  • LatinX in AI, CV Research workshop, CVPR, New Orlean (2022)
  • Failure Modes in the Age of Foundation Models , workshop "I Can’t Believe It’s Not Better (ICBINB): Failure Modes in the Age of Foundation Models", NeurIPS, New Orleans (2023)
  • Mentorship Hour, LatinX in AI, ICML, Honolulu (2023)
  • On-Device Workshop MLSys, Miami (2023)
  • 1st workshop and challenge on Vision-based InduStrial InspectiON , CVPR 2023
  • 2st workshop on Vision-based InduStrial InspectiON , ECCV 2024

Kaggle Competition "Flavours of Physics"

  • research/technical support
  • award committee member
  • co-organizer of ALEPH workshop at NeurIPS 2015
  • starter-kit for competition
  • Akshita Gupta, summer internship (co-advising with Navdeep Jaitly, Richard Bai, ...), Apple, 2024
  • Zijin Gu , AI/ML Resident, Apple 2023-2024 (co-advising with Navdeep Jaitly)
  • Andrew Rouditchenko , summer internship, Apple, 2023
  • Lingxiao Zhao , summer internship, Apple, 2023 (co-advising)
  • Chun-wei Ho , summer internship, Apple, 2023 (co-advising with Navdeep Jaitly and Ronan Collobert)
  • Sheikh Shams Azam , AI/ML Resident, Apple 2022-2023 (co-advising with Honza Silovsky)
  • Dan Berrebbi , summer internship, Apple, 2022
  • Mozhdeh Gheini , summer internship, Apple, 2022 (co-advising with Matthias Sperber and Hendra Setiawan); Apple, 2023
  • Colby Bunbary , summer internship, Apple, 2022 (co-advising)
  • Loren Lugosch : summer internship, Facebook AI Reserch, 2021 (co-advising with Ronan Collobert and Gabriel Synnaeve); summer internship, Apple (co-advising with Ronan Collobert), 2022
  • Chaitanya Talnikar , AI Residency 2019-2020 (co-advising with Ronan Collobert and Gabriel Synnaeve)
  • Interview to Republic (in Russian)
  • Q&A with AI Residents
  • About paper "Rethinking Evaluation in ASR: Are Our Models Robust Enough?"
  • About kaggle challenge "Flavours of physics"
  • About paper "LHCb Topological Trigger Reoptimization"
  • Winner of Accelerate your code international competition, Intel (2012)
  • Best student of Computer Science faculty, Lomonosov Moscow State University (2012)
  • The winner (Regional stage) of All-Russian Programming contest (2007, 2008)

Special Sessions & Challenges

The Organizing Committee of INTERSPEECH 2021 is proudly announcing the following special sessions and challenges for INTERSPEECH 2021.

Special sessions and challenges focus on relevant ‘special’ topics which may not be covered in regular conference sessions.

Papers have to be submitted following the same schedule and procedure as regular papers; the papers undergo the same review process by anonymous and independent reviewers.

Speech Recognition of Atypical Speech

While speech recognition systems generally work well on the average population with typical speech characteristics, performance on subgroups with unique speaking patterns is usually significantly worse.

Speech that contains non-standard speech patterns (acoustic-phonetic phonotactic, lexical and prosodic patterns) is particularly challenging, both because of the small population with these speech patterns, and because of the generally higher variance of speech patterns. In the case of dysarthric speech, which is often correlated with mobility or other accessibility limitations, accuracy of existing speech recognition systems is often particularly poor, rendering the technology unusable for many speakers who could benefit the most.

In this oral session, we seek to promote interdisciplinary collaborations between researchers and practitioners addressing this problem, to build community and stimulate research. We invite papers analyzing and improving systems dealing with atypical speech.

Topics of interest include, but are not limited to:

  • Automatic Speech Recognition (ASR) of atypical speech
  • Speech-to-Speech conversion/normalization (e.g. from atypical to typical)
  • Voice enhancement and convergence to improve intelligibility of spoken content of atypical speech
  • Automated classification of atypical speech conditions
  • Robustness of speech processing systems for atypical speech in common application scenarios
  • Data augmentation techniques to deal with data sparsity
  • Aspects of creating, managing data quality, and sharing of data sets of atypical speech
  • Multi-modal integration (e.g. video and voice) and its application
  • https://sites.google.com/view/atypicalspeech-interspeech2021
  • Jordan R. Green, MGH Institute of Health Professions, Harvard University
  • Michael P. Brenner, Harvard University, Google
  • Fadi Biadsy, Google
  • Bob MacDonald, Google
  • Katrin Tomanek, Google

Oriental Language Recognition

Oriental languages are rich and complex. With the great diversity in terms of both acoustics and linguistics, oriental language is a treasure for multilingual research. The Oriental Language Recognition (OLR) challenge has been conducted for 5 years with big success, and demonstrated many novel and interesting techniques devised by the participants.

The main goal of this special session is to summarize the technical advance of OLR 2020, but it will welcome all submissions related to language recognition and multilingual soeecg processing.

  • http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/Interspeech_2021_Special_Session
  • Dong Wang (Tsinghua University)
  • Qingyang Hong (Xiamen University)
  • Xiaolei Zhang (Northwestern Polytechnical University)
  • Ming Li (Duke Kunshan University)
  • Yufeng Hao (Speechocean)

Far-field Multi-Channel Speech Enhancement Challenge for Video Conferencing (ConferencingSpeech 2021)

The ConferencingSpeech 2021 challenge is proposed to stimulate research in multi-channel speech enhancement and aims for processing the far-field speech from microphone arrays in the video conferencing rooms. Targeting the real video conferencing room application, the ConferencingSpeech 2021 challenge database is recorded from real speakers. The number of speakers and distances between speakers and microphone arrays vary according to the sizes of meeting rooms. Multiple microphone arrays from three different types of geometric topology are allocated in each recording environment.

The challenge will have two tasks:

  • Task 1 is multi-channel speech enhancement with single microphone array and focusing on practical application with real-time requirement.
  • Task 2 is multi-channel speech enhancement with multiple distributed microphone arrays, which is non-real-time track and does not have any constraints so that participants could explore any algorithms to obtain high speech quality.

To focus on the development of algorithms, the challenge requires the close training condition. Only provided lists of open source clean speech datasets and noise dataset could be used for training. In addition, the challenge will provide the development set, scripts for simulating the training data, baseline systems for participants to develop their systems. The final ranking of the challenge will be decided by the subjective evaluation. The subjective evaluation will be performed using Absolute Category Ratings (ACR) to estimate a Mean Opinion Score (MOS) through Tencent Online Media Subjective Evaluation platform.

More details about the data and challenge can be found from the evaluation plan of ConferencingSpeech 2021 challenge.

Besides the submitted paper related to ConferencingSpeech 2021 challenge, Paper on multi-channel speech enhancement are all encouraged to submit to this special session.

  • https://tea-lab.qq.com/conferencingspeech-2021
  • Wei Rao, Tencent Ethereal Audio Lab, China
  • Lei Xie, Northwestern Polytechnical University, China
  • Yannan Wang, Tencent Ethereal Audio Lab, China
  • Tao Yu, Tencent Ethereal Audio Lab, USA
  • Shinji Watanabe, Associate Professor, Carnegie Mellon University / Johns Hopkins University, USA
  • Zheng-Hua Tan, Aalborg University, Denmark
  • Hui Bu, AISHELL foundation, China
  • Shidong Shang, Tencent Ethereal Audio Lab, China

Voice quality characterization for clinical voice assessment: Voice production, acoustics, and auditory perception

The appraisal of voice quality is relevant to the clinical care of disordered voices. It contributes to the selection and optimization of clinical treatment as well as to the assessment of the outcome of the treatment. Levels of description of voice quality include the biomechanics of the vocal folds and their kinematics, temporal and spectral acoustic features, as well as the auditory scoring of hoarseness, hyper- and hypo-functionality, creakiness, diplophonia, harshness, etc. Broad and fuzzy definitions of terms regarding voice quality are in use, which impede scientific and clinical communication.

Aim of the special session is to contribute to the improvement of the clinical assessment of voice quality via a translational approach, which focuses on quantifying and explaining relationships between several levels of description. The objective is to gather new insights, advancement of knowledge and practical tools to assist researchers and clinicians in obtaining effective descriptions of voice quality and reliable measures of its acoustic correlates. Topics of interest include, but are not limited to, (i) the statistical analysis and automatic classification, possibly relying on state-of-the-art machine learning approaches, of distinct types of voice quality via non-obtrusively recorded features, (ii) the analysis and simulation of vocal fold vibrations by means of analytical, kinematic or mechanical modelling, (iii) the interpretation and modeling of both acoustic emission and/or high– speed video recordings such as videolaryngoscopy and videokymography, (iv) the synthesis of disordered voices jointly with auditory experimentation involving synthetic and natural disordered voice stimuli.

Automatic Speech Recognition in Air Traffic Management (ASR-ATM)

Air-traffic management is a dedicated domain where in addition to using the voice signal, other contextual information (i.e. air traffic surveillance data, meteorological data, etc.) plays an important role. Automatic speech recognition is the first challenge in the whole chain. Further processing usually requires transforming the recognized word sequence into the conceptual form, a more important application in ATM. This also means that the usual metrics for evaluating ASR systems (e.g. word error rate) are less important, and other performance criteria (i.e. objective such as command recognition error rate, callsign detection accuracy, overall algorithmic delay, real-time factor, or reduced flight times, or subjective such as decrease of a workload of the users) are employed.

The main objective of the special session is to bring together ATM players (both academic and industrial) interested in ASR and ASR researchers looking for new challenges. This can accelerate near future R&D plans to enable an integration of speech technologies to the challenging, but highly safety oriented air-traffic management domain.

  • https://www.haawaii.de/wp/interspeech-2021-agenda-for-special-session-on-automatic-speech-recognition-in-air-traffic-management-is-now-online/
  • Hartmut Helmke (DLR)
  • Pavel Kolcarek (Honeywell)
  • Petr Motlicek (Idiap Research Institute)

Alzheimer's Dementia Recognition through Spontaneous Speech: The ADReSS Challenge

Dementia is a category of neurodegenerative diseases that entails a long-term and usually gradual decrease of cognitive functioning. The main risk factor for dementia is age, and therefore its greatest incidence is amongst the elderly. Due to the severity of the situation worldwide, institutions and researchers are investing considerably on dementia prevention and early detection, focusing on disease progression. There is a need for cost-effective and scalable methods for detection of dementia from its most subtle forms, such as the preclinical stage of Subjective Memory Loss (SML), to more severe conditions like Mild Cognitive Impairment (MCI) and Alzheimer's Dementia (AD) itself.

The ADReSSo (ADReSS, speech only) targets a difficult automatic prediction problem of societal and medical relevance, namely, the detection of Alzheimer's Dementia (AD). The challenge builds on the success of the ADReSS Challenge (Luz et Al, 2020), the first such shared-task event focused on AD, which attracted 34 teams from across the world. While a number of researchers have proposed speech processing and natural language procesing approaches to AD recognition through speech, their studies have used different, often unbalanced and acoustically varied data sets, consequently hindering reproducibility and comparability of approaches. The ADReSSo Challenge will provide a forum for those different research groups to test their existing methods (or develop novel approaches) on a new shared standardized dataset. The approaches that performed best on the original ADReSS dataset employed features extracted from manual transcripts, which were provided. The ADReSSo challenge provides a more challenging and improved spontaneous speech dataset, and requires the creation of models straight from speech, without manual transcription. In keeping with the objectives of AD prediction evaluation, the ADReSSo challenge's dataset will be statistically balanced so as to mitigate common biases often overlooked in evaluations of AD detection methods, including repeated occurrences of speech from the same participant (common in longitudinal datasets), variations in audio quality, and imbalances of gender and age distribution. This task focuses AD recognition using spontaneous speech, which marks a departure from neuropsychological and clinical evaluation approaches. Spontaneous speech analysis has the potential to enable novel applications for speech technology in longitudinal, unobtrusive monitoring of cognitive health, in line with the theme of this year's INTERSPEECH, "Speech Everywhere!".

Important Dates

  • January 18, 2021 : ADReSSo Challenged announced.
  • March 20, 2021 : Model submission deadline.
  • March 26, 2021 : Paper submission deadline.
  • April 2, 2021 : Paper update deadline.
  • June 2, 2021 : Paper acceptance/rejection notification.
  • August 31 - September 3, 2021 : INTERSPEECH 2021.
  • https://edin.ac/3p1cyaI
  • Saturnino Luz, Usher Institute, University of Edinburgh
  • Fasih Haider, University of Edinburgh
  • Sofia de la Fuente, University of Edinburgh
  • Davida Fromm, Carnegie Mellon University
  • Brian MacWhinney, Carnegie Mellon University

SdSV Challenge 2021: Analysis and Exploration of New Ideas on Short-Duration Speaker Verification

Are you searching for new challenges in speaker recognition? Join SdSV Challenge 2021 which focuses on the analysis and exploration of new ideas for short duration speaker verification.

Following the success of the SdSV Challenge 2020, the SdSV Challenge 2021 focuses on systematic benchmark and analysis on varying degrees of phonetic variability on short-duration speaker recognition. The challenge consists of two tasks.

  • Task 1 is defined as speaker verification in text-dependent mode where the lexical content (in both English and Persian) of the test utterances is also taken into consideration.
  • Task 2 is defined as speaker verification in text-independent mode with same- and cross-language trials.

The main purpose of this challenge is to encourage participants on building single but competitive systems, to perform analysis as well as to explore new ideas, such as multi-task learning, unsupervised/self-supervised learning, single-shot learning, disentangled representation learning and so on, for short-duration speaker verification. The participating teams will get access to a train set and the test set drawn from the DeepMine corpus which is the largest public corpus designed for short-duration speaker verification with voice recordings of 1800 speakers. The challenge leaderboard is hosted at CodaLab.

  • For more information visit: https://sdsvc.github.io/
  • Evaluation plan:
  • Contact: [email protected]
  • Hossein Zeinali (Amirkabir University of Technology, Iran)
  • Kong Aik Lee (I2R, A*STAR, Singapore)
  • Jahangir Alam (CRIM, Canada)
  • Lukáš Burget (Brno University of Technology, Czech Republic)

Acoustic Echo Cancellation (AEC) Challenge

The INTERSPEECH 2021 Acoustic Echo Cancellation (AEC) challenge is designed to stimulate research in the AEC domain by open sourcing a large training dataset, test set, and subjective evaluation framework. We provide two new open source datasets for training AEC models. The first is a real dataset captured using a large-scale crowdsourcing effort. This dataset consists of real recordings that have been collected from over 5,000 diverse audio devices and environments. The second is a synthetic dataset with added room impulse responses and background noise derived from the INTERSPEECH 2020 DNS Challenge. An initial test set will be released for the researchers to use during development and a blind test near the end which will be used to decide the final competition winners. We believe these datasets are large enough to facilitate deep learning and representative enough for practical usage in shipping telecommunication products.

The dataset and rules are available here.

Please feel free to reach out to us, if you have any questions or need clarification about any aspect of the challenge.

  • https://aka.ms/aec-challenge
  • Ross Cutler, Microsoft Corp, USA
  • Ando Saabas, Microsoft Corp, Tallinn
  • Tanel Parnamaa, Microsoft Corp, Tallinn
  • Markus Loide, Microsoft Corp, Tallinn
  • Sten Sootla, Microsoft Corp, Tallinn
  • Hannes Gamper, Microsoft Corp, USA
  • Sebastian Braun, Microsoft Corp, USA
  • Karsten Sorensen, Microsoft Corp, USA
  • Robert Aichner, Microsoft Corp, USA
  • Sriram Srinivasan, Microsoft Corp, USA

Non-Autoregressive Sequential Modeling for Speech Processing

Non-autoregressive modeling is a new direction in speech processing research that has recently emerged. One advantage of non-autoregressive models is their decoding speed: decoding is only composed of forward propagation through a neural network, hence complicated left-to-right beam search is not necessary. In addition, they do not assume a left-to-right generation order and thus represent a paradigm shift in speech processing, where left-to-right, autoregressive models have been believed to be legitimate. This special session aims to facilitate knowledge sharing between researchers involved in non-autoregressive modeling across various speech processing fields, including, but not limited to, automatic speech recognition, speech translation, and text to speech, via panel discussions with leading researchers followed by a poster session.

  • https://sw005320.github.io/INTERSPEECH21_SS_NAR_SP/
  • Katrin Kirchhoff (Amazon)
  • Shinji Watanabe (Carnegie Mellon University)
  • Yuya Fujita (Yahoo Japan Corporation)

DiCOVA: Diagnosis of COVID-19 using Acoustics

The COVID-19 pandemic has resulted in more than 93 million infections, and more than 2 million casualties. Large scale testing, social distancing, and face masks have been critical measures to help contain the spread of the infection. While the list of symptoms is regularly updated, it is established that in symptomatic cases COVID-19 seriously impairs normal functioning of the respiratory system. Does this alter the acoustic characteristics of breathe, cough, and speech sounds produced through the respiratory system? This is an open question waiting for answers. A COVID-19 diagnosis methodology based on acoustic signal analysis, if successful, can provide a remote, scalable, and economical means for testing of individuals. This can supplement the existing nucleotides based COVID-19 testing methods, such as RT-PCR and RAT.

The DiCOVA Challenge is designed to find answers to the question by enabling participants to analyze an acoustic dataset gathered from COVID-19 positive and non-COVID-19 individuals. The findings will be presented in a special session at Interspeech 2021. The timeliness, and the global societal importance of the challenge warrants focussed effort from researchers across the globe, including from the fields of medical and respiratory sciences, mathematical sciences, and machine learning engineers. We look forward to your participation!

  • http://dicova2021.github.io/
  • Neeraj Sharma (Indian Institute of Science, Bangalore, India)
  • Prasanta Kumar Ghosh (Indian Institute of Science, Bangalore, India)
  • Srikanth Raj Chetupalli (Indian Institute of Science, Bangalore, India)
  • Sriram Ganapathy (Indian Institute of Science, Bangalore, India)

Deep Noise Suppression Challenge – INTERSPEECH 2021

The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH 2020 and ICASSP 2020. We open-sourced training and test datasets for the wideband scenario. We also open-sourced a subjective evaluation framework based on ITU-T standard P.808, which was used to evaluate challenge submissions. Many researchers from academia and industry made significant contributions to push the field forward, yet even the best noise suppressor was far from achieving superior speech quality in challenging scenarios. In this version of the challenge organized at INTERSPEECH 2021, we are expanding both our training and test datasets to accommodate full band scenarios. The two tracks in this challenge will focus on real-time denoising for (i) wide band, and (ii) full band scenarios. We are also making available a reliable non-intrusive objective speech quality metric for wide band called DNSMOS for the participants to use during their development phase. The final evaluation will be based on ITU-T P.835 subjective evaluation framework that gives the quality of speech and noise in addition to the overall quality of the speech.

We will have two tracks in this challenge:

  • Track 1 : Real-Time Denoising track for wide band scenario The noise suppressor must take less than the stride time Ts (in ms) to process a frame of size T (in ms) on an Intel Core i5 quad-core machine clocked at 2.4 GHz or equivalent processor. For example, Ts = T/2 for 50% overlap between frames. The total algorithmic latency allowed including the frame size T, stride time Ts, and any look ahead must be less than or equal to 40ms. For example, for a real-time system that receives 20ms audio chunks, if you use a frame length of 20ms with a stride of 10ms resulting in an algorithmic latency of 30ms, then you satisfy the latency requirements. If you use a frame of size 32ms with a stride of 16ms resulting in an algorithmic latency of 48ms, then your method does not satisfy the latency requirements as the total algorithmic latency exceeds 40ms. If your frame size plus stride T1=T+Ts is less than 40ms, then you can use up to (40-T1) ms future information.
  • Track 2 : Real-Time Denoising track for full band scenario Satisfy Track 1 requirements but at 48 kHz.

More details about the datasets and the challenge are available in the paper and the challenge github page. Participants must adhere to the rules of the challenge.

  • https://www.microsoft.com/en-us/research/academic-program/deep-noise-suppression-challenge-interspeech-2021/
  • Chandan K A Reddy (Microsoft Corp, USA)
  • Hari Dubey (Microsoft Corp, USA)
  • Kazuhito Koishada (Microsoft Corp, USA)
  • Arun Nair (Johns Hopkins University, USA)
  • Vishak Gopal (Microsoft Corp, USA)
  • Ross Cutler (Microsoft Corp, USA)
  • Robert Aichner (Microsoft Corp, USA)
  • Sebastian Braun (Microsoft Research, USA)
  • Hannes Gamper (Microsoft Research, USA)
  • Sriram Srinivasan (Microsoft Corp, USA)

Privacy-preserving Machine Learning for Audio, Speech and Language Processing

This special session focuses on privacy-preserving machine learning (PPML) techniques in speech, language and audio processing, including centralized, distributed and on-device processing approaches. Novel contributions and overviews on the theory and applications of PPML in speech, language and audio are invited. We encourage submissions related to ethical and regulatory aspects of PPML in this context. Sending speech, language or audio data to a cloud server exposes private information. One approach called anonymization is to preprocess the data so as to hide information which could identify the user by disentangling it from other useful attributes. PPML is a different approach, which solves this problem by moving computation near the clients. Due to recent advances in Edge Computing and Neural Processing Units on mobile devices, PPML is now a feasible technology for most speech, language and audio applications that enables companies to train on customer data without needing them to share the data. With PPML, data can sit on a customer's device where it is used for model training. During the training process, models from several clients are often shared with aggregator nodes that perform model averaging and sync the new models to each client. Next, the new averaged model is used for training on each client. This process continues and enables each client to benefit from training data on all other clients. Such processes were not possible in conventional audio/speech ML. On top of that, high-quality synthetic data can also be used for training thanks to advances in speech, text, and audio synthesis.

  • https://sites.google.com/view/ppmlforaudio
  • Harishchandra Dubey (Microsoft)
  • Amin Fazel (Amazon, Alexa)
  • Mirco Ravanelli (MILA,Université de Montréal)
  • Emmanuel Vincent (Inria)

Computational Paralinguistics ChallengE (ComParE) - COVID-19 Cough, COVID-19 Speech, Escalation & Primates

Interspeech ComParE is an open Challenge dealing with states and traits of speakers as manifested in their speech signal’s properties. In this 13th edition, we introduce four new tasks and Sub-Challenges:

  • COVID-19 Cough based recognition,
  • COVID-19 Speech based recognition,
  • Escalation level assessment in spoken dialogues,
  • Primates classification based on their vocalisations.

Sub-Challenges allow contributors to find their own features with their own machine learning algorithm. However, a standard feature set and tools including recent deep learning approaches are provided that may be used. Participants have five trials on the test set per Sub-Challenge. Participation has to be accompanied by a paper presenting the results that undergoes the Interspeech peer-review.

Contributions using the provided or equivalent data are sought for (but not limited to):

  • Participation in a Sub-Challenge
  • Contributions around the Challenge topics

Results of the Challenge and Prizes will be presented at Interspeech 2021 in Brno, Czechia.

  • http://www.compare.openaudio.eu/now/
  • Björn Schuller (University of Augsburg, Germany / Imperial College, UK)
  • Anton Batliner (University of Augsburg, Germany)
  • Christian Bergler (FAU, Germany)
  • Cecilia Mascolo (University of Cambridge, UK)
  • Jing Han (University of Cambridge, UK)
  • Iulia Lefter (Delft University of Technology, The Netherlands)
  • Heysem Kaya (Utrecht University, The Netherlands)

OpenASR20 and Low Resource ASR Development

The goal of the OpenASR (Open Automatic Speech Recognition) Challenge is to assess the state of the art of ASR technologies for low-resource languages.

The OpenASR Challenge is an open challenge created out of the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program that encompasses more tasks, including CLIR (cross-language information retrieval), domain classification, and summarization. For every year of MATERIAL, NIST supports a simplified, smaller scale evaluation open to all, focusing on a particular technology aspect of MATERIAL. The capabilities tested in the open challenges are expected to ultimately support the MATERIAL task of effective triage and analysis of large volumes of data, in a variety of less-studied languages.

The special session aims to bring together researchers from all sectors working on ASR for low-resource languages to discuss the state of the art and future directions. It will allow for fruitful exchanges between OpenASR20 Challenge participants and other researchers working on low-resource ASR. We invite contributions from OpenASR20 participants, MATERIAL performers, as well as any other researchers with relevant work in the low-resource ASR problem space.

  • Cross-lingual training techniques to compensate for ten-hour training condition
  • Factors influencing ASR performance on low resource languages by gender and dialect
  • Resource conditions used for unconstrained development condition
  • Low Resource ASR tailored to MATERIAL’s Cross Language Information Retrieval Evaluation
  • Genre mismatch condition between speech training data and evaluation
  • Other topics focused on low-resource ASR challenges and solutions
  • https://www.nist.gov/itl/iad/mig/openasr-challenge
  • Peter Bell, University of Edinburgh
  • Jayadev Billa, University of Southern California Information Sciences Institute
  • William Hartmann, Raytheon BBN Technologies
  • Kay Peterson, National Institute of Standards and Technology

IMAGES

  1. INTERSPEECH 2021

    interspeech 2021

  2. Interspeech 2021

    interspeech 2021

  3. INTERSPEECH 2021

    interspeech 2021

  4. Highlights of SSW and INTERSPEECH 2021

    interspeech 2021

  5. INTERSPEECH 2021

    interspeech 2021

  6. Interspeech 2021: Take-aways on Automatic Speech Recognition

    interspeech 2021

VIDEO

  1. Interspeech 2021: Towards unsupervised phone & word segmentation using vector-quantized NNs

  2. [ Interspeech Tutorial ] Intelligibility Evaluation and Speech Enhancement based on Deep Learning

  3. [INTERSPEECH 2022] Closing Session

  4. Репортаж с выставки "Интерткань 2023.Весна"

  5. American Speech and Debate Association

  6. Interlight Russia 2023 УСПЕХ или ПРОВАЛ? Обзор выставки глазами СВЕТОДИЗАЙНЕРА Артёма Воронова!

COMMENTS

  1. INTERSPEECH 2021

    Interspeech 2021 was a successful conference for the speech community, with in-person and virtual participation. Watch videos of oral sessions, keynotes, and surveys, and check the discussions and presentations on the virtual platform until October 6th.

  2. Interspeech 2021

    Interspeech 2021 is a conference on speech communication and technology, organized by ISCA (International Speech Communication Association). The web page lists the papers presented in various topics, such as speech synthesis, disordered speech, speech signal analysis, and speaker recognition.

  3. Proceedings

    The full proceedings of Interspeech 2021 Brno are now available from the ISCA Archive. Interspeech 2021 Proceedings. Get the whole proceedings as one file [ZIP, 1,51 GB] (updated September 1, 2021)

  4. Program

    Tue-M-SS-1 Special-Virtual: The INTERSPEECH 2021 Computational Paralinguistics Challenge (ComParE) - COVID-19 Cough, COVID-19 Speech, Escalation & Primates. Tuesday, August 31, 13:30-15:30. Tue-A-O-1 In-person Oral: Embedding and Network Architecture for Speaker Recognition. Tue-A-O-2 In-person Oral: Speech perception I.

  5. INTERSPEECH2021

    Videos from INTERSPEECH 2021 held in Brno, Czech Republic. There are over 608 talks published with abstract, authors and link to ISCA archive to download the paper PDF.nterspeech

  6. PDF 22nd Annual Conference of the International Speech Communication

    INTERSPEECH 2021 : ASSESSMENT OF PATHOLOGICAL SPEECH AND LANGUAGE II ASSESSING POSTERIOR-BASED MISPRONUNCIATION DETECTION ON FIELD-COLLECTED RECORDINGS FROM CHILD SPEECH THERAPY SESSIONS..... 181 Adam Hair, Guanlong Zhao, Beena Ahmed, Kirrie J. Ballard, Ricardo Gutierrez-Osuna

  7. Interspeech 2021

    This this a short video on the Highlights of Interspeech 2021 selected by Seung Her Yang and Andreas Maier.Selected Papers & Presentations:Heidy Christiansen...

  8. dblp: Interspeech 2021

    Bibliographic content of Interspeech 2021. Edresson Casanova, Arnaldo Candido Jr., Ricardo Corso Fernandes Junior, Marcelo Finger, Lucas Rafael Stefanel Gris, Moacir Antonelli Ponti, Daniel Peixoto Pinto da Silva: Transfer Learning and Data Augmentation Techniques to the COVID-19 Identification Tasks in ComParE 2021.

  9. INTERSPEECH 2021 Deep Noise Suppression Challenge

    Learn about the DNS challenge organized at INTERSPEECH 2021 to foster innovation in noise suppression for superior perceptual speech quality. The challenge offers open-source datasets, subjective evaluation framework, and a new objective metric for wideband and full band scenarios.

  10. Areas & Topics

    INTERSPEECH 2021 is a biennial conference on speech and language research and technology. It covers a wide range of topics, from speech perception and production to speech synthesis and recognition, from phonetics and phonology to paralinguistics and multimodal interaction.

  11. dblp: INTERSPEECH

    22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021. ISCA 2021. 21st INTERSPEECH 2020: Shanghai, China. interspeech2020.org. view. table of contents in dblp; electronic edition via DOI (open access) references & citations;

  12. The 22nd Annual Conference of the International Speech Communication

    Brno, Czechia30 August - 3 September 2021. New user? please register first by clicking HERE .

  13. [2101.01902] Interspeech 2021 Deep Noise Suppression Challenge

    Interspeech 2021 Deep Noise Suppression Challenge. The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH and ICASSP 2020. We open-sourced training and test datasets for the ...

  14. Join us at Interspeech 2021 Event

    INTERSPEECH 2021. August 30 - September 3, 2021. Join us at INTERSPEECH, a technical conference focused on the latest research and technologies in speech processing. NVIDIA will present accepted papers on our latest research in speech recognition and speech synthesis. Register Now.

  15. [2105.01051] SUPERB: Speech processing Universal PERformance Benchmark

    SUPERB: Speech processing Universal PERformance Benchmark. Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation.

  16. Microsoft at INTERSPEECH 2021

    Website: INTERSPEECH 2021 (opens in new tab) Opens in a new tab. Microsoft is proud to be a diamond sponsor of INTERSPEECH 2021, the world's largest and most comprehensive conference on the science and technology of spoken language processing.Microsoft attendees will be presenting 32 papers, one workshop, one special session, and two challenges during this event.

  17. Call for Papers

    INTERSPEECH 2021 is the world's largest and most comprehensive conference on the science and technology of spoken language processing. The theme of the conference is "Speech everywhere" and it covers a wide range of topics, from basic theories to advanced applications, in oral and poster sessions, tutorials, special sessions and challenges.

  18. tlikhomanenko (Tatiana Likhomanenko) · GitHub

    Dr. Tatiana Likhomanenko. Research scientist and software developer. Semi-supervised and unsupervised learning, speech recognition. Gravitating to core ML, video processing, and private federated learning. Industry and Research Experience. Speech recognition and natural language processing for speech. Face recognition and facial attributes ...

  19. Lomonosov Conference

    The 21st Lomonosov Conference will be held at the Moscow State University from August 24 to 30, 2023, to celebrate the anniversaries of MSU, RAS and Pontecorvo. The conference will cover various topics in particle physics, astrophysics and cosmology, and will be open for personal and remote participation.

  20. UFC Moscow: Zabit Magomedsharipov Octagon Interview

    Hear from Zabit Magomedsharipov after his unanimous decision victory over Calvin Kattar in the main event of Fight Night Moscow.Subscribe to get all the late...

  21. Important Dates

    Instructions for presentation will be available. July 15, 2021. Schedule of papers will be available. August 15, 2021. Deadline for submission of videos and other material for Unified virtual sessions. Deadline of requests for Virtual gatherings. August 30, 2021. INTERSPEECH 2021 Tutorial Day. August 31, 2021.

  22. Tempo variation in regional speech (based on northern and southern

    The article examines the variability of tempo in English unprepared dialect speech. The work aims to identify regionally specific tempo modulations in the monologues of eight speakers from the northern and southern parts of England. The scientific novelty of the study resides in the authors' assessment of the prosodic manifestation of the historical linguistic divide between.

  23. Special Sessions & Challenges

    The INTERSPEECH 2021 Acoustic Echo Cancellation (AEC) challenge is designed to stimulate research in the AEC domain by open sourcing a large training dataset, test set, and subjective evaluation framework. We provide two new open source datasets for training AEC models. The first is a real dataset captured using a large-scale crowdsourcing effort.