Around 80 million people worldwide (including around 800,000 in Germany alone) are affected by chronic stuttering. This neurological disorder is currently incurable. Despite speech therapy, which more than 70 percent of those affected undergo, speech flow usually remains impaired for life. Stuttering often leads to severe psychosocial stress and significantly reduces the quality of life of those affected.
The project team is developing an inconspicuous in-ear headphone system that can improve the speech flow of people who stutter immediately and without effort. The system uses artificial intelligence (AI) in speech synthesis to generate audio feedback while speaking. This feedback specifically activates a neurocognitive mechanism in the brain that bypasses the pathological component of stuttering. By aligning the system with the neural principles of human hearing, it is possible for the first time to improve speech fluency even with long-term use. In addition, laboratory and field studies are being conducted to comprehensively evaluate the effectiveness and suitability of the system for everyday use. The result promises to be a pioneering solution that could represent a breakthrough in the technology-assisted treatment of stuttering.
Detecting and segmenting dysfluencies is crucial for effective speech therapy and real-time feedback. However, most methods only classify dysfluencies at the utterance level. We introduce StutterCut, a semi-supervised framework that formulates dysfluency segmentation as a graph partitioning problem, where speech embeddings from overlapping windows are represented as graph nodes. We refine the connections between nodes using a pseudo-oracle classifier trained on weak (utterance-level) labels, with its influence controlled by an uncertainty measure from Monte Carlo dropout. Additionally, we extend the weakly labelled FluencyBank dataset by incorporating frame-level dysfluency boundaries for four dysfluency types. This provides a more realistic benchmark compared to synthetic datasets. Experiments on real and synthetic datasets show that StutterCut outperforms existing methods, achieving higher F1 scores and more precise stuttering onset detection.
@inproceedings{Ghosh2025,author={Ghosh, Suhita and Jouaiti, Melanie and Perschewski, Jan-Ole and Stober, Sebastian},booktitle={Interspeech 2025},title={StutterCut: Uncertainty-Guided Normalised Cut for Dysfluency Segmentation},year={2025},month=aug,pages={808--812},publisher={ISCA},series={interspeech_2025},collection={interspeech_2025},doi={10.21437/interspeech.2025-151}}
Investigating Inclusivity of Whisper for Dysfluent Speech
Evelyn
Starzew, Suhita
Ghosh, and Valerie
Krug
In 12th edition of the Disfluency in Spontaneous Speech Workshop (DiSS 2025), Sep 2025
Speech recognition models have gained popularity in the last couple of years and are able to achieve remarkable performance. However, the under-representation of pathological speech in the training data leads to significant performance drops for many state-of-the-art models on pathological speech. In our work, we investigate the inclusivity of the pre-trained Whisper model in its base variant using dysarthric speech as a use case. We aim to identify potential inequalities and whether they can be reduced through fine-tuning. For this, we compare embedding-based and attention-based representations of healthy and dysarthric samples and analyze the development of the layers’ representational capacities. Our key findings are that there are clear inequalities in the performance and computation of representations, which can be reduced significantly in automatic speech recognition by 73.44% WER through the adaptation to dysarthric speech by fine-tuning.
@inproceedings{Starzew2025,author={Starzew, Evelyn and Ghosh, Suhita and Krug, Valerie},booktitle={12th edition of the Disfluency in Spontaneous Speech Workshop (DiSS 2025)},title={Investigating Inclusivity of Whisper for Dysfluent Speech},year={2025},month=sep,pages={77--81},publisher={ISCA},series={diss_2025},collection={diss_2025},doi={10.21437/diss.2025-16}}
2024
Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example
Suhita
Ghosh, Melanie
Jouaiti, Arnab
Das, Yamini
Sinha, Tim
Polzehl, Ingo
Siegert, and Sebastian
Stober
Speech anonymisation aims to protect speaker identity by changing personal identifiers in speech while retaining linguistic content. Current methods fail to retain prosody and unique speech patterns found in elderly and pathological speech domains, which is essential for remote health monitoring. To address this gap, we propose a voice conversion-based method (DDSP-QbE) using differentiable digital signal processing and query-by-example. The proposed method, trained with novel losses, aids in disentangling linguistic, prosodic, and domain representations, enabling the model to adapt to uncommon speech patterns. Objective and subjective evaluations show that DDSP-QbE significantly outperforms the voice conversion state-of-the-art concerning intelligibility, prosody, and domain preservation across diverse datasets, pathologies, and speakers while maintaining quality and speaker anonymity. Experts validate domain preservation by analysing twelve clinically pertinent domain attributes.
@inproceedings{Ghosh2024,author={Ghosh, Suhita and Jouaiti, Melanie and Das, Arnab and Sinha, Yamini and Polzehl, Tim and Siegert, Ingo and Stober, Sebastian},booktitle={Interspeech 2024},title={Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example},year={2024},month=sep,pages={4438--4442},publisher={ISCA},doi={10.21437/interspeech.2024-328}}
Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses
Suhita
Ghosh, Tim
Thiele, Frederic
Lorbeer, and Sebastian
Stober
In Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, 2024
The increasing use of cloud-based speech assistants has heightened the need for effective speech anonymization, which aims to obscure a speaker’s identity while retaining critical information for subsequent tasks. One approach to achieving this is through voice conversion. While existing methods often emphasize complex architectures and training techniques, our research underscores the importance of loss functions inspired by the human auditory system. Our proposed loss functions are model-agnostic, incorporating handcrafted and deep learning-based features to effectively capture quality representations. Through objective and subjective evaluations, we demonstrate that a VQVAE-based model, enhanced with our perception-driven losses, surpasses the vanilla model in terms of naturalness, intelligibility, and prosody while maintaining speaker anonymity. These improvements are consistently observed across various datasets, languages, target speakers, and genders.
@inproceedings{ghosh24_NeuRIPS_VQVAE,author={Ghosh, Suhita and Thiele, Tim and Lorbeer, Frederic and Stober, Sebastian},booktitle={Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation},title={{Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses}},year={2024},bdsk-url-1={https://openreview.net/forum?id=oitQwlUFYr},url={https://openreview.net/forum?id=oitQwlUFYr}}
T-DVAE: A Transformer-Based Dynamical Variational Autoencoder for Speech
Jan-Ole
Perschewski and Sebastian
Stober
In Artificial Neural Networks and Machine Learning – ICANN 2024, 2024
In contrast to Variational Autoencoders, Dynamical Variational Autoencoders (DVAEs) learn a sequence of latent states for a time series. Initially, they were implemented using recurrent neural networks (RNNs) known for challenging training dynamics and problems with long-term dependencies. This led to the recent adoption of Transformers close to the RNN-based implementation. These implementations still use RNNs as part of the architecture even though the Transformer can solve the task as the sole building block. Hence, we improve the LigHT-DVAE architecture by removing the dependence on RNNs and Cross-Attention. Furthermore, we show that a trained LigHT-DVAE ignores output-to-hidden connections, which allows us to simplify the overall architecture by removing output-to-hidden connections. We demonstrate the capability of the resulting T-DVAE on librispeech and voice bank with an improvement in training time, memory consumption, and generative performance.
@inbook{Perschewski2024,author={Perschewski, Jan-Ole and Stober, Sebastian},pages={33--46},publisher={Springer Nature Switzerland},title={T-DVAE: A Transformer-Based Dynamical Variational Autoencoder for Speech},year={2024},isbn={9783031723506},booktitle={Artificial Neural Networks and Machine Learning – ICANN 2024},doi={10.1007/978-3-031-72350-6_3},issn={1611-3349}}
2023
Improving voice conversion for dissimilar speakers using perceptual losses
Suhita
Ghosh, Yamini
Sinha, Ingo
Siegert, and Sebastian
Stober
In 49. Jahrestagung für Akustik DAGA 2023, Hamburg, Mar 2023
In this paper, we analyze and incorporate acoustic features in a deep learning based speaker anonymization model. Speaker anonymization aims at suppressing personally identifiable information while preserving the prosody and linguistic content. In this work a StarGAN-based voice conversion model is used for anonymization, where a source speaker’s voice is transformed into that of a target speaker. It has been typically observed that the quality of the converted voice varies across target speakers, especially when certain acoustic properties such as pitch are very different between the source and target speakers. Choosing a target speaker dissimilar to the source speaker may lead to successful anonymization. However, it has been observed that choosing a very dissimilar target speaker often leads to a low-quality voice conversion. Therefore, we aim to improve the overall quality of the converted voice, by introducing perceptual losses based on stress and intonation related acoustic features such as power envelope, F0, etc. This facilitates improved anonymization and voice quality for all target speakers.
@inproceedings{Ghosh2023daga,author={Ghosh, Suhita and Sinha, Yamini and Siegert, Ingo and Stober, Sebastian},booktitle={49. Jahrestagung f\"{u}r Akustik DAGA 2023, Hamburg},title={Improving voice conversion for dissimilar speakers using perceptual losses},year={2023},address={Hamburg, Germany},month=mar,organization={German Acoustical Society (DEGA)},pages={1358--1361},publisher={German Acoustical Society (DEGA)},url={https://pub.dega-akustik.de/DAGA_2023/data/articles/000469.pdf}}
Anonymization of Stuttered Speech – Removing Speaker Information while Preserving the Utterance
Jan
Hintz, Sebastian
Bayerl, Yamini
Sinha, Suhita
Ghosh, Martha
Schubert, Sebastian
Stober, Korbinian
Riedhammer, and Ingo
Siegert
In 3rd Symposium on Security and Privacy in Speech Communication, Aug 2023
Concealing the identity through speaker anonymization is essential in various situations. This study focuses on investigating how stuttering affects the anonymization process. Two scenarios are considered: preserving the pathology in the diagnostic/remote treatment context and obfuscating the pathology. The paper examines the effectiveness of three state-of-the-art approaches in achieving high anonymization, as well as the preservation of dysfluencies. The findings indicate that while a speaker conversion method may not achieve perfect anonymization (Baseline 27.25% EER and F0 Delta 32.63% EER), it does preserve the pathology. This effect was objectively evaluated by performing a stuttering classification. Although this solution may be useful in a remote treatment scenario for speech pathologies, it presents a vulnerability in anonymization. To address this issue, we propose an alternative approach that uses automatic speech recognition and text-based speech synthesis to avoid re-identification (48.27% EER).
@inproceedings{Hintz2023,author={Hintz, Jan and Bayerl, Sebastian and Sinha, Yamini and Ghosh, Suhita and Schubert, Martha and Stober, Sebastian and Riedhammer, Korbinian and Siegert, Ingo},booktitle={3rd Symposium on Security and Privacy in Speech Communication},title={Anonymization of Stuttered Speech -- Removing Speaker Information while Preserving the Utterance},year={2023},month=aug,publisher={ISCA},series={spsc_2023},collection={spsc_2023},doi={10.21437/spsc.2023-7}}
StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings
Arnab
Das, Suhita
Ghosh, Tim
Polzehl, Ingo
Siegert, and Sebastian
Stober
In 12th ISCA Speech Synthesis Workshop (SSW2023), Aug 2023
Voice conversion (VC) transforms an utterance to sound like anotherperson without changing the linguistic content. A recently proposedgenerative adversarial network-based VC method, StarGANv2-VCis very successful in generating natural-sounding conversions.However, the method fails to preserve the emotion of the sourcespeaker in the converted samples. Emotion preservation is necessaryfor natural human-computer interaction. In this paper, we showthat StarGANv2-VC fails to disentangle the speaker and emotionrepresentations, pertinent to preserve emotion. Specifically, thereis an emotion leakage from the reference audio used to capture thespeaker embeddings while training. To counter the problem, wepropose novel emotion-aware losses and an unsupervised methodwhich exploits emotion supervision through latent emotion representations. The objective and subjective evaluations prove the efficacyof the proposed strategy over diverse datasets, emotions, gender, etc.
@inproceedings{Das2023,author={Das, Arnab and Ghosh, Suhita and Polzehl, Tim and Siegert, Ingo and Stober, Sebastian},booktitle={12th ISCA Speech Synthesis Workshop (SSW2023)},title={StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings},year={2023},month=aug,publisher={ISCA},series={ssw_2023},collection={ssw_2023},doi={10.21437/ssw.2023-13}}
Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion
Suhita
Ghosh, Arnab
Das, Yamini
Sinha, Ingo
Siegert, Tim
Polzehl, and Sebastian
Stober
Speech anonymisation prevents misuse of spoken data by removing any personal identifier while preserving at least linguistic content. However, emotion preservation is crucial for natural human-computer interaction. The well-known voice conversion technique StarGANv2-VC achieves anonymisation but fails to preserve emotion. This work presents an any-to-many semi-supervised StarGANv2-VC variant trained on partially emotion-labelled non-parallel data. We propose emotion-aware losses computed on the emotion embeddings and acoustic features correlated to emotion. Additionally, we use an emotion classifier to provide direct emotion supervision. Objective and subjective evaluations show that the proposed approach significantly improves emotion preservation over the vanilla StarGANv2-VC. This considerable improvement is seen over diverse datasets, emotions, target speakers, and inter-group conversions without compromising intelligibility and anonymisation.
@inproceedings{Ghosh2023,author={Ghosh, Suhita and Das, Arnab and Sinha, Yamini and Siegert, Ingo and Polzehl, Tim and Stober, Sebastian},booktitle={INTERSPEECH 2023},title={Emo-StarGAN: A Semi-Supervised Any-to-Many Non-Parallel Emotion-Preserving Voice Conversion},year={2023},month=aug,publisher={ISCA},series={interspeech_2023},collection={interspeech_2023},doi={10.21437/interspeech.2023-191}}
2022
Voice Privacy - leveraging multi-scale blocks with ECAPA-TDNN SE-Res2NeXt extension for speaker anonymization
Razieh
Khamsehashari, Yamini
Sinha, Jan
Hintz, Suhita
Ghosh, Tim
Polzehl, Clarlos
Franzreb, Sebastian
Stober, and Ingo
Siegert
In 2nd Symposium on Security and Privacy in Speech Communication, Sep 2022
This paper presents the ongoing efforts on voice anonymization with the purpose to securely anonymize a speaker’s identity in a hotline call scenario. Our hotline seeks out to provide help by remote assessment, treatment and prevention against child sexual abuse in Germany. The presented work originates from the joint contribution to the VoicePrivacy Challenge 2022 and the Symposium on Security and Privacy in Speech Communication in 2022. Having analyzed in depth the results of the first instantiation of the Voice Privacy Challenge in 2020, the current experiments aim to improve the robustness of two distinct components of the challenge baseline. First, we analyze ASR embeddings, in order to present a more precise and resistant representation of the source speech that is used in the challenge baseline GAN. First experiments using wav2vec show promising results. Second, to alleviate modeling and matching of source and target speaker characteristics, we propose to exchange the baseline x-vectors speaker identity features with the more robust ECAPA-TDNN embedding, in order to leverage its higher resolution multi-scale architecture. Also, improving on ECAPA-TDNN, we propose to extend the model architecture by integrating SE-Res2NeXt units, as the expectation that by representing features at various scales using a cutting-edge building block for CNNs, the latter will perform better than the SE-Res2Net block that creates hierarchical residual-like connections within a single residual block, allowing them to represent features at multiple scales. This expands the range of receptive fields for each network layer and depicts multi-scale features at a finer level. Ultimately, when including a more precise speaker identity embedding we expect to reach improvements for future anonymization for various application cases.
@inproceedings{Khamsehashari2022,author={Khamsehashari, Razieh and Sinha, Yamini and Hintz, Jan and Ghosh, Suhita and Polzehl, Tim and Franzreb, Clarlos and Stober, Sebastian and Siegert, Ingo},booktitle={2nd Symposium on Security and Privacy in Speech Communication},title={Voice Privacy - leveraging multi-scale blocks with ECAPA-TDNN SE-Res2NeXt extension for speaker anonymization},year={2022},month=sep,publisher={ISCA},series={spsc_2022},collection={spsc_2022},doi={10.21437/spsc.2022-8}}