Analysis and Synthesis of Audio with AI: from Neurological Disease to Accented Speech and Music
| Title | Analysis and Synthesis of Audio with AI: from Neurological Disease to Accented Speech and Music |
| Publication Type | Thesis |
| Year of Publication | 2025 |
| Authors | Melechovsky J. |
| University | Singapore University of Technology of Design |
| City | Singapore |
| Thesis Type | PhD |
| Abstract | In the modern era, new technology is opening opportunities to help various groups of people around the world. In this thesis, deep learning and audio processing is utilized to target the needs of and develop specific applications for patients with progressive neurological diseases, speakers of non-native English accents, and amateur and leisure musicians and music enjoyers. Throughout the thesis, we deal with datasets of limited size, controllability of deep generative audio models, and creation of new datasets to target both these aspects. First, we propose a pipeline for an automated assessment of oral diadochokinesis in neurological patients and analyze acoustic features among disease type, dysarthria type, and dysarthria severity. The results confirm some of the hypotheses about the manifestation of different dysarthria and disease types in speech while showing the major effect of dysarthria severity on oral diadochokinesis performance. Following the investigation of dysarthric speech, we focus on another part of "non-standard" speech -- foreign accents of English. Specifically, in Text-to-Speech models, we deal with converting a speaker's accent into a different target accent while preserving their original speaker identity. We pioneer the development of accent-converting Text-to-Speech with a family of models that aim to achieve full control over accent and speaker identity in synthesized speech by disentangling the two attributes. This application could benefit minorities with non-native English accents by allowing them to customize the system's speech output for better intelligibility. Immersed in controllable generative models, we continue our journey in the music generation domain with focus on high controllability. To counter the odds of limited availability and size of public datasets and to enable fine controllability of the proposed models, we propose and demonstrate methods of enhancing and augmenting music datasets, with the introduction of MidiCaps, a large-scale captioned MIDI dataset, and with MusicBench, a music audio dataset with enhanced text captions. We utilize MusicBench to build Mustango, a controllable Text-to-Music generation system with focus on music specific commands of chords, beats, key, and tempo. Finally, we introduce SonicMaster, a text-controllable all-in-one music restoration and mastering model that we train with our proposed SonicMaster dataset. |