Speech Technologies | Faculty of Technical Sciences

Subject: Speech Technologies (17.EK550)

Native organizations units: Department of Power, Electronic and Telecommunication Engineering

General information:

Category	Scientific-professional
Scientific or art field	Telecommunications and Signal Processing
ECTS	5

Based on artificial intelligence and machine learning, speech technologies enable the development of new interfaces between humans and smart environment: phones, computers, devices in smart homes etc. Building onto the knowledge acquired through several undergraduate academic courses, the objective of this course is to widen the multidisciplinary knowledge of students in the area of human-machine speech communication. In order to understand the algorithms for automatic speech recognition and synthesis, speaker recognition and emotional speech recognition, students should become more familiar with the features of human speech and its acoustic and linguistic models. Apart from understanding of algorithms, the aim of the course is that students become familiar with software tools for speech signal processing and learn about speech technology applications.

Students become familiar with basic machine learning algorithms used in automatic speech recognition (ASR) and in text-to-speech synthesis (TTS). In that way students acquire the fundamental knowledge needed in ASR and TTS development and application. They acquire the knowledge necessary for recording and processing speech signal databases and for understanding the algorithms for automatic speech recognition and synthesis, but also for speaker and emotion recognition, as well as language modules and dialogue systems. At the end of the course students are familiar with the possibilities of speech technologies, as well as with the tools for development of applications based on these technologies and are ready to give their professional contribution in this scientific and technical field.

•Introduction to ASR and TTS: history, terminology, perspectives •Speech: producation and perception, nature and characteristics (t-f display + labelling (AlfaNum)) •Speech signal: analysis and types of display on a computer (LPC, MFCC, PLP + visualisation (Matlab)) •Natural language processing: language modelling (n-grams) + HMM (HTK) •Approaches to ASR (DTW, HMM, DNN), acoustical, lexical and linguistic models •Procedures of ASR training: GMM, k-means, VQ, Baum-Welch, ML MMI, MWE MPE (HTK) •Algorithms for ASR decoding: Viterbi, Token passing, N-best (HTK) •Robust ASR methods: VTN, CMN, noise suppression •Text-to-speech synthesis (TTS): language processing, synthesis (concatenative, HMM and DNN) •Recognition of speakers (automatic and forensic) •Recognition of emotions in speech •Dialogue modelling, spoken language understanding (SLU), dialogue systems

Lectures are performed with PowerPoint presentations accompanied by numerous audio and video attachments and animations. They are followed by the practical exercises in the Laboratory of Acoustics and Speech Technologies and in a sound studio at FTS. Visits to some companies are arranged, where students can learn more about speech technologies. The exam prerequisites are a seminar work and a project - the condition for entering the exam is 25 of 50 points. Seminar works are done individually and it can serve as basis for master thesis. Independent student work on the project task is supported through the web portal of the Chair of Communications and Signal Processing - www.telekom.ftn.uns.ac.rs.

Authors	Title	Year	Publisher	Language
T. Dutoit	An Introduction to Text-to-Speech Synthesis	1997	Kluwer	English
L. Rabiner and B-H. Juang	Fundamentals of Speech Recognition	1993	Prentice Hall	English