POREĐENJE SISTEMA ZA SINTEZU EKSPRESIVNOG GOVORA SA MOGUĆNOŠĆU KONTROLE JAČINE EMOCIJE

  • Mia Vujović
Ključne reči: ekspresivna sinteza govora, modelovanje emocija, embedding vektori, duboke neuronske mreže

Apstrakt

U sintezi ekspresivnog govora važno je generisati emocionalno obojen govor koji odražava kompleksnost emocionalnih stanja. Brojni TTS sistemi emocije u sintetizovanom govoru modeluju u vidu diskretnih skupova, ali tek kada se uzmu u obzir i varijacije koje postoje unutar emotivnih stanja, generisani govor može biti nalik ljudskom. Ovaj rad obuhvata teorijsku analizu i poređenje dva inovativna sistema za sintezu ekspresivnog govora koji kompleksnost emocija modeluju u vidu kontinualnih vektora kojima je moguće manipulisati. Rezultati pokazuju da je pristup zasnovan na t-SNE embedding vektorima primjenljiv samo u slučaju specifičnih baza podataka, dok je drugi pristup, zasnovan na interpolaciji tačaka u embedding prostoru multi-speaker, multi-style modela, opštiji, ali zahtijeva dodatnu analizu.

Reference

[1] Iida A., Campbell N., Higuchi F., Yasumura M., “A corpus based speech synthesis system with emotion”, Speech Communication 40, 161–187. 10, 2003.
[2] Yamagishi J., Onishi K., Masuko T., Kobayashi T., “Acoustic modeling of speaking styles and emotional expressions in HMM based speech synthesis”, IEICE TRANSACTIONS on Information and Systems 88, 502–509., 2005.
[3] L. Xue, X. Zhu, X. An, L. Xie, “A comparison of expressive speech synthesis approaches based on neural network”, Proc.the Joint Workshop of the 4th Workshop 60 on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data, pp. 15–20, 2018.
[4] Katsuki Inoue, Sunao Hara, Masanobu Abe, Nobukatsu Hojo, Yusuke Ijima, “An investigation to transplant emotional expressions in DNN-based tts synthesis”, Asia- Pacific Signal and Information Processing Association Summit and Conference, pages 1253–1258, 2017.
[5] Zhu, X., Xue, L., “Building a Controllable Expressive Speech Synthesis System with Multiple Emotion Strengths”, Cognitive Systems Research, Volume 59, Pages 151-159 January 2020.
[6] Milan Sečujski, Darko Pekar, Siniša Suzić, Anton Smirnov, Tijana Nosek, “Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding“, Journal of Universal Computer Science, vol. 26, no. 4, 434-453, 2020.
[7] Florian Eyben, Felix Weninger, Martin Wӧllmer, Bjӧrn Schuller, “open-Source Media Interpretation by Large feature-space Extraction“, audEERING GmbH, Version 2.3, November 2016.
[8] Laurens van der Maaten, Geoffrey Hinton, “Visualizing Data using t-SNE“, Journal of Machine Learning Research 9, 2579-2605, 2008.
Objavljeno
2020-12-26
Sekcija
Elektrotehničko i računarsko inženjerstvo