CNR Institutional Research Information System

Speech production in general, and emotional speech in particular, is characterized by a wide variety of phonation modalities. Voice quality, which is the term commonly used in the field, has an important role in the communication of emotions through speech, and nonmodal phonation modalities (soft, breathy, whispery, creaky, for example) are commonly found in emotional speech corpora. In this paper, we describe a voice synthesis framework that allows to control a set of acoustic parameters which are relevant for the simulation of nonmodal voice qualities. The set of controls of the synthesizer includes standard controls for duration and pitch of the phonemes, and additional controls for intensity, spectral emphasis, fast and slow variations of the duration and amplitude of the waveform periods (for voiced frames), frequency axis warping for changing the formant position, and aspiration noise level. Some guidelines are given to combine these signal transformations in the aim of reproducing some nonmodal voice qualities, including soft, loud, breathy, whispery, hoarse, and tremulous voice. It is also discussed how these voice qualities characterize the emotional speech . The system described here is based on the FESTIVAL speech synthesis framework and on the MBROLA diphone concatenation acoustic back-end. We also address the possibility of including affective tags in the input text to be converted.

Control of Voice Quality for Emotional Speech Synthesis

Drioli C;Tesser F;Tisato G;Cosi P;Marchetto E

2005

Abstract

Speech production in general, and emotional speech in particular, is characterized by a wide variety of phonation modalities. Voice quality, which is the term commonly used in the field, has an important role in the communication of emotions through speech, and nonmodal phonation modalities (soft, breathy, whispery, creaky, for example) are commonly found in emotional speech corpora. In this paper, we describe a voice synthesis framework that allows to control a set of acoustic parameters which are relevant for the simulation of nonmodal voice qualities. The set of controls of the synthesizer includes standard controls for duration and pitch of the phonemes, and additional controls for intensity, spectral emphasis, fast and slow variations of the duration and amplitude of the waveform periods (for voiced frames), frequency axis warping for changing the formant position, and aspiration noise level. Some guidelines are given to combine these signal transformations in the aim of reproducing some nonmodal voice qualities, including soft, loud, breathy, whispery, hoarse, and tremulous voice. It is also discussed how these voice qualities characterize the emotional speech . The system described here is based on the FESTIVAL speech synthesis framework and on the MBROLA diphone concatenation acoustic back-end. We also address the possibility of including affective tags in the input text to be converted.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2005
			
	Strutture organizzative
	
				Istituto di Scienze e Tecnologie della Cognizione - ISTC
Istituto di Scienze e Tecnologie della Cognizione - ISTC
			
	Codice ISBN
	
				88-88974-69-5
			
	Parole chiave
	
				Voice Quality
Emotions
Speech Synthesis
TTS
			
	Appare nelle tipologie:
	
				02.01 Contributo in volume (Capitolo o Saggio)

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/140071

Citazioni

ND

ND

ND

social impact