Speech Syntesis

Technology Text-To-Speech (Speech Synthesis)

VitalVoice ™ Text-To-Speech (TTS) system is based on Unit Selection technology, which is a state-of-the-art technology for TTS applications worldwide. The technology involves extracting sound elements, or units, from a large speech database and combining them into continuous speech, producing natural-sounding synthesized speech that retains the individuality of the original speaker’s voice. Currently the system includes six Russian voices (four female and two male voices).

The TTS speech database now comprises over 80 hours of speech recorded at STC. The database was automatically segmented into several feature levels: allophones, diphones, transcription, spelling, etc; in some cases the labeling was checked manually by experts to control for possible errors and to ensure highest TTS quality.

Voice recording for TTS is performed with the aid of special software (Voice Constructor™) developed at STC, which can be used to produce high quality sound material for speech synthesis quickly and easily.

The TTS system also includes automatic text processing, which is based on original linguistic algorithms developed at STC, such as text normalization, homograph resolution, phrase break detection, intonational modeling, etc. This helps the system deal with all types of Russian texts, including online news, literary works, call center dialogs, and so on.

Research directions

TTS research and development at STC started out with an allophone-segmented database and has since moved on to a diphone-segmented database. Diphone segmentation, which involves segmenting speech into elements that combine two halves of adjacent phones (rather than a single allophone which is virtually a triphone since its right and left context is also taken into account), is widely accepted to be a superior technology which substantially improves TTS quality.

A new line of research currently carried out at STC is Hidden Markov Models (HMM) TTS, which involves training a statistical model on a speech database and using the model’s predictions to synthesize speech sounds. While this technology is highly flexible and does not require a large database to be stored on the user’s computer or mobile device, the resulting speech does not achieve the same level of naturalness as Unit Selection-based TTS. However, HMM technology can be successfully used in combination with Unit Selection to improve some aspects of synthesized speech. In keeping with the latest developments in the field, STC is developing a hybrid Unit Selection/HMM TTS system, with HMM-based prosody modeling including spectral envelope modeling. This approach helps to bring synthesized speech closer to the speaker’s natural spontaneous intonation. The HMM model is trained using a CART (Classification and Regression Trees) classifier with a feature package developed specifically for Russian. This technology is used for five female and two male voices.

Further research directions include:

• Further development of the hybrid Unit Selection/HMM technology with the aim of better imitating the quality of a speaker’s individual speech;

• Automatic parameter tuning for hybrid TTS algorithms and for speech quality assessment;

• Producing new TTS voices in a range of languages;

• Improving text processing methods to increase the accuracy of non-standard word normalization, homonymy resolution, syntactic ambiguity resolution, etc.