TTS - Cube

Trainable end-to-end speech synthesis system using recurrent neural networks

Browse the code

Language independent

TTSCube can be trained on raw text, making it somewhat language-independent. We hope to provide as many pretrained models as possible, but we are limited by hardware availability

Minimal effort

TTSCube does not require pre aligned data or any text-processing. It learns prosody and accoustic models just by looking at text. It is only recommended that the input data is normalized, numbers, abbreviations and achronyms being expanded into their spoken form.

Open source

TTSCube is opensource and we are actively looking for contributors and developers. If you want to join, just contact us on github and we will gladly add you to the repository

Encoder outputs:

"Arată că interesul utilizatorilor de internet față de acțiuni ecologiste de genul Earth Hour este unul extrem de ridicat."

"Pentru a contracara proiectul, Rusia a demarat un proiect concurent, South Stream, în care a încercat să atragă inclusiv o parte dintre partenerii Nabucco."

Vocoder output (conditioned on gold-standard data)

Note: The mel-spectrum is computed with a frame-shift of 12.5ms. This means that Griffin-Lim reconstruction produces sloppy results at most (regardless on the number of iterations)

Original	Vocoder

End to end decoding

The encoder model is still converging, so right now the examples are still of low quality. We will update the files as soon as we have a stable Encoder model.

Original	Synthesized (TTS-Cube)	HTS

		FAILED

Technical details

TTS-Cube is based on concepts described in Tacotron (1 and 2), Char2Wav and WaveRNN, but it's architecture does not stick to the exact recipes:

It has a dual-architecture, composed of (a) a module (Encoder) that converts sequences of characters or phonemes into mel-log spectrogram and (b) a RNN-based Vocoder that is conditioned on the spectrogram to produce audio
The Encoder is similar to those proposed in Tacotron (Wang et al., 2017) and Char2Wav (Sotelo et al., 2017), but

has a lightweight architecture with just a two-layer BDLSTM encoder and a two-layer LSTM decoder
uses the guided attention trick (Tachibana et al., 2017), which provides incredibly fast convergence of the attention module (in our experiments we were unable to reach an acceptable model without this trick)
does not employ any CNN/pre-net or post-net
uses a simple highway connection from the attention to the output of the decoder (which we observed that forces the encoder to actually learn how to produce the mean-values of the mel-log spectrum for particular phones/characters)

The Vocoder is similar to WaveRNN(Kalchbrenner et al., 2018), but instead of modifying the RNN cells (as proposed in their paper), we used two coupled neural networks