TTSCube can be trained on raw text, making it somewhat language-independent. We hope to provide as many pretrained models as possible, but we are limited by hardware availability
TTSCube does not require pre aligned data or any text-processing. It learns prosody and accoustic models just by looking at text. It is only recommended that the input data is normalized, numbers, abbreviations and achronyms being expanded into their spoken form.
TTSCube is opensource and we are actively looking for contributors and developers. If you want to join, just contact us on github and we will gladly add you to the repository
"Arată că interesul utilizatorilor de internet față de acțiuni ecologiste de genul Earth Hour este unul extrem de ridicat."
"Pentru a contracara proiectul, Rusia a demarat un proiect concurent, South Stream, în care a încercat să atragă inclusiv o parte dintre partenerii Nabucco."
Note: The mel-spectrum is computed with a frame-shift of 12.5ms. This means that Griffin-Lim reconstruction produces sloppy results at most (regardless on the number of iterations)
The encoder model is still converging, so right now the examples are still of low quality. We will update the files as soon as we have a stable Encoder model.
TTS-Cube is based on concepts described in Tacotron (1 and 2), Char2Wav and WaveRNN, but it's architecture does not stick to the exact recipes: