Ask questionsis there a way to configure it so that, rather than differentiate voices by speakers, it creates a composite voice trained on a few different speakers?
ive spent the last week or so having a ton of fun with this project and ive noticed that if you continuously train the encoder on one speaker it will tend to make any future attempts at voice cloning sound closer to the original voice it was trained on.
im a brainless monkey though and have no idea what im doing so im wondering if anyone else here has attempted this.
Answer questions blue-fish
The encoder is deterministic and doesn't learn when multiple samples are loaded. Set "random seed" to a fixed number and your results will be consistent.
You can combine embeddings (256-element array) to make a composite voice. A similar idea is explored in the SV2TTS paper, for audio samples see "fictitious speakers" section of: https://google.github.io/tacotron/publications/speaker_adaptation/