Since for inference it needs input sample, how does it infer during evaluation while training? Since it's a multi speaker dataset, it must be generating according to the speaker that's being tested. How does that happen? or it does not happen at all?