From the original inference CSM code, the decoder is an autoregressive transformer that predicts 1-31th codebooks. Still, in this repo (from what I understood), you are using the c0 codebook to predict the rest of the codebooks in a parallel way using the audio heads and not autoregressively. Am I wrong? Could you clarify?
From the original inference CSM code, the decoder is an autoregressive transformer that predicts 1-31th codebooks. Still, in this repo (from what I understood), you are using the c0 codebook to predict the rest of the codebooks in a parallel way using the audio heads and not autoregressively. Am I wrong? Could you clarify?