And it seems like the number of masked tokens smoothly increases how garbled it is. Something is being computed without regard for masking probably.
Happens on both CPU and CUDA.
tensor([18, 37, 20, 87]) tensor([[ 0, 81, 157, 102, 61, 16, 102, 68, 16, 70, 16, 62, 156, 86,
61, 62, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0],
[ 0, 50, 156, 24, 3, 16, 65, 157, 138, 62, 16, 102, 68, 16,
92, 156, 31, 102, 112, 16, 157, 76, 56, 16, 102, 56, 16, 81,
102, 61, 16, 58, 54, 156, 24, 6, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0],
[ 0, 64, 156, 86, 123, 51, 16, 61, 55, 156, 63, 81, 16, 138,
64, 16, 52, 63, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0],
[ 0, 157, 25, 16, 65, 156, 138, 56, 46, 83, 123, 16, 65, 157,
25, 16, 81, 102, 61, 16, 102, 68, 16, 50, 156, 72, 58, 42,
56, 102, 112, 4, 16, 53, 54, 156, 102, 123, 54, 51, 3, 16,
81, 102, 61, 16, 61, 156, 72, 55, 58, 42, 54, 16, 102, 68,
16, 53, 54, 156, 102, 123, 16, 44, 83, 53, 156, 138, 68, 16,
102, 62, 16, 102, 68, 16, 157, 138, 56, 55, 156, 72, 61, 53,
62, 4, 0]])
The first 3 outputs are highly garbled while the last is totally clear.
And it seems like the number of masked tokens smoothly increases how garbled it is. Something is being computed without regard for masking probably.
Happens on both CPU and CUDA.
Sample audio generated by
main.py:samples.zip
With the following input ids:
The first 3 outputs are highly garbled while the last is totally clear.