Generations from masked input sequences are garbled

And it seems like the number of masked tokens smoothly increases how garbled it is. Something is being computed without regard for masking probably.

Happens on both CPU and CUDA.

Sample audio generated by `main.py`: 

```
text = [
    "This is a test!",
    "Hey, what is going on in this play?",
    "Very smooth of you.",
    "I wonder why this is happening. Clearly, this sample is clear because it is unmasked.",
]
```

[samples.zip](https://github.com/user-attachments/files/20498394/samples.zip)

With the following input ids:

```
tensor([18, 37, 20, 87]) tensor([[  0,  81, 157, 102,  61,  16, 102,  68,  16,  70,  16,  62, 156,  86,
          61,  62,   5,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0],
        [  0,  50, 156,  24,   3,  16,  65, 157, 138,  62,  16, 102,  68,  16,
          92, 156,  31, 102, 112,  16, 157,  76,  56,  16, 102,  56,  16,  81,
         102,  61,  16,  58,  54, 156,  24,   6,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0],
        [  0,  64, 156,  86, 123,  51,  16,  61,  55, 156,  63,  81,  16, 138,
          64,  16,  52,  63,   4,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0],
        [  0, 157,  25,  16,  65, 156, 138,  56,  46,  83, 123,  16,  65, 157,
          25,  16,  81, 102,  61,  16, 102,  68,  16,  50, 156,  72,  58,  42,
          56, 102, 112,   4,  16,  53,  54, 156, 102, 123,  54,  51,   3,  16,
          81, 102,  61,  16,  61, 156,  72,  55,  58,  42,  54,  16, 102,  68,
          16,  53,  54, 156, 102, 123,  16,  44,  83,  53, 156, 138,  68,  16,
         102,  62,  16, 102,  68,  16, 157, 138,  56,  55, 156,  72,  61,  53,
          62,   4,   0]])
```

The first 3 outputs are highly garbled while the last is totally clear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generations from masked input sequences are garbled #6

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Generations from masked input sequences are garbled #6

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions