Output structure for frame-level predictions

The output structure of waveform-to-label models seems to be more targeted towards intervallic predictions, such as music tagging or instrument recognition. It would be nice to have a more straightforward way to output frame-level predictions, such as for frame-wise music transcription.