-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract audio representations for future use #21
Comments
May I ask if you wish to pretrain SSAST or take our checkpoint? |
I want to use the provided checkpoints. I don't want to fine-tune or pre-train |
Got it. Just a reminder that self-supervised checkpoint is not comparable with finetuned checkpoints for almost all downstream tasks. I guess what you need to do is return before the mean pooling (or taking the cls): ssast/src/models/ast_models.py Line 262 in a1a3eec
or ssast/src/models/ast_models.py Line 284 in a1a3eec
But as I said, if you wish to take the representation for your downstream task, it won't work very well as there's no finetuning. You can also consider finetuned models that fits your application, e.g., for speech tasks, whisper; for audio tasks, audio mae or the audio branch of finetuned https://github.com/YuanGongND/cav-mae. -Yuan |
For spectrogram generation, we used Line 126 in a1a3eec
But there's otherways, e.g., librosa. Note: there packages generate different outputs, you have to stick to one. |
Thank you so much for this information. I'll fine-tune the "head only" on top of the extracted representation, I want to have an idea of the extracted representations. Just for comparison, is there any checkpoint that can be comparable to the first one in this list https://github.com/YuanGongND/ast#pretrained-models ? If I got it correctly they are pre-trained (not fine-tuned) on audioset, right? |
This is usually refered as to "linear probing", which is a common way to evaluate the model representation, but in most cases, it is (a lot) worse than end-to-end finetuning (all parameters trainable). Specifically for SSAST, all results shown in the paper are end-to-end finetuning.
No, AST is supervisedly finetuned on AudioSet (has seen labels during training) while all checkpoints in this repo are self-supervisedly pretrained (haven't seen labels). This is a significant difference. I expect much better linear probing result for AST. -Yuan |
That's the case indeed. I want to evaluate audio representations using probing (linear mostly). So, from what I understand there is no way to do it with AST, only with SSAST? Just to set the context, this is a fair evaluation if you want to compare it with respect to this kind of model right? |
Sorry I didn't make it clear. I actually meant that AST would be better than SSAST for linear probing. Please check my previous reply.
Another major difference between wav2vec and AST/SSAST is the task, wav2vec focus on speech and should be better for speech tasks while AST/SSAST is stronger on general audio event recognition. Please see Table 5 of the paper. -Yuan |
Sorry for the misunderstanding, but I would actually mean that they are not comparable (not that we can not use AST for linear probing). Thanks for all clarification. Sure I know that they are trained using different pre-training objectives and different data collections (audioset vs speech-related datasets). What I meant in my previous reply was that there is no way to do a "fair" comparison between AST and SASST, it means there is no pre-trained AST without fine-tuning, am I right? |
Yes, that is what I meant. AST is pretrained with ImageNet - a vision dataset, it soulds werid but actually works quite well. Closing the gap between self-supervised models and supervised models is a goal of the research community, but I think we are not yet there. |
@MorenoLaQuatra and @YuanGongND Thanks for discussion. Regrads, |
@YuanGongND Mr. Gong, I encounter a problem as well when extracting audio representations. My model is |
hi there, Your input [24, 128, 128] means batch size 24, input sequence length in time frames 128, and each frame has 128 features (mel fbanks). I.e., your input spectrogram dim is 128 x 128. The output of the model is also a sequence, but in flattened patches (in your sample it is 16*16), so 128 x 128/ (16 x 16) = 64 is the actual patch sequence length. There are two prefix tokens [CLS] so in total 66. Therefore the input length is time frame length and the output length is flattened patch length, they are totally different. In case to make them similar, you would need patch size -Yuan |
Hi,
First of all, thank you all for the impressive work and for making the code and models available to the community. I would like to use the SSAST models to extract audio embeddings. Specifically, I'm interested in writing a script that accomplishes the following:
In previous issues, you pointed out some mean/variance normalization and a way to extract average pooled tokens (e.g., commenting out the mlp head from ASTModel forward). Do you have any suggestion about how to do point (1) and (2)?
Thanks
The text was updated successfully, but these errors were encountered: