Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to get the video level "weak" label #3

Open
xiaoyiming opened this issue Nov 13, 2018 · 5 comments
Open

how to get the video level "weak" label #3

xiaoyiming opened this issue Nov 13, 2018 · 5 comments

Comments

@xiaoyiming
Copy link

Dear Mr. Gao
Thank you so much for the great work. However, I met some problems when I implemented this code.
As described in you article, "For the visual frames, we use an ImageNet pre-trained ResNet-152 network [34] to make object category predictions, and we max-pool over predictions of all frames to obtain a video-level prediction. The top labels (with class probability larger than a threshold = 0.3) are used as weak \labels" for the unlabeled video."
However, when I use the pre-trained-152 network, I can get the only one category prediction lager than the threshold. How can I get multi-labels through the pre-trained-152 network.
Should I train a object detection network or a multi-classes multi-labels network or some other solutions. Thank you for your assistance
Best regards!

@rhgao
Copy link
Owner

rhgao commented Nov 13, 2018

Hi,

We didn't use all 1000 imagenet classes, but ~20 selected audio-related classes. Then we normalize the class probabilities for these classes, so you could get multiple labels with class probability larger than the threshold. Also, 0.3 is just empirical.

Thanks for your interest!

@rhgao rhgao closed this as completed Nov 13, 2018
@rhgao rhgao reopened this Nov 13, 2018
@xiaoyiming
Copy link
Author

@rhgao
Thanks for your reply! I will try it

@xiaoyiming
Copy link
Author

Dear Mr. Gao
Thank you so much for the great work. However, I met some problems when I implemented this code.
As described in you paper, "we collect a maximum of 3,000 basis vectors for each object category." " In other words, we concatenate the basis vectors learnt for each detected object to construct the basis dictionary W(q). Next, in the NMF algorithm, we hold W(q) fi xed, and only estimate activation H(q) with multiplicative update rules.
However, what's the shape of the selected W(q)(j) ? It is also MXK (K=25)? And how do you selected K basis vectors from the 3000 stored basis vectors

@rhgao
Copy link
Owner

rhgao commented Dec 3, 2018

Hi, We use all the collected basis vectors to initialize W, namely M x K with M = 3000, K=25. 3,000 is just a hyperparameter, and a larger number of basis vectors could potentially lead to better results.

@xiaoyiming
Copy link
Author

Thanks, cloud you please give me your train loss/mAp ,and val loss/mAp. my train loss is about 0.0001, train Map is about 0.72. My val loss is about 0.1 and val mAp is 0.65 after 300 iter, batchSize and Valsize is the same of you. Is that normal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants