You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Congrats, nice work! I have two questions out of curiosity:
Forward pass: Why did you choose to sample from the Bernoulli distribution instead of the Gumbel-softmax? To my knowledge, sampling from the Bernoulli distribution introduces a bias in the gradient estimation which could make optimization trickier. I understand that you would not be able to use sparse convolutions in the training but I wonder if there is another reason.
Have you tried annealing the temperature parameter to less than 1?
The text was updated successfully, but these errors were encountered:
Hi!
I think you mean that it uses the straight-through version of the Gumbel-Softmax trick (hard version). I did not thoroughly ablate this, but my initial results indicated slightly better performance for the hard straight-through version. The straight-through version indeed has bias, but the network's weights directly optimize for the sparse convolutions. I can agree though that the soft Gumbel-Softmax with some temperature annealing towards 0 might improve training stability.
The best solution though might be to weight spatial positions by the probabilities (i.e. soft attention), e.g. by using the soft Gumbel softmax and multiplying executed positions (where prob_exec > 0.5 by (prob_exec-0.5)*2 ) both at training and inference time.
As I have more compute available nowadays I might explore this over summer when writing my PhD thesis
Hi @thomasverelst
Congrats, nice work! I have two questions out of curiosity:
Forward pass: Why did you choose to sample from the Bernoulli distribution instead of the Gumbel-softmax? To my knowledge, sampling from the Bernoulli distribution introduces a bias in the gradient estimation which could make optimization trickier. I understand that you would not be able to use sparse convolutions in the training but I wonder if there is another reason.
Have you tried annealing the temperature parameter to less than 1?
The text was updated successfully, but these errors were encountered: