Skip to content

Commit c0b7b68

Browse files
committed
fixed error
1 parent 5d46ebd commit c0b7b68

6 files changed

+3
-3
lines changed

README.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ I have chosen to use the **101 layered Residual Network trained on the ImageNet
102102

103103
![ResNet Encoder](./img/encoder.png)
104104

105-
These models progressively create smaller and smaller representations of the original image, and each subsequent representation is more "learned", with a greater number of channels. The final encoding produced by our ResNet-101 encoder has a size of 14x14 with 4096 channels, i.e., a `4096, 14, 14` size tensor.
105+
These models progressively create smaller and smaller representations of the original image, and each subsequent representation is more "learned", with a greater number of channels. The final encoding produced by our ResNet-101 encoder has a size of 14x14 with 2048 channels, i.e., a `2048, 14, 14` size tensor.
106106

107107
I encourage you to experiment with other pre-trained architectures. The paper uses a VGGnet, also pretrained on ImageNet, but without fine-tuning. Either way, modifications are necessary. Since the last layer or two of these models are linear layers coupled with softmax activation for classification, we strip them away.
108108

@@ -273,13 +273,13 @@ See `Attention` in [`models.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-
273273

274274
The Attention network is simple – it's composed of only linear layers and a couple of activations.
275275

276-
Separate linear layers **transform both the encoded image (flattened to `N, 14 * 14, 4096`) and the hidden state (output) from the Decoder to the same dimension**, viz. the Attention size. They are then added and ReLU activated. A third linear layer **transforms this result to a dimension of 1**, whereupon we **apply the softmax to generate the weights** `alpha`.
276+
Separate linear layers **transform both the encoded image (flattened to `N, 14 * 14, 2048`) and the hidden state (output) from the Decoder to the same dimension**, viz. the Attention size. They are then added and ReLU activated. A third linear layer **transforms this result to a dimension of 1**, whereupon we **apply the softmax to generate the weights** `alpha`.
277277

278278
### Decoder
279279

280280
See `DecoderWithAttention` in [`models.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning/blob/master/models.py).
281281

282-
The output of the Encoder is received here and flattened to dimensions `N, 14 * 14, 4096`. This is just convenient and prevents having to reshape the tensor multiple times.
282+
The output of the Encoder is received here and flattened to dimensions `N, 14 * 14, 2048`. This is just convenient and prevents having to reshape the tensor multiple times.
283283

284284
We **initialize the hidden and cell state of the LSTM** using the encoded image with the `init_hidden_state()` method, which uses two separate linear layers.
285285

img/att.png

-1.04 KB
Loading

img/decoder_att.png

-843 Bytes
Loading

img/decoder_no_att.png

406 Bytes
Loading

img/encoder.png

2.04 KB
Loading

img/model.png

-1.48 KB
Loading

0 commit comments

Comments
 (0)