Regarding the selection of Transformer decoder #10

yihp · 2024-08-13T02:36:05Z

Hi! Thanks for your contribution. It is an excellent piece of work!

I would like to ask why you chose a randomly-initialised Transformer decoder with six layers? Do you have any relevant literature references?I'm very curious about it.

Thank you very much for your time and consideration. I eagerly look forward to your response.

anicolson · 2024-08-13T23:52:42Z

Hi @yihp,

We were previously using DistilGPT2 (see https://arxiv.org/abs/2201.09405) which has six layers. We wanted to try a domain-specific vocabulary, i.e., training a BPE tokenizer on the radiology reports, as this was one of the observations of that paper (PubMedBERT seemed to have an advantage over the other similar checkpoints in that paper due to its domain-specific vocabulary). And this + a randomly initialised decoder the same size as DistilGPT2 performed better than the DistilGPT2 checkpoint. We used the BERT model (configured to be the same size as DistilGPT2) from Hugging Face as you can use token_type_embeddings with it (later we realised we could also easily have used token_type_embeddings with Hugging Face's GPT2 implementation 😄). We didn't show this in the CXRMate paper, as the paper is very overloaded as it is. Maybe that is a shortcoming. It should probably live in an appendix of some paper.

Now we are using the Llama architecture from Hugging Face, still randomly initialised with the same hyper parameters as DistilGPT2 and the domain-specific vocabulary, to modernise our decoder so that we can use things like 4D attention masks (https://arxiv.org/abs/2406.13181). We are struggling to outperform this setup (randomly initialised Transformer decoder with DistilGPT2's hyper parameters + a domain-specific vocabulary). We have tried fine-tuning many different LLMs in many different ways, but have yet to outperform this simple setup.

I hope that helps.

yihp · 2024-08-21T02:03:51Z

Hi @yihp,

We were previously using DistilGPT2 (see https://arxiv.org/abs/2201.09405) which has six layers. We wanted to try a domain-specific vocabulary, i.e., training a BPE tokenizer on the radiology reports, as this was one of the observations of that paper (PubMedBERT seemed to have an advantage over the other similar checkpoints in that paper due to its domain-specific vocabulary). And this + a randomly initialised decoder the same size as DistilGPT2 performed better than the DistilGPT2 checkpoint. We used the BERT model (configured to be the same size as DistilGPT2) from Hugging Face as you can use token_type_embeddings with it (later we realised we could also easily have used token_type_embeddings with Hugging Face's GPT2 implementation 😄). We didn't show this in the CXRMate paper, as the paper is very overloaded as it is. Maybe that is a shortcoming. It should probably live in an appendix of some paper.

Now we are using the Llama architecture from Hugging Face, still randomly initialised with the same hyper parameters as DistilGPT2 and the domain-specific vocabulary, to modernise our decoder so that we can use things like 4D attention masks (https://arxiv.org/abs/2406.13181). We are struggling to outperform this setup (randomly initialised Transformer decoder with DistilGPT2's hyper parameters + a domain-specific vocabulary). We have tried fine-tuning many different LLMs in many different ways, but have yet to outperform this simple setup.

I hope that helps.

Thank you very much for your reply, it is very helpful to me！
I have a question I want to ask you. This is about word embedding. I asked you before that my language is Chinese and you suggested that I use SentencePiece tokenizer. I did that and I want to ask you if I start training from scratch, will it have any impact? I understand that retraining from scratch will retrain word embedding, I don't know if my understanding is correct?

anicolson · 2024-08-21T03:08:33Z

Hi @yihp,

You could train everything from scratch.

Another thing you could try with the already trained model (which I am not sure will work) is to freeze everything except the word embeddings and the language modelling head to adapt them to the Chinese tokenizer. You could then unfreeze everything and do an additional stage of fine-tuning.

yihp · 2024-08-21T03:20:53Z

Hi @yihp,

You could train everything from scratch.

Another thing you could try with the already trained model (which I am not sure will work) is to freeze everything except the word embeddings and the language modelling head to adapt them to the Chinese tokenizer. You could then unfreeze everything and do an additional stage of fine-tuning.

ok，thanks.I will try fllowyihp,

You could train everything from scratch.

Another thing you could try with the already trained model (which I am not sure will work) is to freeze everything except the word embeddings and the language modelling head to adapt them to the Chinese tokenizer. You could then unfreeze everything and do an additional stage of fine-tuning.

ok，thanks. I will try it following your ideas!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding the selection of Transformer decoder #10

Regarding the selection of Transformer decoder #10

yihp commented Aug 13, 2024

anicolson commented Aug 13, 2024 •

edited

Loading

yihp commented Aug 21, 2024

anicolson commented Aug 21, 2024

yihp commented Aug 21, 2024

Regarding the selection of Transformer decoder #10

Regarding the selection of Transformer decoder #10

Comments

yihp commented Aug 13, 2024

anicolson commented Aug 13, 2024 • edited Loading

yihp commented Aug 21, 2024

anicolson commented Aug 21, 2024

yihp commented Aug 21, 2024

anicolson commented Aug 13, 2024 •

edited

Loading