Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding the selection of Transformer decoder #10

Open
yihp opened this issue Aug 13, 2024 · 4 comments
Open

Regarding the selection of Transformer decoder #10

yihp opened this issue Aug 13, 2024 · 4 comments

Comments

@yihp
Copy link

yihp commented Aug 13, 2024

Hi! Thanks for your contribution. It is an excellent piece of work!

I would like to ask why you chose a randomly-initialised Transformer decoder with six layers? Do you have any relevant literature references?I'm very curious about it.

Thank you very much for your time and consideration. I eagerly look forward to your response.

@anicolson
Copy link
Member

anicolson commented Aug 13, 2024

Hi @yihp,

We were previously using DistilGPT2 (see https://arxiv.org/abs/2201.09405) which has six layers. We wanted to try a domain-specific vocabulary, i.e., training a BPE tokenizer on the radiology reports, as this was one of the observations of that paper (PubMedBERT seemed to have an advantage over the other similar checkpoints in that paper due to its domain-specific vocabulary). And this + a randomly initialised decoder the same size as DistilGPT2 performed better than the DistilGPT2 checkpoint. We used the BERT model (configured to be the same size as DistilGPT2) from Hugging Face as you can use token_type_embeddings with it (later we realised we could also easily have used token_type_embeddings with Hugging Face's GPT2 implementation 😄). We didn't show this in the CXRMate paper, as the paper is very overloaded as it is. Maybe that is a shortcoming. It should probably live in an appendix of some paper.

Now we are using the Llama architecture from Hugging Face, still randomly initialised with the same hyper parameters as DistilGPT2 and the domain-specific vocabulary, to modernise our decoder so that we can use things like 4D attention masks (https://arxiv.org/abs/2406.13181). We are struggling to outperform this setup (randomly initialised Transformer decoder with DistilGPT2's hyper parameters + a domain-specific vocabulary). We have tried fine-tuning many different LLMs in many different ways, but have yet to outperform this simple setup.

I hope that helps.

@yihp
Copy link
Author

yihp commented Aug 21, 2024

Hi @yihp,

We were previously using DistilGPT2 (see https://arxiv.org/abs/2201.09405) which has six layers. We wanted to try a domain-specific vocabulary, i.e., training a BPE tokenizer on the radiology reports, as this was one of the observations of that paper (PubMedBERT seemed to have an advantage over the other similar checkpoints in that paper due to its domain-specific vocabulary). And this + a randomly initialised decoder the same size as DistilGPT2 performed better than the DistilGPT2 checkpoint. We used the BERT model (configured to be the same size as DistilGPT2) from Hugging Face as you can use token_type_embeddings with it (later we realised we could also easily have used token_type_embeddings with Hugging Face's GPT2 implementation 😄). We didn't show this in the CXRMate paper, as the paper is very overloaded as it is. Maybe that is a shortcoming. It should probably live in an appendix of some paper.

Now we are using the Llama architecture from Hugging Face, still randomly initialised with the same hyper parameters as DistilGPT2 and the domain-specific vocabulary, to modernise our decoder so that we can use things like 4D attention masks (https://arxiv.org/abs/2406.13181). We are struggling to outperform this setup (randomly initialised Transformer decoder with DistilGPT2's hyper parameters + a domain-specific vocabulary). We have tried fine-tuning many different LLMs in many different ways, but have yet to outperform this simple setup.

I hope that helps.

Thank you very much for your reply, it is very helpful to me!
I have a question I want to ask you. This is about word embedding. I asked you before that my language is Chinese and you suggested that I use SentencePiece tokenizer. I did that and I want to ask you if I start training from scratch, will it have any impact? I understand that retraining from scratch will retrain word embedding, I don't know if my understanding is correct?

@anicolson
Copy link
Member

Hi @yihp,

You could train everything from scratch.

Another thing you could try with the already trained model (which I am not sure will work) is to freeze everything except the word embeddings and the language modelling head to adapt them to the Chinese tokenizer. You could then unfreeze everything and do an additional stage of fine-tuning.

@yihp
Copy link
Author

yihp commented Aug 21, 2024

Hi @yihp,

You could train everything from scratch.

Another thing you could try with the already trained model (which I am not sure will work) is to freeze everything except the word embeddings and the language modelling head to adapt them to the Chinese tokenizer. You could then unfreeze everything and do an additional stage of fine-tuning.

ok,thanks.I will try fllowyihp,

You could train everything from scratch.

Another thing you could try with the already trained model (which I am not sure will work) is to freeze everything except the word embeddings and the language modelling head to adapt them to the Chinese tokenizer. You could then unfreeze everything and do an additional stage of fine-tuning.

ok,thanks. I will try it following your ideas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants