-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regarding the selection of Transformer decoder #10
Comments
Hi @yihp, We were previously using DistilGPT2 (see https://arxiv.org/abs/2201.09405) which has six layers. We wanted to try a domain-specific vocabulary, i.e., training a BPE tokenizer on the radiology reports, as this was one of the observations of that paper (PubMedBERT seemed to have an advantage over the other similar checkpoints in that paper due to its domain-specific vocabulary). And this + a randomly initialised decoder the same size as DistilGPT2 performed better than the DistilGPT2 checkpoint. We used the BERT model (configured to be the same size as DistilGPT2) from Hugging Face as you can use token_type_embeddings with it (later we realised we could also easily have used token_type_embeddings with Hugging Face's GPT2 implementation 😄). We didn't show this in the CXRMate paper, as the paper is very overloaded as it is. Maybe that is a shortcoming. It should probably live in an appendix of some paper. Now we are using the Llama architecture from Hugging Face, still randomly initialised with the same hyper parameters as DistilGPT2 and the domain-specific vocabulary, to modernise our decoder so that we can use things like 4D attention masks (https://arxiv.org/abs/2406.13181). We are struggling to outperform this setup (randomly initialised Transformer decoder with DistilGPT2's hyper parameters + a domain-specific vocabulary). We have tried fine-tuning many different LLMs in many different ways, but have yet to outperform this simple setup. I hope that helps. |
Thank you very much for your reply, it is very helpful to me! |
Hi @yihp, You could train everything from scratch. Another thing you could try with the already trained model (which I am not sure will work) is to freeze everything except the word embeddings and the language modelling head to adapt them to the Chinese tokenizer. You could then unfreeze everything and do an additional stage of fine-tuning. |
ok,thanks.I will try fllowyihp,
ok,thanks. I will try it following your ideas! |
Hi! Thanks for your contribution. It is an excellent piece of work!
I would like to ask why you chose a randomly-initialised Transformer decoder with six layers? Do you have any relevant literature references?I'm very curious about it.
Thank you very much for your time and consideration. I eagerly look forward to your response.
The text was updated successfully, but these errors were encountered: