GitHub - calculating/architecture-distillation: Hotswapping a llama attention layer with a Hyena convolution

By saving the activations on just 400 samples from Llama 7B, a hyena operator can be trained which can be swapped in with a minimal drop in perplexity.

The minimally trained small Hyena op increases perplexity from 1.55 to 1.58. For comparison, replacing the attention output with a matrix of ones increases perplexity to 10.11, and skipping the attention layer increases perplexity to 1.78

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
attention.png		attention.png
data_download.ipynb		data_download.ipynb
finetune_gen.ipynb		finetune_gen.ipynb
hyena.ipynb		hyena.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

calculating/architecture-distillation

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages