-
Notifications
You must be signed in to change notification settings - Fork 29
Implement .generate (greedy decoding only) #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
Just a quick update here: @bigximik already confirmed that the stock The goal was never to reimplement We're using Thanks for checking in. |
To clarify, I'm not against this kind large-scale integration tests, but adding them to our current testing workflow would severely hurt our ability to run it. I run the full tests on a daily basis, including the ones marked as "slow", it's already 10-minutes long and anything more will prevent me from debugging efficiently. So if we do want long and resource-hungry integration tests, we need to exclude them from the normal testing scheme (ex. with a separate "integration" marker) and find an alternate workflow for running them (automated CI?). That would obviously leave gaps in the "normal" testing suite, which could be filled by a small and fast version of the test (we can -- and should -- have both) |
How should we implement distributed
|
Thanks for the detailed write-up, @bigximik.
Here's a plan I would endorse: Variant (c)
This keeps HF |
🎯 Goal (What & Why)
Implement
generate()
forHuggingfaceGPTModelForCausalLM
usingGenerationMixin
, supporting greedy decoding only.This makes the Fast-LLM model behave like a HuggingFace model for generation. The goal is to enable validation-time text generation directly from the sharded Fast-LLM model in memory, without converting to HF format (which would require extra memory and lead to model eviction).
We use batched greedy decoding and support FlashAttention by padding and attention masking.
No beam search, sampling, or KV caching is needed.
🚀 Execution Plan
Develop a minimal, batched, greedy generation loop using Fast-LLM's
.forward()
andGenerationMixin
integration.Step 1: What is the smallest working version?
GenerationMixin
interface inHuggingfaceGPTModelForCausalLM
, i.e. this interface:generate()
to:inputs
andmax_new_tokens
,eos_token_id
, as well aspad_token_id
viaGenerationConfig
orkwargs
.eos_token_id
andmax_new_tokens
..forward()
on the Fast-LLM model in each step.prepare_inputs_for_generation()
minimally to satisfy HF's expectations.HuggingFaceTB/SmolLM2-135M-Instruct
in both HF and Fast-LLM formats..generate()
from both and compare outputs.Step 2: What additional optimizations are possible (but optional)?
📌 Acceptance Criteria (Must-Haves for Completion)
GenerationMixin
.pad_token_id
.📎 Relevant Links
https://github.com/huggingface/transformers/blob/3249c5dc1560dace3c31cdbe4797b6c878ab47de/src/transformers/generation/utils.py#L2018
https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct
https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/huggingface.py
🛠️ Project Management
Estimate
field (in days) in the GitHub project.Size
field to categorize the PR size (Small/Medium/Large).The text was updated successfully, but these errors were encountered: