Skip to content

Conversation

sunildkumar
Copy link
Member

No description provided.

sunildkumar and others added 25 commits April 29, 2025 22:33
…d of open vocabulary (reducing degrees of freedom of the tool)

Explicitly tell the model that it can call a tool but it does not have to.
Explicitly tell the model it needs to consider all 4 options in the user prompt. Failures often look like torpedos, so maybe this helps prevent that? Doing this in the bootstrap prompt didn’t help, but I think the IFT model “listens” to the user more strongly.
Reward schedule for tool use reward. The model gets 200 gradient updates with a tool use reward. The reward decays linearly between steps 0 and 200. Then it stays at 0.
Point tool at b0
save more frequently
Implement the new combined correctness-and-tool-use reward
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants