-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VLM support #220
base: main
Are you sure you want to change the base?
Add VLM support #220
Conversation
@aymeric-roucher we can keep images in action step with a separate key. normally models do not produce images, so if we put images with "images" key it will break chat template. if we keep images for the sake of keeping images, we can keep it under a different key like "observation_images" (just like how we do in the function). |
we need to unify the image handling for both OpenAI & transformers I think, I saw you overwrote templates which could break transformers. will handle tomorrow |
@aymeric-roucher can I write the tests now? will there be changes to API? |
@@ -40,7 +40,7 @@ def reset(self): | |||
self.total_input_token_count = 0 | |||
self.total_output_token_count = 0 | |||
|
|||
def update_metrics(self, step_log): | |||
def update_metrics(self, step_log, agent): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aymeric-roucher why did you add agent here, I don't see any changes where it's
used (just icymi)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merve this is a general change in callback logic: the idea is to let callback functions access the whole agent, for instance to read token counts from agent.monitor
. But not sure this is the most ergonomic solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change in the callback logic necessary for the VLM support? Or could we address it in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albertvillanova in cases like taking screenshots callbacks are necessary across every step, I think other than that, no
image handling is a bit like that if image is required to be kept at every step or you require to dynamically retrieve images (e.g. from a knowledge base)
de7e376
to
bdd8847
Compare
@merveenoyan do |
> [!TIP] | ||
> Read [Open-source LLMs as LangChain Agents](https://huggingface.co/blog/open-source-llms-as-agents) blog post to learn more about multi-step agents. | ||
1. **Thought:** This is the first step initializing the system, prompting it on how it should behave (`SystemPromptStep`), the facts about the task at hand (`PlanningStep`) and providing the task at hand (`TaskStep`). System prompt, facts and task prompt are appended to the memory. Facts are updated at each step until the agent receives the final response. If there's any images in the prompt, they are fed to `TaskStep`. | ||
2. **Action:** This is where all the action is taken, including LLM inference and callback function execution. After the inference takes place, the output of LLM/VLM (called "observations") is fed to `ActionStep`. Callbacks are functions executed at the end of every step. A good callback example is taking screenshots and add it to agent's state in an agentic web browser. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merveenoyan in React framework, both Thought and Action are in the while loop.
So I'd really make a distinction here between 1. Initialization and 2 While loop with 2.1 Thought (basically the LLM generation + parsing) and 2.2 Action (execution of the action)
i've reworded the blog post as well to convey this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
taking a look now |
e6fbdca
to
6c46065
Compare
6c46065
to
adfce1c
Compare
src/smolagents/agents.py
Outdated
for image in step_log.task_images: | ||
task_message = { | ||
"role": MessageRole.USER, | ||
"content": [ | ||
{"type": "text", "text": f"New task:\n{step_log.task}"}, | ||
{ | ||
"type": "image", | ||
"image": image, | ||
}, | ||
], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this loop does not work as expected: only last image is kept.
@@ -410,6 +456,7 @@ def __init__( | |||
device_map: Optional[str] = None, | |||
torch_dtype: Optional[str] = None, | |||
trust_remote_code: bool = False, | |||
flatten_messages_as_text: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is never used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added it as an instance attribute. See: 221e678
…nto add-vlm-support
|
||
def __call__( | ||
self, | ||
messages: List[Dict[str, str]], | ||
stop_sequences: Optional[List[str]] = None, | ||
grammar: Optional[str] = None, | ||
tools_to_call_from: Optional[List[Tool]] = None, | ||
images: Optional[List[Image.Image]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this variable type be Optional[List[str]]
instead?
This PR adds VLM support (closing other one for the sake of collaboration) @aymeric-roucher
whereas with VLMs we do like following so I modified a bit.
but you access
content
and modify it here and there so it is broken: (fixing)Will open to review once I fix these.