Add VLM support #220

merveenoyan · 2025-01-16T11:44:38Z

This PR adds VLM support (closing other one for the sake of collaboration) @aymeric-roucher

This PR at the creation is probably broken (since you wanted to see it) primarily because as of now when you're writing memory you adopt chat templates like following:

messages = [
  {"role": "user", "content": "I'm doing great. How can I help you today?"},
]

whereas with VLMs we do like following so I modified a bit.

messages = [
  {"role": "user", "content": [{"type": "text", "text": "I'm doing great. How can I help you today?"},
{"type":"image"}
]

but you access content and modify it here and there so it is broken: (fixing)

Secondly I need to check if I'm adding images necessarily only once because it will break the inference if we pass one image more than once, I add it in multiple steps, so I will see.

Will open to review once I fix these.

merveenoyan · 2025-01-16T11:55:17Z

@aymeric-roucher we can keep images in action step with a separate key. normally models do not produce images, so if we put images with "images" key it will break chat template. if we keep images for the sake of keeping images, we can keep it under a different key like "observation_images" (just like how we do in the function).

merveenoyan · 2025-01-16T18:43:09Z

we need to unify the image handling for both OpenAI & transformers I think, I saw you overwrote templates which could break transformers. will handle tomorrow

merveenoyan · 2025-01-17T16:28:34Z

@aymeric-roucher can I write the tests now? will there be changes to API?

merveenoyan · 2025-01-19T13:39:16Z

src/smolagents/monitoring.py

@@ -40,7 +40,7 @@ def reset(self):
        self.total_input_token_count = 0
        self.total_output_token_count = 0

-    def update_metrics(self, step_log):
+    def update_metrics(self, step_log, agent):


@aymeric-roucher why did you add agent here, I don't see any changes where it's
used (just icymi)

@merve this is a general change in callback logic: the idea is to let callback functions access the whole agent, for instance to read token counts from agent.monitor. But not sure this is the most ergonomic solution.

Is this change in the callback logic necessary for the VLM support? Or could we address it in a separate PR?

@albertvillanova in cases like taking screenshots callbacks are necessary across every step, I think other than that, no

image handling is a bit like that if image is required to be kept at every step or you require to dynamically retrieve images (e.g. from a knowledge base)

aymeric-roucher · 2025-01-20T16:29:58Z

@merveenoyan do TransformersModel VLMs work in the current state? Also one image-related test is failing.

aymeric-roucher · 2025-01-20T16:50:52Z

docs/source/en/conceptual_guides/react.md

-> [!TIP]
-> Read [Open-source LLMs as LangChain Agents](https://huggingface.co/blog/open-source-llms-as-agents) blog post to learn more about multi-step agents.
+1. **Thought:** This is the first step initializing the system, prompting it on how it should behave (`SystemPromptStep`), the facts about the task at hand (`PlanningStep`) and providing the task at hand (`TaskStep`).  System prompt, facts and task prompt are appended to the memory. Facts are updated at each step until the agent receives the final response. If there's any images in the prompt, they are fed to `TaskStep`.
+2. **Action:** This is where all the action is taken, including LLM inference and callback function execution. After the inference takes place, the output of LLM/VLM (called "observations") is fed to `ActionStep`. Callbacks are functions executed at the end of every step. A good callback example is taking screenshots and add it to agent's state in an agentic web browser.


@merveenoyan in React framework, both Thought and Action are in the while loop.
So I'd really make a distinction here between 1. Initialization and 2 While loop with 2.1 Thought (basically the LLM generation + parsing) and 2.2 Action (execution of the action)

i've reworded the blog post as well to convey this.

sounds good!

merveenoyan · 2025-01-22T13:21:38Z

taking a look now

albertvillanova · 2025-01-22T19:54:12Z

src/smolagents/agents.py

+                    for image in step_log.task_images:
+                        task_message = {
+                            "role": MessageRole.USER,
+                            "content": [
+                                {"type": "text", "text": f"New task:\n{step_log.task}"},
+                                {
+                                    "type": "image",
+                                    "image": image,
+                                },
+                            ],
+                        }


I think this loop does not work as expected: only last image is kept.

src/smolagents/agents.py

…nto add-vlm-support

albertvillanova · 2025-01-23T14:10:56Z

src/smolagents/models.py

@@ -410,6 +456,7 @@ def __init__(
        device_map: Optional[str] = None,
        torch_dtype: Optional[str] = None,
        trust_remote_code: bool = False,
+        flatten_messages_as_text: bool = True,


This variable is never used.

I have added it as an instance attribute. See: 221e678

…nto add-vlm-support

albertvillanova · 2025-01-23T17:35:42Z

src/smolagents/models.py


    def __call__(
        self,
        messages: List[Dict[str, str]],
        stop_sequences: Optional[List[str]] = None,
        grammar: Optional[str] = None,
        tools_to_call_from: Optional[List[Tool]] = None,
+        images: Optional[List[Image.Image]] = None,


Shouldn't this variable type be Optional[List[str]] instead?

vlm initial commit

5a4d736

transformers integration for vlms

aef7a51

merveenoyan mentioned this pull request Jan 16, 2025

Add VLM support #177

Closed

Add webbrowser example and make it work 🥳🥳

f321d26

aymeric-roucher added 3 commits January 17, 2025 16:06

Refactor image support

79c20e6

Allow modifying agent attributes in callback

64d4ff1

Improve vlm browser example

68f0742

time.sleep(0.5) before screenshot to let js animations happen

99f9ebe

merveenoyan commented Jan 19, 2025

View reviewed changes

merveenoyan and others added 5 commits January 19, 2025 16:43

test to validate internal workflow for passing images

48a1b01

Update test_agents.py

71a1ca5

Improve error logging

5961dce

Merge branch 'main' into add-vlm-support

20c6286

Switch to OpenAIServerModel

bdd8847

aymeric-roucher force-pushed the add-vlm-support branch from de7e376 to bdd8847 Compare January 20, 2025 11:14

Improve the example

2b0e900

aymeric-roucher mentioned this pull request Jan 20, 2025

Improve python executor's error logging #275

Merged

aymeric-roucher added 2 commits January 20, 2025 15:59

Merge branch 'main' into add-vlm-support

f50bb45

Format

3c75298

add docs about steps, callbacks & co

a9cfd43

aymeric-roucher reviewed Jan 20, 2025

View reviewed changes

aymeric-roucher added 3 commits January 21, 2025 14:22

Add precisions in doc

3da8cdd

Improve browser

f693e49

Tiny prompting update

84c6e73

Merge branch 'main' into add-vlm-support

f93bc29

merveenoyan and others added 4 commits January 22, 2025 18:56

refactor

fc8548c

Fix write_inner_memory_from_logs for OpenAI format

be7e216

Add back summary mode

74a7c7f

Make it work with TransformersModel

8b3865e

aymeric-roucher force-pushed the add-vlm-support branch from e6fbdca to 6c46065 Compare January 22, 2025 19:36

Fix test

adfce1c

aymeric-roucher force-pushed the add-vlm-support branch from 6c46065 to adfce1c Compare January 22, 2025 19:48

albertvillanova reviewed Jan 22, 2025

View reviewed changes

albertvillanova added 2 commits January 22, 2025 20:57

Fix loop

fe12a61

Fix quality

28f9464

albertvillanova reviewed Jan 22, 2025

View reviewed changes

src/smolagents/agents.py Outdated Show resolved Hide resolved

albertvillanova added 2 commits January 23, 2025 07:31

Fix mutable default argument

94ce0be

Rename tool_response_message to error_message and append it

16f3bcd

albertvillanova mentioned this pull request Jan 23, 2025

Append unused error message to memory #325

Merged

albertvillanova and others added 3 commits January 23, 2025 14:03

Merge branch 'main' into add-vlm-support

b5b5db2

Working browser with firefox

a298a90

Merge branch 'add-vlm-support' of github.com:huggingface/smolagents i…

070694b

…nto add-vlm-support

albertvillanova reviewed Jan 23, 2025

View reviewed changes

albertvillanova and others added 11 commits January 23, 2025 15:12

Use flatten_messages_as_text passed to TransformersModel

221e678

Fix quality

70809f6

Document flatten_messages_as_text in docstring

5213089

Working ctrl + f in browser

12711c4

Merge branch 'add-vlm-support' of github.com:huggingface/smolagents i…

6cd12ee

…nto add-vlm-support

Make style

199cd9a

Fix summary_mode type hint and add to docstring

970735e

Move image functions to tools

98458d6

Merge remote-tracking branch 'upstream/main' into add-vlm-support

05cbec7

Update docstrings

5526512

Fix type hint

0d140b5

albertvillanova reviewed Jan 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VLM support #220

Add VLM support #220

merveenoyan commented Jan 16, 2025 •

edited

Loading

merveenoyan commented Jan 16, 2025

merveenoyan commented Jan 16, 2025

merveenoyan commented Jan 17, 2025

merveenoyan Jan 19, 2025 •

edited

Loading

aymeric-roucher Jan 19, 2025

albertvillanova Jan 20, 2025

merveenoyan Jan 20, 2025

aymeric-roucher commented Jan 20, 2025

aymeric-roucher Jan 20, 2025

merveenoyan Jan 22, 2025

merveenoyan commented Jan 22, 2025

albertvillanova Jan 22, 2025

albertvillanova Jan 23, 2025

albertvillanova Jan 23, 2025

albertvillanova Jan 23, 2025

Add VLM support #220

Are you sure you want to change the base?

Add VLM support #220

Conversation

merveenoyan commented Jan 16, 2025 • edited Loading

merveenoyan commented Jan 16, 2025

merveenoyan commented Jan 16, 2025

merveenoyan commented Jan 17, 2025

merveenoyan Jan 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aymeric-roucher commented Jan 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merveenoyan commented Jan 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merveenoyan commented Jan 16, 2025 •

edited

Loading

merveenoyan Jan 19, 2025 •

edited

Loading