Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VLM support #220

Draft
wants to merge 45 commits into
base: main
Choose a base branch
from
Draft

Add VLM support #220

wants to merge 45 commits into from

Conversation

merveenoyan
Copy link

@merveenoyan merveenoyan commented Jan 16, 2025

This PR adds VLM support (closing other one for the sake of collaboration) @aymeric-roucher

  • This PR at the creation is probably broken (since you wanted to see it) primarily because as of now when you're writing memory you adopt chat templates like following:
messages = [
  {"role": "user", "content": "I'm doing great. How can I help you today?"},
]

whereas with VLMs we do like following so I modified a bit.

messages = [
  {"role": "user", "content": [{"type": "text", "text": "I'm doing great. How can I help you today?"},
{"type":"image"}
]

but you access content and modify it here and there so it is broken: (fixing)

  • Secondly I need to check if I'm adding images necessarily only once because it will break the inference if we pass one image more than once, I add it in multiple steps, so I will see.

Will open to review once I fix these.

@merveenoyan
Copy link
Author

@aymeric-roucher we can keep images in action step with a separate key. normally models do not produce images, so if we put images with "images" key it will break chat template. if we keep images for the sake of keeping images, we can keep it under a different key like "observation_images" (just like how we do in the function).

@merveenoyan merveenoyan mentioned this pull request Jan 16, 2025
@merveenoyan
Copy link
Author

we need to unify the image handling for both OpenAI & transformers I think, I saw you overwrote templates which could break transformers. will handle tomorrow

@merveenoyan
Copy link
Author

@aymeric-roucher can I write the tests now? will there be changes to API?

@@ -40,7 +40,7 @@ def reset(self):
self.total_input_token_count = 0
self.total_output_token_count = 0

def update_metrics(self, step_log):
def update_metrics(self, step_log, agent):
Copy link
Author

@merveenoyan merveenoyan Jan 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aymeric-roucher why did you add agent here, I don't see any changes where it's
used (just icymi)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merve this is a general change in callback logic: the idea is to let callback functions access the whole agent, for instance to read token counts from agent.monitor. But not sure this is the most ergonomic solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change in the callback logic necessary for the VLM support? Or could we address it in a separate PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertvillanova in cases like taking screenshots callbacks are necessary across every step, I think other than that, no

image handling is a bit like that if image is required to be kept at every step or you require to dynamically retrieve images (e.g. from a knowledge base)

@aymeric-roucher
Copy link
Collaborator

@merveenoyan do TransformersModel VLMs work in the current state? Also one image-related test is failing.

> [!TIP]
> Read [Open-source LLMs as LangChain Agents](https://huggingface.co/blog/open-source-llms-as-agents) blog post to learn more about multi-step agents.
1. **Thought:** This is the first step initializing the system, prompting it on how it should behave (`SystemPromptStep`), the facts about the task at hand (`PlanningStep`) and providing the task at hand (`TaskStep`). System prompt, facts and task prompt are appended to the memory. Facts are updated at each step until the agent receives the final response. If there's any images in the prompt, they are fed to `TaskStep`.
2. **Action:** This is where all the action is taken, including LLM inference and callback function execution. After the inference takes place, the output of LLM/VLM (called "observations") is fed to `ActionStep`. Callbacks are functions executed at the end of every step. A good callback example is taking screenshots and add it to agent's state in an agentic web browser.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merveenoyan in React framework, both Thought and Action are in the while loop.
So I'd really make a distinction here between 1. Initialization and 2 While loop with 2.1 Thought (basically the LLM generation + parsing) and 2.2 Action (execution of the action)

i've reworded the blog post as well to convey this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

@merveenoyan
Copy link
Author

taking a look now

Comment on lines 263 to 273
for image in step_log.task_images:
task_message = {
"role": MessageRole.USER,
"content": [
{"type": "text", "text": f"New task:\n{step_log.task}"},
{
"type": "image",
"image": image,
},
],
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this loop does not work as expected: only last image is kept.

@@ -410,6 +456,7 @@ def __init__(
device_map: Optional[str] = None,
torch_dtype: Optional[str] = None,
trust_remote_code: bool = False,
flatten_messages_as_text: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is never used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added it as an instance attribute. See: 221e678


def __call__(
self,
messages: List[Dict[str, str]],
stop_sequences: Optional[List[str]] = None,
grammar: Optional[str] = None,
tools_to_call_from: Optional[List[Tool]] = None,
images: Optional[List[Image.Image]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this variable type be Optional[List[str]] instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants