I used the provided aria_ui_vllm.py and aria_ui_hf.py for inference separately and found that there are inconsistencies in the results.
- running
aria_ui_vllm.py
llm = LLM(
model=model_path,
tokenizer_mode="slow",
dtype="bfloat16",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
model_path, trust_remote_code=True, use_fast=False
)
instruction = "Try Aria."
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{
"type": "text",
"text": "Given a GUI image, what are the relative (0-1000) pixel point coordinates for the element corresponding to the following instruction or description: " + instruction,
}
],
}
]
outputs:
```(684, 786)```<|im_end|>
After running draw_coord, it can be seen that the correct coordinates for 'Try Aria' were not found.

- running
aria_ui_hf.py
instruction = "Try Aria."
image = Image.open(image_file).convert("RGB")
# NOTE: using huggingface on a single 80GB GPU, we resize the image to 1920px on the long side to prevent OOM. this is unnecessary with vllm.
image = resize_image(image, long_size=1920)
messages = [
{
"role": "user",
"content": [
{"text": None, "type": "image"},
{"text": instruction, "type": "text"},
],
}
]
outputs:
```(767, 782)```<|im_end|>
After draw_coord

- When I changed the prompt in vllm. I removed the sentence 'Given a GUI image, what are the relative (0-1000) pixel point coordinates for the element corresponding to the following instruction or description:'.
instruction = "Try Aria."
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{
"type": "text",
"text": instruction,
}
],
}
]
outputs:
```(760, 786)```<|im_end|>

This way, the observed results are correct
I used the provided aria_ui_vllm.py and aria_ui_hf.py for inference separately and found that there are inconsistencies in the results.
aria_ui_vllm.pyoutputs:
After running draw_coord, it can be seen that the correct coordinates for 'Try Aria' were not found.

aria_ui_hf.pyoutputs:
After draw_coord

outputs:
This way, the observed results are correct