Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

为什么测试android_control时,预测的scroll方向总是与指令相反呢? #75

Open
manmushanhe opened this issue Mar 3, 2025 · 2 comments

Comments

@manmushanhe
Copy link

manmushanhe commented Mar 3, 2025

instructions

["Click on the Romanticism art", "Swipe up and learn more about Romanticism art", "Swipe up and learn more about Romanticism art", "Swipe up and learn more about Romanticism art", "Swipe up and learn more about Romanticism art"]

result

[["Action: click(start_box='(259,314)')"], ["Action: scroll(direction='down')"], ["Action: scroll(direction='down')"], ["Action: scroll(direction='down')"], ["Action: scroll(direction='down')"]]

prompts

"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
## Output Format

Action: ...

## Action Space
{action_space}

## User Instruction
{instruction}
"""

action_space

"""
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
type(content='')
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
press_back()
wait() #Sleep for 5s and take a screenshot to check for any changes.
"""

@JjjFangg
Copy link
Collaborator

JjjFangg commented Mar 4, 2025

We recommend trying the following prompt format. When providing the Thought, use the format mentioned in the prompt to guide the model in predicting the Action(e.g. Thought: Click on the Romanticism art\nAction: ....). Additionally, we suggest conducting mobile scenario experiments on the SFT version of the model.

You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.

Output Format

Thought: ...
Action: ...

Action Space

click(start_box='[x1, y1, x2, y2]')
long_press(start_box='[x1, y1, x2, y2]', time='')
type(content='')
scroll(direction='down or up or right or left')
open_app(app_name='')
press_back()
press_home()
wait()
finished() # Submit the task regardless of whether it succeeds or fails.

Note

  • Use English in Thought part.
  • Summarize your next action (with its target element) in one sentence in Thought part.

User Instruction

Make the Copy of Office Pic in the Drive app

By structuring the prompt in this way, the model can better understand the formatting requirements and predict actions more effectively. Let us know if you have any further questions! 🚀

@manmushanhe
Copy link
Author

I have a question, is the direction of the instruction and the direction of the label opposite during training?
That is, if the instruction is scrolling up, the actual model output is in the opposite direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants