-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in qwen2-vl ? #552
Comments
Because this is a dummy image token in LLaVA, so I don't think it would make any effect for Qwen2-VL. To truly interleave, the inputs are packed with messages format |
Thank you for your answer. Does this meet the requirements of the benchmark? |
I think there are 2 splits, one is interleave one is v right? Does this correspond to with and without subtitle? |
if lmms_eval_specific_kwargs.get("insert_interleave_subtitles", False):
with open(Path(__file__).parent / "longvideobench_val_i.yaml", "r") as f:
raw_data = f.readlines()
safe_data = []
for i, line in enumerate(raw_data):
# remove function definition since yaml load cannot handle it
if "!function" not in line:
safe_data.append(line)
cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]
subtitle_subdir_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"].get("subtitle_subdir", "subtitles")
cache_dir = os.path.join(base_cache_dir, cache_name, subtitle_subdir_name)
with open(os.path.join(cache_dir, doc["subtitle_path"])) as f:
subtitles = json.load(f)
max_num_frames = yaml.safe_load("".join(safe_data))["dataset_kwargs"].get("max_num_frames", 16)
frame_timestamps = compute_frame_timestamps(doc["duration"], max_num_frames)
interleaved_prefix = insert_subtitles_into_frames(frame_timestamps, subtitles, doc["starting_timestamp_for_subtitles"], doc["duration"])
return f"{pre_prompt}{interleaved_prefix}\n{question}\n{post_prompt}"
else:
return f"{pre_prompt}{question}\n{post_prompt}" This is the code segment to construct the prompt of longvideobench. |
Code above is from the Qwen2_VL's generate_until function,. It replaces all the "<image>" token to empty string. When I evaluate the longvideobench with subtitles, it interleaves
and subtitle text. In my opinion, it seems to evaluate in this format, but the implement of qwen2vl makes it only include text. Is this a bug?
Additionally, I see that the Qwen2_5_VL delete these code.
The text was updated successfully, but these errors were encountered: