Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions lmms_eval/models/chat/openai_compatible.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,12 +75,12 @@ def generate_until(self, requests) -> List[str]:
payload["max_tokens"] = gen_kwargs["max_new_tokens"]
payload["temperature"] = gen_kwargs["temperature"]

if "o1" in self.model_version or "o3" in self.model_version or "o4" in self.model_version:
if "o1" in self.model_version or "o3" in self.model_version or "o4" in self.model_version or "gpt-5" in self.model_version:
del payload["temperature"]
payload.pop("max_tokens")
payload["reasoning_effort"] = "medium"
#payload["reasoning_effort"] = "medium"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think possibly should add a control args for reasoning effort for openai compatible instead of direct comment out it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize, this part is a temporary modification I made for testing purposes. You can ignore this modification and only focus on the newly added task.

payload["response_format"] = {"type": "text"}
payload["max_completion_tokens"] = gen_kwargs["max_new_tokens"]
payload["max_completion_tokens"] = 5000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hardcoding


batch_payloads.append(payload)
batch_responses.append(None)
Expand Down
32 changes: 32 additions & 0 deletions lmms_eval/tasks/seephys/seephys.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
dataset_path: SeePhys/SeePhys
dataset_kwargs:
token: True
task: "seephys"
test_split: train
output_type: generate_until

doc_to_visual: !function seephys_utils.seephys_doc_to_visual
doc_to_text: !function seephys_utils.seephys_doc_to_text
doc_to_target: "answer"

process_results: !function seephys_utils.seephys_process_results

generation_kwargs:
until:
- "</answer>"
- "\n\n"
do_sample: false
temperature: 1

metric_list:
- metric: eval_results
aggregation: !function seephys_utils.seephys_aggregate_results
higher_is_better: true

metadata:
version: 0.0
# 用于 LLM-as-a-judge 的评估模型
eval_model_name: "gpt-5-mini"
# 设为 false 以启用 LLM-as-a-judge (推荐)
# 设为 true 将只使用正则表达式进行快速(但不准确)的评估
quick_extract: false
Loading
Loading