Fix an inconsistency bug in multi-turn evaluation. #275

junya-takayama · 2025-12-25T07:46:21Z

In multi-turn evaluation setups like MT-Bench, where responses for intermediate turns are also generated, running the flexeval_lm command by itself did not pass the generated responses from intermediate turns to Metric.evaluate().

On the other hand, when using the workflow of generating multi-turn responses with flexeval_lm and then running the evaluation with flexeval_file, the assistant responses from intermediate turns are correctly passed to Metric.

Since running flexeval_lm alone should behave the same way as the latter, we will fix this.

junya-takayama · 2025-12-25T08:45:34Z

試運転

mt-ja.jsonnet

/*
Multi-Turn Benchmark for large language models in Japanese.

References:

* [Data Source](https://github.com/Stability-AI/FastChat/tree/jp-stable/fastchat/llm_judge)
*/
{
  class_path: 'ChatResponse',
  init_args: {
    eval_dataset: {
      class_path: 'ChatbotBench',
      init_args: {
        path_or_name: 'mt-ja',
        ref_path_or_name: 'mt-ja-ref-gpt4',
      },
    },
    metrics: [
      { class_path: 'OutputLengthStats' },
      {
        class_path: 'ChatLLMScore',
        init_args: {
          language_model: { class_path: 'OpenAIChatAPI', init_args: { model: 'gpt-4o-mini' } },
          valid_score_range: [1, 10],
          prompt_template: {
            class_path: 'Jinja2PromptTemplate',
            init_args: {
              template: std.stripChars(|||
                {% if references|length > 0 -%}
                <|参照回答の開始|>
                ### ユーザ:
                {{ messages[0]["content"] }}

                ### 参照回答:
                {{ references[0] }}

                ### ユーザ:
                {{ messages[2]["content"] }}

                ### 参照回答:
                {{ references[1] }}

                <|参照回答の修了|>
                {% endif -%}

                <|アシスタントAとユーザの対話の開始|>

                ### ユーザ:
                {{ messages[0]["content"] }}

                ### アシスタントA:
                {{ messages[1]["content"] }}

                ### ユーザ:
                {{ messages[2]["content"] }}

                ### アシスタントA:
                {% if messages|length == 3 %}{{ lm_output }}{% else %}{{ messages[3]["content"] }}{% endif %}

                <|アシスタントAとユーザの対話の終了|>
              |||, '\n'),
            },
          },
          system_message: {
            class_path: 'Jinja2PromptTemplate',
            init_args: {
              template: std.stripChars(|||
                {% if references|length > 0 -%}
                以下に表示されるユーザの質問に対するアシスタントの応答の品質を評価してください。評価は正確さと有用性を考慮すべきです。アシスタントの回答の言語は、ユーザが使用している言語と一致しているべきで、そうでない場合は減点されるべきです。参照回答とアシスタントの回答が与えられます。ユーザの２つ目の質問に対するアシスタントの応答の品質について評価してください。あなたの評価は、アシスタントの回答と参照回答を比較することから始めてください。ミスを特定し、訂正してください。できるだけ客観的であること。評価の説明をした後、"[[rating]]"という形式で、1から10までの整数の評価値を出力してください（例 "rating：[[5]]"）。
                {%- else -%}
                以下に表示されるユーザの質問に対するアシスタントの応答の品質を公平に評価してください。評価は、応答の有用性、関連性、正確性、深さ、創造性、詳細度などの要素を考慮すべきです。アシスタントの回答の言語は、ユーザが使用している言語と一致しているべきで、そうでない場合は減点されるべきです。ユーザの２つ目の質問に対するアシスタントの応答の品質について評価してください。評価は短い説明から始めてください。できるだけ客観的であること。評価の説明をした後、"[[rating]]"という形式で、1から10までの整数の評価値を出力してください（例 "rating：[[5]]"）。
                {%- endif %}
              |||, '\n'),
            },
          },
          category_key: 'category',
        },
      },
    ],
    gen_kwargs: { max_new_tokens: 1024 },
    batch_size: 16,
  },
}

旧バージョン

途中のターンが渡らないので参照エラーで落ちる

poetry run flexeval_lm --eval_setup mt-ja.jsonnet --language_model OpenAIChatAPI --language_model.model gpt-4o-mini --save_dir ./tmp_mt_old
...

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [03:10<00:00,  2.39s/it]
...
2025-12-25 17:13:33.284 | ERROR    | flexeval.scripts.flexeval_lm:main:306 - Error in evaluation:
list object has no element 2
Traceback (most recent call last):
  File "/Users/junya.takayama/workspace/flexeval-oss/flexeval/scripts/flexeval_lm.py", line 292, in main
    metrics, outputs = eval_setup.evaluate_lm(
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/junya.takayama/workspace/flexeval-oss/flexeval/core/eval_setups.py", line 68, in evaluate_lm
    return evaluate_chat_response(
           ^^^^^^^^^^^^^^^^^^^^^^^
...
jinja2.exceptions.UndefinedError: list object has no element 2

新バージョン

🎉

poetry run flexeval_lm --eval_setup mt-ja.jsonnet --language_model OpenAIChatAPI --language_model.model gpt-4o-mini --save_dir ./tmp_mt_new

...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [03:52<00:00,  2.91s/it]
Calculating ChatLLM score: 100%|█████████████████████████████████████████████████████████████████████████████████| 80/80 [02:17<00:00,  1.72s/it]
2025-12-25 17:09:51.702 | INFO     | flexeval.core.evaluate_chat_response:evaluate_chat_response:210 - {'avg_output_length': 682.775, 'max_output_length': 2221, 'min_output_length': 13, 'llm_score': 8.0875, 'num_failed_score_parses': 0, 'llm_score/category/coding': 8.9, 'llm_score/category/extraction': 7.4, 'llm_score/category/humanities': 9.0, 'llm_score/category/math': 7.0, 'llm_score/category/reasoning': 7.3, 'llm_score/category/roleplay': 8.4, 'llm_score/category/stem': 8.4, 'llm_score/category/writing': 8.3, 'finish_reason_ratio-stop': 0.9875, 'finish_reason_ratio-length': 0.0125}
2025-12-25 17:09:51.702 | INFO     | flexeval.scripts.flexeval_lm:main:297 - Elapsed time: 370.20 sec
2025-12-25 17:09:51.703 | INFO     | flexeval.core.result_recorder.local_recorder:record_metrics:84 - Saved the metrics to tmp_mt_new/metrics.json
2025-12-25 17:09:51.706 | INFO     | flexeval.core.result_recorder.local_recorder:record_model_outputs:95 - Saved the outputs to tmp_mt_new/outputs.jsonl

junya-takayama added 5 commits September 26, 2025 16:03

Merge branch 'main' of github.com:sbintuitions/flexeval

4329a2d

Merge github.com:sbintuitions/flexeval

e15a5be

Merge branch 'main' of github.com:sbintuitions/flexeval

dc089ec

Merge branch 'main' of github.com:sbintuitions/flexeval

afea85a

use replaced 'messages' for metric.evaluate()

1b754da

junya-takayama changed the title ~~[WIP] Fix an inconsistency bug in multi-turn evaluation.~~ Fix an inconsistency bug in multi-turn evaluation. Dec 25, 2025

junya-takayama marked this pull request as ready for review December 25, 2025 08:46

junya-takayama requested a review from a team December 25, 2025 08:47

yuma-hirakawa approved these changes Dec 26, 2025

View reviewed changes

junya-takayama merged commit b947138 into main Dec 26, 2025
7 checks passed

junya-takayama deleted the fix_bug_on_multiturn branch December 26, 2025 03:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix an inconsistency bug in multi-turn evaluation. #275

Fix an inconsistency bug in multi-turn evaluation. #275

Uh oh!

junya-takayama commented Dec 25, 2025 •

edited

Loading

Uh oh!

junya-takayama commented Dec 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix an inconsistency bug in multi-turn evaluation. #275

Fix an inconsistency bug in multi-turn evaluation. #275

Uh oh!

Conversation

junya-takayama commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

junya-takayama commented Dec 25, 2025

旧バージョン

新バージョン

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

junya-takayama commented Dec 25, 2025 •

edited

Loading