[BUG] Using o1 models with high concurrency for classification causing timeouts and recursion error #6198

manibatra · 2025-01-29T06:57:12Z

Describe the bug
When using the “o1” models (which may take > 120s to respond), tasks exceed the hardcoded timeout in the consumer method of the AsyncExecutor This leads to repeated timeouts, re-queuing, and ultimately a recursion depth exhaustion error under heavy concurrency.

To Reproduce
1. Set up an experiment with high concurrency (> 20). In my test I was running multiple experiments concurrently using asyncio.
2. Use o1 for classification with high concurrency (>20).
3. Observe that tasks eventually start timing out (exceeding the hardcoded 120s limit).
4. The tasks are repeatedly re-queued until a recursion depth error occurs.
5. Use a patched consumer function with an increased timeout to mitigate the error.

Setup:

import nest_asyncio
nest_asyncio.apply()

...
   run_experiment(
        dataset=dataset,
        task=task,
        evaluators=evaluators_to_run,
        experiment_name=experiment_name,
        experiment_description="Evaluate hallucinations with custom templates",
        dry_run=dry_run,
        concurrency=30,
    )

....


def get_eval_model_rails(rails_map: OrderedDict[bool, str]):
    rails = list(rails_map.values())
    eval_model = OpenAIModel(
        model="o1",
        temperature=0.0,
        model_kwargs={
            "response_format": {
                "type": "json_schema",
                "json_schema": get_response_function(rails, with_explanation=True),
            },
            "reasoning_effort": "high",
        },
        request_timeout=30000,
    )
    return (eval_model, rails)

@create_evaluator(name="qa_correctness_evaluator", kind="LLM")
async def qa_correctness_evaluator(
    output: Optional[TaskOutput] = None,
    input: ExampleInput = {},
) -> EvaluationResult:
    df = create_queries_df(output, input)
    eval_model, rails = get_eval_model_rails(QA_PROMPT_RAILS_MAP)

    eval_result = llm_classify(
        dataframe=df,
        system_instruction="You are a helpful evaluator that classifies whether a response is correct or not based on the criteria and data provided. You only reply in valid JSON according to the schema provided",
        template=qa_prompt_template(),
        model=eval_model,
        rails=rails,
        provide_explanation=True,
        use_function_calling_if_available=False,
        include_response=True,
        concurrency=30,
    )

...

Error:

Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
llm_classify |                                                                                                                                                                                         | 0/1 (0.0%) | ⏳ 00:18<? | ?it/s
running experiment evaluations |█████████████████▉                                                                                                                                           | 8/70 (11.4%) | ⏳ 04:00<28:34 | 27.66s/it
llm_classify |                                                                                                                                                                                         | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s
llm_classify |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 (100.0%) | ⏳ 00:18<00:00 | 18.18s/it
llm_classify |                                                                                                                                                                                         | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s
llm_classify |                                                                                                                                                                                         | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s
llm_classify |                                                                                                                                                                                         | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s
                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                        
llm_classify |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 (100.0%) | ⏳ 01:59<00:00 | 30.93s/it

Traceback (most recent call last):
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/experiments/functions.py", line 555, in async_evaluate_run
    result = await evaluator.async_evaluate(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/experiments/evaluators/utils.py", line 187, in async_evaluate
    result = await func(*bound_signature.args, **bound_signature.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/app/evaluator.py", line 70, in hallucination_evaluator
    eval_result = llm_classify(
                  ^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/evals/classify.py", line 90, in wrapper
    return func(*bound_args.args, **bound_args.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/evals/classify.py", line 335, in llm_classify
    results, execution_details = executor.run(list_of_inputs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/evals/executors.py", line 282, in run
    return asyncio.run(self.execute(inputs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 30, in run
    return loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 98, in run_until_complete
    return f.result()
           ^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/opt/homebrew/Cellar/[email protected]/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/evals/executors.py", line 228, in execute
    progress_bar = tqdm(
                   ^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/asyncio.py", line 24, in __init__
    super().__init__(iterable, *args, **kwargs)
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1098, in __init__
    self.refresh(lock_args=self.lock_args)
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1347, in refresh
    self.display()
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1495, in display
    self.sp(self.__str__() if msg is None else msg)
            ^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1151, in __str__
    return self.format_meter(**self.format_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 643, in format_meter
    return disp_trim(res, ncols) if ncols else res
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/utils.py", line 390, in disp_trim
    if len(data) == disp_len(data):
                    ^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/utils.py", line 383, in disp_len
    return _text_width(RE_ANSI.sub('', data))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/utils.py", line 375, in _text_width
    return sum(2 if east_asian_width(ch) in 'FW' else 1 for ch in str(s))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded

The above exception was the direct cause of the following exception:

RuntimeError: evaluator failed for example id 'RGF0YXNldEV4YW1wbGU6MzM4', repetition 1

Expected behavior
Longer-running concurrent tasks (especially “o1” model calls) should finish even when they take longer than 120s to respond.

Screenshots
Error log above.

Environment:
• OS: macOS 15.2 (Apple Silicon)
• Notebook Runtime: Python script
• Browser: N/A
• Version: arize-phoenix=7.9.1, python=3.12.7

The text was updated successfully, but these errors were encountered:

anticorrelator · 2025-01-29T14:16:42Z

@manibatra thanks for the thorough report and investigation! I'll look into this shortly :)

manibatra added bug Something isn't working triage issues that need triage labels Jan 29, 2025

github-project-automation bot added this to phoenix Jan 29, 2025

github-project-automation bot moved this to 📘 Todo in phoenix Jan 29, 2025

anticorrelator self-assigned this Jan 30, 2025

anticorrelator linked a pull request Jan 30, 2025 that will close this issue

feat: Enable overriding executor timeouts per model #6206

Open

anticorrelator moved this from 📘 Todo to 🔍. Needs Review in phoenix Jan 30, 2025

mikeldking removed the triage issues that need triage label Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Using o1 models with high concurrency for classification causing timeouts and recursion error #6198

[BUG] Using o1 models with high concurrency for classification causing timeouts and recursion error #6198

manibatra commented Jan 29, 2025

anticorrelator commented Jan 29, 2025

[BUG] Using o1 models with high concurrency for classification causing timeouts and recursion error #6198

[BUG] Using o1 models with high concurrency for classification causing timeouts and recursion error #6198

Comments

manibatra commented Jan 29, 2025

anticorrelator commented Jan 29, 2025