You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When using the “o1” models (which may take > 120s to respond), tasks exceed the hardcoded timeout in the consumer method of the AsyncExecutor This leads to repeated timeouts, re-queuing, and ultimately a recursion depth exhaustion error under heavy concurrency.
To Reproduce
1. Set up an experiment with high concurrency (> 20). In my test I was running multiple experiments concurrently using asyncio.
2. Use o1 for classification with high concurrency (>20).
3. Observe that tasks eventually start timing out (exceeding the hardcoded 120s limit).
4. The tasks are repeatedly re-queued until a recursion depth error occurs.
5. Use a patched consumer function with an increased timeout to mitigate the error.
Setup:
importnest_asyncionest_asyncio.apply()
...
run_experiment(
dataset=dataset,
task=task,
evaluators=evaluators_to_run,
experiment_name=experiment_name,
experiment_description="Evaluate hallucinations with custom templates",
dry_run=dry_run,
concurrency=30,
)
....
defget_eval_model_rails(rails_map: OrderedDict[bool, str]):
rails=list(rails_map.values())
eval_model=OpenAIModel(
model="o1",
temperature=0.0,
model_kwargs={
"response_format": {
"type": "json_schema",
"json_schema": get_response_function(rails, with_explanation=True),
},
"reasoning_effort": "high",
},
request_timeout=30000,
)
return (eval_model, rails)
@create_evaluator(name="qa_correctness_evaluator", kind="LLM")
asyncdefqa_correctness_evaluator(
output: Optional[TaskOutput] =None,
input: ExampleInput= {},
) ->EvaluationResult:
df=create_queries_df(output, input)
eval_model, rails=get_eval_model_rails(QA_PROMPT_RAILS_MAP)
eval_result=llm_classify(
dataframe=df,
system_instruction="You are a helpful evaluator that classifies whether a response is correct or not based on the criteria and data provided. You only reply in valid JSON according to the schema provided",
template=qa_prompt_template(),
model=eval_model,
rails=rails,
provide_explanation=True,
use_function_calling_if_available=False,
include_response=True,
concurrency=30,
)
...
Error:
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
Worker timeout, requeuing
llm_classify || 0/1 (0.0%) | ⏳ 00:18<?|?it/s
running experiment evaluations |█████████████████▉ | 8/70 (11.4%) | ⏳ 04:00<28:34 | 27.66s/it
llm_classify || 0/1 (0.0%) | ⏳ 00:00<?|?it/s
llm_classify |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 (100.0%) | ⏳ 00:18<00:00 | 18.18s/it
llm_classify || 0/1 (0.0%) | ⏳ 00:00<?|?it/s
llm_classify || 0/1 (0.0%) | ⏳ 00:00<?|?it/s
llm_classify || 0/1 (0.0%) | ⏳ 00:00<?|?it/s
llm_classify |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 (100.0%) | ⏳ 01:59<00:00 | 30.93s/it
Traceback (most recent call last):
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/experiments/functions.py", line 555, in async_evaluate_run
result = await evaluator.async_evaluate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/experiments/evaluators/utils.py", line 187, in async_evaluate
result = await func(*bound_signature.args, **bound_signature.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/manibatra/code/evals/app/evaluator.py", line 70, in hallucination_evaluator
eval_result = llm_classify(
^^^^^^^^^^^^^
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/evals/classify.py", line 90, in wrapper
return func(*bound_args.args, **bound_args.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/evals/classify.py", line 335, in llm_classify
results, execution_details = executor.run(list_of_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/evals/executors.py", line 282, in run
return asyncio.run(self.execute(inputs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 30, in run
return loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 98, in run_until_complete
returnf.result()
^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/futures.py", line 203, in result
raise self._exception.with_traceback(self._exception_tb)
File "/opt/homebrew/Cellar/[email protected]/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
result = coro.send(None)
^^^^^^^^^^^^^^^
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/evals/executors.py", line 228, in execute
progress_bar = tqdm(
^^^^^
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/asyncio.py", line 24, in __init__
super().__init__(iterable, *args, **kwargs)
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1098, in __init__
self.refresh(lock_args=self.lock_args)
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1347, in refresh
self.display()
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1495, in display
self.sp(self.__str__() if msg is None else msg)
^^^^^^^^^^^^^^
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1151, in __str__
return self.format_meter(**self.format_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 643, in format_meter
return disp_trim(res, ncols) if ncols else res
^^^^^^^^^^^^^^^^^^^^^
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/utils.py", line 390, in disp_trim
if len(data) == disp_len(data):
^^^^^^^^^^^^^^
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/utils.py", line 383, in disp_len
return _text_width(RE_ANSI.sub('', data))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/utils.py", line 375, in _text_width
return sum(2 if east_asian_width(ch) in'FW'else 1 forchin str(s))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded
The above exception was the direct cause of the following exception:
RuntimeError: evaluator failed for example id 'RGF0YXNldEV4YW1wbGU6MzM4', repetition 1
Expected behavior
Longer-running concurrent tasks (especially “o1” model calls) should finish even when they take longer than 120s to respond.
Describe the bug
When using the “o1” models (which may take > 120s to respond), tasks exceed the hardcoded timeout in the consumer method of the AsyncExecutor This leads to repeated timeouts, re-queuing, and ultimately a recursion depth exhaustion error under heavy concurrency.
To Reproduce
1. Set up an experiment with high concurrency (> 20). In my test I was running multiple experiments concurrently using asyncio.
2. Use
o1
for classification with high concurrency (>20).3. Observe that tasks eventually start timing out (exceeding the hardcoded 120s limit).
4. The tasks are repeatedly re-queued until a recursion depth error occurs.
5. Use a patched consumer function with an increased timeout to mitigate the error.
Setup:
Error:
Expected behavior
Longer-running concurrent tasks (especially “o1” model calls) should finish even when they take longer than 120s to respond.
Screenshots
Error log above.
Environment:
• OS: macOS 15.2 (Apple Silicon)
• Notebook Runtime: Python script
• Browser: N/A
• Version:
arize-phoenix=7.9.1
,python=3.12.7
The text was updated successfully, but these errors were encountered: