Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Using o1 models with high concurrency for classification causing timeouts and recursion error #6198

Open
manibatra opened this issue Jan 29, 2025 · 1 comment · May be fixed by #6206
Assignees
Labels
bug Something isn't working

Comments

@manibatra
Copy link

Describe the bug
When using the “o1” models (which may take > 120s to respond), tasks exceed the hardcoded timeout in the consumer method of the AsyncExecutor This leads to repeated timeouts, re-queuing, and ultimately a recursion depth exhaustion error under heavy concurrency.

To Reproduce
1. Set up an experiment with high concurrency (> 20). In my test I was running multiple experiments concurrently using asyncio.
2. Use o1 for classification with high concurrency (>20).
3. Observe that tasks eventually start timing out (exceeding the hardcoded 120s limit).
4. The tasks are repeatedly re-queued until a recursion depth error occurs.
5. Use a patched consumer function with an increased timeout to mitigate the error.

Setup:

import nest_asyncio
nest_asyncio.apply()

...
   run_experiment(
        dataset=dataset,
        task=task,
        evaluators=evaluators_to_run,
        experiment_name=experiment_name,
        experiment_description="Evaluate hallucinations with custom templates",
        dry_run=dry_run,
        concurrency=30,
    )

....


def get_eval_model_rails(rails_map: OrderedDict[bool, str]):
    rails = list(rails_map.values())
    eval_model = OpenAIModel(
        model="o1",
        temperature=0.0,
        model_kwargs={
            "response_format": {
                "type": "json_schema",
                "json_schema": get_response_function(rails, with_explanation=True),
            },
            "reasoning_effort": "high",
        },
        request_timeout=30000,
    )
    return (eval_model, rails)

@create_evaluator(name="qa_correctness_evaluator", kind="LLM")
async def qa_correctness_evaluator(
    output: Optional[TaskOutput] = None,
    input: ExampleInput = {},
) -> EvaluationResult:
    df = create_queries_df(output, input)
    eval_model, rails = get_eval_model_rails(QA_PROMPT_RAILS_MAP)

    eval_result = llm_classify(
        dataframe=df,
        system_instruction="You are a helpful evaluator that classifies whether a response is correct or not based on the criteria and data provided. You only reply in valid JSON according to the schema provided",
        template=qa_prompt_template(),
        model=eval_model,
        rails=rails,
        provide_explanation=True,
        use_function_calling_if_available=False,
        include_response=True,
        concurrency=30,
    )

...

Error:

Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
Worker timeout, requeuing                                                                                                                                                                                                               
llm_classify |                                                                                                                                                                                         | 0/1 (0.0%) | ⏳ 00:18<? | ?it/s
running experiment evaluations |█████████████████▉                                                                                                                                           | 8/70 (11.4%) | ⏳ 04:00<28:34 | 27.66s/it
llm_classify |                                                                                                                                                                                         | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s
llm_classify |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 (100.0%) | ⏳ 00:18<00:00 | 18.18s/it
llm_classify |                                                                                                                                                                                         | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s
llm_classify |                                                                                                                                                                                         | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s
llm_classify |                                                                                                                                                                                         | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s
                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                        
llm_classify |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 (100.0%) | ⏳ 01:59<00:00 | 30.93s/it

Traceback (most recent call last):
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/experiments/functions.py", line 555, in async_evaluate_run
    result = await evaluator.async_evaluate(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/experiments/evaluators/utils.py", line 187, in async_evaluate
    result = await func(*bound_signature.args, **bound_signature.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/app/evaluator.py", line 70, in hallucination_evaluator
    eval_result = llm_classify(
                  ^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/evals/classify.py", line 90, in wrapper
    return func(*bound_args.args, **bound_args.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/evals/classify.py", line 335, in llm_classify
    results, execution_details = executor.run(list_of_inputs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/evals/executors.py", line 282, in run
    return asyncio.run(self.execute(inputs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 30, in run
    return loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 98, in run_until_complete
    return f.result()
           ^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/opt/homebrew/Cellar/[email protected]/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/phoenix/evals/executors.py", line 228, in execute
    progress_bar = tqdm(
                   ^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/asyncio.py", line 24, in __init__
    super().__init__(iterable, *args, **kwargs)
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1098, in __init__
    self.refresh(lock_args=self.lock_args)
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1347, in refresh
    self.display()
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1495, in display
    self.sp(self.__str__() if msg is None else msg)
            ^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1151, in __str__
    return self.format_meter(**self.format_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/std.py", line 643, in format_meter
    return disp_trim(res, ncols) if ncols else res
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/utils.py", line 390, in disp_trim
    if len(data) == disp_len(data):
                    ^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/utils.py", line 383, in disp_len
    return _text_width(RE_ANSI.sub('', data))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/manibatra/code/evals/.venv/lib/python3.12/site-packages/tqdm/utils.py", line 375, in _text_width
    return sum(2 if east_asian_width(ch) in 'FW' else 1 for ch in str(s))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded

The above exception was the direct cause of the following exception:

RuntimeError: evaluator failed for example id 'RGF0YXNldEV4YW1wbGU6MzM4', repetition 1

Expected behavior
Longer-running concurrent tasks (especially “o1” model calls) should finish even when they take longer than 120s to respond.

Screenshots
Error log above.

Environment:
• OS: macOS 15.2 (Apple Silicon)
• Notebook Runtime: Python script
• Browser: N/A
• Version: arize-phoenix=7.9.1, python=3.12.7

@manibatra manibatra added bug Something isn't working triage issues that need triage labels Jan 29, 2025
@github-project-automation github-project-automation bot moved this to 📘 Todo in phoenix Jan 29, 2025
@anticorrelator
Copy link
Contributor

@manibatra thanks for the thorough report and investigation! I'll look into this shortly :)

@anticorrelator anticorrelator self-assigned this Jan 30, 2025
@anticorrelator anticorrelator moved this from 📘 Todo to 🔍. Needs Review in phoenix Jan 30, 2025
@mikeldking mikeldking removed the triage issues that need triage label Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 🔍. Needs Review
Development

Successfully merging a pull request may close this issue.

3 participants