Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(llmobs): replace trace processor with event listener #11781

Merged
merged 13 commits into from
Jan 9, 2025

Conversation

Kyle-Verhoog
Copy link
Member

@Kyle-Verhoog Kyle-Verhoog commented Dec 18, 2024

The LLMObs service formerly depended on the TraceProcessor interface in the tracer. This was problematic due to sharing a dependency with the public API. As such, users could configure a trace filter (under the hood is a trace processor) and overwrite the LLMObs TraceProcessor.

Instead, the tracer can emit span start and finish events which the LLMObs service listens to and acts on, as proposed here.

The gotcha is that the LLMObs service no longer has a way to drop traces when run in agentless mode, which only LLMObs supports. Instead, we encourage users to explicitly turn off APM which carries the benefit of clarity since this was implicit before.

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Copy link
Contributor

github-actions bot commented Dec 18, 2024

CODEOWNERS have been resolved as:

.riot/requirements/16562eb.txt                                          @DataDog/apm-python
releasenotes/notes/fix-llmobs-processor-4afd715a84323d32.yaml           @DataDog/apm-python
ddtrace/_trace/tracer.py                                                @DataDog/apm-sdk-api-python
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_utils.py                                                @DataDog/ml-observability
riotfile.py                                                             @DataDog/apm-python
tests/llmobs/conftest.py                                                @DataDog/ml-observability
tests/llmobs/test_llmobs.py                                             @DataDog/ml-observability
tests/llmobs/test_llmobs_ragas_faithfulness_evaluator.py                @DataDog/ml-observability
tests/llmobs/test_llmobs_service.py                                     @DataDog/ml-observability
tests/llmobs/test_llmobs_span_agent_writer.py                           @DataDog/ml-observability
tests/llmobs/test_llmobs_span_agentless_writer.py                       @DataDog/ml-observability
tests/llmobs/test_propagation.py                                        @DataDog/ml-observability

@Kyle-Verhoog
Copy link
Member Author

the existing tests for the trace processor are coupled to the implementation so i'm gonna open up a refactor on those first instead of trying to migrate them at the same time increasing the risk of a regression

@pr-commenter
Copy link

pr-commenter bot commented Dec 18, 2024

Benchmarks

Benchmark execution time: 2025-01-08 23:44:58

Comparing candidate commit 047e147 in PR branch llmobs/span-processor with baseline commit 6c61f40 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 148 metrics, 2 unstable metrics.

@Kyle-Verhoog
Copy link
Member Author

done in #11784

The LLMObs service formerly depended on the TraceProcessor interface in the
tracer. This was problematic due to sharing a dependency with the public API.
As such, users could configure a trace filter (under the hood is a trace
processor) and overwrite the LLMObs TraceProcessor.

Instead, the tracer can emit span start and finish events which the LLMObs
service listens to and acts on, as proposed here.

The gotcha is that the LLMObs service no longer has a way to drop traces when
run in agentless mode, which only LLMObs supports. Instead, we encourage users
to explicitly turn off APM which carries the benefit of clarity since this was
implicit before.
@datadog-dd-trace-py-rkomorn
Copy link

datadog-dd-trace-py-rkomorn bot commented Jan 2, 2025

Datadog Report

Branch report: llmobs/span-processor
Commit report: 047e147
Test service: dd-trace-py

✅ 0 Failed, 1555 Passed, 0 Skipped, 24m 49.5s Total duration (13m 53.33s time saved)

@Yun-Kim Yun-Kim force-pushed the llmobs/span-processor branch from 93dda16 to 85b3bae Compare January 2, 2025 19:33
@Yun-Kim Yun-Kim force-pushed the llmobs/span-processor branch 2 times, most recently from bd714af to d0cc865 Compare January 6, 2025 18:38
@Yun-Kim Yun-Kim marked this pull request as ready for review January 6, 2025 18:38
@Yun-Kim Yun-Kim requested review from a team as code owners January 6, 2025 18:38
@Yun-Kim
Copy link
Contributor

Yun-Kim commented Jan 6, 2025

Instead, we encourage users to explicitly turn off APM which carries the benefit of clarity since this was implicit before.

@Kyle-Verhoog should we add a release note specifying that DD_TRACE_ENABLED should be set to false for agentless users? Or would you rather we do this implicitly for agentless customers?

@Yun-Kim Yun-Kim force-pushed the llmobs/span-processor branch from d0cc865 to 24b8c92 Compare January 6, 2025 18:50
@Kyle-Verhoog
Copy link
Member Author

yes let's add a release note for the change

@Yun-Kim Yun-Kim changed the title chore(llmobs): refactor to use span events fix(llmobs): replace trace processor with event listener Jan 6, 2025
@Yun-Kim Yun-Kim force-pushed the llmobs/span-processor branch from 4790838 to b20211e Compare January 6, 2025 21:10
@Kyle-Verhoog Kyle-Verhoog enabled auto-merge (squash) January 8, 2025 21:14
Copy link
Contributor

@mabdinur mabdinur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions but overall looks good to me

@Yun-Kim Yun-Kim force-pushed the llmobs/span-processor branch from f39ac5c to 047e147 Compare January 8, 2025 23:07
@Kyle-Verhoog Kyle-Verhoog merged commit d676233 into main Jan 9, 2025
608 checks passed
@Kyle-Verhoog Kyle-Verhoog deleted the llmobs/span-processor branch January 9, 2025 17:45
Yun-Kim added a commit that referenced this pull request Jan 9, 2025
Follow up of #11781 to further clean up LLMObs tests, specifically
replacing potentially flaky LLMObs span writer mocks and assertions with
the test LLMObsSpanWriter dummy class. Also clean up the
`tests/llmobs/conftest.py` file which previously contained a ton of
rarely used and sometimes redundant fixtures.

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
Yun-Kim added a commit that referenced this pull request Jan 10, 2025
Follow up on #11781 to fix a weird duplicate span writing issue with the
new listener hook logic.

Since we were registering these hooks on `LLMObs.__init__()` which also
happens at startup (as we create a default LLMObs() instance) as well as
on `LLMObs.enable()`, we were double registering these hooks, and the
default LLMObsSpanWriter was still saved and called each time the tracer
finished a span. A symptom of this issue is that if a user was to
manually enable agentless mode, they would see noisy logs indicating a
failure to send spans to the agent proxy endpoint (which is the default
writer mode) even though they also submitted spans to the agentless
endpoint succesfully.

This fix resolves the issue by moving the hook registering to
`LLMObs.enable()`, and adding corresponding logic to deregister the
hooks on `_stop_service()`. This way we should only ever have one set of
hooks registered per process.

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants