Skip to content

[Feature] Enable Tracing Mechanism (Phase 1)#1068

Open
Mustafa974 wants to merge 20 commits intoLazyAGI:mainfrom
Mustafa974:hlt/tracing
Open

[Feature] Enable Tracing Mechanism (Phase 1)#1068
Mustafa974 wants to merge 20 commits intoLazyAGI:mainfrom
Mustafa974:hlt/tracing

Conversation

@Mustafa974
Copy link
Copy Markdown
Collaborator

@Mustafa974 Mustafa974 commented Mar 24, 2026

📌 PR 内容 / PR Description

本 PR 引入 LazyLLM tracing 机制的 Phase 1,实现最小可用闭环,并补充了一轮稳定性与可维护性修正。
核心目标:

  • 打通 flow / module 的 tracing 数据链路
  • 通过 hook 生命周期自动创建、更新并结束 tracing span
  • 支持将 tracing 数据导出到 Langfuse
  • 完善 hook 异常处理路径,避免 tracing 逻辑干扰主业务异常传播
  • 补充 tracing 相关文档与配置能力

✅ 主要变更 / Main Changes

1. 新增 tracing 基础设施

  • 新增 lazyllm/tracing/ 包,提供 tracing runtime、配置与 backend 抽象
  • 新增 TracingBackend 抽象接口和 LangfuseBackend 实现
  • 新增 tracing 相关配置项:
    • trace_enabled
    • trace_backend
    • trace_content_enabled
  • 新增 globals['trace'] 作为当前请求 / 执行上下文的 tracing 状态载体

2. Hook 体系增强

  • LazyLLMHook 增加 on_error 生命周期
  • 新增 LazyTracingHook,在 pre_hook / post_hook / on_error / report 生命周期中维护 span
  • 抽离并统一 hook 相关辅助逻辑:
    • prepare_hooks
    • register_hooks
    • resolve_default_hooks
    • run_hooks
  • 引入 HookPhaseError,统一 hook phase 失败时的异常表达

3. Flow / Module tracing 接入

  • 改造 LazyLLMFlowsBase.__call__
  • 改造 ModuleBase.__call__
  • 默认在 flow / module 调用链中自动注入 tracing hook
  • 支持捕获正常输出与异常输出,写入 tracing span

4. 稳定性修正

  • 避免 tracing span handle 为空时的错误访问
  • 修复 tracing context 默认值中的可变对象风险
  • 避免 post_hook 异常被误判为主业务异常
  • 避免 on_error / report hook 覆盖原始业务异常
  • 为 hook 注册补充重复注册保护与更早的类型校验
  • 将 hook 默认错误模式命名从 ignore 调整为 warn,语义更清晰

5. 文档与依赖

  • 新增 tracing 相关中英文文档
  • 补充 LazyTracingHook 生命周期方法文档:
    • pre_hook
    • post_hook
    • on_error
    • report
  • 新增 tracing extra 依赖:
    • langfuse
    • opentelemetry-api
    • opentelemetry-sdk
    • opentelemetry-exporter-otlp-proto-http

✅ 变更类型 / Type of Change

  • 修复 Bug / Bug fix (non-breaking change that fixes an issue)
  • 新功能 / New feature (non-breaking change that adds functionality)
  • 重构 / Refactor (no functionality change, code structure optimized)
  • 重大变更 / Breaking change (fix or feature that would cause existing functionality to change)
  • 文档更新 / Documentation update (changes to docs only)
  • 性能优化 / Performance optimization

⚡ 更新后的用法示例 / Usage After Update

# 示例 / Example
def build_ppl():
    with pipeline() as ppl:
        ppl.retriever = FakeRetriever("sparse")

        ppl.reranker = FakeReranker() | bind(query=ppl.input["query"])
        ppl.formatter = (
            lambda result, query: lazyllm.kwargs(
                context_str="\n".join(doc["text"] for doc in result["docs"]),
                query=query,
            )
        ) | bind(query=ppl.input["query"])
        ppl.llm = FakeLLM()
    return ppl

lazyllm.globals["trace"] = {
    "enabled": True,
    "session_id": "trace-simple-poc-session",
    "user_id": "trace-simple-poc-user",
    "request_tags": ["poc", "simple-rag"],
    "debug_capture_payload": True,
}

ppl = build_ppl()
request = {
    "query": "测试问题33333333",
    "filters": {"category": "rag-demo"},
}
result = ppl(request)

Signed-off-by: Mustafa974 <[email protected]>
Signed-off-by: Mustafa974 <[email protected]>
@Mustafa974 Mustafa974 requested review from a team as code owners March 24, 2026 09:53
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a foundational tracing mechanism into the LazyLLM framework, enabling developers to gain deeper insights into the execution flow of their LLM applications. By integrating with tracing backends like Langfuse, it provides a proof-of-concept for monitoring and debugging modules and pipelines. The changes include adding global configuration options for tracing, defining a new LazyTracingHook to manage span lifecycles, and enhancing the core call methods of ModuleBase and FlowBase to support these hooks and improve error reporting. This work lays the groundwork for comprehensive observability within LazyLLM.

Highlights

  • Tracing Mechanism Introduction: Implemented a proof-of-concept (POC) tracing mechanism to observe the execution of modules and pipelines, with initial integration for Langfuse.
  • Enhanced Error Handling: Modified ModuleBase.call() and LazyLLMFlowsBase.call() to incorporate try-except-finally blocks, allowing for robust error capturing and reporting via hooks.
  • Flexible Hook Management: Refactored hook registration and execution to use lists for _hooks and introduced _builtin_hooks and run_pre_hooks for more structured and extensible hook processing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a tracing mechanism, a significant feature for observability. It integrates with Langfuse using OpenTelemetry. The core changes involve refactoring ModuleBase.__call__ and LazyLLMFlowsBase.__call__ to support hooks with robust error handling via try...except...finally blocks. A new LazyTracingHook is added to capture execution spans, inputs, outputs, and errors. The implementation is well-structured within a new lazyllm/tracing package. My review includes a suggestion to improve thread safety in the new tracing configuration module to prevent potential race conditions.

Signed-off-by: Mustafa974 <[email protected]>
Signed-off-by: Mustafa974 <[email protected]>
@Mustafa974 Mustafa974 requested a review from a team as a code owner March 24, 2026 10:23
Signed-off-by: Mustafa974 <[email protected]>
@wzh1994
Copy link
Copy Markdown
Contributor

wzh1994 commented Mar 30, 2026

可以把

lazyllm.globals["trace"] = {
    "enabled": True,
    "session_id": "trace-simple-poc-session",
    "user_id": "trace-simple-poc-user",
    "request_tags": ["poc", "simple-rag"],
    "debug_capture_payload": True,
}

包装成一个函数,内部调用lazyllm.globals;后面用户直接调用函数,相比于直接操作globals,会清晰一些

@wzh1994
Copy link
Copy Markdown
Contributor

wzh1994 commented Mar 30, 2026

"request_tags": ["poc", "simple-rag"],这两个是什么?

@Mustafa974
Copy link
Copy Markdown
Collaborator Author

"request_tags": ["poc", "simple-rag"],这两个是什么?

存储为 langfuse.trace.tags 字段,方便在 langfuse 界面根据 tag 筛选查看 tracing/sessions
image

@Mustafa974 Mustafa974 changed the title [Feature] Enable Tracing Mechanism (POC Phase) [Feature] Enable Tracing Mechanism (Phase 1) Apr 3, 2026
Signed-off-by: Mustafa974 <[email protected]>
Signed-off-by: Mustafa974 <[email protected]>
Copy link
Copy Markdown
Contributor

@wzh1994 wzh1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test review body (debug)

Copy link
Copy Markdown
Contributor

@wzh1994 wzh1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary:
Purpose:
This PR introduces Phase 1 of a tracing mechanism for LazyLLM, adding infrastructure to trace execution of flows and modules via hooks and spans, along with global state support and documentation.

Files/Modules Changed:

  • lazyllm/common/globals.py: Adds a new trace={} entry to the global thread-safe dictionary, providing per-thread storage for tracing state.
  • lazyllm/configs.py: Minor reformatting — wraps the config chain in parentheses for cleaner multi-line formatting. No new config keys added in the visible diff.
  • lazyllm/docs/__init__.py: Registers the new tracing docs module so documentation is initialized alongside other modules.
  • lazyllm/docs/hook.py: Adds documentation (Chinese and English) for several new APIs: LazyLLMHook.on_error (error-handling lifecycle hook), HookPhaseError (exception raised when strict hooks fail during a phase), LazyTracingHook (the concrete tracing hook that creates/updates/finishes spans), and presumably more (diff is truncated).

The actual implementation files for LazyTracingHook, HookPhaseError, and the span/tracing infrastructure are not visible in the truncated diff but are referenced by the documentation.

Key Design Decisions:

  1. Hook lifecycle extension: The existing LazyLLMHook base class gains an on_error callback, enabling hooks to react to exceptions — important for marking spans as errored.
  2. Strict vs. non-strict hooks: HookPhaseError aggregates failures from multiple strict-mode hooks in a single phase, allowing lenient hooks to fail silently while strict ones propagate errors. This is a pragmatic trade-off between observability reliability and application resilience.
  3. Global thread-local trace state: Using the existing ThreadSafeDict globals mechanism for trace context keeps the design consistent with other cross-cutting concerns (user_id, usage, etc.) and avoids introducing a separate context-propagation mechanism.
  4. Phase 1 scope: This appears to be foundational infrastructure (hooks, spans, global state, docs) without yet wiring tracing into all flows/modules, suggesting an incremental rollout.

Potential Risk Areas:

  1. The truncated diff hides the core tracing implementation — the actual span creation, context propagation, and hook registration logic needs careful review for thread safety, async compatibility, and performance overhead.
  2. The trace={} global is a mutable dict default; need to verify ThreadSafeDict properly deep-copies defaults per thread to avoid cross-thread contamination.
  3. HookPhaseError aggregating multiple exceptions — callers must handle this composite error type correctly, especially in existing error-handling paths that may not expect it.
  4. Performance impact of tracing hooks on hot paths (every flow/module call) should be benchmarked, even when tracing is disabled.

Findings:

  • total_issues: 22
  • exception: 6
  • logic: 6
  • type: 4
  • safety: 2
  • concurrency: 1
  • design: 1
  • performance: 1
  • style: 1

auto reviewed by BOT (claude-opus-4-6)

Copy link
Copy Markdown
Contributor

@wzh1994 wzh1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary:
目的: 本 PR 为 LazyLLM 框架引入第一阶段的 tracing(链路追踪)机制,通过 hook 体系在 flow/module 执行生命周期中自动创建、更新和结束 tracing span,并将 trace 数据存储在全局线程安全字典中。

变更文件及原因:

  • lazyllm/common/globals.py:在全局线程安全字典 __global_attrs__ 中新增 trace={} 字段,用于存储当前线程/请求的 tracing 数据。
  • lazyllm/configs.py:对 config 链式调用做了括号包裹的格式调整(将整个表达式用 () 包围),属于代码风格修正,无功能变更。
  • lazyllm/docs/__init__.py:在文档初始化导入列表中增加 tracing 模块。
  • lazyllm/docs/hook.py:为 hook 体系新增三组中英文文档:LazyLLMHook.on_error(异常处理钩子)、HookPhaseError(hook 阶段错误异常类)、LazyTracingHook(tracing hook)。

关键设计决策:

  1. 复用 hook 机制: tracing 并非独立子系统,而是通过 LazyTracingHook 作为 hook 插件接入现有的 pre_hook / post_hook 生命周期,设计上保持了扩展性和解耦。
  2. 引入 on_error 钩子: 在原有 pre_hook/post_hook/report 基础上增加异常处理阶段,使 tracing 能捕获执行失败的 span 状态。
  3. 引入 HookPhaseError 区分 strict 模式和非 strict 模式的 hook,strict 模式下 hook 失败会抛出聚合异常(包含所有失败 hook 的信息),这是一个重要的错误传播策略选择。
  4. 全局 trace 字典: 利用已有的 ThreadSafeDict 机制存储 trace 数据,与 usagechat_history 等保持一致的存取模式。

潜在风险点:

  1. diff 被截断: LazyTracingHook 的实际实现代码未在 diff 中完整展示,需确认 span 的创建/结束逻辑、父子 span 关联、以及与 call_stack 的交互是否正确。
  2. HookPhaseError 的 strict 模式: 需关注哪些 hook 默认是 strict 的——如果 tracing hook 被设为 strict,其自身失败可能中断业务流程。
  3. trace 字典的生命周期管理: 需确认在请求结束后 trace 数据是否被正确清理,避免内存泄漏。
  4. configs.py 的括号重构: 虽然是纯格式变更,但链式调用较长,需确认结尾括号匹配无误。

建议:

- config = _NamespaceConfig().add('mode', ...
+ config = (_NamespaceConfig().add('mode', ...
+         ))

configs.py 中的括号包裹改动建议在 CI 中确认 config 对象的所有属性访问仍正常工作,避免因运算符优先级变化引入隐蔽问题。

Findings:

  • total_issues: 13
  • style: 4
  • maintainability: 3
  • logic: 2
  • concurrency: 1
  • design: 1
  • safety: 1
  • type: 1

auto reviewed by BOT (claude-opus-4-6)

Signed-off-by: Mustafa974 <[email protected]>
Signed-off-by: Mustafa974 <[email protected]>
'LazyTracingHook',

# tracing
'TracingSetupError',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracing相关的能力不在tools里面吧,这样确定能import到么

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修改为显示 import,其他不需要的部分删掉了

lazyllm/hook.py Outdated


def resolve_default_hooks(obj):
trace_cfg = globals.get('trace', {})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hook的基类不应该感知“trace”这个子类

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

self._sync = False
self._hooks = set()
self._hooks = []
register_hooks(self, resolve_default_hooks(self))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

做成hook意味着模块在 “最上层” 。应该是在“trace”里面判断标志位,然后注册给flow和module;或者放把对应的cls到一个公共空间,每次实例化flow和module的时候从这个公共空间里面取cls,然后注册进去,而不是反过来在flow里面判断要注册哪些hook

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。

  • 在 hook.py 增加了全局 provider registry,统一管理所有 hook provider(方便后续扩展其他类型的 hook)
  • 在 tracing/hook.py 里注册 tracing provider(根据对象动态判断是否要注册 tracing hook,返回 [] 或者 [LazyTracingHook])
  • flow.py/module.py 里调用 hook.py 里的 resolve_builtin_hooks() 函数,而不是由 flow/module 判断是否要注册 tracing hook

LOG.warning('Flow on_error hook failed', exc_info=True)
raise
else:
run_hooks(hook_objs, 'post_hook', r)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post_hook是在else还是finally,如果在else,那么需要释放资源的时候,一旦发生了异常,资源就无法释放了

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post_hook应该放在 else,处理成功路径下的后处理(比如结束span的记录)。
释放资源、clean up 之类的操作应该放在 report 里,在 finally 里强制执行(不论执行成功与否)。
在语义上,应该区分 hook 里的 post_hookreport,后续有其他 hook,也应该在 report 函数里执行资源的释放。

为了防止后续歧义,把 report 函数改名为 finalize 了。

self._use_cache: Union[bool, str] = False
self._hooks = set()
self._hooks = []
register_hooks(self, resolve_default_hooks(self))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同flow

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

LOG.warning('Module on_error hook failed', exc_info=True)
raise err from None
else:
run_hooks(hook_objs, 'post_hook', r)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同flow

lazyllm/hook.py Outdated
except StopIteration: pass


class LazyTracingHook(LazyLLMHook):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个hook定义在tracing目录下

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已移动位置

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants