Skip to content

[RFC]: Lumilake 2026 Q2 Roadmap #7

@timzsu

Description

@timzsu

Motivation

This RFC outlines the development roadmap for Lumilake in Q2 2026. We are gathering ideas and feedback.

Proposed change

Code health & security

  • CI improvement ci(unit-tests): collect coverage and stop skipping tests/server #8

    • Add pytest-cov to CI to ensure coverage.
    • Do not ignore tests/server/ in Pytest CI.
  • Harden untrusted-input and failure paths

    • Strengthen the try-except blocks (runtime/server.py:303,314,322,648,693, parser/n8n.py:457) so hook failures stop being silently swallowed.
    • Replace regex-based SQL table extraction in runtime_graph.py:1529–1533 and 909–911 with a parser or explicit table_name param (injection risk).
    • Sanitize request_id / batch_id before joining into archive paths in runtime_manager/flowmesh.py:113 (path traversal).
    • Strip directory components in _artifact_name_from_uri() (routes/jobs.py:673–685) using Path(name).name.
    • Add max-body-size and YAML/JSON depth limits before parsing in routes/jobs.py:140–154 (DoS via 100MB or deeply nested payload).
    • Fail-fast in parser/n8n.py:120–124 when a node has no type instead of silently passing the filter.
    • Add an explicit iteration cap to the n8n topo-sort loop (parser/n8n.py:165–268) so a progress-tracking bug can't hang the parser.
  • Tighten runtime code quality

    • Remove getattr for known-optional fields in halo_dp.py; replace with typed Optional fields.
    • Unify duplicated parameter resolution for _resolve_data_retrieval_params() in runtime_graph.py:848–889 vs 980–1018.
    • Unify YAML and n8n parsers behind one IR; require explicit FormatOp instead of invisible auto-wrap.
    • Guard _build_candidate_pool against the item_map[workflow_id] race when items are dequeued between selection and access (priority_queue.py:248).
    • Assert / log when finalize_workflows() pops a missing workflow id (priority_queue.py:161–163).

Usability — SDK, CLI, docs, errors

Dependencies

Generalizability

  • Decouple the runtime graph builder from vllm / transformers / diffusers / omni + HF assumptions
    • Introduce a ModelRegistry / backend-strategy seam in runtime/runtime_graph.py (≈L478–519).
    • Replace free-form data_spec / model_spec / inference_spec dicts (≈L1148–1160) with Pydantic discriminated unions per (backend, task_type).

Performance

  • Job manager - Priority queue fairness

    • Drop _apply_user_fairness() from O(N²) by precomputing user_to_ids (priority_queue.py:390–441).
    • Track oldest enqueue timestamp incrementally instead of scanning every queue on get_pending_stats() (priority_queue.py:143–156).
  • Query Optimizer

    • Skip the redundant topo sort in graph rewriting and memoize remap() on graph prefixing (runtime_graph.py:167–237).
    • Canonicalize + intern state tuples in Halo-DP to avoid blow-up on deep graphs (runtime/optimizer/schedule/halo_dp.py).
  • Storage and scheduling I/O

    • Batch S3 artifact writes (tar/zip + multipart) instead of one stat + get/put per artifact (utils/job_storage.py:35–72).
    • Replace the LUMILAKE_POLL_INTERVAL_SECONDS sleep loop with an event / condition variable for worker availability (runtime/server.py:673–747).

Alternatives considered

No response

Migration / compatibility

No response

Feedback period

No response

CC list

No response

Before submitting

  • I have searched existing issues and confirmed this is not a duplicate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions