Skip to content

Refactor API layer and add real async job progress for training requests #1

Description

@ISEEKYAN

Background

While reading the current server implementation, I noticed two areas that may become increasingly hard to maintain as the API surface grows.

1. app.py is becoming too large and repetitive

src/verl_mint/app.py currently contains a large amount of route registration, request/response adaptation, compatibility handling, in-memory state management, checkpoint bookkeeping, future wrapping, and helper logic in one file.
A lot of the route code is also boilerplate-like:

  • duplicated /api/v1/* and legacy route bindings
  • repeated conversion between Pydantic schemas and backend/service payloads
  • repeated future wrapping
  • repeated checkpoint/session/sampler response shaping
  • version-specific behavior based on route path checks
    This makes the API layer harder to review and harder to extend safely.
    Suggested direction:
  • Split app.py into smaller route modules by domain, such as sessions, models, training, sampling, checkpoints, rollouts, futures, etc.
  • Move compatibility-specific response shaping into presenter/adapter helpers.
  • Consider defining repetitive endpoint mappings in a declarative config, then auto-registering or generating simple route handlers where possible.
  • Keep route handlers thin: validate request, call service layer, format response.

2. Core training APIs appear synchronous and lack progress visibility

For core training operations such as forward_backward_ppo, the current HTTP handler appears to execute the backend training call synchronously, then only after completion wraps the completed result into a future-like request_id.
That means the client experience is roughly:

  1. POST /forward_backward_ppo
  2. server blocks until training finishes
  3. server returns request_id
  4. client calls retrieve_future
  5. result is usually already complete
    This preserves a future-shaped API, but it is not a true async job model. For long-running training operations, the first request may hang for a long time with no progress information, which can feel risky or confusing to users.
    Suggested direction:
  • Change long-running training endpoints to submit a background job and return a token/request ID immediately.
  • Let clients poll by token to retrieve job status.
  • Expose useful status fields, for example:
    • queued / running / succeeded / failed / cancelled
    • progress percentage or completed steps / total steps
    • current phase
    • latest metrics
    • error message if failed
    • timestamps
  • Keep final result retrieval compatible with the current future semantics where possible.
  • Optionally support cancellation and server-side timeout handling.

Why this matters

These two issues are connected: as more training operations become long-running and stateful, keeping all route/state/future logic inside app.py will make it harder to provide robust async behavior, progress reporting, retries, cancellation, and persistence.
A cleaner split between API routing, job management, progress state, and backend execution would make the system easier to evolve.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions