Refactor API layer and add real async job progress for training requests


## Background
While reading the current server implementation, I noticed two areas that may become increasingly hard to maintain as the API surface grows.
## 1. `app.py` is becoming too large and repetitive
`src/verl_mint/app.py` currently contains a large amount of route registration, request/response adaptation, compatibility handling, in-memory state management, checkpoint bookkeeping, future wrapping, and helper logic in one file.
A lot of the route code is also boilerplate-like:
- duplicated `/api/v1/*` and legacy route bindings
- repeated conversion between Pydantic schemas and backend/service payloads
- repeated future wrapping
- repeated checkpoint/session/sampler response shaping
- version-specific behavior based on route path checks
This makes the API layer harder to review and harder to extend safely.
Suggested direction:
- Split `app.py` into smaller route modules by domain, such as sessions, models, training, sampling, checkpoints, rollouts, futures, etc.
- Move compatibility-specific response shaping into presenter/adapter helpers.
- Consider defining repetitive endpoint mappings in a declarative config, then auto-registering or generating simple route handlers where possible.
- Keep route handlers thin: validate request, call service layer, format response.
## 2. Core training APIs appear synchronous and lack progress visibility
For core training operations such as `forward_backward_ppo`, the current HTTP handler appears to execute the backend training call synchronously, then only after completion wraps the completed result into a future-like `request_id`.
That means the client experience is roughly:
1. `POST /forward_backward_ppo`
2. server blocks until training finishes
3. server returns `request_id`
4. client calls `retrieve_future`
5. result is usually already complete
This preserves a future-shaped API, but it is not a true async job model. For long-running training operations, the first request may hang for a long time with no progress information, which can feel risky or confusing to users.
Suggested direction:
- Change long-running training endpoints to submit a background job and return a token/request ID immediately.
- Let clients poll by token to retrieve job status.
- Expose useful status fields, for example:
  - queued / running / succeeded / failed / cancelled
  - progress percentage or completed steps / total steps
  - current phase
  - latest metrics
  - error message if failed
  - timestamps
- Keep final result retrieval compatible with the current future semantics where possible.
- Optionally support cancellation and server-side timeout handling.
## Why this matters
These two issues are connected: as more training operations become long-running and stateful, keeping all route/state/future logic inside `app.py` will make it harder to provide robust async behavior, progress reporting, retries, cancellation, and persistence.
A cleaner split between API routing, job management, progress state, and backend execution would make the system easier to evolve.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor API layer and add real async job progress for training requests #1

Background

1. `app.py` is becoming too large and repetitive

2. Core training APIs appear synchronous and lack progress visibility

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Refactor API layer and add real async job progress for training requests #1

Description

Background

1. app.py is becoming too large and repetitive

2. Core training APIs appear synchronous and lack progress visibility

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `app.py` is becoming too large and repetitive