Background
While reading the current server implementation, I noticed two areas that may become increasingly hard to maintain as the API surface grows.
1. app.py is becoming too large and repetitive
src/verl_mint/app.py currently contains a large amount of route registration, request/response adaptation, compatibility handling, in-memory state management, checkpoint bookkeeping, future wrapping, and helper logic in one file.
A lot of the route code is also boilerplate-like:
- duplicated
/api/v1/* and legacy route bindings
- repeated conversion between Pydantic schemas and backend/service payloads
- repeated future wrapping
- repeated checkpoint/session/sampler response shaping
- version-specific behavior based on route path checks
This makes the API layer harder to review and harder to extend safely.
Suggested direction:
- Split
app.py into smaller route modules by domain, such as sessions, models, training, sampling, checkpoints, rollouts, futures, etc.
- Move compatibility-specific response shaping into presenter/adapter helpers.
- Consider defining repetitive endpoint mappings in a declarative config, then auto-registering or generating simple route handlers where possible.
- Keep route handlers thin: validate request, call service layer, format response.
2. Core training APIs appear synchronous and lack progress visibility
For core training operations such as forward_backward_ppo, the current HTTP handler appears to execute the backend training call synchronously, then only after completion wraps the completed result into a future-like request_id.
That means the client experience is roughly:
POST /forward_backward_ppo
- server blocks until training finishes
- server returns
request_id
- client calls
retrieve_future
- result is usually already complete
This preserves a future-shaped API, but it is not a true async job model. For long-running training operations, the first request may hang for a long time with no progress information, which can feel risky or confusing to users.
Suggested direction:
- Change long-running training endpoints to submit a background job and return a token/request ID immediately.
- Let clients poll by token to retrieve job status.
- Expose useful status fields, for example:
- queued / running / succeeded / failed / cancelled
- progress percentage or completed steps / total steps
- current phase
- latest metrics
- error message if failed
- timestamps
- Keep final result retrieval compatible with the current future semantics where possible.
- Optionally support cancellation and server-side timeout handling.
Why this matters
These two issues are connected: as more training operations become long-running and stateful, keeping all route/state/future logic inside app.py will make it harder to provide robust async behavior, progress reporting, retries, cancellation, and persistence.
A cleaner split between API routing, job management, progress state, and backend execution would make the system easier to evolve.
Background
While reading the current server implementation, I noticed two areas that may become increasingly hard to maintain as the API surface grows.
1.
app.pyis becoming too large and repetitivesrc/verl_mint/app.pycurrently contains a large amount of route registration, request/response adaptation, compatibility handling, in-memory state management, checkpoint bookkeeping, future wrapping, and helper logic in one file.A lot of the route code is also boilerplate-like:
/api/v1/*and legacy route bindingsThis makes the API layer harder to review and harder to extend safely.
Suggested direction:
app.pyinto smaller route modules by domain, such as sessions, models, training, sampling, checkpoints, rollouts, futures, etc.2. Core training APIs appear synchronous and lack progress visibility
For core training operations such as
forward_backward_ppo, the current HTTP handler appears to execute the backend training call synchronously, then only after completion wraps the completed result into a future-likerequest_id.That means the client experience is roughly:
POST /forward_backward_pporequest_idretrieve_futureThis preserves a future-shaped API, but it is not a true async job model. For long-running training operations, the first request may hang for a long time with no progress information, which can feel risky or confusing to users.
Suggested direction:
Why this matters
These two issues are connected: as more training operations become long-running and stateful, keeping all route/state/future logic inside
app.pywill make it harder to provide robust async behavior, progress reporting, retries, cancellation, and persistence.A cleaner split between API routing, job management, progress state, and backend execution would make the system easier to evolve.