Fix: drop reconstructed conditioning frame in action-conditioned autoregressive concat by stonesstones · Pull Request #151 · nvidia-cosmos/cosmos-predict2.5

stonesstones · 2026-05-13T02:15:25Z

Summary

In the autoregressive loops of three action-conditioning entry points, each
generated chunk has length chunk_size + 1. Index 0 is the reconstructed
conditioning frame (≈ previous chunk's last frame, preserved by
denoise_replace_gt_frames=True), and the last frame is also used as
next_img_array for the next chunk's conditioning input.

The current concat uses chunk_video[i][:chunk_size] for i ≥ 1, which keeps
the reconstructed conditioning frame (index 0) and drops the last freshly
generated frame (index chunk_size). This produces:

A 1-frame visual duplicate at the chunk[0]/chunk[1] boundary only.
chunk[0] is appended in full, so its last frame F_{chunk_size} ends up
in the concat, and chunk[1]'s first frame (≈ the same F_{chunk_size},
reconstructed) is also kept. Subsequent boundaries are not visibly
duplicated because the [:chunk_size] slice on chunk[i] (i ≥ 1) trims
away its own last frame before the next chunk's reconstructed conditioning
frame is appended.
A 1-frame loss only at the very end (chunk[-1]). For intermediate
chunks i ∈ [1, N-2], the dropped last frame F_{(i+1)·chunk_size} is
effectively replaced in the timeline by chunk[i+1]'s reconstructed
conditioning frame ≈ the same frame (slight quality loss from a VAE
round-trip, but no positional gap). For the last chunk chunk[-1], there
is no successor, so its dropped final frame is permanently missing from
the output.

The fix switches the slice to chunk_video[i][1:] for i ≥ 1, dropping the
duplicated reconstructed conditioning frame at the front and keeping every
freshly generated frame.

Affected files

cosmos_predict2/action_conditioned.py
cosmos_predict2/_src/predict2/action/inference/inference.py
cosmos_predict2/_src/predict2/action/inference/inference_gr00t.py

All three scripts construct chunk input as [img, 0, 0, ...] (one valid
frame + zeros) and call step_inference / generate_vid2world with
num_latent_conditional_frames=1, so the conditioning region is exactly one
pixel frame. The [1:] slice is therefore correct for these scripts as
written.

Verification

Trace example with chunk_size=12 and 3 chunks:

chunk_video[0] = [F_init', F1, F2, ..., F12]   # 13 frames, next_img = F12
chunk_video[1] = [F12',    F13,F14,..., F24]   # 13 frames, F12' ≈ F12, next_img = F24
chunk_video[2] = [F24',    F25,F26,..., F36]   # 13 frames, F24' ≈ F24

# Before this PR (chunk[0] full, chunk[i>=1] sliced [:12])
concat = [F_init', F1..F12]  +  [F12', F13..F23]  +  [F24', F25..F35]
                  ^^^         ^^^^                        ^^^
                  chunk0 last  duplicate of F12           F24' (≈ F24, reconstruction)
                                                          F36 silently lost (no chunk 3)
  → duplicate F12/F12' visible at chunk[0]/chunk[1] only
  → F24 (chunk 1's fresh last) replaced by F24' (chunk 2's reconstruction): position filled, slight quality drop
  → F36 (chunk 2's fresh last) lost entirely: 1 frame missing from output

# After this PR (chunk[i>=1] sliced [1:])
concat = [F_init', F1..F12]  +  [F13..F24]  +  [F25..F36]
  → no duplicate, no missing frame, all fresh generations kept

Test plan

just lint passes (verified locally with ruff check and ruff format --check v0.12.7)
Run action-conditioned inference with ≥3 chunks before/after and:
- confirm the F12/F12' duplicate at the first chunk boundary disappears
- confirm the very last frame (F_{N·chunk_size}) is now present in the output
- check that reconstruction-frame artifacts at intermediate boundaries (F_{2·chunk_size}, F_{3·chunk_size}, ...) are gone

…regressive concat The autoregressive loops in three action-conditioning entry points generate chunks of length `chunk_size + 1` where index 0 is the reconstructed conditioning frame (= previous chunk's last frame, preserved by `denoise_replace_gt_frames=True`). The previous concat used `chunk_video[i][:chunk_size]`, which kept that reconstructed conditioning frame and dropped the last newly generated frame. This caused a 1-frame duplicate at every chunk boundary and lost one fresh frame per chunk. Switch to `chunk_video[i][1:]` so each subsequent chunk contributes only its `chunk_size` newly generated frames. Affected files: - cosmos_predict2/action_conditioned.py - cosmos_predict2/_src/predict2/action/inference/inference.py - cosmos_predict2/_src/predict2/action/inference/inference_gr00t.py Signed-off-by: stonesstones <ku292stones@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: drop reconstructed conditioning frame in action-conditioned autoregressive concat#151

Fix: drop reconstructed conditioning frame in action-conditioned autoregressive concat#151
stonesstones wants to merge 1 commit into
nvidia-cosmos:mainfrom
stonesstones:fix/action-autoregressive-concat

stonesstones commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

stonesstones commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Affected files

Verification

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stonesstones commented May 13, 2026 •

edited

Loading