Skip to content

Fix: drop reconstructed conditioning frame in action-conditioned autoregressive concat#151

Draft
stonesstones wants to merge 1 commit into
nvidia-cosmos:mainfrom
stonesstones:fix/action-autoregressive-concat
Draft

Fix: drop reconstructed conditioning frame in action-conditioned autoregressive concat#151
stonesstones wants to merge 1 commit into
nvidia-cosmos:mainfrom
stonesstones:fix/action-autoregressive-concat

Conversation

@stonesstones

@stonesstones stonesstones commented May 13, 2026

Copy link
Copy Markdown

Summary

In the autoregressive loops of three action-conditioning entry points, each
generated chunk has length chunk_size + 1. Index 0 is the reconstructed
conditioning frame
(≈ previous chunk's last frame, preserved by
denoise_replace_gt_frames=True), and the last frame is also used as
next_img_array for the next chunk's conditioning input.

The current concat uses chunk_video[i][:chunk_size] for i ≥ 1, which keeps
the reconstructed conditioning frame (index 0) and drops the last freshly
generated frame (index chunk_size). This produces:

  • A 1-frame visual duplicate at the chunk[0]/chunk[1] boundary only.
    chunk[0] is appended in full, so its last frame F_{chunk_size} ends up
    in the concat, and chunk[1]'s first frame (≈ the same F_{chunk_size},
    reconstructed) is also kept. Subsequent boundaries are not visibly
    duplicated because the [:chunk_size] slice on chunk[i] (i ≥ 1) trims
    away its own last frame before the next chunk's reconstructed conditioning
    frame is appended.
  • A 1-frame loss only at the very end (chunk[-1]). For intermediate
    chunks i ∈ [1, N-2], the dropped last frame F_{(i+1)·chunk_size} is
    effectively replaced in the timeline by chunk[i+1]'s reconstructed
    conditioning frame ≈ the same frame (slight quality loss from a VAE
    round-trip, but no positional gap). For the last chunk chunk[-1], there
    is no successor, so its dropped final frame is permanently missing from
    the output.

The fix switches the slice to chunk_video[i][1:] for i ≥ 1, dropping the
duplicated reconstructed conditioning frame at the front and keeping every
freshly generated frame.

Affected files

  • cosmos_predict2/action_conditioned.py
  • cosmos_predict2/_src/predict2/action/inference/inference.py
  • cosmos_predict2/_src/predict2/action/inference/inference_gr00t.py

All three scripts construct chunk input as [img, 0, 0, ...] (one valid
frame + zeros) and call step_inference / generate_vid2world with
num_latent_conditional_frames=1, so the conditioning region is exactly one
pixel frame. The [1:] slice is therefore correct for these scripts as
written.

Verification

Trace example with chunk_size=12 and 3 chunks:

chunk_video[0] = [F_init', F1, F2, ..., F12]   # 13 frames, next_img = F12
chunk_video[1] = [F12',    F13,F14,..., F24]   # 13 frames, F12' ≈ F12, next_img = F24
chunk_video[2] = [F24',    F25,F26,..., F36]   # 13 frames, F24' ≈ F24

# Before this PR (chunk[0] full, chunk[i>=1] sliced [:12])
concat = [F_init', F1..F12]  +  [F12', F13..F23]  +  [F24', F25..F35]
                  ^^^         ^^^^                        ^^^
                  chunk0 last  duplicate of F12           F24' (≈ F24, reconstruction)
                                                          F36 silently lost (no chunk 3)
  → duplicate F12/F12' visible at chunk[0]/chunk[1] only
  → F24 (chunk 1's fresh last) replaced by F24' (chunk 2's reconstruction): position filled, slight quality drop
  → F36 (chunk 2's fresh last) lost entirely: 1 frame missing from output

# After this PR (chunk[i>=1] sliced [1:])
concat = [F_init', F1..F12]  +  [F13..F24]  +  [F25..F36]
  → no duplicate, no missing frame, all fresh generations kept

Test plan

  • just lint passes (verified locally with ruff check and ruff format --check v0.12.7)
  • Run action-conditioned inference with ≥3 chunks before/after and:
    • confirm the F12/F12' duplicate at the first chunk boundary disappears
    • confirm the very last frame (F_{N·chunk_size}) is now present in the output
    • check that reconstruction-frame artifacts at intermediate boundaries (F_{2·chunk_size}, F_{3·chunk_size}, ...) are gone

…regressive concat

The autoregressive loops in three action-conditioning entry points generate
chunks of length `chunk_size + 1` where index 0 is the reconstructed
conditioning frame (= previous chunk's last frame, preserved by
`denoise_replace_gt_frames=True`).

The previous concat used `chunk_video[i][:chunk_size]`, which kept that
reconstructed conditioning frame and dropped the last newly generated frame.
This caused a 1-frame duplicate at every chunk boundary and lost one fresh
frame per chunk.

Switch to `chunk_video[i][1:]` so each subsequent chunk contributes only
its `chunk_size` newly generated frames.

Affected files:
- cosmos_predict2/action_conditioned.py
- cosmos_predict2/_src/predict2/action/inference/inference.py
- cosmos_predict2/_src/predict2/action/inference/inference_gr00t.py

Signed-off-by: stonesstones <ku292stones@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant