Fix: drop reconstructed conditioning frame in action-conditioned autoregressive concat#151
Draft
stonesstones wants to merge 1 commit into
Draft
Conversation
…regressive concat The autoregressive loops in three action-conditioning entry points generate chunks of length `chunk_size + 1` where index 0 is the reconstructed conditioning frame (= previous chunk's last frame, preserved by `denoise_replace_gt_frames=True`). The previous concat used `chunk_video[i][:chunk_size]`, which kept that reconstructed conditioning frame and dropped the last newly generated frame. This caused a 1-frame duplicate at every chunk boundary and lost one fresh frame per chunk. Switch to `chunk_video[i][1:]` so each subsequent chunk contributes only its `chunk_size` newly generated frames. Affected files: - cosmos_predict2/action_conditioned.py - cosmos_predict2/_src/predict2/action/inference/inference.py - cosmos_predict2/_src/predict2/action/inference/inference_gr00t.py Signed-off-by: stonesstones <ku292stones@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
In the autoregressive loops of three action-conditioning entry points, each
generated chunk has length
chunk_size + 1. Index0is the reconstructedconditioning frame (≈ previous chunk's last frame, preserved by
denoise_replace_gt_frames=True), and the last frame is also used asnext_img_arrayfor the next chunk's conditioning input.The current concat uses
chunk_video[i][:chunk_size]fori ≥ 1, which keepsthe reconstructed conditioning frame (index 0) and drops the last freshly
generated frame (index
chunk_size). This produces:chunk[0]/chunk[1]boundary only.chunk[0]is appended in full, so its last frameF_{chunk_size}ends upin the concat, and
chunk[1]'s first frame (≈ the sameF_{chunk_size},reconstructed) is also kept. Subsequent boundaries are not visibly
duplicated because the
[:chunk_size]slice onchunk[i](i ≥ 1) trimsaway its own last frame before the next chunk's reconstructed conditioning
frame is appended.
chunk[-1]). For intermediatechunks
i ∈ [1, N-2], the dropped last frameF_{(i+1)·chunk_size}iseffectively replaced in the timeline by
chunk[i+1]'s reconstructedconditioning frame ≈ the same frame (slight quality loss from a VAE
round-trip, but no positional gap). For the last chunk
chunk[-1], thereis no successor, so its dropped final frame is permanently missing from
the output.
The fix switches the slice to
chunk_video[i][1:]fori ≥ 1, dropping theduplicated reconstructed conditioning frame at the front and keeping every
freshly generated frame.
Affected files
cosmos_predict2/action_conditioned.pycosmos_predict2/_src/predict2/action/inference/inference.pycosmos_predict2/_src/predict2/action/inference/inference_gr00t.pyAll three scripts construct chunk input as
[img, 0, 0, ...](one validframe + zeros) and call
step_inference/generate_vid2worldwithnum_latent_conditional_frames=1, so the conditioning region is exactly onepixel frame. The
[1:]slice is therefore correct for these scripts aswritten.
Verification
Trace example with
chunk_size=12and 3 chunks:Test plan
just lintpasses (verified locally withruff checkandruff format --checkv0.12.7)F_{N·chunk_size}) is now present in the outputF_{2·chunk_size},F_{3·chunk_size}, ...) are gone