Skip to content

fix: derive checkpoint_format from resolved local path (LoRA base weights silently not loaded)#161

Open
Rouhi-Amirreza wants to merge 1 commit into
nvidia-cosmos:mainfrom
Rouhi-Amirreza:fix/lora-checkpoint-format-pt
Open

fix: derive checkpoint_format from resolved local path (LoRA base weights silently not loaded)#161
Rouhi-Amirreza wants to merge 1 commit into
nvidia-cosmos:mainfrom
Rouhi-Amirreza:fix/lora-checkpoint-format-pt

Conversation

@Rouhi-Amirreza

Copy link
Copy Markdown

What

checkpoint_format is computed from the configured checkpoint.load_path string, but get_checkpoint_path() may resolve a DCP-style URI to a consolidated .pt file. The LoRA key-mapping branch is gated on checkpoint_format == "pt", so PEFT-wrapped models (use_lora=True) load zero parameters under the non-strict load and train from random init — with no error raised. This derives the format from the resolved file instead.

Fixes #160. Also matches the long-standing symptom reported in nvidia-cosmos/cosmos-predict2#176 (LoRA model produces identical/degenerate results vs. base; checkpoint-loading warnings).

Change

One conditional in cosmos_predict2/_src/predict2/utils/model_loader.py, immediately after the get_checkpoint_path(...) resolution:

    if str(local_s3_ckpt_fp).endswith(".pt"):
        checkpoint_format = "pt"

Verification

Cosmos-Predict2.5-2B LoRA post-training (rank 32, single H100), per the issue's reproduction:

Before After
Startup log _IncompatibleKeys(missing_keys=[<all base params>]), no LoRA-mapping line Mapped 689 LoRA keys from checkpoint to model
First-iter loss ~3.04 (random base) ~0.03 (loaded base)
Generation noise coherent video

No behavior change for runs whose load_path already ends in .pt (the condition only promotes dcppt when the resolved file is a .pt).

…ghts silently not loaded)

Signed-off-by: Amirreza Rouhi <ar3755@drexel.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LoRA post-training silently trains from random base weights when load_path is a checkpoint-DB URI (root cause of #176)

1 participant