Feature/tokenizer pipeline #2

marpng · 2025-10-08T10:15:37Z

No description provided.

Copilot

Pull Request Overview

This PR introduces a tokenizer pipeline for implementing text-to-tokenized sign language video translation using NVIDIA Cosmos Tokenizers. The pipeline handles video tokenization from the PHOENIX-2014-T dataset and includes infrastructure for logging and metrics tracking.

Key changes include:

Implementation of sample tokenization scripts for testing individual video sequences
Full dataset tokenization pipeline for processing the PHOENIX-2014-T dataset
Project configuration and CI/CD setup for the tokenizer pipeline module

Reviewed Changes

Copilot reviewed 20 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tokenize_sample.py	Sample tokenization script that processes single PHOENIX sequences with both CV8x8x8 and DV8x16x16 models
tokenize_dataset.py	Full dataset tokenization pipeline that processes all PHOENIX-2014-T splits and saves discrete tokens
text_to_tokenized_video/tokenizer_pipeline/	Organized tokenizer pipeline module with duplicate sample and dataset scripts
text_to_tokenized_video/tokenizer_pipeline/pyproject.toml	Project configuration for the tokenizer pipeline package
Various requirements/metadata files	Package dependencies and metadata for NVIDIA Cosmos integration

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-08T13:03:43Z

tokenize_dataset.py

+        for model_name in model_names:
+            encoder_ckpt = f"checkpoints/{model_name}/encoder.jit"
+            decoder_ckpt = f"checkpoints/{model_name}/decoder.jit"
+
+            tokenizer = CausalVideoTokenizer(
+                checkpoint_enc=encoder_ckpt,
+                checkpoint_dec=decoder_ckpt,
+                device="cuda",
+                dtype="bfloat16",
+            )


Creating a new tokenizer instance for each model inside the sequence loop is inefficient. The tokenizer should be created once per model outside the sequence loop and reused for all sequences.

Copilot · 2025-10-08T13:03:43Z

tokenize_sample.py

+for model_name in model_names:
+    encoder_ckpt = f"checkpoints/{model_name}/encoder.jit"
+    decoder_ckpt = f"checkpoints/{model_name}/decoder.jit"
+
+    print(f"\n=== Running model {model_name} ===")
+    t0 = time.time()
+
+    tokenizer = CausalVideoTokenizer(
+        checkpoint_enc=encoder_ckpt,
+        checkpoint_dec=decoder_ckpt,
+        device="cuda",
+        dtype="bfloat16",  # change to float32 if GPU complains
+    )


The tokenizer is recreated for each model iteration, which is inefficient. Consider creating tokenizers once and reusing them, or moving the initialization outside the timing measurement if you need fresh instances.

AmitMY

you are pushing a directory with this repo, not this repo. notice - all your files are under text_to_tokenized_video/tokenizer_pipeline so there are duplicates.

you also have twice tokenize_sample and tokenize_dataset

My recommendation is:

git clone into a directory.
copy the files you need into this directory
push and make a new pull request.

AmitMY · 2025-10-08T13:02:43Z

.gitignore

+output/
+runs/
+wandb/
+cosmos_output/


add *.egg-info and remove the egg-info stuff from git.

marpng added 4 commits October 1, 2025 11:43

Add tokenizer pipeline scripts with cropping + z_indices

173d0c7

add cosmos_predict1 for review

2a51baa

Fix tokenizer_pipeline submodule issue and include cosmos_predict1

6fc734b

Remove leftover submodule reference

d7ec0d4

AmitMY requested a review from Copilot October 8, 2025 13:02

Copilot AI reviewed Oct 8, 2025

View reviewed changes

AmitMY requested changes Oct 8, 2025

View reviewed changes

.gitignore

output/

runs/

wandb/

cosmos_output/

Copy link

Contributor

AmitMY Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add *.egg-info and remove the egg-info stuff from git.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/tokenizer pipeline #2

Feature/tokenizer pipeline #2

Uh oh!

marpng commented Oct 8, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 8, 2025

Uh oh!

Copilot AI Oct 8, 2025

Uh oh!

AmitMY left a comment

Uh oh!

AmitMY Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feature/tokenizer pipeline #2

Are you sure you want to change the base?

Feature/tokenizer pipeline #2

Uh oh!

Conversation

marpng commented Oct 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

AmitMY left a comment

Choose a reason for hiding this comment

Uh oh!

AmitMY Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants