Skip to content

Conversation

marpng
Copy link
Collaborator

@marpng marpng commented Oct 8, 2025

No description provided.

@AmitMY AmitMY requested a review from Copilot October 8, 2025 13:02
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a tokenizer pipeline for implementing text-to-tokenized sign language video translation using NVIDIA Cosmos Tokenizers. The pipeline handles video tokenization from the PHOENIX-2014-T dataset and includes infrastructure for logging and metrics tracking.

Key changes include:

  • Implementation of sample tokenization scripts for testing individual video sequences
  • Full dataset tokenization pipeline for processing the PHOENIX-2014-T dataset
  • Project configuration and CI/CD setup for the tokenizer pipeline module

Reviewed Changes

Copilot reviewed 20 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tokenize_sample.py Sample tokenization script that processes single PHOENIX sequences with both CV8x8x8 and DV8x16x16 models
tokenize_dataset.py Full dataset tokenization pipeline that processes all PHOENIX-2014-T splits and saves discrete tokens
text_to_tokenized_video/tokenizer_pipeline/ Organized tokenizer pipeline module with duplicate sample and dataset scripts
text_to_tokenized_video/tokenizer_pipeline/pyproject.toml Project configuration for the tokenizer pipeline package
Various requirements/metadata files Package dependencies and metadata for NVIDIA Cosmos integration

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +94 to +103
for model_name in model_names:
encoder_ckpt = f"checkpoints/{model_name}/encoder.jit"
decoder_ckpt = f"checkpoints/{model_name}/decoder.jit"

tokenizer = CausalVideoTokenizer(
checkpoint_enc=encoder_ckpt,
checkpoint_dec=decoder_ckpt,
device="cuda",
dtype="bfloat16",
)
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new tokenizer instance for each model inside the sequence loop is inefficient. The tokenizer should be created once per model outside the sequence loop and reused for all sequences.

Copilot uses AI. Check for mistakes.

Comment on lines +113 to +125
for model_name in model_names:
encoder_ckpt = f"checkpoints/{model_name}/encoder.jit"
decoder_ckpt = f"checkpoints/{model_name}/decoder.jit"

print(f"\n=== Running model {model_name} ===")
t0 = time.time()

tokenizer = CausalVideoTokenizer(
checkpoint_enc=encoder_ckpt,
checkpoint_dec=decoder_ckpt,
device="cuda",
dtype="bfloat16", # change to float32 if GPU complains
)
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tokenizer is recreated for each model iteration, which is inefficient. Consider creating tokenizers once and reusing them, or moving the initialization outside the timing measurement if you need fresh instances.

Copilot uses AI. Check for mistakes.

Copy link
Contributor

@AmitMY AmitMY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are pushing a directory with this repo, not this repo. notice - all your files are under text_to_tokenized_video/tokenizer_pipeline so there are duplicates.

you also have twice tokenize_sample and tokenize_dataset

My recommendation is:

  1. git clone into a directory.
  2. copy the files you need into this directory
  3. push and make a new pull request.

output/
runs/
wandb/
cosmos_output/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add *.egg-info and remove the egg-info stuff from git.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants