-
Notifications
You must be signed in to change notification settings - Fork 1
Feature/tokenizer pipeline #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a tokenizer pipeline for implementing text-to-tokenized sign language video translation using NVIDIA Cosmos Tokenizers. The pipeline handles video tokenization from the PHOENIX-2014-T dataset and includes infrastructure for logging and metrics tracking.
Key changes include:
- Implementation of sample tokenization scripts for testing individual video sequences
- Full dataset tokenization pipeline for processing the PHOENIX-2014-T dataset
- Project configuration and CI/CD setup for the tokenizer pipeline module
Reviewed Changes
Copilot reviewed 20 out of 22 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
tokenize_sample.py | Sample tokenization script that processes single PHOENIX sequences with both CV8x8x8 and DV8x16x16 models |
tokenize_dataset.py | Full dataset tokenization pipeline that processes all PHOENIX-2014-T splits and saves discrete tokens |
text_to_tokenized_video/tokenizer_pipeline/ | Organized tokenizer pipeline module with duplicate sample and dataset scripts |
text_to_tokenized_video/tokenizer_pipeline/pyproject.toml | Project configuration for the tokenizer pipeline package |
Various requirements/metadata files | Package dependencies and metadata for NVIDIA Cosmos integration |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
for model_name in model_names: | ||
encoder_ckpt = f"checkpoints/{model_name}/encoder.jit" | ||
decoder_ckpt = f"checkpoints/{model_name}/decoder.jit" | ||
|
||
tokenizer = CausalVideoTokenizer( | ||
checkpoint_enc=encoder_ckpt, | ||
checkpoint_dec=decoder_ckpt, | ||
device="cuda", | ||
dtype="bfloat16", | ||
) |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new tokenizer instance for each model inside the sequence loop is inefficient. The tokenizer should be created once per model outside the sequence loop and reused for all sequences.
Copilot uses AI. Check for mistakes.
for model_name in model_names: | ||
encoder_ckpt = f"checkpoints/{model_name}/encoder.jit" | ||
decoder_ckpt = f"checkpoints/{model_name}/decoder.jit" | ||
|
||
print(f"\n=== Running model {model_name} ===") | ||
t0 = time.time() | ||
|
||
tokenizer = CausalVideoTokenizer( | ||
checkpoint_enc=encoder_ckpt, | ||
checkpoint_dec=decoder_ckpt, | ||
device="cuda", | ||
dtype="bfloat16", # change to float32 if GPU complains | ||
) |
Copilot
AI
Oct 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tokenizer is recreated for each model iteration, which is inefficient. Consider creating tokenizers once and reusing them, or moving the initialization outside the timing measurement if you need fresh instances.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are pushing a directory with this repo, not this repo. notice - all your files are under text_to_tokenized_video/tokenizer_pipeline
so there are duplicates.
you also have twice tokenize_sample and tokenize_dataset
My recommendation is:
- git clone into a directory.
- copy the files you need into this directory
- push and make a new pull request.
output/ | ||
runs/ | ||
wandb/ | ||
cosmos_output/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add *.egg-info
and remove the egg-info stuff from git.
No description provided.