Fix bug: Update requirements and notebooks for DL examples by YanxuanLiu · Pull Request #633 · NVIDIA/cudf-spark-examples

YanxuanLiu · 2026-06-16T02:56:00Z

Fix bug of requirements and notebooks for DL inference examples.

Updated versions in requirements
Updated notebooks to align with latest environment and dependencies.

greptile-apps · 2026-06-24T01:59:40Z

Greptile Summary

This PR fixes several bugs in TensorFlow deep-learning inference notebooks and tightens dependency pins across the shared requirements files. The core fixes address a [UNK]-in-vocabulary corruption in text_classification_tf, a target-column data-leakage bug in keras_preprocessing_tf, and a non-deterministic vocabulary deduplication via list(set()) that could silently shuffle token indices between runs.

Dependency updates (requirements.txt, torch_requirements.txt): datasets is pinned to ==3.*, huggingface-hub<1.0 is added to guard against breaking API changes, and torch/torchvision/torch-tensorrt are locked to exact versions (2.8.0/0.23.0/2.8.0).
TF model loading (conditional_generation_tf, pipelines_tf): use_safetensors=False is added to all TFAutoModel* and TFT5* loads to force the PyTorch .bin path, fixing an incompatibility in newer transformers+TF combinations; explicit model/tokenizer objects are now passed to pipeline() instead of letting it auto-download.
Vocabulary normalization (text_classification_tf): normalize_vocabulary is moved before export_model.save() so the saved .keras file already contains the cleaned vocab, and the function now correctly removes both \"\" and \"[UNK]\" (special tokens that set_vocabulary prepends automatically) while preserving insertion order.

Confidence Score: 5/5

Safe to merge — all changes are targeted bug fixes and dependency stabilization with no new logic paths that could regress existing behavior.

The changes correct real bugs (data leakage from the wrong DataFrame reference in df_to_dataset, non-deterministic vocabulary ordering from list(set()), missing [UNK] filter before set_vocabulary) and add defensive version pins. The refactoring in the notebooks is straightforward and backed by executed cell outputs in the notebook itself. The only rough edge is a leftover unused device variable in predict_batch_fn, which is cosmetic and has no effect on correctness or GPU placement.

The two requirements files (requirements.txt, torch_requirements.txt) carry the dependency constraints that have been discussed in prior review threads; they are worth a second look if there are concerns about the numpy <2 upper bound or the tensorrt index URL CUDA alignment.

Important Files Changed

Filename	Overview
examples/ML+DL-Examples/Spark-DL/dl_inference/requirements.txt	Pins datasets to `==3.*`, adds `huggingface-hub<1.0`, widens numpy to `>=1.26.4,<2`; `<2` upper bound still excludes NumPy 2.x which may conflict with modern tensorflow.
examples/ML+DL-Examples/Spark-DL/dl_inference/torch_requirements.txt	Pins torch/torchvision/torch-tensorrt to 2.8.0/0.23.0/2.8.0; tensorrt index URL still points to cu121 which may not carry a matching wheel for torch 2.8.0.
examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb	Adds `use_safetensors=False` to all three TFT5ForConditionalGeneration loads to force the PyTorch .bin format, fixing a model-loading incompatibility with newer transformers/TF combinations.
examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/pipelines_tf.ipynb	Model and tokenizer are now loaded explicitly with `use_safetensors=False`; `device=device` replaced with `dtype=None` across all inference paths. Leftover `device` variable in `predict_batch_fn` is dead code after the refactoring.
examples/ML+DL-Examples/Spark-DL/dl_inference/tensorflow/keras_preprocessing_tf.ipynb	Fixes a data-leakage bug in `df_to_dataset` where `dataframe.items()` was used after `df.pop('target')`; now correctly uses `df.items()`. Also refactors the train/val/test split.
examples/ML+DL-Examples/Spark-DL/dl_inference/tensorflow/text_classification_tf.ipynb	Moves `normalize_vocabulary` earlier (before `export_model.save`); fixes non-deterministic `list(set())` deduplication and correctly filters `[UNK]` which `set_vocabulary` expects to be absent.
examples/ML+DL-Examples/Spark-DL/dl_inference/vllm/qwen-2.5-7b_vllm.ipynb	Removes `task="generate"` (now auto-inferred by vllm) and increases `wait_retries` from 60 to 180.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[vectorize_layer.adapt on train data] --> B[Train + evaluate model]
    B --> C["normalize_vocabulary(get_vocabulary())\nFilters '' and '[UNK]', preserves order"]
    C --> D[vectorize_layer.set_vocabulary]
    D --> E[export_model.save → text_model.keras]
    E --> F[Spark predict_batch_udf loads model]
    D --> G["normalize_vocabulary again\n(idempotent on already-clean vocab)"]
    G --> H[vectorize_layer.set_vocabulary]
    H --> I[export_model.save → text_model_cleaned.keras]
    I --> J[Triton server loads cleaned model]
    style C fill:#d4edda,stroke:#28a745
    style D fill:#d4edda,stroke:#28a745
    style G fill:#fff3cd,stroke:#ffc107
    style H fill:#fff3cd,stroke:#ffc107

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[vectorize_layer.adapt on train data] --> B[Train + evaluate model]
    B --> C["normalize_vocabulary(get_vocabulary())\nFilters '' and '[UNK]', preserves order"]
    C --> D[vectorize_layer.set_vocabulary]
    D --> E[export_model.save → text_model.keras]
    E --> F[Spark predict_batch_udf loads model]
    D --> G["normalize_vocabulary again\n(idempotent on already-clean vocab)"]
    G --> H[vectorize_layer.set_vocabulary]
    H --> I[export_model.save → text_model_cleaned.keras]
    I --> J[Triton server loads cleaned model]
    style C fill:#d4edda,stroke:#28a745
    style D fill:#d4edda,stroke:#28a745
    style G fill:#fff3cd,stroke:#ffc107
    style H fill:#fff3cd,stroke:#ffc107

_{Reviews (2): Last reviewed commit: "Update DL inference requirement headers" | Re-trigger Greptile}

Signed-off-by: YanxuanLiu <yanxuanl@nvidia.com>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: YanxuanLiu <yanxuanl@nvidia.com>

Signed-off-by: YanxuanLiu <yanxuanl@nvidia.com>

rishic3

Looks good, thanks @YanxuanLiu. On a broader note almost no one is writing any new Tensorflow, so we should consider deprecating the _tf notebooks.

YanxuanLiu · 2026-06-26T02:01:22Z

Looks good, thanks @YanxuanLiu. On a broader note almost no one is writing any new Tensorflow, so we should consider deprecating the _tf notebooks.

Sure, will remove from our default notebook list

YanxuanLiu self-assigned this Jun 16, 2026

YanxuanLiu changed the title ~~Fix bug: Update requirements and notebooks for DL examples~~ [DO NOT REVIEW] Fix bug: Update requirements and notebooks for DL examples Jun 16, 2026

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread examples/ML+DL-Examples/Spark-DL/dl_inference/vllm_requirements.txt Outdated

Comment thread examples/ML+DL-Examples/Spark-DL/dl_inference/requirements.txt Outdated

Comment thread examples/ML+DL-Examples/Spark-DL/dl_inference/torch_requirements.txt

YanxuanLiu changed the title ~~[DO NOT REVIEW] Fix bug: Update requirements and notebooks for DL examples~~ Fix bug: Update requirements and notebooks for DL examples Jun 25, 2026

YanxuanLiu and others added 7 commits June 25, 2026 18:35

Stabilize DL inference environments

6dcf057

Signed-off-by: YanxuanLiu <yanxuanl@nvidia.com>

Pin HuggingFace dataset dependencies

20dbf47

Signed-off-by: YanxuanLiu <yanxuanl@nvidia.com>

Fix text classification vocabulary serialization

8691078

Signed-off-by: YanxuanLiu <yanxuanl@nvidia.com>

Fix TensorFlow DL inference notebooks

a5e0c77

Signed-off-by: YanxuanLiu <yanxuanl@nvidia.com>

Apply suggestions from code review

69b5faf

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: YanxuanLiu <yanxuanl@nvidia.com>

Relax DL inference public requirements

4d53c44

Signed-off-by: YanxuanLiu <yanxuanl@nvidia.com>

Update DL inference requirement headers

493c550

Signed-off-by: YanxuanLiu <yanxuanl@nvidia.com>

YanxuanLiu force-pushed the dl-inf-env branch from add41e4 to 493c550 Compare June 25, 2026 10:36

YanxuanLiu marked this pull request as ready for review June 25, 2026 10:38

YanxuanLiu requested review from NvTimLiu, rishic3 and yinqingh June 25, 2026 10:39

rishic3 approved these changes Jun 25, 2026

View reviewed changes

YanxuanLiu merged commit 1151c92 into NVIDIA:main Jun 26, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix bug: Update requirements and notebooks for DL examples #633

Fix bug: Update requirements and notebooks for DL examples #633
YanxuanLiu merged 7 commits into
NVIDIA:mainfrom
YanxuanLiu:dl-inf-env

YanxuanLiu commented Jun 16, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rishic3 left a comment

Uh oh!

YanxuanLiu commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

YanxuanLiu commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rishic3 left a comment

Choose a reason for hiding this comment

Uh oh!

YanxuanLiu commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YanxuanLiu commented Jun 16, 2026 •

edited

Loading

greptile-apps Bot commented Jun 24, 2026 •

edited

Loading