Skip to content

test: add unit tests for utils, ops, format, config, download and pipeline_dag modules#990

Open
cmgzn wants to merge 13 commits into
mainfrom
test/add-unit-tests
Open

test: add unit tests for utils, ops, format, config, download and pipeline_dag modules#990
cmgzn wants to merge 13 commits into
mainfrom
test/add-unit-tests

Conversation

@cmgzn

@cmgzn cmgzn commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

What

Add comprehensive unit tests for 19 modules, with 18 new test files containing 409 test cases. Focus on modules with low coverage in CI full regression (based on coverage_report_all from unit-test.yml run #26550645873).

Test Files Added

Directory Files Test Cases
tests/utils/ 8 183
tests/ops/ 5 66
tests/format/ 3 59
tests/config/ 1 59
tests/download/ 1 14
tests/core/executor/ 1 30
Total 18 409

Coverage Impact (vs CI full regression baseline)

Module CI Baseline Gap Filled
format/json_formatter.py 42% +56% (→ 98%)
utils/agent_output_locale.py 69% +31% (→ 100%)
utils/file_utils.py 64% +22% (→ 86%)
utils/sample.py 81% +19% (→ 100%)
utils/fingerprint_utils.py 78% +15% (→ 93%)
utils/jsonl_lenient_loader.py 82% +13% (→ 95%)
format/formatter.py 73% +9% (→ 82%)
utils/ckpt_utils.py 90% +8% (→ 98%)
utils/common_utils.py 90% +8% (→ 98%)
ops/mixins.py 17% new dedicated tests
core/executor/pipeline_dag.py 76% new dedicated tests

Notes

  • No source code changes — tests only
  • All tests inherit DataJuicerTestCaseBase and are compatible with the CI unittest.TestLoader discovery
  • External services (SMTP, Slack, DingTalk) are mocked; all other tests use real execution

cmgzn added 10 commits June 9, 2026 10:34
…ig functions

- file_utils: byte_size_to_size_str, is_remote_path, get_all_files_paths_under,
  single_partition_write_with_filename, read_single_partition, expand_outdir_and_mkdir
- empty_formatter: multiple feature_keys, string-to-list conversion, null_value, zero length
- formatter: audio/video relative-to-absolute path conversion, mixed media keys
- config: resolve_job_id, validate_work_dir_config, resolve_job_directories
- common_utils: check_op_method_param, deprecated decorator (bare/with-reason/with-version/invalid-args)
- sample: random_sample (weight/number/upsample/seed/rounding)
- jsonl_lenient_loader: zstd decompression, missing file handling, empty lines
- HasherBasicTest: hash_bytes, update/hexdigest, dispatch fallback
- UpdateFingerprintTest: deterministic behavior, unhashable transform/args
  with caching enabled/disabled, empty args
- normalize_preferred_output_lang: zh variants, en variants, empty, unknown
- rubric_reason_language_clause: zh/en branches
- llm_filter_free_text_language_appendix: zh/en/empty
- agent_insight_system_prompt: zh/en
- dialog_detection_output_language_note: intent/topic/sentiment/intensity modes
- replace_content_mapper: None pattern, multi-pattern/repl, mismatched length, raw string strip
- clean_email_mapper: custom pattern, no-match, batched process path
- fix_unicode_mapper: custom NFKC normalization, invalid raises, empty defaults
- load_ops: single/multiple ops, args passing, _op_cfg stored, order preserved, empty list
…vior

The original test used an input ('user@example.org') that matched both
the default and custom patterns identically, so the test would pass even
without the custom pattern feature. Now uses '{admin}@srv.co' which
only matches the custom pattern (with curly braces in char class),
with a precondition assertion proving the default pattern does NOT match.
…ists, empty dataset save, unknown strategy, malformed filenames, ray save/load with event_logger tests
…rints, redirect_sys_output, make_log_summarization, setup_logger branches

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly expands the test suite of the data_juicer repository, adding comprehensive unit tests for configuration functions, pipeline DAG execution, dataset formatters, downloaders, mappers, mixins, and various utility modules. The review feedback correctly identifies two key issues in the newly added logger utility tests: a flawed assertion in test_buffer_truncate_on_write that fails to verify actual buffer truncation, and a non-portable shell command (os.system) used for cleanup that should be replaced with shutil.rmtree for cross-platform compatibility.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread tests/utils/test_logger_utils.py Outdated
Comment thread tests/utils/test_logger_utils.py
@cmgzn cmgzn changed the title Test/add unit tests test: add unit tests for utils, ops, format, config, download and pipeline_dag modules Jun 9, 2026
…plicator

- New: tests/utils/test_availability_utils.py covering _is_package_available
  and _torch_check_and_set (fixes typo in old filename too)
- character_repetition_filter: add direct compute_stats_batched/process_batched
  tests (57% -> 100%)
- alphanumeric_filter: add batched API tests for non-tokenization path (50% -> 82%)
- suffix_filter: cover None/str suffixes and reversed_range (87% -> 100%)
- word_repetition_filter: add batched API direct call tests (87% -> 91%)
- document_line_deduplicator: fix skip_brackets test to actually hit the branch,
  add compute_hash existing-hash early-return test (84% -> 86%)
- Remove misspelled test_availablility_utils.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant