test: add unit tests for utils, ops, format, config, download and pipeline_dag modules#990
test: add unit tests for utils, ops, format, config, download and pipeline_dag modules#990cmgzn wants to merge 13 commits into
Conversation
…ig functions - file_utils: byte_size_to_size_str, is_remote_path, get_all_files_paths_under, single_partition_write_with_filename, read_single_partition, expand_outdir_and_mkdir - empty_formatter: multiple feature_keys, string-to-list conversion, null_value, zero length - formatter: audio/video relative-to-absolute path conversion, mixed media keys - config: resolve_job_id, validate_work_dir_config, resolve_job_directories
- common_utils: check_op_method_param, deprecated decorator (bare/with-reason/with-version/invalid-args) - sample: random_sample (weight/number/upsample/seed/rounding) - jsonl_lenient_loader: zstd decompression, missing file handling, empty lines
- HasherBasicTest: hash_bytes, update/hexdigest, dispatch fallback - UpdateFingerprintTest: deterministic behavior, unhashable transform/args with caching enabled/disabled, empty args
- normalize_preferred_output_lang: zh variants, en variants, empty, unknown - rubric_reason_language_clause: zh/en branches - llm_filter_free_text_language_appendix: zh/en/empty - agent_insight_system_prompt: zh/en - dialog_detection_output_language_note: intent/topic/sentiment/intensity modes
- replace_content_mapper: None pattern, multi-pattern/repl, mismatched length, raw string strip - clean_email_mapper: custom pattern, no-match, batched process path - fix_unicode_mapper: custom NFKC normalization, invalid raises, empty defaults
- load_ops: single/multiple ops, args passing, _op_cfg stored, order preserved, empty list
…vior
The original test used an input ('user@example.org') that matched both
the default and custom patterns identically, so the test would pass even
without the custom pattern feature. Now uses '{admin}@srv.co' which
only matches the custom pattern (with curly braces in char class),
with a precondition assertion proving the default pattern does NOT match.
… print, refactor test_download
…ists, empty dataset save, unknown strategy, malformed filenames, ray save/load with event_logger tests
…rints, redirect_sys_output, make_log_summarization, setup_logger branches
There was a problem hiding this comment.
Code Review
This pull request significantly expands the test suite of the data_juicer repository, adding comprehensive unit tests for configuration functions, pipeline DAG execution, dataset formatters, downloaders, mappers, mixins, and various utility modules. The review feedback correctly identifies two key issues in the newly added logger utility tests: a flawed assertion in test_buffer_truncate_on_write that fails to verify actual buffer truncation, and a non-portable shell command (os.system) used for cleanup that should be replaced with shutil.rmtree for cross-platform compatibility.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…plicator - New: tests/utils/test_availability_utils.py covering _is_package_available and _torch_check_and_set (fixes typo in old filename too) - character_repetition_filter: add direct compute_stats_batched/process_batched tests (57% -> 100%) - alphanumeric_filter: add batched API tests for non-tokenization path (50% -> 82%) - suffix_filter: cover None/str suffixes and reversed_range (87% -> 100%) - word_repetition_filter: add batched API direct call tests (87% -> 91%) - document_line_deduplicator: fix skip_brackets test to actually hit the branch, add compute_hash existing-hash early-return test (84% -> 86%) - Remove misspelled test_availablility_utils.py
What
Add comprehensive unit tests for 19 modules, with 18 new test files containing 409 test cases. Focus on modules with low coverage in CI full regression (based on
coverage_report_allfromunit-test.ymlrun #26550645873).Test Files Added
tests/utils/tests/ops/tests/format/tests/config/tests/download/tests/core/executor/Coverage Impact (vs CI full regression baseline)
format/json_formatter.pyutils/agent_output_locale.pyutils/file_utils.pyutils/sample.pyutils/fingerprint_utils.pyutils/jsonl_lenient_loader.pyformat/formatter.pyutils/ckpt_utils.pyutils/common_utils.pyops/mixins.pycore/executor/pipeline_dag.pyNotes
DataJuicerTestCaseBaseand are compatible with the CIunittest.TestLoaderdiscovery