-
Notifications
You must be signed in to change notification settings - Fork 373
feat: add human-centric video understanding operators for HumanVBench #938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| # Data-Juicer-HumanVbench-ops | ||
|
|
||
| This is the operator contribution page for the paper: **HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks (CVPR'26)**. | ||
|
|
||
| ## Related Operator Documentation Locations | ||
|
|
||
| * **Example Recipe:** `demos/video_humanvbench_simple/analyzer.yaml` | ||
| * **Operator Definition:** `data_juicer/config/config_all.yaml` | ||
|
|
||
| ## Quick Start | ||
|
|
||
| As HumanVBench operators involve modifications to external repositories, these adjusted repositories are currently stored in: | ||
| `thirdparty/humanvbench_models` | ||
|
|
||
| To use these operators, you can choose: | ||
|
|
||
| 1. **Manual Mode:** Follow the instructions in `thirdparty/humanvbench_models/README.md` to manually complete the `git clone` and `.diff` patch merging, then run: | ||
|
|
||
| ```shell | ||
| dj-process --config demos/video_humanvbench_simple/analyzer.yaml | ||
|
|
||
| ``` | ||
|
|
||
| 2. **Automatic Mode (Recommended):** Start running directly: | ||
|
|
||
| ```shell | ||
| dj-process --config demos/video_humanvbench_simple/analyzer.yaml | ||
|
|
||
| ``` | ||
| The relevant operators already cover the logic for automatic `git clone` and `merge diff`, making manual intervention non-essential. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # Data-Juicer-HumanVbench-ops | ||
|
|
||
| 这是论文:**HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks (CVPR'26)** 的算子贡献页。 | ||
|
|
||
| ## 相关算子介绍文件位置 | ||
|
|
||
| * **范例 Recipe:** `demos/video_humanvbench_simple/analyzer.yaml` | ||
| * **算子定义:** `data_juicer/config/config_all.yaml` | ||
|
|
||
| ## 快速开始 | ||
|
|
||
| 由于 HumanVBench 算子涉及外部仓库的修改,这些经过调整的仓库目前存储在: | ||
| `thirdparty/humanvbench_models` | ||
|
|
||
| 为了使用这些算子,你可以选择: | ||
|
|
||
| 1. **手动模式:** 按照 `thirdparty/humanvbench_models/README.md` 下的指引手动完成 `git clone` 和 `.diff` 补丁合并,然后运行: | ||
| ```shell | ||
| dj-process --config demos/video_humanvbench_simple/analyzer.yaml | ||
|
|
||
| ``` | ||
|
|
||
|
|
||
| 2. **自动模式(推荐):** 直接开始运行: | ||
| ```shell | ||
| dj-process --config demos/video_humanvbench_simple/analyzer.yaml | ||
|
|
||
| ``` | ||
| 我们在相关算子已经涵盖了自动 `git clone` 和 `merge diff` 的逻辑,手动干预是非必须的。 |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -67,8 +67,10 @@ def __init__( | |||||||||||||||||
| self.min_face_count = min_face_count | ||||||||||||||||||
| self.max_face_count = max_face_count | ||||||||||||||||||
|
|
||||||||||||||||||
| self.extra_kwargs = self._default_kwargs.copy() | ||||||||||||||||||
| self.extra_kwargs.update((k, v) for k, v in kwargs.items() if k in self.extra_kwargs) | ||||||||||||||||||
| self.extra_kwargs = self._default_kwargs | ||||||||||||||||||
| for key in kwargs: | ||||||||||||||||||
| if key in self.extra_kwargs: | ||||||||||||||||||
| self.extra_kwargs[key] = kwargs[key] | ||||||||||||||||||
|
Comment on lines
+70
to
+73
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assigning This issue is also present in other files in this PR, including:
Suggested change
|
||||||||||||||||||
|
|
||||||||||||||||||
| if any_or_all not in ["any", "all"]: | ||||||||||||||||||
| raise ValueError(f"Keep strategy [{any_or_all}] is not supported. " f'Can only be one of ["any", "all"].') | ||||||||||||||||||
|
|
@@ -96,10 +98,13 @@ def compute_stats_single(self, sample, context=False): | |||||||||||||||||
|
|
||||||||||||||||||
| # count the number of detected faces in each image | ||||||||||||||||||
| face_counts = {} | ||||||||||||||||||
| for key, image in images.items(): | ||||||||||||||||||
| dets = detect_faces(image, model, **self.extra_kwargs) | ||||||||||||||||||
| face_counts[key] = len(dets) | ||||||||||||||||||
| logger.debug(f"face counts: {face_counts}") | ||||||||||||||||||
| try: | ||||||||||||||||||
| for key, image in images.items(): | ||||||||||||||||||
| dets = detect_faces(image, model, **self.extra_kwargs) | ||||||||||||||||||
| face_counts[key] = len(dets) | ||||||||||||||||||
| logger.debug(f"face counts: {face_counts}") | ||||||||||||||||||
| except Exception as e: | ||||||||||||||||||
| logger.exception(e) | ||||||||||||||||||
|
|
||||||||||||||||||
| sample[Fields.stats][StatsKeys.face_counts] = [face_counts[key] for key in loaded_image_keys] | ||||||||||||||||||
| return sample | ||||||||||||||||||
|
|
||||||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,143 @@ | ||||||
| import av | ||||||
| import numpy as np | ||||||
| from data_juicer.utils.constant import Fields, StatsKeys | ||||||
| from data_juicer.utils.mm_utils import (load_data_with_context, load_video, | ||||||
| pil_to_opencv, pil_to_opencv, process_each_frame) | ||||||
| from ..base_op import OPERATORS, Filter | ||||||
| from ..op_fusion import LOADED_VIDEOS | ||||||
| from ..op_fusion import INTER_SAMPLED_FRAMES | ||||||
|
|
||||||
| import psutil | ||||||
| import gc,os | ||||||
|
|
||||||
|
|
||||||
| import cv2,dlib | ||||||
| from PIL import ImageFilter | ||||||
|
|
||||||
| OP_NAME = 'video_face_ratio_filter' | ||||||
| @OPERATORS.register_module(OP_NAME) | ||||||
| @LOADED_VIDEOS.register_module(OP_NAME) | ||||||
|
|
||||||
| class VideoFaceRatioFilter(Filter): | ||||||
| """ | ||||||
| Keep data samples whose videos' durations are within a specified range. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The docstring for this filter seems to be a copy-paste from another operator. It states "Keep data samples whose videos' durations are within a specified range," but this filter operates on face ratios, not durations. Please update the docstring to accurately describe the filter's functionality.
Suggested change
|
||||||
|
|
||||||
| Source: This operator is a part of HumanVBench (CVPR 2026). | ||||||
| """ | ||||||
|
|
||||||
| def __init__(self, | ||||||
| threshold: float = 0.8, | ||||||
| detect_interval: int = 1, | ||||||
| any_or_all: str = 'all', | ||||||
| *args, | ||||||
| **kwargs): | ||||||
| """ | ||||||
| Initialization method. | ||||||
|
|
||||||
| :param any_or_all: keep this sample with 'any' or 'all' strategy of | ||||||
| all videos. 'any': keep this sample if any videos meet the | ||||||
| condition. 'all': keep this sample only if all videos meet the | ||||||
| condition. | ||||||
| :param args: extra args | ||||||
| :param kwargs: extra args | ||||||
| """ | ||||||
| super().__init__(*args, **kwargs) | ||||||
| self.threshold = threshold | ||||||
|
|
||||||
| if any_or_all not in ['any', 'all']: | ||||||
| raise ValueError(f'Keep strategy [{any_or_all}] is not supported. ' | ||||||
| f'Can only be one of ["any", "all"].') | ||||||
| self.any = (any_or_all == 'any') | ||||||
|
|
||||||
| # Initialize face detector | ||||||
| self.detector = dlib.get_frontal_face_detector() | ||||||
|
|
||||||
|
|
||||||
| self.detect_interval = detect_interval | ||||||
|
|
||||||
|
|
||||||
| def compute_stats_single(self, sample, rank=None, context=False): | ||||||
| # check if it's computed already | ||||||
| if StatsKeys.video_face_exist in sample[Fields.stats]: | ||||||
| return sample | ||||||
|
|
||||||
| # load videos | ||||||
| loaded_video_keys = sample[self.video_key] | ||||||
| video_faces_ratio = {} | ||||||
|
|
||||||
| # face_detect_S3FD = get_model(self.detector_key, rank=rank) | ||||||
|
|
||||||
| process = psutil.Process(os.getpid()) | ||||||
| # memory_before = process.memory_info().rss / 1024 ** 2 # MB | ||||||
|
|
||||||
|
|
||||||
| for video_key in loaded_video_keys: | ||||||
| try: | ||||||
| with av.open(video_key) as container: | ||||||
| # getting video stream | ||||||
| video_stream = next(s for s in container.streams if s.type == 'video') | ||||||
| # iterate over the video frame and detect faces | ||||||
| frame_counter = 0 | ||||||
| total_frames = 0 | ||||||
| frames_with_face = 0 | ||||||
| detect_num = 0 | ||||||
| for packet in container.demux(video_stream): | ||||||
| try: | ||||||
| for frame in packet.decode(): | ||||||
| total_frames += 1 | ||||||
| frame_counter += 1 | ||||||
|
|
||||||
| if frame_counter % self.detect_interval == 0: | ||||||
| detect_num = detect_num + 1 | ||||||
| img = frame.to_image() | ||||||
| image = pil_to_opencv(img) | ||||||
| # imageNumpy = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) | ||||||
| # faces = face_detect_S3FD.detect_faces(imageNumpy, conf_th=0.9, scales=[0.25]) | ||||||
| faces = self.detector(image) | ||||||
| if len(faces) > 0: | ||||||
| frames_with_face += 1 | ||||||
| except Exception as e: | ||||||
| print(f"Frame decoding error in video {video_key}: {e}") | ||||||
| frames_with_face = 0 | ||||||
| detect_num = 0 | ||||||
|
|
||||||
| # calculate the proportion of the number of face frames | ||||||
| if detect_num > 0: | ||||||
| face_ratio = frames_with_face / detect_num | ||||||
| else: | ||||||
| face_ratio = 0.0 | ||||||
| video_faces_ratio[video_key] = face_ratio | ||||||
| except av.AVError as e: | ||||||
| print(f"Error opening video {video_key}: {e}") | ||||||
| video_faces_ratio[video_key] = 0.0 | ||||||
| finally: | ||||||
| container.close() | ||||||
|
|
||||||
| video_faces_ratio[video_key] = face_ratio | ||||||
|
|
||||||
| # get video faces ratio | ||||||
| sample[Fields.stats][StatsKeys.video_face_exist] = [ | ||||||
| video_faces_ratio[video_key] for video_key in sample[self.video_key] | ||||||
| ] | ||||||
|
|
||||||
| memory_after = process.memory_info().rss / 1024 ** 2 # MB | ||||||
| print(f"Memory Usage: {memory_after:.2f} MB") | ||||||
|
|
||||||
| gc.collect() | ||||||
|
|
||||||
| return sample | ||||||
|
|
||||||
| def process_single(self, sample): | ||||||
| video_faces_ratio = sample[Fields.stats][StatsKeys.video_face_exist] | ||||||
| keep_bools = np.array([ | ||||||
| duration >= self.threshold | ||||||
| for duration in video_faces_ratio | ||||||
| ]) | ||||||
| if len(keep_bools) <= 0: | ||||||
| return True | ||||||
|
|
||||||
| # different strategies | ||||||
| if self.any: | ||||||
| return keep_bools.any() | ||||||
| else: | ||||||
| return keep_bools.all() | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR seems to remove support for compressed json files (
.gz,.zst) in several places, which is a significant breaking change and seems unrelated to the main goal of adding video operators. The changes are also inconsistent across the codebase. For example,load_strategy.pyandray_dataset.pyremove support for.jsonl.zst, butdata_juicer/format/json_formatter.pyretains it. Could you clarify if removing compressed file support is intended? If so, the implementation should be consistent across the codebase.