Data

Canonical source

All datasets used in the paper are published on HuggingFace Hub at holi-lab/ReCAP_datatset. The Hub release includes only the raw splits — the record-level persuasion utility scores and QG candidates used for training are produced locally with the scripts in scripts/ (see docs/reproducing_results.md).

holi-lab/ReCAP_datatset/
├── CMV/        {train, validation, test}     # 1,341 / 167 / 168 (Table 8)
├── OpinionQA/  {train, validation, test}     # 1,198 / 150 / 150
└── PRISM/      {train, validation, test}     #   992 / 124 / 124

Programmatic loading:

from recap.data import load_raw_split, extract_passages, extract_comments

cmv_test = load_raw_split("cmv", "test")        # HF Dataset (168 rows)
row = cmv_test[0]
passages = extract_passages(row)                 # list[str] — user history bodies
comments = extract_comments(row)                 # list[{"text": str, "label": 0|1}]

recap.data.load_raw_split accepts both val and validation for the validation split.

Raw row schema (CMV / PRISM)

Field	Type	Description
`post_id`	`str`	Post identifier
`user_id`	`str`	Author identifier — original Reddit username for CMV; PRISM-anonymized `userNNN` for PRISM
`post`	`str`	Post body (CMV submission or PRISM prompt)
`user_history`	`list[HistoryEntry]`	Prior posts/messages for this user
`preferred_responses`	`list[str]`	Delta-awarded / user-preferred candidate responses
`dispreferred_responses`	`list[str]`	Non-delta / user-dispreferred candidate responses

where HistoryEntry = {body: str, created_utc: int|None, type: str} for CMV; PRISM uses the same shape. PRISM additionally carries preferred_responses_metadata / dispreferred_responses_metadata lists of {model_name, score} dicts.

Raw row schema (OpinionQA)

Field	Type	Description
`sample_id`	`str`	`qid<question>_user<id>`
`user_id`	`str`	Pew ATP respondent identifier
`target`	`dict`	`{qid, wave, key, question, options, answer}`
`user_history`	`list[dict]`	Prior survey responses `{wave, key, question, answer}`

recap.data.schema.extract_post renders the target question + numbered answer options into the post string consumed by the MCQ predictor prompt. For the user history, extract_passages produces one block per past survey response in the form "[wave=X, key=Y]\nQuestion: <question>\nAnswer: <answer>", mirroring origin/mcq_pokpo:bootstrapping/prepare_opinionqa_for_profiler.py so the profiler sees the same context the paper's experiments used.

Schema helpers

recap.data.schema exposes the normalization layer the rest of the codebase uses; you should read through it if you ever need to slot in a different dataset:

extract_post_id(row) / extract_user_id(row) / extract_post(row)
extract_passages(row) — ordered history bodies
extract_passage_entries(row) — with synthesized integer passage_idx
extract_comments(row) — [{"text", "label"}] with label 1 = preferred
extract_passages_and_scores(row) — aligned (bodies, utility_scores) pair for utility-scored rows produced by scripts/score_utility.py

Dataset statistics (Table 8)

Split	# Posts	Unique users	User history count (min / mean / median / max)	Preferred (min / mean / med)	Dispreferred (min / mean / med)
Train	1,341	1,257	15 / 252.40 / 57 / 11,965	1 / 1.77 / 1	1 / 33.06 / 20
Val	167	69	16 / 956.35 / 65 / 19,583	1 / 2.24 / 1	1 / 31.81 / 19
Test	168	69	15 / 613.36 / 71 / 19,583	1 / 2.54 / 1.5	2 / 35.30 / 19

Locally-produced training artifacts

These are not on the Hub — you generate them with the scripts in scripts/:

data/*/train_utility.jsonl — raw rows + passage_f1_scores: {str(idx): float} keyed by position in user_history. Output of scripts/score_utility.py.
data/*/profiler_dpo_train.jsonl — one row per grouping × positive profile with grouping_passages, best_profile, negative_profiles. Output of scripts/build_profile_preferences.py.
data/*/train_qg_candidates.jsonl — utility-scored rows + expansion_base (stage-1 question), augmented_generated (16 stage-2 candidates), and augmented_ndcg5 (NDCG@5 per candidate). Output of scripts/generate_qg_candidates.py. When invoked with --pick-best also writes retrieval_query = argmax-NDCG candidate, which is what scripts/train_retriever.py and scripts/eval_retrieval.py consume.

Re-building from raw Reddit dumps

The paper's filtering pipeline (Appendix A) applies two sequential filters:

Interaction completeness — keep only posts with ≥ 1 delta comment and ≥ 1 non-delta comment.
History availability — keep only OPs with ≥ 15 historical records (posts or comments) published prior to the post's timestamp.

The final filtered set after splitting train/val/test 8:1:1 at the OP level is 1,676 posts / 1,326 unique users (the layout above).

A reference implementation lives in scripts/prepare_cmv.py; it currently delegates to the HuggingFace release. If you need to reconstruct from scratch, pull CMV from Pushshift and port the filtering logic documented in Appendix A.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data

Canonical source

Raw row schema (CMV / PRISM)

Raw row schema (OpinionQA)

Schema helpers

Dataset statistics (Table 8)

Locally-produced training artifacts

Re-building from raw Reddit dumps

FilesExpand file tree

data.md

Latest commit

History

data.md

File metadata and controls

Data

Canonical source

Raw row schema (CMV / PRISM)

Raw row schema (OpinionQA)

Schema helpers

Dataset statistics (Table 8)

Locally-produced training artifacts

Re-building from raw Reddit dumps