All datasets used in the paper are published on HuggingFace Hub at
holi-lab/ReCAP_datatset.
The Hub release includes only the raw splits — the record-level persuasion
utility scores and QG candidates used for training are produced locally with
the scripts in scripts/ (see docs/reproducing_results.md).
holi-lab/ReCAP_datatset/
├── CMV/ {train, validation, test} # 1,341 / 167 / 168 (Table 8)
├── OpinionQA/ {train, validation, test} # 1,198 / 150 / 150
└── PRISM/ {train, validation, test} # 992 / 124 / 124
Programmatic loading:
from recap.data import load_raw_split, extract_passages, extract_comments
cmv_test = load_raw_split("cmv", "test") # HF Dataset (168 rows)
row = cmv_test[0]
passages = extract_passages(row) # list[str] — user history bodies
comments = extract_comments(row) # list[{"text": str, "label": 0|1}]recap.data.load_raw_split accepts both val and validation for the
validation split.
| Field | Type | Description |
|---|---|---|
post_id |
str |
Post identifier |
user_id |
str |
Author identifier — original Reddit username for CMV; PRISM-anonymized userNNN for PRISM |
post |
str |
Post body (CMV submission or PRISM prompt) |
user_history |
list[HistoryEntry] |
Prior posts/messages for this user |
preferred_responses |
list[str] |
Delta-awarded / user-preferred candidate responses |
dispreferred_responses |
list[str] |
Non-delta / user-dispreferred candidate responses |
where HistoryEntry = {body: str, created_utc: int|None, type: str} for CMV;
PRISM uses the same shape. PRISM additionally carries
preferred_responses_metadata / dispreferred_responses_metadata lists of
{model_name, score} dicts.
| Field | Type | Description |
|---|---|---|
sample_id |
str |
qid<question>_user<id> |
user_id |
str |
Pew ATP respondent identifier |
target |
dict |
{qid, wave, key, question, options, answer} |
user_history |
list[dict] |
Prior survey responses {wave, key, question, answer} |
recap.data.schema.extract_post renders the target question + numbered answer
options into the post string consumed by the MCQ predictor prompt. For the
user history, extract_passages produces one block per past survey response in
the form "[wave=X, key=Y]\nQuestion: <question>\nAnswer: <answer>", mirroring
origin/mcq_pokpo:bootstrapping/prepare_opinionqa_for_profiler.py so the
profiler sees the same context the paper's experiments used.
recap.data.schema exposes the normalization layer the rest of the codebase
uses; you should read through it if you ever need to slot in a different
dataset:
extract_post_id(row)/extract_user_id(row)/extract_post(row)extract_passages(row)— ordered history bodiesextract_passage_entries(row)— with synthesized integerpassage_idxextract_comments(row)—[{"text", "label"}]with label 1 = preferredextract_passages_and_scores(row)— aligned(bodies, utility_scores)pair for utility-scored rows produced byscripts/score_utility.py
| Split | # Posts | Unique users | User history count (min / mean / median / max) | Preferred (min / mean / med) | Dispreferred (min / mean / med) |
|---|---|---|---|---|---|
| Train | 1,341 | 1,257 | 15 / 252.40 / 57 / 11,965 | 1 / 1.77 / 1 | 1 / 33.06 / 20 |
| Val | 167 | 69 | 16 / 956.35 / 65 / 19,583 | 1 / 2.24 / 1 | 1 / 31.81 / 19 |
| Test | 168 | 69 | 15 / 613.36 / 71 / 19,583 | 1 / 2.54 / 1.5 | 2 / 35.30 / 19 |
These are not on the Hub — you generate them with the scripts in
scripts/:
data/*/train_utility.jsonl— raw rows +passage_f1_scores: {str(idx): float}keyed by position inuser_history. Output ofscripts/score_utility.py.data/*/profiler_dpo_train.jsonl— one row per grouping × positive profile withgrouping_passages,best_profile,negative_profiles. Output ofscripts/build_profile_preferences.py.data/*/train_qg_candidates.jsonl— utility-scored rows +expansion_base(stage-1 question),augmented_generated(16 stage-2 candidates), andaugmented_ndcg5(NDCG@5 per candidate). Output ofscripts/generate_qg_candidates.py. When invoked with--pick-bestalso writesretrieval_query= argmax-NDCG candidate, which is whatscripts/train_retriever.pyandscripts/eval_retrieval.pyconsume.
The paper's filtering pipeline (Appendix A) applies two sequential filters:
- Interaction completeness — keep only posts with ≥ 1 delta comment and ≥ 1 non-delta comment.
- History availability — keep only OPs with ≥ 15 historical records (posts or comments) published prior to the post's timestamp.
The final filtered set after splitting train/val/test 8:1:1 at the OP level is 1,676 posts / 1,326 unique users (the layout above).
A reference implementation lives in
scripts/prepare_cmv.py; it currently delegates
to the HuggingFace release. If you need to reconstruct from scratch, pull
CMV from Pushshift and port the
filtering logic documented in Appendix A.