Skip to content

Latest commit

 

History

History
118 lines (92 loc) · 6.38 KB

File metadata and controls

118 lines (92 loc) · 6.38 KB

Data

Canonical source

All datasets used in the paper are published on HuggingFace Hub at holi-lab/ReCAP_datatset. The Hub release includes only the raw splits — the record-level persuasion utility scores and QG candidates used for training are produced locally with the scripts in scripts/ (see docs/reproducing_results.md).

holi-lab/ReCAP_datatset/
├── CMV/        {train, validation, test}     # 1,341 / 167 / 168 (Table 8)
├── OpinionQA/  {train, validation, test}     # 1,198 / 150 / 150
└── PRISM/      {train, validation, test}     #   992 / 124 / 124

Programmatic loading:

from recap.data import load_raw_split, extract_passages, extract_comments

cmv_test = load_raw_split("cmv", "test")        # HF Dataset (168 rows)
row = cmv_test[0]
passages = extract_passages(row)                 # list[str] — user history bodies
comments = extract_comments(row)                 # list[{"text": str, "label": 0|1}]

recap.data.load_raw_split accepts both val and validation for the validation split.

Raw row schema (CMV / PRISM)

Field Type Description
post_id str Post identifier
user_id str Author identifier — original Reddit username for CMV; PRISM-anonymized userNNN for PRISM
post str Post body (CMV submission or PRISM prompt)
user_history list[HistoryEntry] Prior posts/messages for this user
preferred_responses list[str] Delta-awarded / user-preferred candidate responses
dispreferred_responses list[str] Non-delta / user-dispreferred candidate responses

where HistoryEntry = {body: str, created_utc: int|None, type: str} for CMV; PRISM uses the same shape. PRISM additionally carries preferred_responses_metadata / dispreferred_responses_metadata lists of {model_name, score} dicts.

Raw row schema (OpinionQA)

Field Type Description
sample_id str qid<question>_user<id>
user_id str Pew ATP respondent identifier
target dict {qid, wave, key, question, options, answer}
user_history list[dict] Prior survey responses {wave, key, question, answer}

recap.data.schema.extract_post renders the target question + numbered answer options into the post string consumed by the MCQ predictor prompt. For the user history, extract_passages produces one block per past survey response in the form "[wave=X, key=Y]\nQuestion: <question>\nAnswer: <answer>", mirroring origin/mcq_pokpo:bootstrapping/prepare_opinionqa_for_profiler.py so the profiler sees the same context the paper's experiments used.

Schema helpers

recap.data.schema exposes the normalization layer the rest of the codebase uses; you should read through it if you ever need to slot in a different dataset:

  • extract_post_id(row) / extract_user_id(row) / extract_post(row)
  • extract_passages(row) — ordered history bodies
  • extract_passage_entries(row) — with synthesized integer passage_idx
  • extract_comments(row)[{"text", "label"}] with label 1 = preferred
  • extract_passages_and_scores(row) — aligned (bodies, utility_scores) pair for utility-scored rows produced by scripts/score_utility.py

Dataset statistics (Table 8)

Split # Posts Unique users User history count (min / mean / median / max) Preferred (min / mean / med) Dispreferred (min / mean / med)
Train 1,341 1,257 15 / 252.40 / 57 / 11,965 1 / 1.77 / 1 1 / 33.06 / 20
Val 167 69 16 / 956.35 / 65 / 19,583 1 / 2.24 / 1 1 / 31.81 / 19
Test 168 69 15 / 613.36 / 71 / 19,583 1 / 2.54 / 1.5 2 / 35.30 / 19

Locally-produced training artifacts

These are not on the Hub — you generate them with the scripts in scripts/:

  • data/*/train_utility.jsonl — raw rows + passage_f1_scores: {str(idx): float} keyed by position in user_history. Output of scripts/score_utility.py.
  • data/*/profiler_dpo_train.jsonl — one row per grouping × positive profile with grouping_passages, best_profile, negative_profiles. Output of scripts/build_profile_preferences.py.
  • data/*/train_qg_candidates.jsonl — utility-scored rows + expansion_base (stage-1 question), augmented_generated (16 stage-2 candidates), and augmented_ndcg5 (NDCG@5 per candidate). Output of scripts/generate_qg_candidates.py. When invoked with --pick-best also writes retrieval_query = argmax-NDCG candidate, which is what scripts/train_retriever.py and scripts/eval_retrieval.py consume.

Re-building from raw Reddit dumps

The paper's filtering pipeline (Appendix A) applies two sequential filters:

  1. Interaction completeness — keep only posts with ≥ 1 delta comment and ≥ 1 non-delta comment.
  2. History availability — keep only OPs with ≥ 15 historical records (posts or comments) published prior to the post's timestamp.

The final filtered set after splitting train/val/test 8:1:1 at the OP level is 1,676 posts / 1,326 unique users (the layout above).

A reference implementation lives in scripts/prepare_cmv.py; it currently delegates to the HuggingFace release. If you need to reconstruct from scratch, pull CMV from Pushshift and port the filtering logic documented in Appendix A.