KeyChain: Key-Sentence-Driven Long-Context Data Synthesis [ICLR 2026 Oral]

This repository implements the KeyChain data creation pipeline from LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts. KeyChain synthesizes high-quality, verifiable long-context QA training data by embedding questions within long contexts using UUID key-value chain linking and multi-level distractors — designed specifically for reinforcement learning over long contexts with fine-grained difficulty control.

RL Training: For the reinforcement learning training framework, see LoongRL.

Overview

KeyChain constructs long-context QA instances through a three-stage pipeline:

Data Filtering — Filter source multi-hop questions (HotpotQA, MuSiQue, 2WikiMQA) using Qwen2.5-32B, retaining only questions of appropriate difficulty
Long Context Filling — Compose long contexts (4K–128K tokens) by shuffling and inserting documents around the question-relevant passages
KeyChain Insertion — Generate UUID key-value chains and insert the pairs at random positions throughout the context. One chain leads to the real question; other chains lead to distractor questions. The model must follow the correct chain starting from a given UUID to locate the question, then reason over the surrounding documents to answer it

Generated Data Example

Each instance presents the model with a long context containing documents interleaved with UUID key-value pairs. The model receives a starting UUID and must follow the chain to find the hidden question:

Please read the following text.
Document 0: ...
Document 3:
Who's Who? is a studio album by American jazz musician John Scofield. ...
{"bdd640fb-0667-4ad1-9c80-317fa3b1799d": "23b8c1e9-3924-46de-beb1-3b9046685257"}.
...
Document 10:
... Sonoma State offers 92 Bachelor's degrees, 19 Master's degrees ...
{"972a8469-1641-4f82-8b9d-2434e465e150": "Musician and satirist Allie Goertz
 wrote a song about the "The Simpsons" character Milhouse, who Matt Groening
 named after who?"}.
...
Document 47:
Neil Affleck
{"bd9c66b3-ad3c-4d6d-9a3d-1fa7bc8960a9": "972a8469-1641-4f82-8b9d-2434e465e150"}.
...

In the context above, there is one correct question to answer. The correct
question can only be found by following the correct consecutive chain of
key:value pairs encoded with UUID strings, starting from
"bdd640fb-0667-4ad1-9c80-317fa3b1799d".
Find the correct question first, then answer it.

Data Sources

Dataset	Training QA Pairs	Unique Documents
HotpotQA	90,447	483,696
MuSiQue	19,938	797,413
2WikiMQA	167,454	369,378

Installation

pip install transformers tenacity openai azure-identity tqdm gdown
sudo apt update -y && sudo apt install unzip

Usage

Basic Synthesis

# Standard long-context QA synthesis
bash scripts/synth.sh

# Relevant-documents-only (no fillers)
bash scripts/synth_relevant_only.sh

Full Pipeline (Context Filling + KeyChain Insertion)

# Synthesize across all datasets x all context lengths with
# context filling and KeyChain insertion
bash scripts/synth_qwen_filter_core_gaussian_add_distractor.sh

Reasoning Extraction

# Extract GPT-4o reasoning steps for training data
bash scripts/synth_gpt_call_reasoning.sh

Custom Generation

python qa.py \
    --save_dir=./ \
    --save_name=hotpotqa \
    --dataset=hotpotqa/hotpot_train_v1.1.json \
    --tokenizer_path=Qwen/Qwen2.5-7B-Instruct \
    --tokenizer_type=hf \
    --max_seq_length=32768 \
    --tokens_to_generate=128 \
    --num_samples=100 \
    --template="{context}"

Repository Structure

KeyChain/
├── qa.py                                            # Basic context composition
├── qa_filter_data.py                                # Context generation with pre-filtering
├── qa_filter_data_core_gaussian_middle.py           # Gaussian question placement
├── qa_qwen_filtered_core_gaussian_add_distractor.py # Full pipeline: context filling + KeyChain insertion
├── qa_add_distractor.py                             # Distractor injection module
├── qa_relevant_only.py                              # Supporting-facts-only generation
├── gpt_call.py                                      # GPT-4o reasoning extraction
├── tokenizer.py                                     # Multi-backend tokenizer (HF, NeMo, OpenAI)
├── filter_question/                                 # Stage 1: model-based QA quality filtering
│   ├── filter_infer.py                              #   vLLM distributed inference
│   ├── merge_output.py                              #   Prediction merging
│   └── convert_*.py                                 #   Dataset format converters
├── filter_again/                                    # Stage 2: secondary quality control
│   ├── judge_filters.py                             #   Answer matching & filtering
│   └── judge_utils.py                               #   Evaluation metrics (EM, F1, CEM)
├── scripts/                                         # Shell scripts for pipeline execution
│   ├── synth.sh                                     #   Basic synthesis
│   ├── synth_qwen_filter_core_gaussian_add_distractor.sh  # Full pipeline
│   └── synth_gpt_call_reasoning.sh                  #   Reasoning extraction
└── difficulty_analysis/                             # Dataset analysis notebooks

Citation

If you use KeyChain in your research, please cite our paper:

@misc{wang2025loongrlreinforcementlearningadvanced,
      title={LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts},
      author={Siyuan Wang and Gaokai Zhang and Li Lyna Zhang and Ning Shang and Fan Yang and Dongyao Chen and Mao Yang},
      year={2025},
      eprint={2510.19363},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.19363},
}

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
difficulty_analysis		difficulty_analysis
filter_again		filter_again
filter_question		filter_question
scripts		scripts
.gitignore		.gitignore
README.md		README.md
az_files.sh		az_files.sh
cp_files.sh		cp_files.sh
generate_sanity_check.py		generate_sanity_check.py
gpt_call.py		gpt_call.py
gpt_test.py		gpt_test.py
install.sh		install.sh
qa.py		qa.py
qa_add_distractor.py		qa_add_distractor.py
qa_filter_data.py		qa_filter_data.py
qa_filter_data_core_gaussian_middle.py		qa_filter_data_core_gaussian_middle.py
qa_musique_hard.py		qa_musique_hard.py
qa_qwen_filtered_add_distractor.py		qa_qwen_filtered_add_distractor.py
qa_qwen_filtered_core_gaussian_add_distractor.py		qa_qwen_filtered_core_gaussian_add_distractor.py
qa_relevant_only.py		qa_relevant_only.py
race_download.py		race_download.py
requirements.txt		requirements.txt
sample_question_length.py		sample_question_length.py
tokenizer.py		tokenizer.py
translate_to_rl.py		translate_to_rl.py
uuid_test.py		uuid_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KeyChain: Key-Sentence-Driven Long-Context Data Synthesis [ICLR 2026 Oral]

Overview

Generated Data Example

Data Sources

Installation

Usage

Basic Synthesis

Full Pipeline (Context Filling + KeyChain Insertion)

Reasoning Extraction

Custom Generation

Repository Structure

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Wangmerlyn/KeyChain

Folders and files

Latest commit

History

Repository files navigation

KeyChain: Key-Sentence-Driven Long-Context Data Synthesis [ICLR 2026 Oral]

Overview

Generated Data Example

Data Sources

Installation

Usage

Basic Synthesis

Full Pipeline (Context Filling + KeyChain Insertion)

Reasoning Extraction

Custom Generation

Repository Structure

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages