Research project for detecting and analyzing political bias in large language models using C4 web dataset.
Paper Title: From Data to Model in Bias: A Statistical Analysis of Political Bias in the C4 Corpus and Its Impact on LLMs
Authors: Jaebeom You, Jaewon Lee, Sehun Lee, Hyuk-Yoon Kwon
Conference: Proceedings of the 19th ACM International Conference on Web Search and Data Mining
Year: 2026
DOI: ---
Published: February 22--26, 2026
Link: ---
Bias_detection/- Statistical bias detection and analysis toolsC4_datat_collection/- Political content extraction from C4 datasetFine_tuning_model/- QLoRA fine-tuning system for bias scenariosLLM_based_annotation/- Multi-persona annotation using ChatGPT/ClaudePolitical_compass_test/- Political compass evaluation framework
statement.json- Query statements for evaluating vanilla models' political stances on various topics. Contains structured questions with topic categories, statements, and polarity indicators for bias assessment.topics-questions.csv- Comprehensive list of political topics and keywords used in our analysis. Includes categorized topics (Economics & Markets, Governance & Civil Rights, Social & Cultural Values) with associated keywords and search queries for data collection.
- Data Collection: Extract political content from C4
- Annotation: Generate bias annotations with multiple LLM personas
- Fine-tuning: Train models on different bias scenarios
- Testing: Evaluate models using political compass questions
- Analysis: Detect and measure bias using statistical methods
- Python 3.8+
- CUDA-capable GPU (16GB+ VRAM recommended)
- OpenAI API key (for ChatGPT)
- Anthropic API key (for Claude)