Skip to content

The Community Alignment dataset a culturally diverse, multilingual, multi-turn dataset of preferences for LLM alignment

License

Notifications You must be signed in to change notification settings

facebookresearch/community-alignment-dataset

Community Alignment

Hugging Face   |   Paper  

community_alignment.mov

Dataset

Community Alignment is a large-scale open source, multilingual and multi-turn preference dataset to align LLMs with human preferences across cultures. It features prompt-level overlap in annotators, enabling social-choice-based and distributional approaches to LLM alignment, as well as natural language explanations for choices.

  • [Large-scale] ~200,000 comparisons of LLM responses, collected from >3,000 unique annotators who provided feedback at an individual level.
  • [Multilingual] Contains comparisons in English, French, Italian, Hindi, and Portuguese. 63% of comparisons are non-English.
  • [Prompt-level overlap] 2599 prompts feature at least 10 annotations per comparison where annotators overlap across prompts.
  • [High-quality natural language explanations] For 27% of prompts, annotators provided detailed explanations why they preferred one response over another.

License

Community Alignment is released under the Creative Commons Attribution 4.0 International License (CC-BY-4.0). For details, see the LICENSE in this repository.

Codebook

Please see Appendix H of the paper for the codebook.

Usage

In ~27% of the conversations in our dataset, annotators initiate the dialogue with their own prompts. These prompts do not reflect the position of Meta or its employees. Users must implement appropriate filtering and moderation measures when utilizing this dataset for training purposes to ensure that the generated outputs adhere to their own content standards. The user-initiated conversations can be easily filtered out of the dataset using the is_pregenerated_first_prompt flag.

Attribution

When using this dataset in any publications or research output, please cite the accompanying paper. For BibTex, use

@article{zhang2025cultivating,
  title   = {Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset},
  author  = {Lily Hong Zhang and Smitha Milli and Karen Jusko and Jonathan Smith and Brandon Amos and Wassim and Bouaziz and Manon Revel and Jack Kussman and Lisa Titus and Bhaktipriya Radharapu and Jane Yu and Vidya Sarma and Kris Rose and Maximilian Nickel},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2507.09650}
}

For in-text citations, use

Zhang, L. H., Milli, S., Jusko, K., Smith, J., Amos, B., Bouaziz, W., Revel, M., Kussmann, J., Titus, L., Radharapu, B., Yu, J., Sarma, V., Rose, K., Nickel, M. (2025). Cultivating Pluralism In Algorithmic Monoculture: The Community Alignent Dataset.

Feedback

If you use Community Alignment, we would love to know (a) what you found valuable in it and (b) what features you wish it had (as well as any other feedback you may have). This will help support and guide us in doing future projects of this kind. Additionally, if you encounter any issues, such as the presence of personal or private information (PII) or requests from participants for data removal, please let us know. You can contact us at [email protected].

About

The Community Alignment dataset a culturally diverse, multilingual, multi-turn dataset of preferences for LLM alignment

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •