This repository contains information and code of Can Large Language Models Master Complex Card Games?.
We investigate whether language models can master complex card games. We systematically evaluate the performance of language models on eight carefully selected card games. Specifically, we focus on the following three research questions:
Can LLMs master complex card games? And how much data is required for them to master these games?
Can LLMs simultaneously master multiple games? Do different games mutually enhance each other or do conflicts arise between them?
Can LLMs maintain their general capabilities while mastering complex games?
To answer these questions, we first fine-tune language models on each of the eight games separately to evaluate the extent to which the models can master individual games. Next, we fine-tune the models on a mixture of all the game data to assess their ability to master all the games simultaneously. Finally, we evaluate whether the models' general capabilities decline using MMLU-Pro, Math-500, and HumanEval benchmarks for knowledge question answering, math, and coding skills.
To collect game data and evaluate game performance, we rely on three projects: DouZero, DanZero, and RLCard. For training, we use the LLaMA-Factory framework, and for evaluating general benchmarks, we use the OpenCompass framework. Therefore, it is necessary to install the dependencies for these five projects.
First, use the teacher model to generate interaction data for each game:
bash gen_data.shThen, filter and covert interaction data to sft format:
bash convert_data.shRun the following script to train a mixture model on the data from the 8 games:
bash train.shExecute the following script to further train the mixture model with general data, in order to mitigate the issue of degraded general capabilities in the hybrid model:
bash train_ct.shUse the following script to evaluate the change in model performance as the data volume increases:
bash de_ckpt.shUse the following script to verify the performance of the hybrid model on 8 games:
bash de_final.shUse the following script to validate the performance of the API-based models on 8 games:
bash eval_llm_one_on_all.shUse the following script to verify the performance of models on three general benchmarks:
bash eval_general.shOur project utilizes many open-source projects. We thank the contributors of these projects, including but not limited to: DouZero, DanZero, RLCard, LLaMA-Factory, OpenCompass.
If you find our work helpful, please kindly cite our paper:
@article{wang2025can,
title={Can Large Language Models Master Complex Card Games?},
author={Wang, Wei and Bie, Fuqing and Chen, Junzhe and Zhang, Dan and Huang, Shiyu and Kharlamov, Evgeny and Tang, Jie},
journal={arXiv preprint arXiv:2509.01328},
year={2025}
}