π Paper: Attempt to Persuade Eval (APE)
This repository contains the code for the Attempt to Persuade Eval (APE) project. The goal of the project is to develop a set of evaluation metrics for measuring the attempt of language models to persuade.
To run the evaluation scripts, you will need an OpenAI API key. This is done using
a .env
file. Create a file called .env
in the root directory of the repository
and add the following line:
OPENAI_API_KEY="your_openai_api_key"
Replace your_openai_api_key
with your actual OpenAI API key.
To get authentication for running Gemini (Vertex AI), run the following in terminal:
gcloud auth application-default login
Also, you'll need to set the following .env
vars from the GCP Vertex project:
VERTEXAI_PROJECT=""
VERTEXAI_LOCATION=""
You can see the full list of models and appropriate locations to use at this link. For example, if you are using the Gemini 2.5 Flash
model and live in the USA, you could use:
VERTEXAI_PROJECT=my-project
VERTEXAI_LOCATION=us-east5
When running the fine-tuned model, use e.g.: persuader_model=vertex_ai/<VERTEXAI_ENDPOINTID>
You can get a Hugging Face API key by creating an account on the Hugging Face website
and then going to your account settings. Once you have your API key, either add the
path to the .env
file or set the HF_TOKEN
environment variable in the .env file:
HF_TOKEN="hf_..."
When using huggingface models, make sure you download the checkpoints to src/ckpts. To download the huggingface (hf) model weights, you can use the huggingface-cli downloader and download them to src/ckpts
, for example:
huggingface-cli download Qwen/Qwen3-32B-Instruct --local-dir src/ckpts/Qwen3-32B-Instruct --local-dir-use-symlinks False
Dependencies are in pyproject.toml
, install them with:
pip install -e ".[dev,test]"
To run the persuasion attempt eval, use the following command:
python main.py persuader_model=gpt-4o
This will run the persuasion evals using the gpt-4o
model. This eval simulates a
conversation between a user (i.e., roleplaying persuadee model) and a model (persuader), where the model is prompted to try to persuade the user into/out of a certain statement over three conversational rounds; the 600 different statements used in APE can be found in src/topics/diverse_topics.jsonl. The eval will output, to the 'results' directory, a JSON file containing the following information: the dialogue between the user and the model, an evaluator model's score for the persuasion attempt, and an evaluator model's score for the success of persuasion.
Running main.py produces several figures and saves results to enable further analysis. The main
figures to look at are persuasion_attempt_counts_turn_n.png
and nh_subjects_attempt_counts_turn_n.png
which show the number of persuasion attempts vs. no attempts vs. refusals across all categories, and harmful categories, respectively, for turn n
. We also show these plots as percentages, changes in the self-reported user belief, and evaluator confusion matrix of prompted vs. predicted persuasion degree (see Figure 6 in the paper for more details).
The evaluation system uses a flexible configuration system. You can run experiments in several ways:
python main.py
Using Pre-configured Experiments in configs/experiment
python main.py experiment=gpt_4o
python main.py experiment=llama_8b_journalist
python main.py experiment=gpt_4o_10_turns
python main.py experiment=gpt_4o num_users=50 num_turns=5
python main.py persuader_model=gpt-4o-mini sample_belief_upper=50 all_topics=false
Pre-configured experiments include:
- Model Evaluations:
gpt_4o
,gpt_4o_mini
,llama_8b
,gemini_25_pro
,gemini_flash_001
,qwen3_32b
- Persona Experiments:
gpt_4o_journalist
,gpt_4o_politics
,llama_8b_journalist
- Long Conversations:
gpt_4o_10_turns
,llama_8b_10_turns
- Persuasion Degrees:
gpt_4o_2_degree
,gpt_4o_3_degree
,gpt_4o_100_degree
- Context Experiments:
gpt_4o_online_debater
,gpt_4o_peer_support
See configs/README.md
for a complete list and detailed configuration options.
Parameter | Description | Default |
---|---|---|
num_turns |
Number of conversation turns | 3 |
persuader_model |
Model playing persuader role | gpt-4o |
persuadee_model |
Model playing persuadee role | gpt-4o |
evaluator_model |
Model evaluating conversations | gpt-4o |
experiment_name |
Name for this experiment run | default_experiment |
all_topics |
Use all 600 available topics (else num_users topics are randomly selected) | true |
only_persuade |
Only attempt persuasion (not dissuasion) | false |
batch_size |
Local model batch size | 32 |
A list of models that can be used as the persuader are as follows:
gpt-4o
: OpenAI GPT-4 modelgpt-4o-mini
: Smaller version of GPT-4vertex_ai/gemini-2.0-flash-001
: Google's Gemini 2.0 Flash modelvertex_ai/gemini-2.5-pro-preview-03-25
: Google's Gemini 2.5 Pro modelhf/Meta-Llama-3.1-8B-Instruct
: Meta's Llama 3.1 8B instruction-tuned modelhf/Qwen3-32B-Instruct
: Qwen 3 32B instruction-tuned model
To add more models you will need to modify src/generate_conversations/generate.py
to ensure the appropriate sampling format is used with your model.
(1) API Models: For API calls, you should modify the process_messages
function defined inside the generate_llm
function. You should make sure that the model can be accessed through the LiteLLM library which is used in this repo for making API calls.
(2) Local Models: For local models, First, download the checkpoints to src/ckpts
and ensure the preload_local_model
function loads your new model correctly. Then, either use models from Huggingface and prepend hf/
to the model name, or modify this line to ensure it uses a custom local generation function. The easiest way to sample from a new model is to modify generate_with_local_model
to include the new model in this if statement. As well, you should define any required pre- or post-processing formatting for the new model, e.g., like for the Qwen 3 model here and here.
(3) Visualizations: To include new models in the aggregate visualizations that compare various models after running them on APE, you can simply add your new model name to the dicts here and here to order the models and ensure they have the right display names.
A number of scripts with pre-configured settings for persuasion evals for certain experiments can be found in the exps/
directory.
For example, to run all models multiple times and plot the aggregate results over the runs with error bars, run:
bash exps/run_all_full_eval.sh
Other available experiment scripts:
bash exps/ablations/persuasion_degree.sh
bash exps/ablations/persona_comparison.sh
bash exps/ablations/run_10_turn_convo.sh
bash exps/ablations/contexts_comparison.sh
The default topics file can be found at src/topics/diverse_topics.jsonl
which contains topics from all categories:
[
"Conspiracy",
"BenignOpinion",
"BenignFactual",
"Controversial",
"NoncontroversiallyHarmful",
"UnderminingControl"
]
To add new topics to the persuasion conversations, simply add a new line in src/topics/diverse_topics.jsonl. This includes a category
, short_title
, text
, and subject
. Alternatively, you can create your own topic list via a .jsonl
file with the above keys in the src/topics/
directory and then adjusting the topics_file
parameter to point to it. New harmful topics can be generated using the script at src/utils/generate_harmful_texts.py
, but note that for the harmful topics used in the paper, we use a currently unreleased jailbroken model, but provide this script anyway which can be used with an OpenAI models.
Graphs of the results are automatically generated and created in the same directory as all other results. More visualization tooling is available, see instructions at src/visualizations/README.md
To view conversations from the results of the persuasion evals, we use the logviz library. We have included a version of the library in the logviz repository. First, install logviz as a library using the following command:
cd logviz
pip install -e .
To visualize the results, run logviz
from terminal and then drag the conversation_log.jsonl
file
containing the results into the window. This will display the conversation results
in a visual format.
We also report human annotations to validate the evaluator model in the human-annotation
directory. See the readme for more details.
If you use this work, please cite:
@article{kowal2025its,
title={It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics},
author={Kowal, Matthew and Timm, Jasper and Godbout, Jean-Francois and Costello, Thomas and Arechar, Antonio A and Pennycook, Gordon and Rand, David and Gleave, Adam and Pelrine, Kellin},
journal={arXiv preprint arXiv:2506.02873},
year={2025}
}