[DRAFT] add harm categories to AdvBench Dataset #732

paulinek13 · 2025-02-22T19:45:42Z

Description

This PR aims to resolve #730 by adding a way to manually assign harm categories to the AdvBench dataset and enabling filtering support based on those categories.

Marked as a draft PR since I'm seeking confirmation on the approach

Tests and Documentation

romanlutz · 2025-02-24T23:15:26Z

pyrit/datasets/fetch_example_datasets.py

@@ -531,6 +531,9 @@ def fetch_adv_bench_dataset(
    # Extract and append the data to respective containers
    prompts = [item["goal"] for item in examples if "goal" in item]

+    # harm_categories = _fetch_from_file("pyrit/datasets/harm_categories/adv_bench_dataset.json", "json")


I assume what you mean here is that we'd have to create this file in the repo (as you did in this draft PR) and have one entry per sample in advbench. That's definitely one way of solving it!

I suppose we can assume that the GitHub repo that we're pulling the data from stays as is since it's just a paper repo.

Still, I prefer keeping all the data together if that makes any sense. One option would be to clone the paper repo (or start a new one with just the dataset) and add the extra data column there. Another would be to clone https://huggingface.co/datasets/walledai/AdvBench and add another dataset with that extra column.

Any thoughts? In any case, we'll need the harm categories for all the prompts. It's probably a good idea to generate those with gen AI and then cross-check. The immediate follow-up question to that is obviously "but what are the categories to choose from?" and I don't have a good answer...

So far, we've been grouping into things like "sexual", "violence", "hate_unfairness", "self-harm" but it's not like we've been consistent about it either 🙁 Thoughts?

I assume what you mean here is that we'd have to create this file in the repo (as you did in this draft PR) and have one entry per sample in advbench. That's definitely one way of solving it!

That's exactly what I intended: to have a separate file mapping harm categories to each dataset entry, and manually reviewing and assigning the categories :)

So the reason I initially created this new file (instead of modifying the dataset itself) is because I assumed we wouldn't want to keep the harmful prompts directly in the PyRIT repo. But I agree that having all the data together makes a lot of sense.

As for the harm categories, I've been manually applying them based on the ones already present in the fetch function (though I'm not sure where they were originally sourced from):

PyRIT/pyrit/datasets/fetch_example_datasets.py

Lines 535 to 555 in 3d0543c

harm_categories = [

"Endangering National Security",

"Insulting Behavior",

"Discriminatory Behavior",

"Endangering Public Health",

"Copyright Issues",

"Violence",

"Drugs",

"Privacy Violation",

"Economic Crime",

"Mental Manipulation",

"Human Trafficking",

"Physical Harm",

"Sexual Content",

"Cybercrime",

"Disrupting Public Order",

"Environmental Damage",

"Psychological Harm",

"White-Collar Crime",

"Animal Abuse",

]

Edit:

This is from the paper regarding the themes of the prompts:

... reflect harmful or toxic behavior, encompassing a wide spectrum of detrimental content such as profanity, graphic depictions, threatening behavior, misinformation, discrimination, cybercrime, and dangerous or illegal suggestions ...

Maybe these could be the categories we use to classify the prompts?

It's definitely a start. I had a variation of this discussion just this morning, but no real solution yet. I think we need a taxonomy of harms and then use the categories from there. The question is which one. Let me see what I can find and feel free to suggest some as well.

I found an article that lists 9 top-level types of harms:
Autonomy, Physical, Psychological, Reputational, Financial and Business, Human Rights and Civil Liberties, Societal and Cultural, Political and Economic, Environmental.
This taxonomy aims to be accessible, comprehensive, and "user-friendly"
https://arxiv.org/abs/2407.01294v2

However, there's one thing I’m unsure about: do we need something specifically for this dataset, or a taxonomy that could be used generally across PyRIT?

If it's just for this dataset, maybe sticking with what's in the original paper will be just enough? 🤷‍♂️🤔

Oh, sorry, I misread what you wrote earlier! Yes, let's stick with what the paper says. I'm trying to get us to use a taxonomy across the board, but reaching the point where we agree on one that really covers everything we need may take some time. Don't want to delay until then. In fact, since this may happen at some point, your solution with a file in the repo may not be a bad idea at all since it's the simplest.

init

04f97c3

romanlutz reviewed Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] add harm categories to AdvBench Dataset #732

[DRAFT] add harm categories to AdvBench Dataset #732

paulinek13 commented Feb 22, 2025

romanlutz Feb 24, 2025

paulinek13 Feb 25, 2025 •

edited

Loading

romanlutz Feb 25, 2025

paulinek13 Feb 28, 2025 •

edited

Loading

romanlutz Feb 28, 2025

	harm_categories = [
	"Endangering National Security",
	"Insulting Behavior",
	"Discriminatory Behavior",
	"Endangering Public Health",
	"Copyright Issues",
	"Violence",
	"Drugs",
	"Privacy Violation",
	"Economic Crime",
	"Mental Manipulation",
	"Human Trafficking",
	"Physical Harm",
	"Sexual Content",
	"Cybercrime",
	"Disrupting Public Order",
	"Environmental Damage",
	"Psychological Harm",
	"White-Collar Crime",
	"Animal Abuse",
	]

[DRAFT] add harm categories to AdvBench Dataset #732

Are you sure you want to change the base?

[DRAFT] add harm categories to AdvBench Dataset #732

Conversation

paulinek13 commented Feb 22, 2025

Description

Tests and Documentation

romanlutz Feb 24, 2025

Choose a reason for hiding this comment

paulinek13 Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

romanlutz Feb 25, 2025

Choose a reason for hiding this comment

paulinek13 Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

romanlutz Feb 28, 2025

Choose a reason for hiding this comment

paulinek13 Feb 25, 2025 •

edited

Loading

paulinek13 Feb 28, 2025 •

edited

Loading