Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] add harm categories to AdvBench Dataset #732

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

paulinek13
Copy link
Contributor

Description

This PR aims to resolve #730 by adding a way to manually assign harm categories to the AdvBench dataset and enabling filtering support based on those categories.

Marked as a draft PR since I'm seeking confirmation on the approach

Tests and Documentation

@@ -531,6 +531,9 @@ def fetch_adv_bench_dataset(
# Extract and append the data to respective containers
prompts = [item["goal"] for item in examples if "goal" in item]

# harm_categories = _fetch_from_file("pyrit/datasets/harm_categories/adv_bench_dataset.json", "json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume what you mean here is that we'd have to create this file in the repo (as you did in this draft PR) and have one entry per sample in advbench. That's definitely one way of solving it!

I suppose we can assume that the GitHub repo that we're pulling the data from stays as is since it's just a paper repo.

Still, I prefer keeping all the data together if that makes any sense. One option would be to clone the paper repo (or start a new one with just the dataset) and add the extra data column there. Another would be to clone https://huggingface.co/datasets/walledai/AdvBench and add another dataset with that extra column.

Any thoughts? In any case, we'll need the harm categories for all the prompts. It's probably a good idea to generate those with gen AI and then cross-check. The immediate follow-up question to that is obviously "but what are the categories to choose from?" and I don't have a good answer...

So far, we've been grouping into things like "sexual", "violence", "hate_unfairness", "self-harm" but it's not like we've been consistent about it either 🙁 Thoughts?

Copy link
Contributor Author

@paulinek13 paulinek13 Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume what you mean here is that we'd have to create this file in the repo (as you did in this draft PR) and have one entry per sample in advbench. That's definitely one way of solving it!

That's exactly what I intended: to have a separate file mapping harm categories to each dataset entry, and manually reviewing and assigning the categories :)

So the reason I initially created this new file (instead of modifying the dataset itself) is because I assumed we wouldn't want to keep the harmful prompts directly in the PyRIT repo. But I agree that having all the data together makes a lot of sense.

As for the harm categories, I've been manually applying them based on the ones already present in the fetch function (though I'm not sure where they were originally sourced from):

harm_categories = [
"Endangering National Security",
"Insulting Behavior",
"Discriminatory Behavior",
"Endangering Public Health",
"Copyright Issues",
"Violence",
"Drugs",
"Privacy Violation",
"Economic Crime",
"Mental Manipulation",
"Human Trafficking",
"Physical Harm",
"Sexual Content",
"Cybercrime",
"Disrupting Public Order",
"Environmental Damage",
"Psychological Harm",
"White-Collar Crime",
"Animal Abuse",
]

Edit:

This is from the paper regarding the themes of the prompts:

... reflect harmful or toxic behavior, encompassing a wide spectrum of detrimental content such as profanity, graphic depictions, threatening behavior, misinformation, discrimination, cybercrime, and dangerous or illegal suggestions ...

Maybe these could be the categories we use to classify the prompts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely a start. I had a variation of this discussion just this morning, but no real solution yet. I think we need a taxonomy of harms and then use the categories from there. The question is which one. Let me see what I can find and feel free to suggest some as well.

Copy link
Contributor Author

@paulinek13 paulinek13 Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found an article that lists 9 top-level types of harms:
Autonomy, Physical, Psychological, Reputational, Financial and Business, Human Rights and Civil Liberties, Societal and Cultural, Political and Economic, Environmental.
This taxonomy aims to be accessible, comprehensive, and "user-friendly"
https://arxiv.org/abs/2407.01294v2

However, there's one thing I’m unsure about: do we need something specifically for this dataset, or a taxonomy that could be used generally across PyRIT?

If it's just for this dataset, maybe sticking with what's in the original paper will be just enough? 🤷‍♂️🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I misread what you wrote earlier! Yes, let's stick with what the paper says. I'm trying to get us to use a taxonomy across the board, but reaching the point where we agree on one that really covers everything we need may take some time. Don't want to delay until then. In fact, since this may happen at some point, your solution with a file in the repo may not be a bad idea at all since it's the simplest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG harm categories for AdvBench Dataset aren't added yet
2 participants