-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] add harm categories to AdvBench Dataset #732
base: main
Are you sure you want to change the base?
Conversation
@@ -531,6 +531,9 @@ def fetch_adv_bench_dataset( | |||
# Extract and append the data to respective containers | |||
prompts = [item["goal"] for item in examples if "goal" in item] | |||
|
|||
# harm_categories = _fetch_from_file("pyrit/datasets/harm_categories/adv_bench_dataset.json", "json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume what you mean here is that we'd have to create this file in the repo (as you did in this draft PR) and have one entry per sample in advbench. That's definitely one way of solving it!
I suppose we can assume that the GitHub repo that we're pulling the data from stays as is since it's just a paper repo.
Still, I prefer keeping all the data together if that makes any sense. One option would be to clone the paper repo (or start a new one with just the dataset) and add the extra data column there. Another would be to clone https://huggingface.co/datasets/walledai/AdvBench and add another dataset with that extra column.
Any thoughts? In any case, we'll need the harm categories for all the prompts. It's probably a good idea to generate those with gen AI and then cross-check. The immediate follow-up question to that is obviously "but what are the categories to choose from?" and I don't have a good answer...
So far, we've been grouping into things like "sexual", "violence", "hate_unfairness", "self-harm" but it's not like we've been consistent about it either 🙁 Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume what you mean here is that we'd have to create this file in the repo (as you did in this draft PR) and have one entry per sample in advbench. That's definitely one way of solving it!
That's exactly what I intended: to have a separate file mapping harm categories to each dataset entry, and manually reviewing and assigning the categories :)
So the reason I initially created this new file (instead of modifying the dataset itself) is because I assumed we wouldn't want to keep the harmful prompts directly in the PyRIT repo. But I agree that having all the data together makes a lot of sense.
As for the harm categories, I've been manually applying them based on the ones already present in the fetch function (though I'm not sure where they were originally sourced from):
PyRIT/pyrit/datasets/fetch_example_datasets.py
Lines 535 to 555 in 3d0543c
harm_categories = [ | |
"Endangering National Security", | |
"Insulting Behavior", | |
"Discriminatory Behavior", | |
"Endangering Public Health", | |
"Copyright Issues", | |
"Violence", | |
"Drugs", | |
"Privacy Violation", | |
"Economic Crime", | |
"Mental Manipulation", | |
"Human Trafficking", | |
"Physical Harm", | |
"Sexual Content", | |
"Cybercrime", | |
"Disrupting Public Order", | |
"Environmental Damage", | |
"Psychological Harm", | |
"White-Collar Crime", | |
"Animal Abuse", | |
] |
Edit:
This is from the paper regarding the themes of the prompts:
... reflect harmful or toxic behavior, encompassing a wide spectrum of detrimental content such as profanity, graphic depictions, threatening behavior, misinformation, discrimination, cybercrime, and dangerous or illegal suggestions ...
Maybe these could be the categories we use to classify the prompts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely a start. I had a variation of this discussion just this morning, but no real solution yet. I think we need a taxonomy of harms and then use the categories from there. The question is which one. Let me see what I can find and feel free to suggest some as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found an article that lists 9 top-level types of harms:
Autonomy, Physical, Psychological, Reputational, Financial and Business, Human Rights and Civil Liberties, Societal and Cultural, Political and Economic, Environmental.
This taxonomy aims to be accessible, comprehensive, and "user-friendly"
https://arxiv.org/abs/2407.01294v2
However, there's one thing I’m unsure about: do we need something specifically for this dataset, or a taxonomy that could be used generally across PyRIT?
If it's just for this dataset, maybe sticking with what's in the original paper will be just enough? 🤷♂️🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, I misread what you wrote earlier! Yes, let's stick with what the paper says. I'm trying to get us to use a taxonomy across the board, but reaching the point where we agree on one that really covers everything we need may take some time. Don't want to delay until then. In fact, since this may happen at some point, your solution with a file in the repo may not be a bad idea at all since it's the simplest.
Description
This PR aims to resolve #730 by adding a way to manually assign harm categories to the AdvBench dataset and enabling filtering support based on those categories.
Marked as a draft PR since I'm seeking confirmation on the approach
Tests and Documentation