Skip to content

feat: add WMDP dataset integration #93

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ruidazeng
Copy link
Contributor

Add WMDP cyber dataset configuration for unlearning

What does this PR do?

Fixes #80

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Have you gone through the contributions guide?
  • Are your changes documented? Read documentation guidelines here.

Add WMDP cyber dataset configuration for unlearning

We invite the LLM unlearning community to collaborate by adding new benchmarks, unlearning methods, datasets and evaluation metrics here to expand OpenUnlearning's features, gain feedback from wider usage and drive progress in the field.

---

### 📢 Updates

#### [Apr 9, 2025]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update just before merge

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same as WMDP_cyber_forget.yaml
Is it needed for something? If for testing performance disruption, IIRC MMLU was used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruidazeng I don't see the path existing for this dataset in hf.
cais/wmdp-corpora

Copy link

@filyp filyp Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean for MMLU? Yeah, it's not in cais/wmdp-corpora but in https://huggingface.co/datasets/cais/mmlu

From what I understand, in original setup, training is done on:

  • "cais/wmdp-corpora" - "cyber-forget-corpus"
  • "cais/wmdp-corpora" - "cyber-retain-corpus"

And evaluation on:

  • "cais/wmdp" - "wmdp-cyber"
  • "cais/mmlu" - "all"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you mentioned in #80 that evaluation won't be done here anyway, but only afterwards, so I guess WMDP_cyber_forget.yaml and WMDP_cyber_retain.yaml just aren't needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@filyp filyp mentioned this pull request Apr 26, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Add WMDP Dataset
4 participants