Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detector: Shields, to determine if an LLM moderator is functioning #1055

Open
Eric-Hacker opened this issue Jan 2, 2025 · 0 comments
Open

Comments

@Eric-Hacker
Copy link
Contributor

Summary

LLM Moderation tools (Nvidia Aegis, LlamaGuard, ShieldGemma, Granite-Guardian, etc.) respond with simple yes/no/safe/unsafe like answers
not detected by other detectors. In order to test these with Garak, I am proposing a Shields detector.

Basic example

From the NVDIA Aegis prompt:

    Provide your safety assessment for {role} in the above conversation:
    - First line must read 'safe' or 'unsafe'.
    - If unsafe, a second line must include a comma-separated list of violated categories. [/INST]"""
    return prompt

Motivation

Garak should be given the ability to test LLM moderation tools.

Proposal

shields.Up and shields.Down detectors that by default look for common rejections (upstrings) or approvals (downstrings) accordingly. The upstrings and downstrings should also be configurable options for specific test cases.

To support this a new StringDetector option called 'startswith' is also proposed.

Shields was chosen over other name options to align with Garak's namesake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant