-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detector Shields for testing LLM Application Firewalls #1059
Conversation
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅ |
Awesome, I like this idea. There has been some discussion in discord about boolean detection for rule triggers. I wonder if there is a reasonable way to incorporate that here. The solution for users may be as simple as setting |
I have read the DCO Document and I hereby sign the DCO |
It looks like I messed up my username when I switched machines and now there is a commit from what Github thinks is someone else. I'll have to figure out how to fix that. Pulled it back to draft. |
Regarding adding other boolean detection strings, I'd be happy to do something, but looking at the thread it looks like the firewall returned a key of "flagged" that is set to true rather than something like "flagged: true" in the normal response_json_field. It seems that would need non-string detector which should probably be its own class. Also, one of the default Up strings that is matched is "flag" at the beginning of the response. So if a firewall was returning "flagged true" and "flagged false" as strings, this would treat them both as a pass. That is one of the reasons the upstrings and downstrings are params. I didn't call that out in the docs, but I could do so since I know this can get really confusing. |
TL;DR Fixed the DCO identity issue. Putting back up for review. Details. My first commit was using a canary email address on my domain that wasn't registered with Github. Then I found out that github provides email addresses for public exposure and I switched to that. I had to add the canary email address to my github profile and the DCO issue cleared up. So now I have an email address that is only exposed in one public commit. It will be interesting to see how much spam it gets. |
@Eric-Hacker if you would like to limit the exposure you can rebase or squash the branch before we land it to remove the email and allowing removal from your github account, while not a full scrub since git commits are rarely truly deleted it could mitigate future spam. Either way we will consider this ready for review and provide feedback soon. |
@jmartin-tech No need to try and remove that email, thank you. I already have filters to direct compromised canary email addresses to spam. Now I'm curious to see what it receives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing looks good, #1075 has added the startswith
capability here and incorporated the tests in test_detectors_base.py
making test_detectors_string.py
here redundant.
I will push a commit removing test_detectors_string.py
and land this shortly.
These tests are for `StringDetector` are incorporated in `test_detectors_base.py`
This implements a Shields detector as describe in #1055
Verification
Use this detector with an LLM Guardrail such as Granite-Guardian
python -m pytest tests/
python -m garak --model_type rest \ --generator_option_file garak_runs/ollama_graniteguardian.json \ --generations 1 \ --probes test.Test \ --detectors shields.Up \ --report_prefix promptinject.ollama_graniteguardian.up
python -m garak --model_type rest \ --generator_option_file garak_runs/ollama_graniteguardian.json \ --generations 1 \ --probes test.Test \ --detectors shields.Down \ --report_prefix promptinject.ollama_graniteguardian.down