Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detector Shields for testing LLM Application Firewalls #1059

Merged
merged 5 commits into from
Jan 17, 2025

Conversation

Eric-Hacker
Copy link
Contributor

This implements a Shields detector as describe in #1055

Verification

Use this detector with an LLM Guardrail such as Granite-Guardian

  • Rest generator config for running Granite-Guardian 2B locally with Ollama
{
   "rest": {
      "RestGenerator": {
         "name": "Ollama GraniteGuardian",
         "uri": "http://localhost:11434/api/chat",
         "method": "post",
         "headers": {
            "Content-Type": "application/json"
         },
         "req_template_json_object": {
            "model": "granite3-guardian:2b",
            "messages": [ {"role": "user", "content": "$INPUT"} ],
            "temperature": 0,
           "stream": false
         },
         "response_json": true,
         "response_json_field": "$.message.content",
         "request_timeout": 120
      }
   }
}
  • Run the tests and ensure they pass python -m pytest tests/
  • python -m garak --model_type rest \ --generator_option_file garak_runs/ollama_graniteguardian.json \ --generations 1 \ --probes test.Test \ --detectors shields.Up \ --report_prefix promptinject.ollama_graniteguardian.up
  • Verify shields.Up: FAIL ok on 3/ 8 (failure rate: 62.50%)
  • python -m garak --model_type rest \ --generator_option_file garak_runs/ollama_graniteguardian.json \ --generations 1 \ --probes test.Test \ --detectors shields.Down \ --report_prefix promptinject.ollama_graniteguardian.down
  • Verify shields.Down: FAIL ok on 5/ 8 (failure rate: 37.50%)
  • Documented

Copy link
Contributor

github-actions bot commented Jan 6, 2025

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

@jmartin-tech
Copy link
Collaborator

Awesome, I like this idea. There has been some discussion in discord about boolean detection for rule triggers. I wonder if there is a reasonable way to incorporate that here.

The solution for users may be as simple as setting upstrings: [ "true" ] or downstrings: [ "false" ] to detect rule match response.

@Eric-Hacker
Copy link
Contributor Author

I have read the DCO Document and I hereby sign the DCO

@Eric-Hacker Eric-Hacker marked this pull request as draft January 6, 2025 22:47
@Eric-Hacker
Copy link
Contributor Author

It looks like I messed up my username when I switched machines and now there is a commit from what Github thinks is someone else. I'll have to figure out how to fix that. Pulled it back to draft.

@Eric-Hacker
Copy link
Contributor Author

Regarding adding other boolean detection strings, I'd be happy to do something, but looking at the thread it looks like the firewall returned a key of "flagged" that is set to true rather than something like "flagged: true" in the normal response_json_field. It seems that would need non-string detector which should probably be its own class.

Also, one of the default Up strings that is matched is "flag" at the beginning of the response. So if a firewall was returning "flagged true" and "flagged false" as strings, this would treat them both as a pass. That is one of the reasons the upstrings and downstrings are params. I didn't call that out in the docs, but I could do so since I know this can get really confusing.

github-actions bot added a commit that referenced this pull request Jan 7, 2025
@Eric-Hacker
Copy link
Contributor Author

TL;DR Fixed the DCO identity issue. Putting back up for review.

Details. My first commit was using a canary email address on my domain that wasn't registered with Github. Then I found out that github provides email addresses for public exposure and I switched to that. I had to add the canary email address to my github profile and the DCO issue cleared up.

So now I have an email address that is only exposed in one public commit. It will be interesting to see how much spam it gets.

@Eric-Hacker Eric-Hacker marked this pull request as ready for review January 8, 2025 12:48
@jmartin-tech
Copy link
Collaborator

@Eric-Hacker if you would like to limit the exposure you can rebase or squash the branch before we land it to remove the email and allowing removal from your github account, while not a full scrub since git commits are rarely truly deleted it could mitigate future spam.

Either way we will consider this ready for review and provide feedback soon.

@Eric-Hacker
Copy link
Contributor Author

@jmartin-tech No need to try and remove that email, thank you. I already have filters to direct compromised canary email addresses to spam. Now I'm curious to see what it receives.

@jmartin-tech jmartin-tech self-assigned this Jan 16, 2025
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing looks good, #1075 has added the startswith capability here and incorporated the tests in test_detectors_base.py making test_detectors_string.py here redundant.

I will push a commit removing test_detectors_string.py and land this shortly.

These tests are for `StringDetector` are incorporated in `test_detectors_base.py`
@jmartin-tech jmartin-tech merged commit d8d11a9 into NVIDIA:main Jan 17, 2025
9 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jan 17, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants