Detector Shields for testing LLM Application Firewalls #1059

Eric-Hacker · 2025-01-06T14:25:24Z

This implements a Shields detector as describe in #1055

Verification

Use this detector with an LLM Guardrail such as Granite-Guardian

Rest generator config for running Granite-Guardian 2B locally with Ollama

{
   "rest": {
      "RestGenerator": {
         "name": "Ollama GraniteGuardian",
         "uri": "http://localhost:11434/api/chat",
         "method": "post",
         "headers": {
            "Content-Type": "application/json"
         },
         "req_template_json_object": {
            "model": "granite3-guardian:2b",
            "messages": [ {"role": "user", "content": "$INPUT"} ],
            "temperature": 0,
           "stream": false
         },
         "response_json": true,
         "response_json_field": "$.message.content",
         "request_timeout": 120
      }
   }
}

Run the tests and ensure they pass python -m pytest tests/
python -m garak --model_type rest \ --generator_option_file garak_runs/ollama_graniteguardian.json \ --generations 1 \ --probes test.Test \ --detectors shields.Up \ --report_prefix promptinject.ollama_graniteguardian.up
Verify shields.Up: FAIL ok on 3/ 8 (failure rate: 62.50%)
python -m garak --model_type rest \ --generator_option_file garak_runs/ollama_graniteguardian.json \ --generations 1 \ --probes test.Test \ --detectors shields.Down \ --report_prefix promptinject.ollama_graniteguardian.down
Verify shields.Down: FAIL ok on 5/ 8 (failure rate: 37.50%)
Documented

github-actions · 2025-01-06T14:25:37Z

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

jmartin-tech · 2025-01-06T22:36:20Z

Awesome, I like this idea. There has been some discussion in discord about boolean detection for rule triggers. I wonder if there is a reasonable way to incorporate that here.

The solution for users may be as simple as setting upstrings: [ "true" ] or downstrings: [ "false" ] to detect rule match response.

Eric-Hacker · 2025-01-06T22:44:14Z

I have read the DCO Document and I hereby sign the DCO

Eric-Hacker · 2025-01-06T22:50:15Z

It looks like I messed up my username when I switched machines and now there is a commit from what Github thinks is someone else. I'll have to figure out how to fix that. Pulled it back to draft.

Eric-Hacker · 2025-01-06T23:25:36Z

Regarding adding other boolean detection strings, I'd be happy to do something, but looking at the thread it looks like the firewall returned a key of "flagged" that is set to true rather than something like "flagged: true" in the normal response_json_field. It seems that would need non-string detector which should probably be its own class.

Also, one of the default Up strings that is matched is "flag" at the beginning of the response. So if a firewall was returning "flagged true" and "flagged false" as strings, this would treat them both as a pass. That is one of the reasons the upstrings and downstrings are params. I didn't call that out in the docs, but I could do so since I know this can get really confusing.

Eric-Hacker · 2025-01-08T12:48:49Z

TL;DR Fixed the DCO identity issue. Putting back up for review.

Details. My first commit was using a canary email address on my domain that wasn't registered with Github. Then I found out that github provides email addresses for public exposure and I switched to that. I had to add the canary email address to my github profile and the DCO issue cleared up.

So now I have an email address that is only exposed in one public commit. It will be interesting to see how much spam it gets.

jmartin-tech · 2025-01-08T15:11:04Z

@Eric-Hacker if you would like to limit the exposure you can rebase or squash the branch before we land it to remove the email and allowing removal from your github account, while not a full scrub since git commits are rarely truly deleted it could mitigate future spam.

Either way we will consider this ready for review and provide feedback soon.

Eric-Hacker · 2025-01-09T22:40:22Z

@jmartin-tech No need to try and remove that email, thank you. I already have filters to direct compromised canary email addresses to spam. Now I'm curious to see what it receives.

jmartin-tech

Testing looks good, #1075 has added the startswith capability here and incorporated the tests in test_detectors_base.py making test_detectors_string.py here redundant.

I will push a commit removing test_detectors_string.py and land this shortly.

These tests are for `StringDetector` are incorporated in `test_detectors_base.py`

Eric-Hacker and others added 4 commits December 5, 2024 22:32

Added shields detector and startswith stringdetector

28221a1

Added options to Shields and fixed bug.

2a0dd05

Added tests

1c34988

Finished docs to pass tests.

62a961f

Eric-Hacker marked this pull request as draft January 6, 2025 22:47

github-actions bot added a commit that referenced this pull request Jan 7, 2025

@Eric-Hacker has signed the CLA in #1059

e6755d6

Eric-Hacker marked this pull request as ready for review January 8, 2025 12:48

jmartin-tech self-assigned this Jan 16, 2025

jmartin-tech approved these changes Jan 17, 2025

View reviewed changes

remove redundant test file

4ba35af

These tests are for `StringDetector` are incorporated in `test_detectors_base.py`

jmartin-tech merged commit d8d11a9 into NVIDIA:main Jan 17, 2025
9 checks passed

github-actions bot locked and limited conversation to collaborators Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detector Shields for testing LLM Application Firewalls #1059

Detector Shields for testing LLM Application Firewalls #1059

Eric-Hacker commented Jan 6, 2025

github-actions bot commented Jan 6, 2025 •

edited

Loading

jmartin-tech commented Jan 6, 2025

Eric-Hacker commented Jan 6, 2025

Eric-Hacker commented Jan 6, 2025

Eric-Hacker commented Jan 6, 2025

Eric-Hacker commented Jan 8, 2025

jmartin-tech commented Jan 8, 2025

Eric-Hacker commented Jan 9, 2025

jmartin-tech left a comment

Detector Shields for testing LLM Application Firewalls #1059

Detector Shields for testing LLM Application Firewalls #1059

Conversation

Eric-Hacker commented Jan 6, 2025

Verification

github-actions bot commented Jan 6, 2025 • edited Loading

jmartin-tech commented Jan 6, 2025

Eric-Hacker commented Jan 6, 2025

Eric-Hacker commented Jan 6, 2025

Eric-Hacker commented Jan 6, 2025

Eric-Hacker commented Jan 8, 2025

jmartin-tech commented Jan 8, 2025

Eric-Hacker commented Jan 9, 2025

jmartin-tech left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 6, 2025 •

edited

Loading