-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepseek-inspired policy check (?) #1087
Comments
Can this be extended so it's exactly clear what the ask is? Is there a specific guardrail circumvention technique here (which one) or a failure mode that's suggested for consideration? |
I am thinking aloud here, also on mobile. There is a special case of a test proposed, but somehow a rebours: aimed to discovering what has been hard- and soft-coded there as such guardrails. I had been using garak for only about 12 hours in total, a month ago, spot checking some bespoke LLMs, so I am not a garak expert. I only guess it can be done using a tailored garak format battery of queries, politically loaded here, by design. Re: 'Is there a specific guardrail circumvention technique here (which one) '. But here we also have a moral aspect - if we want to (even inadvertently) aid such non-WEIRD LLMs in their suppressing (granted also WEIRD-based) The Truth... |
garak is more of an LLM security tool than a moral judge. If there is a specific and clear attack technique we should be considering, please let us know! |
Ok, let us close this ticket then, as deliberate, "geopolitical" and thus inherently moral (as "one man's terrorist is another's freedom fighter") tests are more suitable here, e.g. this informal one: deepseek-ai/DeepSeek-R1#35 (comment). |
Summary
We may need tests for not-WEIRD Politically Correct LLMs like: https://github.com/deepseek-ai/DeepSeek-R1.
Basic example
On mobile, so excuse brevity. See the informal tests and discussions here: deepseek-ai/DeepSeek-R1#35 (comment)
Motivation
The non-PC questions and the answers are erased by the guardrail from the GUI session and logs (as half-expected), but these guardrails are also reported to operate in some black-box layers also in the offline versions of DeepSeek-R1.
The text was updated successfully, but these errors were encountered: