Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepseek-inspired policy check (?) #1087

Closed
Manamama opened this issue Jan 24, 2025 · 4 comments
Closed

deepseek-inspired policy check (?) #1087

Manamama opened this issue Jan 24, 2025 · 4 comments

Comments

@Manamama
Copy link

Summary

We may need tests for not-WEIRD Politically Correct LLMs like: https://github.com/deepseek-ai/DeepSeek-R1.

Basic example

On mobile, so excuse brevity. See the informal tests and discussions here: deepseek-ai/DeepSeek-R1#35 (comment)

Motivation

The non-PC questions and the answers are erased by the guardrail from the GUI session and logs (as half-expected), but these guardrails are also reported to operate in some black-box layers also in the offline versions of DeepSeek-R1.

@leondz
Copy link
Collaborator

leondz commented Jan 24, 2025

Can this be extended so it's exactly clear what the ask is? Is there a specific guardrail circumvention technique here (which one) or a failure mode that's suggested for consideration?

@leondz leondz changed the title New Garak tests, bespoke for deepseek-inspired policy check (?) Jan 24, 2025
@Manamama
Copy link
Author

I am thinking aloud here, also on mobile. There is a special case of a test proposed, but somehow a rebours: aimed to discovering what has been hard- and soft-coded there as such guardrails.

I had been using garak for only about 12 hours in total, a month ago, spot checking some bespoke LLMs, so I am not a garak expert. I only guess it can be done using a tailored garak format battery of queries, politically loaded here, by design.

Re: 'Is there a specific guardrail circumvention technique here (which one) '.
I noticed (informally only so far, as am too busy), and via its GUI only, not any API or with some standard battery of tests that asking it about their non-PC, guardrailed topics in a non English language elicits the answer, which is then erased once the presumed guardrail component realizes that the output is non-PC from their (black boxed so far) perspective.

But here we also have a moral aspect - if we want to (even inadvertently) aid such non-WEIRD LLMs in their suppressing (granted also WEIRD-based) The Truth...

@leondz
Copy link
Collaborator

leondz commented Jan 25, 2025

garak is more of an LLM security tool than a moral judge. If there is a specific and clear attack technique we should be considering, please let us know!

@Manamama
Copy link
Author

Ok, let us close this ticket then, as deliberate, "geopolitical" and thus inherently moral (as "one man's terrorist is another's freedom fighter") tests are more suitable here, e.g. this informal one: deepseek-ai/DeepSeek-R1#35 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants