deepseek-inspired policy check (?) #1087

Manamama · 2025-01-24T14:05:09Z

Summary

We may need tests for not-WEIRD Politically Correct LLMs like: https://github.com/deepseek-ai/DeepSeek-R1.

Basic example

On mobile, so excuse brevity. See the informal tests and discussions here: deepseek-ai/DeepSeek-R1#35 (comment)

Motivation

The non-PC questions and the answers are erased by the guardrail from the GUI session and logs (as half-expected), but these guardrails are also reported to operate in some black-box layers also in the offline versions of DeepSeek-R1.

leondz · 2025-01-24T15:58:50Z

Can this be extended so it's exactly clear what the ask is? Is there a specific guardrail circumvention technique here (which one) or a failure mode that's suggested for consideration?

Manamama · 2025-01-24T22:52:51Z

I am thinking aloud here, also on mobile. There is a special case of a test proposed, but somehow a rebours: aimed to discovering what has been hard- and soft-coded there as such guardrails.

I had been using garak for only about 12 hours in total, a month ago, spot checking some bespoke LLMs, so I am not a garak expert. I only guess it can be done using a tailored garak format battery of queries, politically loaded here, by design.

Re: 'Is there a specific guardrail circumvention technique here (which one) '.
I noticed (informally only so far, as am too busy), and via its GUI only, not any API or with some standard battery of tests that asking it about their non-PC, guardrailed topics in a non English language elicits the answer, which is then erased once the presumed guardrail component realizes that the output is non-PC from their (black boxed so far) perspective.

But here we also have a moral aspect - if we want to (even inadvertently) aid such non-WEIRD LLMs in their suppressing (granted also WEIRD-based) The Truth...

leondz · 2025-01-25T03:25:25Z

garak is more of an LLM security tool than a moral judge. If there is a specific and clear attack technique we should be considering, please let us know!

Manamama · 2025-01-25T09:06:23Z

Ok, let us close this ticket then, as deliberate, "geopolitical" and thus inherently moral (as "one man's terrorist is another's freedom fighter") tests are more suitable here, e.g. this informal one: deepseek-ai/DeepSeek-R1#35 (comment).

leondz changed the title ~~New Garak tests, bespoke for~~ deepseek-inspired policy check (?) Jan 24, 2025

Manamama closed this as completed Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepseek-inspired policy check (?) #1087

deepseek-inspired policy check (?) #1087

Manamama commented Jan 24, 2025

leondz commented Jan 24, 2025

Manamama commented Jan 24, 2025

leondz commented Jan 25, 2025

Manamama commented Jan 25, 2025

deepseek-inspired policy check (?) #1087

deepseek-inspired policy check (?) #1087

Comments

Manamama commented Jan 24, 2025

Summary

Basic example

Motivation

leondz commented Jan 24, 2025

Manamama commented Jan 24, 2025

leondz commented Jan 25, 2025

Manamama commented Jan 25, 2025