Skip to content

Anuj-G-06/redviz

Repository files navigation

RedViz : A Redteaming Visualization Framework

chart

Large Language Models (LLMs) like GPT, Claude, and LLaMA power real-world applications yet remain vulnerable to adversarial prompts that bypass safety alignment to elicit harmful or policy-violating outputs, called jailbreaks. Existing red-teaming datasets, such as CohereAI’s AYA Red Teaming and JailbreakBench, offer rich multilingual prompt collections, but research remains predominantly static, limited to categorical or linguistic aggregation. This leaves a critical gap: the absence of an interactive, unified environment for dynamic LLM safety evaluation.

We introduce RedTeam Viz, an interactive Streamlit dashboard that transforms static red-teaming datasets into a live platform for exploratory safety analysis. By integrating AYA Red Teaming, JailbreakBench, and open-source models (TinyLLaMA, LLaMA, etc.), the system enables practitioners to explore harm distributions, conduct live jailbreak stress-tests, quantify attack success rates, and interpret failure modes via attention maps and token-level entropy.

The modular pipeline including data exploration, prompt testing, model inference with safety classification, and interpretability empowers users to compare attack styles across languages, pinpoint high-risk vectors, and correlate internal uncertainty with unsafe generations in real time. Built with libraries like Hugging Face, pandas, Streamlit, and Plotly, this dashboard bridges theoretical red-teaming and practical safety engineering while maintaining simplicity. It equips researchers and developers with a visual, structured workflow to derive actionable insights, ultimately fostering more robust, transparent, and safer LLM deployments.

System Design

Environment Setup

To set up the environment for this project, follow these steps:

conda create -n redteam python=3.10
conda activate redteam
pip install -r requirements.txt

for using gemma from huggingface we need to have the auth token set up as an environment variable. You can do this by running the following command in your terminal:

huggingface-cli login

Running the Project

To run the project, execute the following command in your terminal:

streamlit run RedTeaming_Dashboard.py

About dataset:

The Aya Red-teaming dataset is a human-annotated multilingual red-teaming dataset consisting of harmful prompts in 8 languages across 9 different categories of harm with explicit labels for "global" and "local" harm.

Languages: Arabic, English, Filipino, French, Hindi, Russian, Serbian and Spanish License: Apache 2.0

Loading dataset Loading dataset


Dataset Structure: { 'prompt': '...', 'language': 'French', 'harm_category': '["Discrimination & Injustice", "Bullying & Harassment"]', 'global_or_local': 'local', 'literal_translation': '...', 'semantic_translation': None, 'explanation': '...' } where,

- prompt: Original human curated prompt
- language: Language of the prompt
- harm_category: Category of harm which the prompt falls under
- global_or_local: Specifies whether the prompt is globally or locally harmful
- literal_translation: Word-for-word translation of the original prompt (None for English)
- semantic_translation: Translation by meaning of the original prompt (None for English)
- explanation: Explanation for why the prompt is harmful and/or globally/locally harmful

Harm Categories:
- Bullying & Harassment
- Discrimination & Injustice
- Graphic Material
- Harms of Representation Allocation & Quality of Service
- Hate Speech
- Non-consensual sexual content
- Profanity
- Self-harm
- Violence, threats & incitement

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages