GitHub - chen37058/Red-Team-Arxiv-Paper-Update: Awesome Jailbreaking Multimodal Large Language Models (Automatically Update Every 12th hours)

Updated on 2025.01.27

Usage instructions: here

Table of Contents

Red Teaming

Red Teaming

Publish Date	Title	Authors	PDF	Code
2025-01-23	Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak	Erjia Xiao et.al.	2501.13772	null
2025-01-19	Dagger Behind Smile: Fool LLMs with a Happy Ending Story	Xurui Song et.al.	2501.13115	null
2025-01-21	You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense	Wuyuao Mai et.al.	2501.12210	null
2025-01-19	Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity	David Williams-King et.al.	2501.11183	null
2025-01-18	Jailbreaking Large Language Models in Infinitely Many Ways	Oliver Goldstein et.al.	2501.10800	null
2025-01-18	Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks	Xin Yi et.al.	2501.10639	null
2024-12-17	What Information Should Be Shared with Whom "Before and During Training"?	Haydn Belfield et.al.	2501.10379	null
2025-01-16	A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy	Huandong Wang et.al.	2501.09431	null
2025-01-14	Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models	Abdulkadir Erol et.al.	2501.09039	null
2025-01-15	SAIF: A Comprehensive Framework for Evaluating the Risks of Generative AI in the Public Sector	Kyeongryul Lee et.al.	2501.08814	null
2025-01-14	Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints	Jonathan Nöther et.al.	2501.08246	null
2025-01-14	Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning	Jiaqi Hua et.al.	2501.07959	link
2025-01-14	Gandalf the Red: Adaptive Security for LLMs	Niklas Pfister et.al.	2501.07927	link
2025-01-13	Lessons From Red Teaming 100 Generative AI Products	Blake Bullwinkel et.al.	2501.07238	null
2025-01-09	Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency	Shiji Zhao et.al.	2501.04931	null
2025-01-05	Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense	Yang Ouyang et.al.	2501.02629	link
2025-01-03	Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models	Ziwei Zheng et.al.	2501.02029	null
2025-01-02	Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs	Joao Fonseca et.al.	2501.02018	null
2025-01-09	Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions	Rachneet Sachdeva et.al.	2501.01872	link
2025-01-03	Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models	Yanjiang Liu et.al.	2501.01830	null
2025-01-09	WeAudit: Scaffolding User Auditors and AI Practitioners in Auditing Generative AI	Wesley Hanwen Deng et.al.	2501.01397	null
2025-01-02	CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models	Johan Wahréus et.al.	2501.01335	link
2024-12-29	Adversarial Negotiation Dynamics in Generative Language Models	Arinbjörn Kolbeinsson et.al.	2501.00069	null
2024-12-28	LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models	Miao Yu et.al.	2501.00055	link
2024-12-30	InfAlign: Inference-aware language model alignment	Ananth Balashankar et.al.	2412.19792	null
2024-12-24	Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning	Alex Beutel et.al.	2412.18693	null
2024-12-25	Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models	Xiaomeng Hu et.al.	2412.18171	null
2024-12-23	Retention Score: Quantifying Jailbreak Risks for Vision Language Models	Zaitang Li et.al.	2412.17544	null
2025-01-05	DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak	Hao Wang et.al.	2412.17522	null
2024-12-22	Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models	Lang Gao et.al.	2412.17034	null
2024-12-22	Robustness of Large Language Models Against Adversarial Attacks	Yiyi Tao et.al.	2412.17011	null
2024-12-21	OpenAI o1 System Card	OpenAI et.al.	2412.16720	null
2024-12-21	POEX: Policy Executable Embodied AI Jailbreak Attacks	Xuancun Lu et.al.	2412.16633	null
2024-12-21	Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models	Yanxu Mao et.al.	2412.16555	null
2025-01-08	Deliberative Alignment: Reasoning Enables Safer Language Models	Melody Y. Guan et.al.	2412.16339	null
2024-12-20	Logical Consistency of Large Language Models in Fact-checking	Bishwamittra Ghosh et.al.	2412.16100	null
2024-12-20	JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs	Hongyi Li et.al.	2412.15623	null
2024-12-19	SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage	Xiaoning Dong et.al.	2412.15289	null
2025-01-08	Toxicity Detection towards Adaptability to Changing Perturbations	Hankun Kang et.al.	2412.15267	null
2024-12-18	Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation	Aneta Zugecova et.al.	2412.13666	null
2024-12-17	Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing	Keltin Grimes et.al.	2412.13341	link
2024-12-17	Jailbreaking? One Step Is Enough!	Weixiong Zheng et.al.	2412.12621	null
2024-12-17	Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols	Alex Mallen et.al.	2412.12480	null
2024-12-13	No Free Lunch for Defending Against Prefilling Attack by In-Context Learning	Zhiyu Xue et.al.	2412.12192	null
2024-12-10	Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars	Yu Yan et.al.	2412.12145	null
2024-12-15	SpearBot: Leveraging Large Language Models in a Generative-Critique Framework for Spear-Phishing Email Generation	Qinglin Qi et.al.	2412.11109	null
2024-12-15	Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models	Di Wu et.al.	2412.11041	null
2024-12-14	IntelEX: A LLM-driven Attack-level Threat Intelligence Extraction Framework	Ming Xu et.al.	2412.10872	null
2024-12-14	Towards Action Hijacking of Large Language Model-based Agent	Yuyang Zhang et.al.	2412.10807	null
2024-12-10	Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM	Shaoqing Zhang et.al.	2412.10423	link
2024-12-13	AdvPrefix: An Objective for Nuanced LLM Jailbreaks	Sicheng Zhu et.al.	2412.10321	link
2024-12-12	AI Red-Teaming is a Sociotechnical System. Now What?	Tarleton Gillespie et.al.	2412.09751	null
2024-12-12	Obfuscated Activations Bypass LLM Latent-Space Defenses	Luke Bailey et.al.	2412.09565	null
2024-12-16	Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models	Jiahui Li et.al.	2412.08615	link
2024-12-11	AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models	Mintong Kang et.al.	2412.08608	null
2024-12-11	Model-Editing-Based Jailbreak against Safety-aligned Large Language Models	Yuxi Li et.al.	2412.08201	null
2024-12-11	Antelope: Potent and Concealed Jailbreak Attack Strategy	Xin Zhao et.al.	2412.08156	null
2024-12-11	Evil twins are not that evil: Qualitative insights into machine-generated prompts	Nathanaël Carraz Rakotonirina et.al.	2412.08127	null
2024-12-16	Granite Guardian	Inkit Padhi et.al.	2412.07724	link
2024-12-10	FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks	Bocheng Chen et.al.	2412.07672	null
2024-12-10	TraSCE: Trajectory Steering for Concept Erasure	Anubhav Jain et.al.	2412.07658	link
2024-12-10	PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips	Zachary Coalson et.al.	2412.07192	null
2024-11-03	Poison Attacks and Adversarial Prompts Against an Informed University Virtual Assistant	Ivan A. Fernandez et.al.	2412.06788	null
2024-12-09	Enhancing Adversarial Resistance in LLMs with Recursion	Bryan Li et.al.	2412.06181	null
2025-01-03	Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models	Ma Teng et.al.	2412.05934	link
2024-12-16	PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization	Ruoxi Cheng et.al.	2412.05892	null
2024-12-07	PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage	Yuzhou Nie et.al.	2412.05734	link
2024-12-06	BadGPT-4o: stripping safety finetuning from GPT models	Ekaterina Krupkina et.al.	2412.05346	null
2024-12-06	LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds	James Beetham et.al.	2412.05232	null
2024-12-19	Best-of-N Jailbreaking	John Hughes et.al.	2412.03556	link
2024-12-04	Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?	Sravanti Addepalli et.al.	2412.03235	null
2024-12-03	Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach	Tony T. Wang et.al.	2412.02159	null
2024-12-03	Trust & Safety of LLMs and LLMs in Trust & Safety	Doohee You et.al.	2412.02113	null
2024-12-02	Improved Large Language Model Jailbreak Detection via Pretrained Embeddings	Erick Galinkin et.al.	2412.01547	null
2024-12-17	Jailbreak Large Vision-Language Models Through Multi-Modal Linkage	Yu Wang et.al.	2412.00473	link
2024-11-30	Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models	Sanghyun Kim et.al.	2412.00357	null
2024-12-19	PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning	Shenghui Li et.al.	2411.19335	null
2024-11-28	DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs	Ben Ganon et.al.	2411.19038	null
2024-12-20	Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment	Soumya Suvra Ghosal et.al.	2411.18688	null
2024-11-27	Embodied Red Teaming for Auditing Robotic Foundation Models	Sathwik Karnik et.al.	2411.18676	null
2024-11-28	Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models	Shuyang Hao et.al.	2411.18000	null
2024-11-26	Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats	Jiaxin Wen et.al.	2411.17693	null
2025-01-14	Don't Command, Cultivate: An Exploratory Study of System-2 Alignment	Yuhang Wang et.al.	2411.17075	link
2024-11-25	In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models	Zhi-Yi Chin et.al.	2411.16769	null
2024-11-23	ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain	Haochen Zhao et.al.	2411.16736	link
2024-12-04	"Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks	Libo Wang et.al.	2411.16730	link
2024-11-29	Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks	Han Wang et.al.	2411.16721	link
2024-11-25	Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective	Jean Marie Tshimula et.al.	2411.16642	null
2024-11-22	Universal and Context-Independent Triggers for Precise Control of LLM Outputs	Jiashuo Liang et.al.	2411.14738	null
2024-11-21	Global Challenge for Safe and Secure LLMs Track 1	Xiaojun Jia et.al.	2411.14502	null
2024-11-21	GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs	Advik Raj Basani et.al.	2411.14133	link
2024-11-20	A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection	Gabriel Chua et.al.	2411.12946	null
2024-11-27	Playing Language Game with LLMs Leads to Jailbreaking	Yu Peng et.al.	2411.12762	null
2024-12-08	TrojanRobot: Backdoor Attacks Against LLM-based Embodied Robots in the Physical World	Xianlong Wang et.al.	2411.11683	null
2024-11-28	Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models	Chenhang Cui et.al.	2411.11496	link
2024-11-18	The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models	Xikang Yang et.al.	2411.11407	link
2024-11-18	Steering Language Model Refusal with Sparse Autoencoders	Kyle O'Brien et.al.	2411.11296	null
2024-11-17	JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit	Zeqing He et.al.	2411.11114	null
2024-12-09	Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey	Xuannan Liu et.al.	2411.09259	link
2024-11-14	DROJ: A Prompt-Driven Attack against Large Language Models	Leyang Hu et.al.	2411.09125	link
2024-11-13	LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs	Piyush Jha et.al.	2411.08862	null
2024-11-13	The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense	Yangyang Guo et.al.	2411.08410	null
2024-11-12	Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models	Tiejin Chen et.al.	2411.07559	null
2024-11-12	Rapid Response: Mitigating LLM Jailbreaks with a Few Examples	Alwin Peng et.al.	2411.07494	null
2024-11-11	HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment	Yannis Belkhiter et.al.	2411.06835	null
2024-11-10	SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains	Bijoy Ahmed Saiem et.al.	2411.06426	null
2024-11-06	Diversity Helps Jailbreak Large Language Models	Weiliang Zhao et.al.	2411.04223	null
2025-01-07	MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue	Fengxiang Wang et.al.	2411.03814	null
2024-11-02	What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks	Nathalie Maria Kirch et.al.	2411.03343	link
2024-12-05	Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment	Jason Vega et.al.	2411.02785	link
2024-11-03	UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models	Sejoon Oh et.al.	2411.01703	null
2024-12-10	SQL Injection Jailbreak: a structural disaster of large language models	Jiawei Zhao et.al.	2411.01565	link
2024-11-03	AURA: Amplifying Understanding, Resilience, and Awareness for Responsible AI Content Work	Alice Qian Zhang et.al.	2411.01426	null
2024-12-11	Plentiful Jailbreaks with String Compositions	Brian R. Y. Huang et.al.	2411.01084	null
2024-11-01	Emoji Attack: A Method for Misleading Judge LLMs in Safety Risk Detection	Zhipeng Wei et.al.	2411.01077	link
2024-11-15	IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves	Ruofan Wang et.al.	2411.00827	null
2024-11-26	Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs	Muhammed Saeed et.al.	2410.24049	null
2024-10-31	Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models	Hao Yang et.al.	2410.23861	null
2024-10-31	Adversarial Attacks of Vision Tasks in the Past 10 Years: A Survey	Chiyu Zhang et.al.	2410.23687	null
2024-11-27	Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models	Yiqi Yang et.al.	2410.23558	null
2024-10-30	ProTransformer: Robustify Transformers via Plug-and-Play Paradigm	Zhichao Hou et.al.	2410.23182	link
2024-10-29	Benchmarking LLM Guardrails in Handling Multilingual Toxicity	Yahan Yang et.al.	2410.22153	null
2024-10-29	AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts	Vishal Kumar et.al.	2410.22143	null
2024-10-29	SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types	Yutao Mou et.al.	2410.21965	link
2024-10-28	Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring	Honglin Mu et.al.	2410.21083	null
2024-10-28	BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks	Yunhan Zhao et.al.	2410.20971	null
2024-10-25	RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction	Tanqiu Jiang et.al.	2410.19937	null
2024-10-25	An Auditing Test To Detect Behavioral Shift in Language Models	Leo Richter et.al.	2410.19406	link
2024-10-25	Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities	Chung-En Sun et.al.	2410.18469	link
2024-10-23	Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks	Samuele Poppi et.al.	2410.18210	null
2024-10-23	Guide for Defense (G4D): Dynamic Guidance for Robust and Balanced Defense in Large Language Models	He Cao et.al.	2410.17922	link
2024-10-22	LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"	Som Sagar et.al.	2410.16738	null
2024-11-02	Bayesian scaling laws for in-context learning	Aryaman Arora et.al.	2410.16531	link
2024-11-16	Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis	Jonathan Brokman et.al.	2410.16527	null
2024-10-18	Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs	Rui Pu et.al.	2410.16327	null
2024-10-21	A Realistic Threat Model for Large Language Model Jailbreaks	Valentyn Boreiko et.al.	2410.16222	link
2024-10-21	A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns	Tianyi Men et.al.	2410.16155	null
2024-11-03	Boosting Jailbreak Transferability for Large Language Models	Hanqing Liu et.al.	2410.15645	link
2024-10-21	SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis	Aidan Wong et.al.	2410.15641	link
2024-10-20	Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models	Xiao Li et.al.	2410.15362	null
2024-10-20	Jailbreaking and Mitigation of Vulnerabilities in Large Language Models	Benji Peng et.al.	2410.15236	null
2024-10-16	SoK: Prompt Hacking of Large Language Models	Baha Rababah et.al.	2410.13901	null
2024-10-15	A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation	Aviral Srivastava et.al.	2410.13897	null
2024-10-21	Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents	Priyanshu Kumar et.al.	2410.13886	link
2024-10-17	PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment	Zekun Moore Wang et.al.	2410.13785	null
2024-10-17	Persistent Pre-Training Poisoning of LLMs	Yiming Zhang et.al.	2410.13722	null
2024-11-09	Jailbreaking LLM-Controlled Robots	Alexander Robey et.al.	2410.13691	null
2025-01-02	BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models	Isack Lee et.al.	2410.13334	link
2024-10-17	SPIN: Self-Supervised Prompt INjection	Leon Zhou et.al.	2410.13236	null
2024-10-18	JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework	Fan Liu et.al.	2410.12855	null
2024-10-19	Multi-round jailbreak attack on large language models	Yihua Zhou et.al.	2410.11533	null
2024-10-15	Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models	Hao Yang et.al.	2410.11459	link
2025-01-20	Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation	Qizhang Li et.al.	2410.11317	link
2024-10-15	AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment	Pankayaraj Pathmanathan et.al.	2410.11283	null
2024-10-15	Cognitive Overload Attack:Prompt Injection for Long Context	Bibek Upadhayay et.al.	2410.11272	link
2024-10-14	Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues	Qibing Ren et.al.	2410.10700	link
2024-10-14	On Calibration of LLM-based Guard Models for Reliable Content Moderation	Hongfu Liu et.al.	2410.10414	link
2024-10-14	Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting	Yifan Luo et.al.	2410.10150	null
2024-11-27	BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models	Xinyuan Wang et.al.	2410.09804	null
2024-10-18	VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment	Lei Li et.al.	2410.09421	null
2024-12-17	Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations	Tarun Raheja et.al.	2410.09097	null
2024-10-11	AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation	Zijun Wang et.al.	2410.09040	link
2024-10-14	AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents	Maksym Andriushchenko et.al.	2410.09024	null
2024-11-29	RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process	Peiran Wang et.al.	2410.08660	null
2024-10-09	Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level	Xinyi Zeng et.al.	2410.06809	null
2024-10-04	Developing Assurance Cases for Adversarial Robustness and Regulatory Compliance in LLMs	Tomas Bueno Momcilovic et.al.	2410.05304	null
2024-11-27	AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs	Xiaogeng Liu et.al.	2410.05295	link
2024-10-06	Attention Shift: Steering AI Away from Unsafe Content	Shivank Garg et.al.	2410.04447	null
2024-10-05	Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks	Zi Wang et.al.	2410.04234	null
2024-10-05	Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models	Yiting Dong et.al.	2410.04190	null
2024-10-04	Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step	Wenxuan Wang et.al.	2410.03869	null
2024-10-08	You Know What I'm Saying: Jailbreak Attack via Implicit Reference	Tianyu Wu et.al.	2410.03857	link
2024-12-16	SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks	Tianhao Li et.al.	2410.03769	null
2024-10-23	Gradient-based Jailbreak Images for Multimodal Fusion Models	Javier Rando et.al.	2410.03489	link
2024-10-23	Safeguard is a Double-edged Sword: Denial-of-service Attack on Large Language Models	Qingzhao Zhang et.al.	2410.02916	null
2024-10-02	FlipAttack: Jailbreak LLMs via Flipping	Yue Liu et.al.	2410.02832	link
2024-10-01	PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System	Gary D. Lopez Munoz et.al.	2410.02828	link
2024-10-03	SteerDiff: Steering towards Safe Text-to-Image Diffusion Models	Hongxiang Zhang et.al.	2410.02710	null
2024-10-07	Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models	Guobin Shen et.al.	2410.02298	null
2024-12-18	Data to Defense: The Role of Curation in Customizing LLMs Against Jailbreaking Attacks	Xiaoqun Liu et.al.	2410.02220	null
2024-10-02	Automated Red Teaming with GOAT: the Generative Offensive Agent Tester	Maya Pavlova et.al.	2410.01606	null
2024-10-04	HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models	Seanie Lee et.al.	2410.01524	link
2024-10-02	Information-Theoretical Principled Trade-off between Jailbreakability and Stealthiness on Vision Language Models	Ching-Chia Kao et.al.	2410.01438	null
2024-12-06	Endless Jailbreaks with Bijection Learning	Brian R. Y. Huang et.al.	2410.01294	null
2024-12-19	Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models	Wei Zhao et.al.	2410.00451	link
2024-09-29	Survey of Security and Data Attacks on Machine Unlearning In Financial and E-Commerce	Carl E. J. Brodzinski et.al.	2410.00055	null
2024-09-30	Robust LLM safeguarding via refusal feature adversarial training	Lei Yu et.al.	2409.20089	null
2024-09-28	Overriding Safety protections of Open-source Models	Sachin Kumar et.al.	2409.19476	link
2024-09-27	HM3: Heterogeneous Multi-Class Model Merging	Stefan Hackmann et.al.	2409.19173	null
2024-09-27	Multimodal Pragmatic Jailbreak on Text-to-image Models	Tong Liu et.al.	2409.19149	null
2024-11-08	An Adversarial Perspective on Machine Unlearning for AI Safety	Jakub Łucki et.al.	2409.18025	link
2024-10-04	MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks	Giandomenico Cornacchia et.al.	2409.17699	null
2024-09-26	RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking	Yifan Jiang et.al.	2409.17458	link
2024-09-25	Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction	Jinchuan Zhang et.al.	2409.16783	link
2024-09-25	RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems	Yihong Tang et.al.	2409.16727	null
2024-09-23	Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI	Ambrish Rawat et.al.	2409.15398	null
2024-09-18	Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning	Essa Jan et.al.	2409.15361	null
2024-10-08	Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs	Xueluan Gong et.al.	2409.14866	link
2024-10-03	PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach	Zhihao Lin et.al.	2409.14177	null
2024-10-29	Towards Safe Multilingual Frontier AI	Artūrs Kanepajs et.al.	2409.13708	link
2024-11-05	Jailbreaking Large Language Models with Symbolic Mathematics	Emet Bethany et.al.	2409.11445	null
2024-09-17	Hackphyr: A Local Fine-Tuned LLM Agent for Network Security Environments	Maria Rigaki et.al.	2409.11276	null
2024-09-14	What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing	Chenyang Yang et.al.	2409.09261	link
2024-09-27	Multi-Robot Coordination Induced in an Adversarial Graph-Traversal Game	James Berneburg et.al.	2409.08222	null
2024-10-19	Securing Large Language Models: Addressing Bias, Misinformation, and Prompt Attacks	Benji Peng et.al.	2409.08087	null
2024-09-12	Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking	Stav Cohen et.al.	2409.08045	link
2024-09-12	Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols	Charlie Griffin et.al.	2409.07985	link
2024-09-11	AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs	Lijia Lv et.al.	2409.07503	link
2024-09-11	Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks	Md Zarif Hossain et.al.	2409.07353	null
2024-09-10	DiPT: Enhancing LLM reasoning through diversified perspective-taking	Hoang Anh Just et.al.	2409.06241	null
2024-09-07	Exploring Straightforward Conversational Red-Teaming	George Kour et.al.	2409.04822	null
2024-08-31	HSF: Defending against Jailbreak Attacks with Hidden State Filtering	Cheng Qian et.al.	2409.03788	null
2024-11-29	Conversational Complexity for Assessing Risk in Large Language Models	John Burden et.al.	2409.01247	null
2024-09-01	Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models	Bang An et.al.	2409.00598	link
2024-08-31	Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness	Wenxuan Wang et.al.	2409.00551	null
2024-10-17	PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action	Yijia Shao et.al.	2409.00138	link
2024-08-29	Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks	Tom Gibbs et.al.	2409.00137	null
2024-11-07	FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks)	Aman Priyanshu et.al.	2408.16163	null
2024-08-28	Red Team Redemption: A Structured Comparison of Open-Source Tools for Adversary Emulation	Max Landauer et.al.	2408.15645	null
2024-09-05	Legilimens: Practical and Unified Content Moderation for Large Language Model Services	Jialin Wu et.al.	2408.15488	link
2024-09-04	LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet	Nathaniel Li et.al.	2408.15221	null
2024-08-27	Investigating Coverage Criteria in Large Language Models: An In-Depth Study Through Jailbreak Attacks	Shide Zhou et.al.	2408.15207	null
2024-10-05	Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models	Hongfu Liu et.al.	2408.14866	link
2024-08-27	Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models	Yuhao Du et.al.	2408.14853	null
2024-12-15	HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models	Sensen Gao et.al.	2408.13896	null
2024-08-14	SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming	Anurakt Kumar et.al.	2408.11851	null
2024-09-14	Efficient Detection of Toxic Prompts in Large Language Models	Yi Liu et.al.	2408.11727	null
2024-08-21	Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer	Weipeng Jiang et.al.	2408.11313	link
2024-08-21	EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models	Chongwen Zhao et.al.	2408.11308	null
2024-08-20	Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles	Zhilong Wang et.al.	2408.11182	null
2024-08-18	DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization	Pucheng Dang et.al.	2408.11071	null
2025-01-02	Security Attacks on LLM-based Code Completion Tools	Wen Cheng et.al.	2408.11006	link
2025-01-02	Perception-guided Jailbreak against Text-to-Image Models	Yihao Huang et.al.	2408.10848	null
2024-08-20	Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique	Tej Deep Pala et.al.	2408.10701	link
2024-08-20	Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models	Hongbang Yuan et.al.	2408.10682	null
2024-08-26	Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation	Haoyu Wang et.al.	2408.10668	null
2024-08-18	Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks	Kexin Chen et.al.	2408.09326	null
2025-01-10	BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger	Yulin Chen et.al.	2408.09093	null
2024-08-22	Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks	Jiawei Zhao et.al.	2408.08924	link
2024-08-11	Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search	Robert J. Moss et.al.	2408.08899	link
2024-10-22	$\textit{MMJ-Bench}$ : A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models	Fenghua Weng et.al.	2408.08464	link
2024-12-19	Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions	Quan Liu et.al.	2408.07663	link
2024-12-14	On Effects of Steering Latent Representation for Large Language Model Unlearning	Dang Huu-Tien et.al.	2408.06223	link
2024-08-09	A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares	Stav Cohen et.al.	2408.05061	link
2024-09-13	h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment	Moussa Koulako Bala Doumbouya et.al.	2408.04811	null
2024-08-08	Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles	Xiongtao Sun et.al.	2408.04686	null
2024-08-08	Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models	Fabio Pernisi et.al.	2408.04522	null
2024-08-07	EnJa: Ensemble Jailbreak on Large Language Models	Jiahao Zhang et.al.	2408.03603	null
2024-12-27	Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws	Dillon Bowen et.al.	2408.02946	link
2024-08-05	Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?	Mohammad Bahrami Karkevandi et.al.	2408.02651	null
2024-12-23	SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models	Muxi Diao et.al.	2408.02632	null
2024-08-02	Mission Impossible: A Statistical Perspective on Jailbreaking LLMs	Jingtong Su et.al.	2408.01420	null
2024-08-01	WHITE PAPER: A Brief Exploration of Data Exfiltration using GCG Suffixes	Victor Valbuena et.al.	2408.00925	null
2024-09-14	Tamper-Resistant Safeguards for Open-Weight LLMs	Rishub Tamirisa et.al.	2408.00761	link
2024-09-09	Jailbreaking Text-to-Image Models with LLM-Based Agents	Yingkai Dong et.al.	2408.00523	null
2024-10-17	Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models	Yue Xu et.al.	2407.21659	link
2025-01-16	Direct Unlearning Optimization for Robust and Safe Text-to-Image Models	Yong-Hyun Park et.al.	2407.21035	null
2024-10-24	Effects of Scale on Language Model Robustness	Nikolaus Howe et.al.	2407.18213	null
2024-12-24	The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models	Zihui Wu et.al.	2407.17915	link
2024-10-01	FLRT: Fluent Student-Teacher Redteaming	T. Ben Thompson et.al.	2407.17447	link
2024-10-07	Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective	Yujian Liu et.al.	2407.16997	link
2024-12-31	From Sands to Mansions: Simulating Full Attack Chain with LLM-Organized Knowledge	Lingzhi Wang et.al.	2407.16928	null
2024-08-23	Can Large Language Models Automatically Jailbreak GPT-4V?	Yuanwei Wu et.al.	2407.16686	null
2024-07-23	RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent	Huiyu Xu et.al.	2407.16667	null
2024-10-26	Course-Correction: Safety Alignment Using Synthetic Preferences	Rongwu Xu et.al.	2407.16637	link
2024-07-23	PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing	Blazej Manczak et.al.	2407.16318	link
2024-08-13	Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models	Shi Lin et.al.	2407.16205	link
2024-07-26	Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems	Siddharth D Jaiswal et.al.	2407.15810	null
2024-08-21	Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs	Abhay Sheshadri et.al.	2407.15549	link
2024-12-16	Failures to Find Transferable Image Jailbreaks Between Vision-Language Models	Rylan Schaeffer et.al.	2407.15211	null
2024-07-21	Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts	Yi Liu et.al.	2407.15050	null
2024-07-23	RogueGPT: dis-ethical tuning transforms ChatGPT4 into a Rogue AI in 158 Words	Alessio Buscemi et.al.	2407.15009	null
2024-07-20	Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)	Apurv Verma et.al.	2407.14937	link
2024-08-23	Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle	Emman Haider et.al.	2407.13833	null
2024-07-16	Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models	Zihao Xu et.al.	2407.13796	link
2024-07-18	LLMs as Function Approximators: Terminology, Taxonomy, and Questions for Evaluation	David Schlangen et.al.	2407.13744	null
2024-07-17	AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases	Zhaorun Chen et.al.	2407.12784	link
2024-10-28	Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models	Chao Gong et.al.	2407.12383	link
2024-07-17	The Better Angels of Machine Personality: How Personality Relates to LLM Safety	Jie Zhang et.al.	2407.12344	link
2024-10-03	Does Refusal Training in LLMs Generalize to the Past Tense?	Maksym Andriushchenko et.al.	2407.11969	link
2024-08-21	What Makes and Breaks Safety Fine-tuning? A Mechanistic Study	Samyak Jain et.al.	2407.10264	null
2024-07-13	MOAT: Securely Mitigating Rowhammer with Per-Row Activation Counters	Moinuddin Qureshi et.al.	2407.09995	null
2024-10-18	ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts	Amelia F. Hardy et.al.	2407.09447	link
2024-09-06	Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions	Tingwei Zhang et.al.	2407.08970	link
2024-07-11	Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing	Huanqian Wang et.al.	2407.08770	link
2024-07-11	Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation	Riccardo Cantini et.al.	2407.08441	null
2024-09-11	The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing	Alice Qian Zhang et.al.	2407.07786	null
2024-07-12	A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends	Daizong Liu et.al.	2407.07403	link
2024-09-08	T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models	Yibo Miao et.al.	2407.05965	null
2024-07-08	$R^2$ -Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning	Mintong Kang et.al.	2407.05557	link
2024-07-06	Safe Generative Chats in a WhatsApp Intelligent Tutoring System	Zachary Levonian et.al.	2407.04915	null
2024-08-30	Jailbreak Attacks and Defenses Against Large Language Models: A Survey	Sibo Yi et.al.	2407.04295	null
2024-12-21	Automated Progressive Red Teaming	Bojian Jiang et.al.	2407.03876	link
2024-07-03	Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning	Simon Ostermann et.al.	2407.03391	null
2024-07-03	SOS! Soft Prompt Attack Against Open-Source Large Language Models	Ziqing Yang et.al.	2407.03160	null
2024-07-03	JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets	Zhihua Jin et.al.	2407.03045	null
2024-11-05	Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks	Zhexin Zhang et.al.	2407.02855	link
2024-10-30	Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses	David Glukhov et.al.	2407.02551	null
2024-08-26	Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything	Xiaotian Zou et.al.	2407.02534	null
2024-07-02	SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack	Yan Yang et.al.	2407.01902	link
2024-07-01	Purple-teaming LLMs with Adversarial Defender Training	Jingyan Zhou et.al.	2407.01850	null
2024-07-25	JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models	Haibo Jin et.al.	2407.01599	link
2024-07-01	Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement	Zisu Huang et.al.	2407.01461	link
2024-07-01	Badllama 3: removing safety finetuning from Llama 3 in minutes	Dmitrii Volkov et.al.	2407.01376	null
2024-09-23	Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks	Yue Zhou et.al.	2407.00869	link
2024-10-01	Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference	Anton Xue et.al.	2407.00075	null
2024-07-11	Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection	Yuqi Zhou et.al.	2406.19845	null
2024-10-03	Jailbreaking LLMs with Arabic Transliteration and Arabizi	Mansour Al Ghanim et.al.	2406.18725	link
2024-07-08	The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm	Aakanksha et.al.	2406.18682	null
2024-06-26	WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models	Liwei Jiang et.al.	2406.18510	link
2024-12-09	WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs	Seungju Han et.al.	2406.18495	link
2024-06-26	Poisoned LangChain: Jailbreak LLMs by LangChain	Ziqiu Wang et.al.	2406.18122	null
2024-12-24	SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance	Caishuang Huang et.al.	2406.18118	link
2024-06-25	CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference	Erxin Yu et.al.	2406.17626	link
2024-06-25	Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations	Cheng Wang et.al.	2406.17576	null
2024-06-21	Steering Without Side Effects: Improving Post-Deployment Control of Language Models	Asa Cooper Stickland et.al.	2406.15518	link
2024-11-02	Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding	Haneul Yoo et.al.	2406.15481	link
2024-06-21	From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking	Siyuan Wang et.al.	2406.14859	null
2024-07-01	Adversaries Can Misuse Combinations of Safe Models	Erik Jones et.al.	2406.14595	null
2025-01-17	Jailbreaking as a Reward Misspecification Problem	Zhihui Xie et.al.	2406.14393	link
2024-06-20	Finding Safety Neurons in Large Language Models	Jianhui Chen et.al.	2406.14144	null
2024-06-19	ObscurePrompt: Jailbreaking Large Language Models via Obscure Input	Yue Huang et.al.	2406.13662	link
2024-08-21	SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation	Xiaoze Liu et.al.	2406.12975	link
2025-01-07	ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates	Fengqing Jiang et.al.	2406.12935	link
2024-06-21	[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs	Abhinav Rao et.al.	2406.12702	null
2024-06-17	Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner	Kenneth Li et.al.	2406.11978	link
2024-10-16	CELL your Model: Contrastive Explanations for Large Language Models	Ronny Luss et.al.	2406.11785	null
2024-10-23	STAR: SocioTechnical Approach to Red Teaming Language Models	Laura Weidinger et.al.	2406.11757	null
2024-10-30	Refusal in Language Models Is Mediated by a Single Direction	Andy Arditi et.al.	2406.11717	link
2024-06-17	Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack	Shangqing Tu et.al.	2406.11682	link
2024-06-17	"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak	Lingrui Mei et.al.	2406.11668	link
2024-06-17	Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming	Vernon Toh Yan Han et.al.	2406.11654	null
2024-06-16	garak: A Framework for Security Probing Large Language Models	Leon Derczynski et.al.	2406.11036	link
2024-06-16	Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications	Stephen Burabari Tete et.al.	2406.11007	null
2024-12-02	Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis	Yuping Lin et.al.	2406.10794	link
2024-11-06	Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs	Zhao Xu et.al.	2406.09324	link
2024-06-13	JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models	Delong Ran et.al.	2406.09321	link
2024-10-05	Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models	Sarah Ball et.al.	2406.09289	link
2024-07-19	Exploiting Uncommon Text-Encoded Structures for Automated Jailbreaks in LLMs	Bangxin Li et.al.	2406.08754	null
2024-06-13	RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs	Xuan Chen et.al.	2406.08725	null
2024-12-18	When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search	Xuan Chen et.al.	2406.08705	link
2024-06-13	MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models	Tianle Gu et.al.	2406.07594	link
2024-07-14	Merging Improves Self-Critique Against Jailbreak Attacks	Victor Gallego et.al.	2406.07188	link
2024-12-06	MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models	Yichi Zhang et.al.	2406.07057	null
2024-06-07	Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs	Fan Liu et.al.	2406.06622	null
2024-07-03	Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks	Zonghao Ying et.al.	2406.06302	link
2024-06-10	Safety Alignment Should Be Made More Than Just a Few Tokens Deep	Xiangyu Qi et.al.	2406.05946	link
2024-06-13	How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States	Zhenhong Zhou et.al.	2406.05644	link
2024-09-05	SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner	Xunguang Wang et.al.	2406.05498	null
2024-06-08	Is On-Device AI Broken and Exploitable? Assessing the Trust and Ethics in Small Language Models	Kalyan Nakka et.al.	2406.05364	null
2024-07-01	Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt	Zonghao Ying et.al.	2406.04031	link
2024-06-06	AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens	Lin Lu et.al.	2406.03805	null
2024-09-25	Ranking Manipulation for Conversational Search Engines	Samuel Pfrommer et.al.	2406.03589	link
2024-06-03	Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits	Andis Draguns et.al.	2406.02619	link
2024-05-28	Are PPO-ed Language Models Hackable?	Suraj Anand et.al.	2406.02577	null
2025-01-21	QROA: A Black-Box Query-Response Optimization Attack on LLMs	Hussein Jawad et.al.	2406.02044	link
2024-10-30	Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses	Xiaosen Zheng et.al.	2406.01288	link
2024-11-03	Are you still on track!? Catching LLM Task Drift with Activations	Sahar Abdelnabi et.al.	2406.00799	link
2024-06-01	Exploring Vulnerabilities and Protections in Large Language Models: A Survey	Frank Weizhen Liu et.al.	2406.00240	null
2024-07-29	Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization	Yuanpu Cao et.al.	2406.00045	link
2024-06-05	Improved Techniques for Optimization-Based Jailbreaking on Large Language Models	Xiaojun Jia et.al.	2405.21018	link
2024-11-01	Improved Generation of Adversarial Examples Against Safety-aligned LLMs	Qizhang Li et.al.	2405.20778	link
2024-08-21	Medical MLLM is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models	Xijie Huang et.al.	2405.20775	link
2024-06-12	Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character	Siyuan Ma et.al.	2405.20773	null
2024-06-04	Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens	Jiahao Yu et.al.	2405.20653	null
2024-05-30	Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters	Haibo Jin et.al.	2405.20413	null
2024-10-17	TAIA: Large Language Models are Out-of-Distribution Data Learners	Shuyang Jiang et.al.	2405.20192	link
2024-05-30	Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks	Chen Xiong et.al.	2405.20099	null
2024-05-30	Efficient LLM-Jailbreaking by Introducing Visual Modality	Zhenxing Niu et.al.	2405.20015	null
2024-05-30	AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization	Jiawei Chen et.al.	2405.19668	null
2024-10-11	ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users	Guanlin Li et.al.	2405.19360	link
2024-05-31	Robustifying Safety-Aligned Large Language Models through Clean Data Curation	Xiaoqun Liu et.al.	2405.19358	null
2024-05-29	ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning	Ruchika Chavhan et.al.	2405.19237	link
2024-05-29	Voice Jailbreak Attacks Against GPT-4o	Xinyue Shen et.al.	2405.19103	link
2024-12-20	DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints	Andrew Zhao et.al.	2405.19026	link
2024-10-20	Quantitative Certification of Bias in Large Language Models	Isha Chaudhary et.al.	2405.18780	link
2024-11-18	A Theoretical Understanding of Self-Correction through In-context Alignment	Yifei Wang et.al.	2405.18634	null
2024-05-28	Learning diverse attacks on large language models for robust red-teaming and safety tuning	Seanie Lee et.al.	2405.18540	null
2024-06-14	Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing	Wei Zhao et.al.	2405.18166	link
2024-10-14	White-box Multimodal Jailbreaks Against Large Vision-Language Models	Ruofan Wang et.al.	2405.17894	link
2024-05-28	Automatic Jailbreaking of the Text-to-Image Generative AI Systems	Minseon Kim et.al.	2405.16567	link
2024-05-24	Hacc-Man: An Arcade Game for Jailbreaking LLMs	Matheus Valentim et.al.	2405.15902	null
2024-10-08	Extracting Prompts by Inverting LLM Outputs	Collin Zhang et.al.	2405.15012	link
2024-10-30	Representation Noising: A Defence Mechanism Against Harmful Finetuning	Domenic Rosati et.al.	2405.14577	link
2024-05-23	Impact of Non-Standard Unicode Characters on Security and Comprehension in Large Language Models	Johan S Daniel et.al.	2405.14490	link
2024-05-22	WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response	Tianrong Zhang et.al.	2405.14023	null
2024-05-22	Safety Alignment for Vision Language Models	Zhendong Liu et.al.	2405.13581	null
2024-07-07	TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models	Pengzhou Cheng et.al.	2405.13401	null
2024-10-15	GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation	Govind Ramesh et.al.	2405.13077	null
2024-06-19	Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation	Yuxi Li et.al.	2405.13068	link
2024-06-17	Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming	Jiaxu Liu et.al.	2405.12604	null
2024-05-29	Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models	Jiaqi Li et.al.	2405.12523	null
2024-08-06	Hummer: Towards Limited Competitive Preference Dataset	Li Jiang et.al.	2405.11647	null
2024-05-15	Benchmark Early and Red Team Often: A Framework for Assessing and Managing Dual-Use Hazards of AI Foundation Models	Anthony M. Barrett et.al.	2405.10986	null
2024-10-05	Red Teaming Language Models for Processing Contradictory Dialogues	Xiaofei Wen et.al.	2405.10128	link
2024-05-15	Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization	Kai Hu et.al.	2405.09113	null
2024-05-15	A safety realignment framework via subspace-oriented model fusion for large language models	Xin Yi et.al.	2405.09055	link
2024-05-14	SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models	Raghuveer Peri et.al.	2405.08317	null
2024-05-14	PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition	Ziyang Zhang et.al.	2405.07932	link
2024-05-14	PLeak: Prompt Leaking Attacks against Large Language Model Applications	Bo Hui et.al.	2405.06823	link
2024-08-29	Mitigating Exaggerated Safety in Large Language Models	Ruchira Ray et.al.	2405.05418	null
2024-05-07	Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks	Georgios Pantazopoulos et.al.	2405.04403	link
2024-05-07	Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent	Shang Shang et.al.	2405.03654	null
2024-05-03	Aloe: A Family of Fine-tuned Open Healthcare LLMs	Ashwin Kumar Gururajan et.al.	2405.01886	null
2024-05-02	Boosting Jailbreak Attack with Momentum	Yihao Zhang et.al.	2405.01229	link
2024-05-10	Evaluating and Mitigating Linguistic Discrimination in Large Language Models	Guoliang Dong et.al.	2404.18534	null
2024-04-26	Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo	Stephen Zhao et.al.	2404.17546	link
2024-04-21	AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs	Anselm Paulus et.al.	2404.16873	link
2024-10-12	Don't Say No: Jailbreaking LLM by Suppressing Refusal	Yukai Zhou et.al.	2404.16369	link
2024-04-24	Universal Adversarial Triggers Are Not Universal	Nicholas Meade et.al.	2404.16020	link
2024-04-23	Bias patterns in the application of LLMs for clinical decision support: A comprehensive study	Raphael Poulain et.al.	2404.15149	link
2024-04-23	A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI	Seliem El-Sayed et.al.	2404.15058	null
2024-06-06	Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs	Javier Rando et.al.	2404.14461	link
2024-10-10	Protecting Your LLMs with Information Bottleneck	Zichuan Liu et.al.	2404.13968	link
2024-04-19	The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions	Eric Wallace et.al.	2404.13208	null
2024-04-18	Advancing the Robustness of Large Language Models through Self-Denoised Smoothing	Jiabao Ji et.al.	2404.12274	link
2024-04-12	JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models	Yingchaojie Feng et.al.	2404.08793	null
2024-06-24	ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming	Simone Tedeschi et.al.	2404.08676	link
2024-04-12	Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts	Tianyu Zhang et.al.	2404.08309	null
2024-11-24	AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs	Zeyi Liao et.al.	2404.07921	link
2024-04-10	CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge	Yu Ying Chiu et.al.	2404.06664	null
2024-05-07	Rethinking How to Evaluate Language Model Jailbreak	Hongyu Cai et.al.	2404.06407	link
2024-07-03	Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge	Weikai Lu et.al.	2404.05880	link
2024-04-16	Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection	Zhilong Wang et.al.	2404.04849	null
2024-09-09	Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes	Divyanshu Kumar et.al.	2404.04392	null
2024-12-15	Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?	Shuo Chen et.al.	2404.03411	link
2024-11-24	JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks	Weidi Luo et.al.	2404.03027	null
2024-09-04	Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models	Jiachen Ma et.al.	2404.02928	null
2024-04-03	Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game	Qianqiao Xu et.al.	2404.02532	null
2024-10-07	Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks	Maksym Andriushchenko et.al.	2404.02151	link
2024-04-02	Red-Teaming Segment Anything Model	Krzysztof Jankowski et.al.	2404.02067	link
2024-09-24	Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack	Mark Russinovich et.al.	2404.01833	null
2024-10-31	JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models	Patrick Chao et.al.	2404.01318	link
2024-08-20	What is in Your Safe Data? Identifying Benign Data that Breaks Safety	Luxi He et.al.	2404.01099	link
2024-11-26	Against The Achilles' Heel: A Survey on Red Teaming for Generative Models	Lizhi Lin et.al.	2404.00629	link
2024-12-27	Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code	Taishi Nakamura et.al.	2404.00399	null
2024-12-08	Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation	Yutong He et.al.	2403.19103	null
2024-03-27	IterAlign: Iterative Constitutional Alignment of Large Language Models	Xiusi Chen et.al.	2403.18341	null
2024-11-15	Optimization-based Prompt Injection Attack to LLM-as-a-Judge	Jiawen Shi et.al.	2403.17710	link
2024-09-30	Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models	Zhiyuan Yu et.al.	2403.17336	null
2024-03-22	Risk and Response in Large Language Models: Evaluating Key Threat Categories	Bahareh Harandizadeh et.al.	2403.14988	null
2024-06-24	Testing the Limits of Jailbreaking Defenses with the Purple Problem	Taeyoun Kim et.al.	2403.14725	link
2024-07-23	RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content	Zhuowen Yuan et.al.	2403.13031	link
2024-03-18	EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models	Weikang Zhou et.al.	2403.12171	link
2024-05-14	Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation	Jessica Quaye et.al.	2403.12075	link
2025-01-13	Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models	Yifan Li et.al.	2403.09792	link
2024-10-15	Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation	Yunhao Gou et.al.	2403.09572	null
2024-03-14	AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting	Yu Wang et.al.	2403.09513	link
2024-07-17	The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?	Qinyu Zhao et.al.	2403.09037	link
2024-03-19	Review of Generative AI Methods in Cybersecurity	Yagmur Yigit et.al.	2403.08701	null
2024-09-30	Distract Large Language Models for Automatic Jailbreak Attack	Zeguan Xiao et.al.	2403.08424	link
2024-03-14	HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback	Ang Li et.al.	2403.08309	null
2024-03-14	Red Teaming Models for Hyperspectral Image Analysis Using Explainable AI	Vladimir Zaigrajew et.al.	2403.08017	null
2024-08-22	Defending Against Unforeseen Failure Modes with Latent Adversarial Training	Stephen Casper et.al.	2403.05030	link
2024-03-07	A Safe Harbor for AI Evaluation and Red Teaming	Shayne Longpre et.al.	2403.04893	null
2024-11-14	AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks	Yifan Zeng et.al.	2403.04783	link
2024-03-11	Using Hallucinations to Bypass GPT4's Filter	Benjamin Lemkin et.al.	2403.04769	null
2024-10-04	Aligners: Decoupling LLMs and Alignment	Lilian Ngweta et.al.	2403.04224	link
2024-03-06	ImgTrojan: Jailbreaking Vision-Language Models with ONE Image	Xijia Tao et.al.	2403.02910	link
2024-03-05	Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications	Stav Cohen et.al.	2403.02817	link
2024-03-02	AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks	Jiacen Xu et.al.	2403.01038	null
2024-11-07	Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes	Xiaomeng Hu et.al.	2403.00867	null
2024-02-28	TroubleLLM: Align to Red Team Expert	Zhuoer Xu et.al.	2403.00829	null
2024-09-19	Enhancing Jailbreak Attacks with Diversity Guidance	Xu Zhang et.al.	2403.00292	null
2024-02-29	Curiosity-driven Red-teaming for Large Language Models	Zhang-Wei Hong et.al.	2402.19464	link
2024-06-10	Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction	Tong Liu et.al.	2402.18104	link
2024-10-30	Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue	Zhenhong Zhou et.al.	2402.17262	null
2024-11-11	DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers	Xirui Li et.al.	2402.16914	link
2024-02-26	CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models	Huijie Lv et.al.	2402.16717	link
2024-06-06	Defending LLMs against Jailbreaking Attacks via Backtranslation	Yihan Wang et.al.	2402.16459	link
2024-02-28	Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing	Jiabao Ji et.al.	2402.16192	link
2024-06-04	ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings	Hao Wang et.al.	2402.16006	null
2024-02-24	PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails	Neal Mangaokar et.al.	2402.15911	null
2024-03-04	LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper	Daoyuan Wu et.al.	2402.15727	null
2024-02-24	Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology	Zhenhua Wang et.al.	2402.15690	null
2024-02-23	Fast Adversarial Attacks on Language Models In One GPU Minute	Vinu Sankar Sadasivan et.al.	2402.15570	link
2024-11-16	How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries	Somnath Banerjee et.al.	2402.15302	link
2024-02-27	Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement	Heegyu Kim et.al.	2402.15180	null
2024-06-20	Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment	Jiongxiao Wang et.al.	2402.14968	null
2024-02-27	Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs	Xiaoxia Li et.al.	2402.14872	null
2024-06-18	Is the System Message Really Important to Jailbreaks in Large Language Models?	Xiaotian Zou et.al.	2402.14857	null
2024-02-21	Coercing LLMs to do and reveal (almost) anything	Jonas Geiping et.al.	2402.14020	link
2024-02-26	AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning	Vasudev Gohil et.al.	2402.13946	null
2024-02-21	Round Trip Translation Defence against Large Language Model Jailbreaking Attacks	Canaan Yung et.al.	2402.13517	link
2024-05-29	GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis	Yueqi Xie et.al.	2402.13494	link
2024-05-17	A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models	Zihao Xu et.al.	2402.13457	link
2024-07-05	Defending Jailbreak Prompts via In-Context Adversarial Game	Yujun Zhou et.al.	2402.13148	null
2024-06-06	TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification	Martin Gubri et.al.	2402.12991	link
2024-06-07	ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs	Fengqing Jiang et.al.	2402.11753	link
2024-08-16	ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages	Junjie Ye et.al.	2402.10753	link
2024-10-23	When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers	Divij Handa et.al.	2402.10601	link
2024-08-27	A StrongREJECT for Empty Jailbreaks	Alexandra Souly et.al.	2402.10260	link
2024-02-15	A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents	Lingbo Mo et.al.	2402.10196	link
2024-10-02	Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks	Yixin Cheng et.al.	2402.09177	null
2024-02-16	Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues	Zhiyuan Chang et.al.	2402.09091	null
2024-07-25	SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding	Zhangchen Xu et.al.	2402.08983	link
2024-06-07	COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability	Xingang Guo et.al.	2402.08679	link
2024-06-03	Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast	Xiangming Gu et.al.	2402.08567	link
2024-02-13	Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning	Gelei Deng et.al.	2402.08416	null
2024-10-31	Fight Back Against Jailbreaking via Prompt Adversarial Tuning	Yichuan Mo et.al.	2402.06255	link
2024-12-16	Comprehensive Assessment of Jailbreak Attacks Against LLMs	Junjie Chu et.al.	2402.05668	link
2024-02-08	Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia	Guangyu Shen et.al.	2402.05467	link
2024-10-24	Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications	Boyi Wei et.al.	2402.05162	null
2024-02-27	HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal	Mantas Mazeika et.al.	2402.04249	link
2024-02-05	Nevermind: Instruction Override and Moderation in Large Language Models	Edward Kim et.al.	2402.03303	null
2024-05-30	GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models	Haibo Jin et.al.	2402.03299	null
2024-02-04	Jailbreaking Attack against Multimodal Large Language Model	Zhenxing Niu et.al.	2402.02309	link
2024-06-17	Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models	Yongshuo Zong et.al.	2402.02207	link
2024-01-25	MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds	Xiaolong Jin et.al.	2402.01706	null
2024-11-14	Security and Privacy Challenges of Large Language Models: A Survey	Badhan Chandra Das et.al.	2402.00888	null
2024-02-01	Investigating Bias Representations in Llama 2 Chat via Activation Steering	Dawn Lu et.al.	2402.00402	null
2024-06-03	On Prompt-Driven Safeguarding for Large Language Models	Chujie Zheng et.al.	2401.18018	link
2024-11-08	Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks	Andy Zhou et.al.	2401.17263	link
2024-02-05	Weak-to-Strong Jailbreaking on Large Language Models	Xuandong Zhao et.al.	2401.17256	link
2024-01-30	A Cross-Language Investigation into Jailbreak Attacks in Large Language Models	Jie Li et.al.	2401.16765	null
2024-01-30	Gradient-Based Language Model Red Teaming	Nevan Wichers et.al.	2401.16656	link
2024-01-29	Towards Red Teaming in Multimodal and Multilingual Translation	Christophe Ropers et.al.	2401.16247	null
2024-08-27	Red-Teaming for Generative AI: Silver Bullet or Security Theater?	Michael Feffer et.al.	2401.15897	null
2024-01-23	Red Teaming Visual Language Models	Mukai Li et.al.	2401.12915	null
2024-01-24	Digital cloning of online social networks for language-sensitive agent-based modeling of misinformation spread	Prateek Puri et.al.	2401.12509	null
2024-07-10	The Ethics of Interaction: Mitigating Security Threats in LLMs	Ashutosh Kumar et.al.	2401.12273	null
2024-01-20	InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance	Pengyu Wang et.al.	2401.11206	link
2024-10-31	Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning	Adib Hasan et.al.	2401.10862	link
2024-05-16	Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models	Rima Hazra et.al.	2401.10647	link
2024-02-12	All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks	Kazuhiro Takemoto et.al.	2401.09798	link
2024-08-03	AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models	Dong shu et.al.	2401.09002	null
2024-12-24	Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective	Tianlong Li et.al.	2401.06824	null
2024-12-16	Intention Analysis Makes LLMs A Good Jailbreak Defender	Yuqi Zhang et.al.	2401.06561	link
2024-01-23	How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs	Yi Zeng et.al.	2401.06373	link
2024-01-11	Combating Adversarial Attacks with Multi-Agent Debate	Steffi Chern et.al.	2401.05998	link
2024-04-01	The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance	Abel Salinas et.al.	2401.03729	link
2024-08-19	Malla: Demystifying Real-world Large Language Model Integrated Malicious Services	Zilong Lin et.al.	2401.03315	link
2024-01-03	A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity	Andrew Lee et.al.	2401.01967	link
2023-12-30	Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks	Aleksander Buszydlik et.al.	2401.00290	link
2023-12-28	Scalable and automated Evaluation of Blue Team cyber posture in Cyber Ranges	Federica Bianchi et.al.	2312.17221	null
2024-08-04	Exploiting Novel GPT-4 APIs	Kellin Pelrine et.al.	2312.14302	link
2023-12-12	Maatphor: Automated Variant Analysis for Prompt Injection Attacks	Ahmed Salem et.al.	2312.11513	null
2023-12-08	A Red Teaming Framework for Securing AI in Maritime Autonomous Systems	Mathew J. Walter et.al.	2312.11500	null
2024-06-18	JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks	Xiaoyu Zhang et.al.	2312.10766	null
2023-12-16	Comprehensive Evaluation of ChatGPT Reliability Through Multilingual Inquiries	Poorna Chander Reddy Puttaparthi et.al.	2312.10524	link
2023-12-04	Generative AI in Writing Research Papers: A New Type of Algorithmic Bias and Uncertainty in Scholarly Work	Rishab Jain et.al.	2312.10057	null
2023-12-14	OSTINATO: Cross-host Attack Correlation Through Attack Activity Similarity Detection	Sutanu Kumar Ghosh et.al.	2312.09321	null
2024-04-17	Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF	Anand Siththaranjan et.al.	2312.08358	link
2023-12-13	Causality Analysis for Evaluating the Security of Large Language Models	Wei Zhao et.al.	2312.07876	link
2024-07-23	AI Control: Improving Safety Despite Intentional Subversion	Ryan Greenblatt et.al.	2312.06942	link
2024-05-30	Privacy Issues in Large Language Models: A Survey	Seth Neel et.al.	2312.06717	link
2023-12-11	Control Risk for Potential Misuse of Artificial Intelligence in Science	Jiyan He et.al.	2312.06632	link
2023-12-08	Seamless: Multilingual Expressive and Streaming Speech Translation	Seamless Communication et.al.	2312.05187	link
2023-12-12	DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions	Fangzhou Wu et.al.	2312.04730	null
2024-02-23	Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak	Yanrui Du et.al.	2312.04127	null
2024-10-31	Tree of Attacks: Jailbreaking Black-Box LLMs Automatically	Anay Mehrotra et.al.	2312.02119	link
2024-06-09	Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections	Yuanpu Cao et.al.	2312.00027	link
2024-03-03	Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition	Sander Schulhoff et.al.	2311.16119	link
2023-11-27	How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs	Haoqin Tu et.al.	2311.16101	link
2023-11-27	InfoPattern: Unveiling Information Propagation Patterns in Social Media	Chi Han et.al.	2311.15642	null
2023-11-15	Towards Publicly Accountable Frontier LLMs: Building an External Scrutiny Ecosystem under the ASPIRE Framework	Markus Anderljung et.al.	2311.14711	null
2024-04-29	Universal Jailbreak Backdoors from Poisoned Human Feedback	Javier Rando et.al.	2311.14455	link
2024-03-24	Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models	Zhaowei Zhu et.al.	2311.11202	link
2024-06-15	Hijacking Large Language Models via Adversarial In-Context Learning	Yao Qiang et.al.	2311.09948	link
2024-02-29	Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking	Nan Xu et.al.	2311.09827	null
2024-06-19	RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models	Jiongxiao Wang et.al.	2311.09641	null
2023-11-16	JAB: Joint Adversarial Prompting and Belief Augmentation	Ninareh Mehrabi et.al.	2311.09473	null
2024-08-15	Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment	Haoran Wang et.al.	2311.09433	link
2024-01-20	Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts	Yuanwei Wu et.al.	2311.09127	null
2024-06-12	Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization	Zhexin Zhang et.al.	2311.09096	link
2023-11-29	AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications	Bhaktipriya Radharapu et.al.	2311.08592	null
2024-04-07	A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily	Peng Ding et.al.	2311.08268	link
2023-11-13	MART: Improving LLM Safety with Multi-round Automatic Red-Teaming	Suyu Ge et.al.	2311.07689	null
2024-05-22	Flames: Benchmarking Value Alignment of LLMs in Chinese	Kexin Huang et.al.	2311.06899	link
2024-12-10	Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming	Nanna Inie et.al.	2311.06237	null
2024-04-01	Fake Alignment: Are LLMs Really Aligned Well?	Yixu Wang et.al.	2311.05915	link
2025-01-19	FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts	Yichen Gong et.al.	2311.05608	link
2024-03-08	Can LLMs Follow Simple Rules?	Norman Mu et.al.	2311.04235	link
2023-11-24	Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation	Rusheb Shah et.al.	2311.03348	null
2024-11-28	DeepInception: Hypnotize Large Language Model to Be Jailbreaker	Xuan Li et.al.	2311.03191	link
2024-05-22	LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B	Simon Lermen et.al.	2310.20624	null
2024-03-10	From Chatbots to PhishBots? -- Preventing Phishing scams created using ChatGPT, Google Bard and Claude	Sayak Saha Roy et.al.	2310.19181	null
2024-03-22	Self-Guard: Empower the LLM to Safeguard Itself	Zezhong Wang et.al.	2310.15851	null
2023-12-14	AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models	Sicheng Zhu et.al.	2310.15140	null
2023-11-13	Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases	Rishabh Bhardwaj et.al.	2310.14303	null
2023-10-20	Adaptive Experimental Design for Intrusion Data Collection	Kate Highnam et.al.	2310.13224	null
2023-10-28	Probing LLMs for hate speech detection: strengths and vulnerabilities	Sarthak Roy et.al.	2310.12860	null
2023-10-19	Attack Prompt Generation for Red Teaming and Defending Large Language Models	Boyi Deng et.al.	2310.12505	link
2023-10-17	Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models	Hsuan Su et.al.	2310.11079	null
2023-10-16	Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks	Erfan Shayegani et.al.	2310.10844	null
2024-02-16	Large Language Model Unlearning	Yuanshun Yao et.al.	2310.10683	link
2024-06-07	Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?	Yu-Lin Tsai et.al.	2310.10012	link
2023-11-11	ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models	Alex Mei et.al.	2310.09624	link
2024-07-18	Jailbreaking Black Box Large Language Models in Twenty Queries	Patrick Chao et.al.	2310.08419	link
2023-10-10	Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation	Yangsibo Huang et.al.	2310.06987	link
2024-03-04	Multilingual Jailbreak Challenges in Large Language Models	Yue Deng et.al.	2310.06474	link
2024-05-25	Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations	Zeming Wei et.al.	2310.06387	null
2024-03-20	AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models	Xiaogeng Liu et.al.	2310.04451	link
2023-09-17	Red Teaming Generative AI/NLP, the BB84 quantum cryptography protocol and the NIST-approved Quantum-Resistant Cryptographic Algorithms	Petar Radanliev et.al.	2310.04425	null
2023-10-05	Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!	Xiangyu Qi et.al.	2310.03693	link
2024-06-11	SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks	Alexander Robey et.al.	2310.03684	link
2024-01-27	Low-Resource Languages Jailbreak GPT-4	Zheng-Xin Yong et.al.	2310.02446	null
2023-10-03	Jailbreaker in Jail: Moving Target Defense for Large Language Models	Bocheng Chen et.al.	2310.02417	null
2023-10-03	Can Language Models be Instructed to Protect Personal Information?	Yang Chen et.al.	2310.02224	null
2024-01-22	Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench	Jen-tse Huang et.al.	2310.01386	link
2023-10-02	No Offense Taken: Eliciting Offensiveness from Language Models	Anugya Srivastava et.al.	2310.00892	link
2024-07-28	Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games	Chengdong Ma et.al.	2310.00322	null
2024-06-12	Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM	Bochuan Cao et.al.	2309.14348	link
2024-06-27	GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts	Jiahao Yu et.al.	2309.10253	link
2024-06-08	Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts	Zhi-Yi Chin et.al.	2309.06135	link
2024-04-14	FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models	Dongyu Yao et.al.	2309.05274	link
2024-08-05	Open Sesame! Universal Black Box Jailbreaking of Large Language Models	Raz Lapid et.al.	2309.01446	null
2023-09-04	Baseline Defenses for Adversarial Attacks Against Aligned Language Models	Neel Jain et.al.	2309.00614	null
2023-08-28	The Promise and Peril of Artificial Intelligence -- Violet Teaming Offers a Balanced Path Forward	Alexander J. Titus et.al.	2308.14253	null
2023-11-07	Detecting Language Model Attacks with Perplexity	Gabriel Alon et.al.	2308.14132	null
2023-08-25	Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models	Zhenhua Wang et.al.	2308.11521	null
2023-08-21	On the Adversarial Robustness of Multi-Modal Foundation Models	Christian Schlarmann et.al.	2308.10741	link
2023-08-21	Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions	Wesley Tann et.al.	2308.10443	null
2023-08-30	Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment	Rishabh Bhardwaj et.al.	2308.09662	link
2024-05-06	Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models	Yugeng Liu et.al.	2308.07847	null
2024-03-26	GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher	Youliang Yuan et.al.	2308.06463	link
2023-08-16	Where's the Liability in Harmful AI Speech?	Peter Henderson et.al.	2308.04635	null
2024-11-07	FLIRT: Feedback Loop In-context Red Teaming	Ninareh Mehrabi et.al.	2308.04265	null
2024-05-15	"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models	Xinyue Shen et.al.	2308.03825	link
2024-04-01	XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models	Paul Röttger et.al.	2308.01263	link
2023-08-03	Confidence-Building Measures for Artificial Intelligence: Workshop Proceedings	Sarah Shoker et.al.	2308.00862	null
2023-12-20	Universal and Transferable Adversarial Attacks on Aligned Language Models	Andy Zou et.al.	2307.15043	link
2023-10-10	Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models	Erfan Shayegani et.al.	2307.14539	null
2023-10-25	MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots	Gelei Deng et.al.	2307.08715	null
2023-08-28	Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models	Huachuan Qiu et.al.	2307.08487	link
2023-07-05	Jailbroken: How Does LLM Safety Training Fail?	Alexander Wei et.al.	2307.02483	null
2023-07-03	From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy	Maanak Gupta et.al.	2307.00691	null
2023-08-16	Visual Adversarial Examples Jailbreak Aligned Large Language Models	Xiangyu Qi et.al.	2306.13213	link
2024-02-26	DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models	Boxin Wang et.al.	2306.11698	null
2023-10-11	Explore, Establish, Exploit: Red Teaming Language Models from Scratch	Stephen Casper et.al.	2306.09442	link
2023-05-30	Seeing Seeds Beyond Weeds: Green Teaming Generative AI for Beneficial Uses	Logan Stapleton et.al.	2306.03097	null
2023-10-19	Red Teaming Language Model Detectors with Language Models	Zhouxing Shi et.al.	2305.19713	link
2023-05-27	Query-Efficient Black-Box Red Teaming via Bayesian Optimization	Deokjae Lee et.al.	2305.17444	link
2024-03-27	Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks	Abhinav Rao et.al.	2305.14965	link
2024-03-10	Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study	Yi Liu et.al.	2305.13860	null
2023-11-10	SneakyPrompt: Jailbreaking Text-to-image Generative Models	Yuchen Yang et.al.	2305.12082	link
2023-05-11	Towards best practices in AGI safety and governance: A survey of expert opinion	Jonas Schuett et.al.	2305.07153	null
2023-05-09	Generating Phishing Attacks using ChatGPT	Sayak Saha Roy et.al.	2305.05133	null
2023-10-19	Automatic Prompt Optimization with "Gradient Descent" and Beam Search	Reid Pryzant et.al.	2305.03495	link
2023-04-21	Power to the Data Defenders: Human-Centered Disclosure Risk Calibration of Open Data	Kaustav Bhattacharjee et.al.	2304.11278	null
2024-06-03	Fundamental Limitations of Alignment in Large Language Models	Yotam Wolf et.al.	2304.11082	link
2023-11-01	Multi-step Jailbreaking Privacy Attacks on ChatGPT	Haoran Li et.al.	2304.05197	link
2023-07-27	Clustered Federated Learning Architecture for Network Anomaly Detection in Large Scale Heterogeneous IoT Networks	Xabier Sáez-de-Cámara et.al.	2303.15986	null
2023-03-09	Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback	Hannah Rose Kirk et.al.	2303.05453	null
2023-09-21	Red Teaming Deep Neural Networks with Feature Synthesis Tools	Stephen Casper et.al.	2302.10894	null
2023-01-05	Can Large Language Models Change User Preference Adversarially?	Varshini Subhash et.al.	2302.10291	null
2023-05-29	Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity	Terry Yue Zhuo et.al.	2301.12867	null
2024-08-23	Asymptotically Normal Estimation of Local Latent Network Curvature	Steven Wilkins-Reeves et.al.	2211.11673	link
2023-05-05	Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks	Stephen Casper et.al.	2211.10024	link
2023-06-07	Beyond the Surface: Investigating Malicious CVE Proof of Concept Exploits on GitHub	Soufian El Yadmani et.al.	2210.08374	null
2022-11-10	Red-Teaming the Stable Diffusion Safety Filter	Javier Rando et.al.	2210.04610	null
2022-11-22	Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned	Deep Ganguli et.al.	2209.07858	link
2023-10-13	Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents	Stephen Casper et.al.	2209.02167	link
2022-08-16	CTI4AI: Threat Intelligence Generation and Sharing after Red Teaming AI Models	Chuyen Nguyen et.al.	2208.07476	null
2022-08-12	PRIVEE: A Visual Analytic Workflow for Proactive Privacy Risk Inspection of Open Data	Kaustav Bhattacharjee et.al.	2208.06481	null
2022-07-30	'PeriHack': Designing a Serious Game for Cybersecurity Awareness	Roberto Dillon et.al.	2208.00235	null
2023-07-27	Gotham Testbed: a Reproducible IoT Testbed for Security Experiments and Dataset Generation	Xabier Sáez-de-Cámara et.al.	2207.13981	link
2022-02-07	Red Teaming Language Models with Language Models	Ethan Perez et.al.	2202.03286	null
2021-12-22	Catch Me If You GAN: Using Artificial Intelligence for Fake Log Generation	Christian Toemmel et.al.	2112.12006	null
2021-12-18	Dynamic Defender-Attacker Blotto Game	Daigo Shishika et.al.	2112.09890	null
2021-11-24	Needle in a Haystack: Detecting Subtle Malicious Edits to Additive Manufacturing G-code Files	Caleb Beckwith et.al.	2111.12746	null
2021-10-04	Automating Privilege Escalation with Deep Reinforcement Learning	Kalle Kujanpää et.al.	2110.01362	null
2021-08-20	CybORG: A Gym for the Development of Autonomous Cyber Agents	Maxwell Standen et.al.	2108.09118	null
2021-05-27	Hopper: Modeling and Detecting Lateral Movement (Extended Report)	Grant Ho et.al.	2105.13442	link
2021-04-23	Predicting Adversary Lateral Movement Patterns with Deep Learning	Nathan Danneman et.al.	2104.13195	null
2021-03-29	Automating Defense Against Adversarial Attacks: Discovery of Vulnerabilities and Application of Multi-INT Imagery to Protect Deployed Models	Josh Kalin et.al.	2103.15897	null
2022-06-28	Dynamically Modelling Heterogeneous Higher-Order Interactions for Malicious Behavior Detection in Event Logs	Corentin Larroche et.al.	2103.15708	link
2021-05-04	An In-memory Embedding of CPython for Offensive Use	Ateeq Sharfuddin et.al.	2103.15202	null
2020-11-26	Investigation on Research Ethics and Building a Benchmark	Shun Inagaki et.al.	2011.13925	null
2020-09-17	Can ROS be used securely in industry? Red teaming ROS-Industrial	Víctor Mayoral-Vilches et.al.	2009.08211	null
2020-07-17	HARMer: Cyber-attacks Automation and Evaluation	Simon Yusuf Enoch et.al.	2006.14352	null
2021-04-16	HACK3D: Crowdsourcing the Assessment of Cybersecurity in Digital Manufacturing	Michael Linares et.al.	2005.04368	null
2020-03-11	Passlab: A Password Security Tool for the Blue Team	Saul Johnson et.al.	2003.07208	null
2020-10-02	SoK: A Survey of Open-Source Threat Emulators	Polina Zilberman et.al.	2003.01518	null
2020-02-26	CybORG: An Autonomous Cyber Operations Research Gym	Callum Baillie et.al.	2002.10667	null
2021-01-29	Anomaly Detection in Large Scale Networks with Latent Space Models	Wesley Lee et.al.	1911.05522	null
2019-06-17	The Little Phone That Could Ch-Ch-Chroot	Jack Whitter-Jones et.al.	1906.07242	null
2019-06-12	Relative Hausdorff Distance for Network Analysis	Sinan G. Aksoy et.al.	1906.04936	null
2019-10-24	Quantifiable & Comparable Evaluations of Cyber Defensive Capabilities: A Survey & Novel, Unified Approach	Michael D. Iannacone et.al.	1902.00053	null
2018-10-13	Two Can Play That Game: An Adversarial Evaluation of a Cyber-alert Inspection System	Ankit Shah et.al.	1810.05921	null
2018-02-27	A Multi-Disciplinary Review of Knowledge Acquisition Methods: From Human to Autonomous Eliciting Agents	George Leu et.al.	1802.09669	null
2018-02-27	Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition	George Leu et.al.	1802.09660	null
2018-02-26	Shaping Influence and Influencing Shaping: A Computational Red Teaming Trust-based Swarm Intelligence Model	Jiangjun Tang et.al.	1802.09647	null
2018-01-06	SLEUTH: Real-time Attack Scenario Reconstruction from COTS Audit Data	Md Nahid Hossain et.al.	1801.02062	null
2017-12-02	Recurrent Neural Network Language Models for Open Vocabulary Event-Level Cyber Anomaly Detection	Aaron Tuor et.al.	1712.00557	link
2015-04-07	Security Toolbox for Detecting Novel and Sophisticated Android Malware	Benjamin Holland et.al.	1504.01693	null

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 2,266 Commits
.github		.github
assets		assets
docs		docs
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
daily_arxiv.py		daily_arxiv.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Updated on 2025.01.27

Red Teaming

About

Releases

Packages

Languages

License

chen37058/Red-Team-Arxiv-Paper-Update

Folders and files

Latest commit

History

Repository files navigation

Updated on 2025.01.27

Red Teaming

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages