Skip to content

chen37058/Red-Team-Arxiv-Paper-Update

 
 

Repository files navigation

Contributors Forks Stargazers Issues

Updated on 2025.01.27

Usage instructions: here

Table of Contents
  1. Red Teaming

Red Teaming

Publish Date Title Authors PDF Code
2025-01-23 Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak Erjia Xiao et.al. 2501.13772 null
2025-01-19 Dagger Behind Smile: Fool LLMs with a Happy Ending Story Xurui Song et.al. 2501.13115 null
2025-01-21 You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense Wuyuao Mai et.al. 2501.12210 null
2025-01-19 Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity David Williams-King et.al. 2501.11183 null
2025-01-18 Jailbreaking Large Language Models in Infinitely Many Ways Oliver Goldstein et.al. 2501.10800 null
2025-01-18 Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks Xin Yi et.al. 2501.10639 null
2024-12-17 What Information Should Be Shared with Whom "Before and During Training"? Haydn Belfield et.al. 2501.10379 null
2025-01-16 A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy Huandong Wang et.al. 2501.09431 null
2025-01-14 Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models Abdulkadir Erol et.al. 2501.09039 null
2025-01-15 SAIF: A Comprehensive Framework for Evaluating the Risks of Generative AI in the Public Sector Kyeongryul Lee et.al. 2501.08814 null
2025-01-14 Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints Jonathan Nöther et.al. 2501.08246 null
2025-01-14 Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning Jiaqi Hua et.al. 2501.07959 link
2025-01-14 Gandalf the Red: Adaptive Security for LLMs Niklas Pfister et.al. 2501.07927 link
2025-01-13 Lessons From Red Teaming 100 Generative AI Products Blake Bullwinkel et.al. 2501.07238 null
2025-01-09 Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency Shiji Zhao et.al. 2501.04931 null
2025-01-05 Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense Yang Ouyang et.al. 2501.02629 link
2025-01-03 Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models Ziwei Zheng et.al. 2501.02029 null
2025-01-02 Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs Joao Fonseca et.al. 2501.02018 null
2025-01-09 Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions Rachneet Sachdeva et.al. 2501.01872 link
2025-01-03 Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models Yanjiang Liu et.al. 2501.01830 null
2025-01-09 WeAudit: Scaffolding User Auditors and AI Practitioners in Auditing Generative AI Wesley Hanwen Deng et.al. 2501.01397 null
2025-01-02 CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models Johan Wahréus et.al. 2501.01335 link
2024-12-29 Adversarial Negotiation Dynamics in Generative Language Models Arinbjörn Kolbeinsson et.al. 2501.00069 null
2024-12-28 LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models Miao Yu et.al. 2501.00055 link
2024-12-30 InfAlign: Inference-aware language model alignment Ananth Balashankar et.al. 2412.19792 null
2024-12-24 Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning Alex Beutel et.al. 2412.18693 null
2024-12-25 Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models Xiaomeng Hu et.al. 2412.18171 null
2024-12-23 Retention Score: Quantifying Jailbreak Risks for Vision Language Models Zaitang Li et.al. 2412.17544 null
2025-01-05 DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak Hao Wang et.al. 2412.17522 null
2024-12-22 Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models Lang Gao et.al. 2412.17034 null
2024-12-22 Robustness of Large Language Models Against Adversarial Attacks Yiyi Tao et.al. 2412.17011 null
2024-12-21 OpenAI o1 System Card OpenAI et.al. 2412.16720 null
2024-12-21 POEX: Policy Executable Embodied AI Jailbreak Attacks Xuancun Lu et.al. 2412.16633 null
2024-12-21 Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models Yanxu Mao et.al. 2412.16555 null
2025-01-08 Deliberative Alignment: Reasoning Enables Safer Language Models Melody Y. Guan et.al. 2412.16339 null
2024-12-20 Logical Consistency of Large Language Models in Fact-checking Bishwamittra Ghosh et.al. 2412.16100 null
2024-12-20 JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs Hongyi Li et.al. 2412.15623 null
2024-12-19 SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage Xiaoning Dong et.al. 2412.15289 null
2025-01-08 Toxicity Detection towards Adaptability to Changing Perturbations Hankun Kang et.al. 2412.15267 null
2024-12-18 Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation Aneta Zugecova et.al. 2412.13666 null
2024-12-17 Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing Keltin Grimes et.al. 2412.13341 link
2024-12-17 Jailbreaking? One Step Is Enough! Weixiong Zheng et.al. 2412.12621 null
2024-12-17 Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols Alex Mallen et.al. 2412.12480 null
2024-12-13 No Free Lunch for Defending Against Prefilling Attack by In-Context Learning Zhiyu Xue et.al. 2412.12192 null
2024-12-10 Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars Yu Yan et.al. 2412.12145 null
2024-12-15 SpearBot: Leveraging Large Language Models in a Generative-Critique Framework for Spear-Phishing Email Generation Qinglin Qi et.al. 2412.11109 null
2024-12-15 Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models Di Wu et.al. 2412.11041 null
2024-12-14 IntelEX: A LLM-driven Attack-level Threat Intelligence Extraction Framework Ming Xu et.al. 2412.10872 null
2024-12-14 Towards Action Hijacking of Large Language Model-based Agent Yuyang Zhang et.al. 2412.10807 null
2024-12-10 Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM Shaoqing Zhang et.al. 2412.10423 link
2024-12-13 AdvPrefix: An Objective for Nuanced LLM Jailbreaks Sicheng Zhu et.al. 2412.10321 link
2024-12-12 AI Red-Teaming is a Sociotechnical System. Now What? Tarleton Gillespie et.al. 2412.09751 null
2024-12-12 Obfuscated Activations Bypass LLM Latent-Space Defenses Luke Bailey et.al. 2412.09565 null
2024-12-16 Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models Jiahui Li et.al. 2412.08615 link
2024-12-11 AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models Mintong Kang et.al. 2412.08608 null
2024-12-11 Model-Editing-Based Jailbreak against Safety-aligned Large Language Models Yuxi Li et.al. 2412.08201 null
2024-12-11 Antelope: Potent and Concealed Jailbreak Attack Strategy Xin Zhao et.al. 2412.08156 null
2024-12-11 Evil twins are not that evil: Qualitative insights into machine-generated prompts Nathanaël Carraz Rakotonirina et.al. 2412.08127 null
2024-12-16 Granite Guardian Inkit Padhi et.al. 2412.07724 link
2024-12-10 FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks Bocheng Chen et.al. 2412.07672 null
2024-12-10 TraSCE: Trajectory Steering for Concept Erasure Anubhav Jain et.al. 2412.07658 link
2024-12-10 PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips Zachary Coalson et.al. 2412.07192 null
2024-11-03 Poison Attacks and Adversarial Prompts Against an Informed University Virtual Assistant Ivan A. Fernandez et.al. 2412.06788 null
2024-12-09 Enhancing Adversarial Resistance in LLMs with Recursion Bryan Li et.al. 2412.06181 null
2025-01-03 Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models Ma Teng et.al. 2412.05934 link
2024-12-16 PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization Ruoxi Cheng et.al. 2412.05892 null
2024-12-07 PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage Yuzhou Nie et.al. 2412.05734 link
2024-12-06 BadGPT-4o: stripping safety finetuning from GPT models Ekaterina Krupkina et.al. 2412.05346 null
2024-12-06 LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds James Beetham et.al. 2412.05232 null
2024-12-19 Best-of-N Jailbreaking John Hughes et.al. 2412.03556 link
2024-12-04 Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts? Sravanti Addepalli et.al. 2412.03235 null
2024-12-03 Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach Tony T. Wang et.al. 2412.02159 null
2024-12-03 Trust & Safety of LLMs and LLMs in Trust & Safety Doohee You et.al. 2412.02113 null
2024-12-02 Improved Large Language Model Jailbreak Detection via Pretrained Embeddings Erick Galinkin et.al. 2412.01547 null
2024-12-17 Jailbreak Large Vision-Language Models Through Multi-Modal Linkage Yu Wang et.al. 2412.00473 link
2024-11-30 Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models Sanghyun Kim et.al. 2412.00357 null
2024-12-19 PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning Shenghui Li et.al. 2411.19335 null
2024-11-28 DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs Ben Ganon et.al. 2411.19038 null
2024-12-20 Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment Soumya Suvra Ghosal et.al. 2411.18688 null
2024-11-27 Embodied Red Teaming for Auditing Robotic Foundation Models Sathwik Karnik et.al. 2411.18676 null
2024-11-28 Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models Shuyang Hao et.al. 2411.18000 null
2024-11-26 Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats Jiaxin Wen et.al. 2411.17693 null
2025-01-14 Don't Command, Cultivate: An Exploratory Study of System-2 Alignment Yuhang Wang et.al. 2411.17075 link
2024-11-25 In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models Zhi-Yi Chin et.al. 2411.16769 null
2024-11-23 ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain Haochen Zhao et.al. 2411.16736 link
2024-12-04 "Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks Libo Wang et.al. 2411.16730 link
2024-11-29 Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks Han Wang et.al. 2411.16721 link
2024-11-25 Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective Jean Marie Tshimula et.al. 2411.16642 null
2024-11-22 Universal and Context-Independent Triggers for Precise Control of LLM Outputs Jiashuo Liang et.al. 2411.14738 null
2024-11-21 Global Challenge for Safe and Secure LLMs Track 1 Xiaojun Jia et.al. 2411.14502 null
2024-11-21 GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs Advik Raj Basani et.al. 2411.14133 link
2024-11-20 A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection Gabriel Chua et.al. 2411.12946 null
2024-11-27 Playing Language Game with LLMs Leads to Jailbreaking Yu Peng et.al. 2411.12762 null
2024-12-08 TrojanRobot: Backdoor Attacks Against LLM-based Embodied Robots in the Physical World Xianlong Wang et.al. 2411.11683 null
2024-11-28 Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models Chenhang Cui et.al. 2411.11496 link
2024-11-18 The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models Xikang Yang et.al. 2411.11407 link
2024-11-18 Steering Language Model Refusal with Sparse Autoencoders Kyle O'Brien et.al. 2411.11296 null
2024-11-17 JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Zeqing He et.al. 2411.11114 null
2024-12-09 Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey Xuannan Liu et.al. 2411.09259 link
2024-11-14 DROJ: A Prompt-Driven Attack against Large Language Models Leyang Hu et.al. 2411.09125 link
2024-11-13 LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs Piyush Jha et.al. 2411.08862 null
2024-11-13 The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense Yangyang Guo et.al. 2411.08410 null
2024-11-12 Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models Tiejin Chen et.al. 2411.07559 null
2024-11-12 Rapid Response: Mitigating LLM Jailbreaks with a Few Examples Alwin Peng et.al. 2411.07494 null
2024-11-11 HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment Yannis Belkhiter et.al. 2411.06835 null
2024-11-10 SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains Bijoy Ahmed Saiem et.al. 2411.06426 null
2024-11-06 Diversity Helps Jailbreak Large Language Models Weiliang Zhao et.al. 2411.04223 null
2025-01-07 MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue Fengxiang Wang et.al. 2411.03814 null
2024-11-02 What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks Nathalie Maria Kirch et.al. 2411.03343 link
2024-12-05 Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment Jason Vega et.al. 2411.02785 link
2024-11-03 UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models Sejoon Oh et.al. 2411.01703 null
2024-12-10 SQL Injection Jailbreak: a structural disaster of large language models Jiawei Zhao et.al. 2411.01565 link
2024-11-03 AURA: Amplifying Understanding, Resilience, and Awareness for Responsible AI Content Work Alice Qian Zhang et.al. 2411.01426 null
2024-12-11 Plentiful Jailbreaks with String Compositions Brian R. Y. Huang et.al. 2411.01084 null
2024-11-01 Emoji Attack: A Method for Misleading Judge LLMs in Safety Risk Detection Zhipeng Wei et.al. 2411.01077 link
2024-11-15 IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves Ruofan Wang et.al. 2411.00827 null
2024-11-26 Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs Muhammed Saeed et.al. 2410.24049 null
2024-10-31 Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models Hao Yang et.al. 2410.23861 null
2024-10-31 Adversarial Attacks of Vision Tasks in the Past 10 Years: A Survey Chiyu Zhang et.al. 2410.23687 null
2024-11-27 Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models Yiqi Yang et.al. 2410.23558 null
2024-10-30 ProTransformer: Robustify Transformers via Plug-and-Play Paradigm Zhichao Hou et.al. 2410.23182 link
2024-10-29 Benchmarking LLM Guardrails in Handling Multilingual Toxicity Yahan Yang et.al. 2410.22153 null
2024-10-29 AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts Vishal Kumar et.al. 2410.22143 null
2024-10-29 SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types Yutao Mou et.al. 2410.21965 link
2024-10-28 Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring Honglin Mu et.al. 2410.21083 null
2024-10-28 BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks Yunhan Zhao et.al. 2410.20971 null
2024-10-25 RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction Tanqiu Jiang et.al. 2410.19937 null
2024-10-25 An Auditing Test To Detect Behavioral Shift in Language Models Leo Richter et.al. 2410.19406 link
2024-10-25 Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities Chung-En Sun et.al. 2410.18469 link
2024-10-23 Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks Samuele Poppi et.al. 2410.18210 null
2024-10-23 Guide for Defense (G4D): Dynamic Guidance for Robust and Balanced Defense in Large Language Models He Cao et.al. 2410.17922 link
2024-10-22 LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded" Som Sagar et.al. 2410.16738 null
2024-11-02 Bayesian scaling laws for in-context learning Aryaman Arora et.al. 2410.16531 link
2024-11-16 Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis Jonathan Brokman et.al. 2410.16527 null
2024-10-18 Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs Rui Pu et.al. 2410.16327 null
2024-10-21 A Realistic Threat Model for Large Language Model Jailbreaks Valentyn Boreiko et.al. 2410.16222 link
2024-10-21 A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns Tianyi Men et.al. 2410.16155 null
2024-11-03 Boosting Jailbreak Transferability for Large Language Models Hanqing Liu et.al. 2410.15645 link
2024-10-21 SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis Aidan Wong et.al. 2410.15641 link
2024-10-20 Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models Xiao Li et.al. 2410.15362 null
2024-10-20 Jailbreaking and Mitigation of Vulnerabilities in Large Language Models Benji Peng et.al. 2410.15236 null
2024-10-16 SoK: Prompt Hacking of Large Language Models Baha Rababah et.al. 2410.13901 null
2024-10-15 A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation Aviral Srivastava et.al. 2410.13897 null
2024-10-21 Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents Priyanshu Kumar et.al. 2410.13886 link
2024-10-17 PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment Zekun Moore Wang et.al. 2410.13785 null
2024-10-17 Persistent Pre-Training Poisoning of LLMs Yiming Zhang et.al. 2410.13722 null
2024-11-09 Jailbreaking LLM-Controlled Robots Alexander Robey et.al. 2410.13691 null
2025-01-02 BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models Isack Lee et.al. 2410.13334 link
2024-10-17 SPIN: Self-Supervised Prompt INjection Leon Zhou et.al. 2410.13236 null
2024-10-18 JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework Fan Liu et.al. 2410.12855 null
2024-10-19 Multi-round jailbreak attack on large language models Yihua Zhou et.al. 2410.11533 null
2024-10-15 Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models Hao Yang et.al. 2410.11459 link
2025-01-20 Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation Qizhang Li et.al. 2410.11317 link
2024-10-15 AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment Pankayaraj Pathmanathan et.al. 2410.11283 null
2024-10-15 Cognitive Overload Attack:Prompt Injection for Long Context Bibek Upadhayay et.al. 2410.11272 link
2024-10-14 Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues Qibing Ren et.al. 2410.10700 link
2024-10-14 On Calibration of LLM-based Guard Models for Reliable Content Moderation Hongfu Liu et.al. 2410.10414 link
2024-10-14 Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting Yifan Luo et.al. 2410.10150 null
2024-11-27 BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models Xinyuan Wang et.al. 2410.09804 null
2024-10-18 VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment Lei Li et.al. 2410.09421 null
2024-12-17 Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations Tarun Raheja et.al. 2410.09097 null
2024-10-11 AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation Zijun Wang et.al. 2410.09040 link
2024-10-14 AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents Maksym Andriushchenko et.al. 2410.09024 null
2024-11-29 RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process Peiran Wang et.al. 2410.08660 null
2024-10-09 Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level Xinyi Zeng et.al. 2410.06809 null
2024-10-04 Developing Assurance Cases for Adversarial Robustness and Regulatory Compliance in LLMs Tomas Bueno Momcilovic et.al. 2410.05304 null
2024-11-27 AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs Xiaogeng Liu et.al. 2410.05295 link
2024-10-06 Attention Shift: Steering AI Away from Unsafe Content Shivank Garg et.al. 2410.04447 null
2024-10-05 Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks Zi Wang et.al. 2410.04234 null
2024-10-05 Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models Yiting Dong et.al. 2410.04190 null
2024-10-04 Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step Wenxuan Wang et.al. 2410.03869 null
2024-10-08 You Know What I'm Saying: Jailbreak Attack via Implicit Reference Tianyu Wu et.al. 2410.03857 link
2024-12-16 SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks Tianhao Li et.al. 2410.03769 null
2024-10-23 Gradient-based Jailbreak Images for Multimodal Fusion Models Javier Rando et.al. 2410.03489 link
2024-10-23 Safeguard is a Double-edged Sword: Denial-of-service Attack on Large Language Models Qingzhao Zhang et.al. 2410.02916 null
2024-10-02 FlipAttack: Jailbreak LLMs via Flipping Yue Liu et.al. 2410.02832 link
2024-10-01 PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System Gary D. Lopez Munoz et.al. 2410.02828 link
2024-10-03 SteerDiff: Steering towards Safe Text-to-Image Diffusion Models Hongxiang Zhang et.al. 2410.02710 null
2024-10-07 Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models Guobin Shen et.al. 2410.02298 null
2024-12-18 Data to Defense: The Role of Curation in Customizing LLMs Against Jailbreaking Attacks Xiaoqun Liu et.al. 2410.02220 null
2024-10-02 Automated Red Teaming with GOAT: the Generative Offensive Agent Tester Maya Pavlova et.al. 2410.01606 null
2024-10-04 HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models Seanie Lee et.al. 2410.01524 link
2024-10-02 Information-Theoretical Principled Trade-off between Jailbreakability and Stealthiness on Vision Language Models Ching-Chia Kao et.al. 2410.01438 null
2024-12-06 Endless Jailbreaks with Bijection Learning Brian R. Y. Huang et.al. 2410.01294 null
2024-12-19 Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models Wei Zhao et.al. 2410.00451 link
2024-09-29 Survey of Security and Data Attacks on Machine Unlearning In Financial and E-Commerce Carl E. J. Brodzinski et.al. 2410.00055 null
2024-09-30 Robust LLM safeguarding via refusal feature adversarial training Lei Yu et.al. 2409.20089 null
2024-09-28 Overriding Safety protections of Open-source Models Sachin Kumar et.al. 2409.19476 link
2024-09-27 HM3: Heterogeneous Multi-Class Model Merging Stefan Hackmann et.al. 2409.19173 null
2024-09-27 Multimodal Pragmatic Jailbreak on Text-to-image Models Tong Liu et.al. 2409.19149 null
2024-11-08 An Adversarial Perspective on Machine Unlearning for AI Safety Jakub Łucki et.al. 2409.18025 link
2024-10-04 MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks Giandomenico Cornacchia et.al. 2409.17699 null
2024-09-26 RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking Yifan Jiang et.al. 2409.17458 link
2024-09-25 Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction Jinchuan Zhang et.al. 2409.16783 link
2024-09-25 RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems Yihong Tang et.al. 2409.16727 null
2024-09-23 Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI Ambrish Rawat et.al. 2409.15398 null
2024-09-18 Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning Essa Jan et.al. 2409.15361 null
2024-10-08 Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs Xueluan Gong et.al. 2409.14866 link
2024-10-03 PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach Zhihao Lin et.al. 2409.14177 null
2024-10-29 Towards Safe Multilingual Frontier AI Artūrs Kanepajs et.al. 2409.13708 link
2024-11-05 Jailbreaking Large Language Models with Symbolic Mathematics Emet Bethany et.al. 2409.11445 null
2024-09-17 Hackphyr: A Local Fine-Tuned LLM Agent for Network Security Environments Maria Rigaki et.al. 2409.11276 null
2024-09-14 What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing Chenyang Yang et.al. 2409.09261 link
2024-09-27 Multi-Robot Coordination Induced in an Adversarial Graph-Traversal Game James Berneburg et.al. 2409.08222 null
2024-10-19 Securing Large Language Models: Addressing Bias, Misinformation, and Prompt Attacks Benji Peng et.al. 2409.08087 null
2024-09-12 Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking Stav Cohen et.al. 2409.08045 link
2024-09-12 Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols Charlie Griffin et.al. 2409.07985 link
2024-09-11 AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs Lijia Lv et.al. 2409.07503 link
2024-09-11 Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks Md Zarif Hossain et.al. 2409.07353 null
2024-09-10 DiPT: Enhancing LLM reasoning through diversified perspective-taking Hoang Anh Just et.al. 2409.06241 null
2024-09-07 Exploring Straightforward Conversational Red-Teaming George Kour et.al. 2409.04822 null
2024-08-31 HSF: Defending against Jailbreak Attacks with Hidden State Filtering Cheng Qian et.al. 2409.03788 null
2024-11-29 Conversational Complexity for Assessing Risk in Large Language Models John Burden et.al. 2409.01247 null
2024-09-01 Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models Bang An et.al. 2409.00598 link
2024-08-31 Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness Wenxuan Wang et.al. 2409.00551 null
2024-10-17 PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action Yijia Shao et.al. 2409.00138 link
2024-08-29 Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks Tom Gibbs et.al. 2409.00137 null
2024-11-07 FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks) Aman Priyanshu et.al. 2408.16163 null
2024-08-28 Red Team Redemption: A Structured Comparison of Open-Source Tools for Adversary Emulation Max Landauer et.al. 2408.15645 null
2024-09-05 Legilimens: Practical and Unified Content Moderation for Large Language Model Services Jialin Wu et.al. 2408.15488 link
2024-09-04 LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Nathaniel Li et.al. 2408.15221 null
2024-08-27 Investigating Coverage Criteria in Large Language Models: An In-Depth Study Through Jailbreak Attacks Shide Zhou et.al. 2408.15207 null
2024-10-05 Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models Hongfu Liu et.al. 2408.14866 link
2024-08-27 Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models Yuhao Du et.al. 2408.14853 null
2024-12-15 HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models Sensen Gao et.al. 2408.13896 null
2024-08-14 SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming Anurakt Kumar et.al. 2408.11851 null
2024-09-14 Efficient Detection of Toxic Prompts in Large Language Models Yi Liu et.al. 2408.11727 null
2024-08-21 Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer Weipeng Jiang et.al. 2408.11313 link
2024-08-21 EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models Chongwen Zhao et.al. 2408.11308 null
2024-08-20 Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles Zhilong Wang et.al. 2408.11182 null
2024-08-18 DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization Pucheng Dang et.al. 2408.11071 null
2025-01-02 Security Attacks on LLM-based Code Completion Tools Wen Cheng et.al. 2408.11006 link
2025-01-02 Perception-guided Jailbreak against Text-to-Image Models Yihao Huang et.al. 2408.10848 null
2024-08-20 Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique Tej Deep Pala et.al. 2408.10701 link
2024-08-20 Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models Hongbang Yuan et.al. 2408.10682 null
2024-08-26 Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation Haoyu Wang et.al. 2408.10668 null
2024-08-18 Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks Kexin Chen et.al. 2408.09326 null
2025-01-10 BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger Yulin Chen et.al. 2408.09093 null
2024-08-22 Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks Jiawei Zhao et.al. 2408.08924 link
2024-08-11 Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search Robert J. Moss et.al. 2408.08899 link
2024-10-22 $\textit{MMJ-Bench}$ : A Comprehensive Study on Jailbreak Attacks and Defenses for Multimodal Large Language Models Fenghua Weng et.al. 2408.08464 link
2024-12-19 Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions Quan Liu et.al. 2408.07663 link
2024-12-14 On Effects of Steering Latent Representation for Large Language Model Unlearning Dang Huu-Tien et.al. 2408.06223 link
2024-08-09 A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares Stav Cohen et.al. 2408.05061 link
2024-09-13 h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment Moussa Koulako Bala Doumbouya et.al. 2408.04811 null
2024-08-08 Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles Xiongtao Sun et.al. 2408.04686 null
2024-08-08 Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models Fabio Pernisi et.al. 2408.04522 null
2024-08-07 EnJa: Ensemble Jailbreak on Large Language Models Jiahao Zhang et.al. 2408.03603 null
2024-12-27 Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws Dillon Bowen et.al. 2408.02946 link
2024-08-05 Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models? Mohammad Bahrami Karkevandi et.al. 2408.02651 null
2024-12-23 SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models Muxi Diao et.al. 2408.02632 null
2024-08-02 Mission Impossible: A Statistical Perspective on Jailbreaking LLMs Jingtong Su et.al. 2408.01420 null
2024-08-01 WHITE PAPER: A Brief Exploration of Data Exfiltration using GCG Suffixes Victor Valbuena et.al. 2408.00925 null
2024-09-14 Tamper-Resistant Safeguards for Open-Weight LLMs Rishub Tamirisa et.al. 2408.00761 link
2024-09-09 Jailbreaking Text-to-Image Models with LLM-Based Agents Yingkai Dong et.al. 2408.00523 null
2024-10-17 Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models Yue Xu et.al. 2407.21659 link
2025-01-16 Direct Unlearning Optimization for Robust and Safe Text-to-Image Models Yong-Hyun Park et.al. 2407.21035 null
2024-10-24 Effects of Scale on Language Model Robustness Nikolaus Howe et.al. 2407.18213 null
2024-12-24 The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models Zihui Wu et.al. 2407.17915 link
2024-10-01 FLRT: Fluent Student-Teacher Redteaming T. Ben Thompson et.al. 2407.17447 link
2024-10-07 Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective Yujian Liu et.al. 2407.16997 link
2024-12-31 From Sands to Mansions: Simulating Full Attack Chain with LLM-Organized Knowledge Lingzhi Wang et.al. 2407.16928 null
2024-08-23 Can Large Language Models Automatically Jailbreak GPT-4V? Yuanwei Wu et.al. 2407.16686 null
2024-07-23 RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent Huiyu Xu et.al. 2407.16667 null
2024-10-26 Course-Correction: Safety Alignment Using Synthetic Preferences Rongwu Xu et.al. 2407.16637 link
2024-07-23 PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing Blazej Manczak et.al. 2407.16318 link
2024-08-13 Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models Shi Lin et.al. 2407.16205 link
2024-07-26 Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems Siddharth D Jaiswal et.al. 2407.15810 null
2024-08-21 Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Abhay Sheshadri et.al. 2407.15549 link
2024-12-16 Failures to Find Transferable Image Jailbreaks Between Vision-Language Models Rylan Schaeffer et.al. 2407.15211 null
2024-07-21 Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts Yi Liu et.al. 2407.15050 null
2024-07-23 RogueGPT: dis-ethical tuning transforms ChatGPT4 into a Rogue AI in 158 Words Alessio Buscemi et.al. 2407.15009 null
2024-07-20 Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) Apurv Verma et.al. 2407.14937 link
2024-08-23 Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle Emman Haider et.al. 2407.13833 null
2024-07-16 Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models Zihao Xu et.al. 2407.13796 link
2024-07-18 LLMs as Function Approximators: Terminology, Taxonomy, and Questions for Evaluation David Schlangen et.al. 2407.13744 null
2024-07-17 AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases Zhaorun Chen et.al. 2407.12784 link
2024-10-28 Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models Chao Gong et.al. 2407.12383 link
2024-07-17 The Better Angels of Machine Personality: How Personality Relates to LLM Safety Jie Zhang et.al. 2407.12344 link
2024-10-03 Does Refusal Training in LLMs Generalize to the Past Tense? Maksym Andriushchenko et.al. 2407.11969 link
2024-08-21 What Makes and Breaks Safety Fine-tuning? A Mechanistic Study Samyak Jain et.al. 2407.10264 null
2024-07-13 MOAT: Securely Mitigating Rowhammer with Per-Row Activation Counters Moinuddin Qureshi et.al. 2407.09995 null
2024-10-18 ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts Amelia F. Hardy et.al. 2407.09447 link
2024-09-06 Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions Tingwei Zhang et.al. 2407.08970 link
2024-07-11 Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing Huanqian Wang et.al. 2407.08770 link
2024-07-11 Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation Riccardo Cantini et.al. 2407.08441 null
2024-09-11 The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing Alice Qian Zhang et.al. 2407.07786 null
2024-07-12 A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends Daizong Liu et.al. 2407.07403 link
2024-09-08 T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models Yibo Miao et.al. 2407.05965 null
2024-07-08 $R^2$ -Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning Mintong Kang et.al. 2407.05557 link
2024-07-06 Safe Generative Chats in a WhatsApp Intelligent Tutoring System Zachary Levonian et.al. 2407.04915 null
2024-08-30 Jailbreak Attacks and Defenses Against Large Language Models: A Survey Sibo Yi et.al. 2407.04295 null
2024-12-21 Automated Progressive Red Teaming Bojian Jiang et.al. 2407.03876 link
2024-07-03 Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning Simon Ostermann et.al. 2407.03391 null
2024-07-03 SOS! Soft Prompt Attack Against Open-Source Large Language Models Ziqing Yang et.al. 2407.03160 null
2024-07-03 JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets Zhihua Jin et.al. 2407.03045 null
2024-11-05 Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks Zhexin Zhang et.al. 2407.02855 link
2024-10-30 Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses David Glukhov et.al. 2407.02551 null
2024-08-26 Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything Xiaotian Zou et.al. 2407.02534 null
2024-07-02 SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack Yan Yang et.al. 2407.01902 link
2024-07-01 Purple-teaming LLMs with Adversarial Defender Training Jingyan Zhou et.al. 2407.01850 null
2024-07-25 JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models Haibo Jin et.al. 2407.01599 link
2024-07-01 Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement Zisu Huang et.al. 2407.01461 link
2024-07-01 Badllama 3: removing safety finetuning from Llama 3 in minutes Dmitrii Volkov et.al. 2407.01376 null
2024-09-23 Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks Yue Zhou et.al. 2407.00869 link
2024-10-01 Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference Anton Xue et.al. 2407.00075 null
2024-07-11 Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection Yuqi Zhou et.al. 2406.19845 null
2024-10-03 Jailbreaking LLMs with Arabic Transliteration and Arabizi Mansour Al Ghanim et.al. 2406.18725 link
2024-07-08 The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm Aakanksha et.al. 2406.18682 null
2024-06-26 WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models Liwei Jiang et.al. 2406.18510 link
2024-12-09 WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs Seungju Han et.al. 2406.18495 link
2024-06-26 Poisoned LangChain: Jailbreak LLMs by LangChain Ziqiu Wang et.al. 2406.18122 null
2024-12-24 SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance Caishuang Huang et.al. 2406.18118 link
2024-06-25 CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference Erxin Yu et.al. 2406.17626 link
2024-06-25 Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations Cheng Wang et.al. 2406.17576 null
2024-06-21 Steering Without Side Effects: Improving Post-Deployment Control of Language Models Asa Cooper Stickland et.al. 2406.15518 link
2024-11-02 Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding Haneul Yoo et.al. 2406.15481 link
2024-06-21 From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking Siyuan Wang et.al. 2406.14859 null
2024-07-01 Adversaries Can Misuse Combinations of Safe Models Erik Jones et.al. 2406.14595 null
2025-01-17 Jailbreaking as a Reward Misspecification Problem Zhihui Xie et.al. 2406.14393 link
2024-06-20 Finding Safety Neurons in Large Language Models Jianhui Chen et.al. 2406.14144 null
2024-06-19 ObscurePrompt: Jailbreaking Large Language Models via Obscure Input Yue Huang et.al. 2406.13662 link
2024-08-21 SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation Xiaoze Liu et.al. 2406.12975 link
2025-01-07 ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates Fengqing Jiang et.al. 2406.12935 link
2024-06-21 [WIP] Jailbreak Paradox: The Achilles' Heel of LLMs Abhinav Rao et.al. 2406.12702 null
2024-06-17 Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner Kenneth Li et.al. 2406.11978 link
2024-10-16 CELL your Model: Contrastive Explanations for Large Language Models Ronny Luss et.al. 2406.11785 null
2024-10-23 STAR: SocioTechnical Approach to Red Teaming Language Models Laura Weidinger et.al. 2406.11757 null
2024-10-30 Refusal in Language Models Is Mediated by a Single Direction Andy Arditi et.al. 2406.11717 link
2024-06-17 Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack Shangqing Tu et.al. 2406.11682 link
2024-06-17 "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak Lingrui Mei et.al. 2406.11668 link
2024-06-17 Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming Vernon Toh Yan Han et.al. 2406.11654 null
2024-06-16 garak: A Framework for Security Probing Large Language Models Leon Derczynski et.al. 2406.11036 link
2024-06-16 Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications Stephen Burabari Tete et.al. 2406.11007 null
2024-12-02 Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis Yuping Lin et.al. 2406.10794 link
2024-11-06 Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs Zhao Xu et.al. 2406.09324 link
2024-06-13 JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models Delong Ran et.al. 2406.09321 link
2024-10-05 Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models Sarah Ball et.al. 2406.09289 link
2024-07-19 Exploiting Uncommon Text-Encoded Structures for Automated Jailbreaks in LLMs Bangxin Li et.al. 2406.08754 null
2024-06-13 RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs Xuan Chen et.al. 2406.08725 null
2024-12-18 When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search Xuan Chen et.al. 2406.08705 link
2024-06-13 MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models Tianle Gu et.al. 2406.07594 link
2024-07-14 Merging Improves Self-Critique Against Jailbreak Attacks Victor Gallego et.al. 2406.07188 link
2024-12-06 MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models Yichi Zhang et.al. 2406.07057 null
2024-06-07 Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs Fan Liu et.al. 2406.06622 null
2024-07-03 Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks Zonghao Ying et.al. 2406.06302 link
2024-06-10 Safety Alignment Should Be Made More Than Just a Few Tokens Deep Xiangyu Qi et.al. 2406.05946 link
2024-06-13 How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States Zhenhong Zhou et.al. 2406.05644 link
2024-09-05 SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner Xunguang Wang et.al. 2406.05498 null
2024-06-08 Is On-Device AI Broken and Exploitable? Assessing the Trust and Ethics in Small Language Models Kalyan Nakka et.al. 2406.05364 null
2024-07-01 Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt Zonghao Ying et.al. 2406.04031 link
2024-06-06 AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens Lin Lu et.al. 2406.03805 null
2024-09-25 Ranking Manipulation for Conversational Search Engines Samuel Pfrommer et.al. 2406.03589 link
2024-06-03 Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits Andis Draguns et.al. 2406.02619 link
2024-05-28 Are PPO-ed Language Models Hackable? Suraj Anand et.al. 2406.02577 null
2025-01-21 QROA: A Black-Box Query-Response Optimization Attack on LLMs Hussein Jawad et.al. 2406.02044 link
2024-10-30 Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses Xiaosen Zheng et.al. 2406.01288 link
2024-11-03 Are you still on track!? Catching LLM Task Drift with Activations Sahar Abdelnabi et.al. 2406.00799 link
2024-06-01 Exploring Vulnerabilities and Protections in Large Language Models: A Survey Frank Weizhen Liu et.al. 2406.00240 null
2024-07-29 Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization Yuanpu Cao et.al. 2406.00045 link
2024-06-05 Improved Techniques for Optimization-Based Jailbreaking on Large Language Models Xiaojun Jia et.al. 2405.21018 link
2024-11-01 Improved Generation of Adversarial Examples Against Safety-aligned LLMs Qizhang Li et.al. 2405.20778 link
2024-08-21 Medical MLLM is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models Xijie Huang et.al. 2405.20775 link
2024-06-12 Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character Siyuan Ma et.al. 2405.20773 null
2024-06-04 Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens Jiahao Yu et.al. 2405.20653 null
2024-05-30 Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters Haibo Jin et.al. 2405.20413 null
2024-10-17 TAIA: Large Language Models are Out-of-Distribution Data Learners Shuyang Jiang et.al. 2405.20192 link
2024-05-30 Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks Chen Xiong et.al. 2405.20099 null
2024-05-30 Efficient LLM-Jailbreaking by Introducing Visual Modality Zhenxing Niu et.al. 2405.20015 null
2024-05-30 AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization Jiawei Chen et.al. 2405.19668 null
2024-10-11 ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users Guanlin Li et.al. 2405.19360 link
2024-05-31 Robustifying Safety-Aligned Large Language Models through Clean Data Curation Xiaoqun Liu et.al. 2405.19358 null
2024-05-29 ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning Ruchika Chavhan et.al. 2405.19237 link
2024-05-29 Voice Jailbreak Attacks Against GPT-4o Xinyue Shen et.al. 2405.19103 link
2024-12-20 DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints Andrew Zhao et.al. 2405.19026 link
2024-10-20 Quantitative Certification of Bias in Large Language Models Isha Chaudhary et.al. 2405.18780 link
2024-11-18 A Theoretical Understanding of Self-Correction through In-context Alignment Yifei Wang et.al. 2405.18634 null
2024-05-28 Learning diverse attacks on large language models for robust red-teaming and safety tuning Seanie Lee et.al. 2405.18540 null
2024-06-14 Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing Wei Zhao et.al. 2405.18166 link
2024-10-14 White-box Multimodal Jailbreaks Against Large Vision-Language Models Ruofan Wang et.al. 2405.17894 link
2024-05-28 Automatic Jailbreaking of the Text-to-Image Generative AI Systems Minseon Kim et.al. 2405.16567 link
2024-05-24 Hacc-Man: An Arcade Game for Jailbreaking LLMs Matheus Valentim et.al. 2405.15902 null
2024-10-08 Extracting Prompts by Inverting LLM Outputs Collin Zhang et.al. 2405.15012 link
2024-10-30 Representation Noising: A Defence Mechanism Against Harmful Finetuning Domenic Rosati et.al. 2405.14577 link
2024-05-23 Impact of Non-Standard Unicode Characters on Security and Comprehension in Large Language Models Johan S Daniel et.al. 2405.14490 link
2024-05-22 WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response Tianrong Zhang et.al. 2405.14023 null
2024-05-22 Safety Alignment for Vision Language Models Zhendong Liu et.al. 2405.13581 null
2024-07-07 TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models Pengzhou Cheng et.al. 2405.13401 null
2024-10-15 GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation Govind Ramesh et.al. 2405.13077 null
2024-06-19 Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation Yuxi Li et.al. 2405.13068 link
2024-06-17 Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming Jiaxu Liu et.al. 2405.12604 null
2024-05-29 Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models Jiaqi Li et.al. 2405.12523 null
2024-08-06 Hummer: Towards Limited Competitive Preference Dataset Li Jiang et.al. 2405.11647 null
2024-05-15 Benchmark Early and Red Team Often: A Framework for Assessing and Managing Dual-Use Hazards of AI Foundation Models Anthony M. Barrett et.al. 2405.10986 null
2024-10-05 Red Teaming Language Models for Processing Contradictory Dialogues Xiaofei Wen et.al. 2405.10128 link
2024-05-15 Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization Kai Hu et.al. 2405.09113 null
2024-05-15 A safety realignment framework via subspace-oriented model fusion for large language models Xin Yi et.al. 2405.09055 link
2024-05-14 SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models Raghuveer Peri et.al. 2405.08317 null
2024-05-14 PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition Ziyang Zhang et.al. 2405.07932 link
2024-05-14 PLeak: Prompt Leaking Attacks against Large Language Model Applications Bo Hui et.al. 2405.06823 link
2024-08-29 Mitigating Exaggerated Safety in Large Language Models Ruchira Ray et.al. 2405.05418 null
2024-05-07 Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks Georgios Pantazopoulos et.al. 2405.04403 link
2024-05-07 Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent Shang Shang et.al. 2405.03654 null
2024-05-03 Aloe: A Family of Fine-tuned Open Healthcare LLMs Ashwin Kumar Gururajan et.al. 2405.01886 null
2024-05-02 Boosting Jailbreak Attack with Momentum Yihao Zhang et.al. 2405.01229 link
2024-05-10 Evaluating and Mitigating Linguistic Discrimination in Large Language Models Guoliang Dong et.al. 2404.18534 null
2024-04-26 Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo Stephen Zhao et.al. 2404.17546 link
2024-04-21 AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs Anselm Paulus et.al. 2404.16873 link
2024-10-12 Don't Say No: Jailbreaking LLM by Suppressing Refusal Yukai Zhou et.al. 2404.16369 link
2024-04-24 Universal Adversarial Triggers Are Not Universal Nicholas Meade et.al. 2404.16020 link
2024-04-23 Bias patterns in the application of LLMs for clinical decision support: A comprehensive study Raphael Poulain et.al. 2404.15149 link
2024-04-23 A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI Seliem El-Sayed et.al. 2404.15058 null
2024-06-06 Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs Javier Rando et.al. 2404.14461 link
2024-10-10 Protecting Your LLMs with Information Bottleneck Zichuan Liu et.al. 2404.13968 link
2024-04-19 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions Eric Wallace et.al. 2404.13208 null
2024-04-18 Advancing the Robustness of Large Language Models through Self-Denoised Smoothing Jiabao Ji et.al. 2404.12274 link
2024-04-12 JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models Yingchaojie Feng et.al. 2404.08793 null
2024-06-24 ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming Simone Tedeschi et.al. 2404.08676 link
2024-04-12 Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts Tianyu Zhang et.al. 2404.08309 null
2024-11-24 AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs Zeyi Liao et.al. 2404.07921 link
2024-04-10 CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge Yu Ying Chiu et.al. 2404.06664 null
2024-05-07 Rethinking How to Evaluate Language Model Jailbreak Hongyu Cai et.al. 2404.06407 link
2024-07-03 Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge Weikai Lu et.al. 2404.05880 link
2024-04-16 Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection Zhilong Wang et.al. 2404.04849 null
2024-09-09 Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes Divyanshu Kumar et.al. 2404.04392 null
2024-12-15 Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? Shuo Chen et.al. 2404.03411 link
2024-11-24 JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks Weidi Luo et.al. 2404.03027 null
2024-09-04 Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models Jiachen Ma et.al. 2404.02928 null
2024-04-03 Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game Qianqiao Xu et.al. 2404.02532 null
2024-10-07 Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks Maksym Andriushchenko et.al. 2404.02151 link
2024-04-02 Red-Teaming Segment Anything Model Krzysztof Jankowski et.al. 2404.02067 link
2024-09-24 Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack Mark Russinovich et.al. 2404.01833 null
2024-10-31 JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models Patrick Chao et.al. 2404.01318 link
2024-08-20 What is in Your Safe Data? Identifying Benign Data that Breaks Safety Luxi He et.al. 2404.01099 link
2024-11-26 Against The Achilles' Heel: A Survey on Red Teaming for Generative Models Lizhi Lin et.al. 2404.00629 link
2024-12-27 Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code Taishi Nakamura et.al. 2404.00399 null
2024-12-08 Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation Yutong He et.al. 2403.19103 null
2024-03-27 IterAlign: Iterative Constitutional Alignment of Large Language Models Xiusi Chen et.al. 2403.18341 null
2024-11-15 Optimization-based Prompt Injection Attack to LLM-as-a-Judge Jiawen Shi et.al. 2403.17710 link
2024-09-30 Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models Zhiyuan Yu et.al. 2403.17336 null
2024-03-22 Risk and Response in Large Language Models: Evaluating Key Threat Categories Bahareh Harandizadeh et.al. 2403.14988 null
2024-06-24 Testing the Limits of Jailbreaking Defenses with the Purple Problem Taeyoun Kim et.al. 2403.14725 link
2024-07-23 RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content Zhuowen Yuan et.al. 2403.13031 link
2024-03-18 EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models Weikang Zhou et.al. 2403.12171 link
2024-05-14 Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation Jessica Quaye et.al. 2403.12075 link
2025-01-13 Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models Yifan Li et.al. 2403.09792 link
2024-10-15 Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation Yunhao Gou et.al. 2403.09572 null
2024-03-14 AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting Yu Wang et.al. 2403.09513 link
2024-07-17 The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models? Qinyu Zhao et.al. 2403.09037 link
2024-03-19 Review of Generative AI Methods in Cybersecurity Yagmur Yigit et.al. 2403.08701 null
2024-09-30 Distract Large Language Models for Automatic Jailbreak Attack Zeguan Xiao et.al. 2403.08424 link
2024-03-14 HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback Ang Li et.al. 2403.08309 null
2024-03-14 Red Teaming Models for Hyperspectral Image Analysis Using Explainable AI Vladimir Zaigrajew et.al. 2403.08017 null
2024-08-22 Defending Against Unforeseen Failure Modes with Latent Adversarial Training Stephen Casper et.al. 2403.05030 link
2024-03-07 A Safe Harbor for AI Evaluation and Red Teaming Shayne Longpre et.al. 2403.04893 null
2024-11-14 AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks Yifan Zeng et.al. 2403.04783 link
2024-03-11 Using Hallucinations to Bypass GPT4's Filter Benjamin Lemkin et.al. 2403.04769 null
2024-10-04 Aligners: Decoupling LLMs and Alignment Lilian Ngweta et.al. 2403.04224 link
2024-03-06 ImgTrojan: Jailbreaking Vision-Language Models with ONE Image Xijia Tao et.al. 2403.02910 link
2024-03-05 Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications Stav Cohen et.al. 2403.02817 link
2024-03-02 AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks Jiacen Xu et.al. 2403.01038 null
2024-11-07 Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes Xiaomeng Hu et.al. 2403.00867 null
2024-02-28 TroubleLLM: Align to Red Team Expert Zhuoer Xu et.al. 2403.00829 null
2024-09-19 Enhancing Jailbreak Attacks with Diversity Guidance Xu Zhang et.al. 2403.00292 null
2024-02-29 Curiosity-driven Red-teaming for Large Language Models Zhang-Wei Hong et.al. 2402.19464 link
2024-06-10 Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction Tong Liu et.al. 2402.18104 link
2024-10-30 Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue Zhenhong Zhou et.al. 2402.17262 null
2024-11-11 DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers Xirui Li et.al. 2402.16914 link
2024-02-26 CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models Huijie Lv et.al. 2402.16717 link
2024-06-06 Defending LLMs against Jailbreaking Attacks via Backtranslation Yihan Wang et.al. 2402.16459 link
2024-02-28 Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing Jiabao Ji et.al. 2402.16192 link
2024-06-04 ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings Hao Wang et.al. 2402.16006 null
2024-02-24 PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails Neal Mangaokar et.al. 2402.15911 null
2024-03-04 LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper Daoyuan Wu et.al. 2402.15727 null
2024-02-24 Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology Zhenhua Wang et.al. 2402.15690 null
2024-02-23 Fast Adversarial Attacks on Language Models In One GPU Minute Vinu Sankar Sadasivan et.al. 2402.15570 link
2024-11-16 How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries Somnath Banerjee et.al. 2402.15302 link
2024-02-27 Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement Heegyu Kim et.al. 2402.15180 null
2024-06-20 Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment Jiongxiao Wang et.al. 2402.14968 null
2024-02-27 Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs Xiaoxia Li et.al. 2402.14872 null
2024-06-18 Is the System Message Really Important to Jailbreaks in Large Language Models? Xiaotian Zou et.al. 2402.14857 null
2024-02-21 Coercing LLMs to do and reveal (almost) anything Jonas Geiping et.al. 2402.14020 link
2024-02-26 AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning Vasudev Gohil et.al. 2402.13946 null
2024-02-21 Round Trip Translation Defence against Large Language Model Jailbreaking Attacks Canaan Yung et.al. 2402.13517 link
2024-05-29 GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis Yueqi Xie et.al. 2402.13494 link
2024-05-17 A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models Zihao Xu et.al. 2402.13457 link
2024-07-05 Defending Jailbreak Prompts via In-Context Adversarial Game Yujun Zhou et.al. 2402.13148 null
2024-06-06 TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification Martin Gubri et.al. 2402.12991 link
2024-06-07 ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs Fengqing Jiang et.al. 2402.11753 link
2024-08-16 ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages Junjie Ye et.al. 2402.10753 link
2024-10-23 When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers Divij Handa et.al. 2402.10601 link
2024-08-27 A StrongREJECT for Empty Jailbreaks Alexandra Souly et.al. 2402.10260 link
2024-02-15 A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents Lingbo Mo et.al. 2402.10196 link
2024-10-02 Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks Yixin Cheng et.al. 2402.09177 null
2024-02-16 Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues Zhiyuan Chang et.al. 2402.09091 null
2024-07-25 SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding Zhangchen Xu et.al. 2402.08983 link
2024-06-07 COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability Xingang Guo et.al. 2402.08679 link
2024-06-03 Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast Xiangming Gu et.al. 2402.08567 link
2024-02-13 Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning Gelei Deng et.al. 2402.08416 null
2024-10-31 Fight Back Against Jailbreaking via Prompt Adversarial Tuning Yichuan Mo et.al. 2402.06255 link
2024-12-16 Comprehensive Assessment of Jailbreak Attacks Against LLMs Junjie Chu et.al. 2402.05668 link
2024-02-08 Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia Guangyu Shen et.al. 2402.05467 link
2024-10-24 Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications Boyi Wei et.al. 2402.05162 null
2024-02-27 HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Mantas Mazeika et.al. 2402.04249 link
2024-02-05 Nevermind: Instruction Override and Moderation in Large Language Models Edward Kim et.al. 2402.03303 null
2024-05-30 GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models Haibo Jin et.al. 2402.03299 null
2024-02-04 Jailbreaking Attack against Multimodal Large Language Model Zhenxing Niu et.al. 2402.02309 link
2024-06-17 Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models Yongshuo Zong et.al. 2402.02207 link
2024-01-25 MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds Xiaolong Jin et.al. 2402.01706 null
2024-11-14 Security and Privacy Challenges of Large Language Models: A Survey Badhan Chandra Das et.al. 2402.00888 null
2024-02-01 Investigating Bias Representations in Llama 2 Chat via Activation Steering Dawn Lu et.al. 2402.00402 null
2024-06-03 On Prompt-Driven Safeguarding for Large Language Models Chujie Zheng et.al. 2401.18018 link
2024-11-08 Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks Andy Zhou et.al. 2401.17263 link
2024-02-05 Weak-to-Strong Jailbreaking on Large Language Models Xuandong Zhao et.al. 2401.17256 link
2024-01-30 A Cross-Language Investigation into Jailbreak Attacks in Large Language Models Jie Li et.al. 2401.16765 null
2024-01-30 Gradient-Based Language Model Red Teaming Nevan Wichers et.al. 2401.16656 link
2024-01-29 Towards Red Teaming in Multimodal and Multilingual Translation Christophe Ropers et.al. 2401.16247 null
2024-08-27 Red-Teaming for Generative AI: Silver Bullet or Security Theater? Michael Feffer et.al. 2401.15897 null
2024-01-23 Red Teaming Visual Language Models Mukai Li et.al. 2401.12915 null
2024-01-24 Digital cloning of online social networks for language-sensitive agent-based modeling of misinformation spread Prateek Puri et.al. 2401.12509 null
2024-07-10 The Ethics of Interaction: Mitigating Security Threats in LLMs Ashutosh Kumar et.al. 2401.12273 null
2024-01-20 InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance Pengyu Wang et.al. 2401.11206 link
2024-10-31 Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning Adib Hasan et.al. 2401.10862 link
2024-05-16 Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models Rima Hazra et.al. 2401.10647 link
2024-02-12 All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks Kazuhiro Takemoto et.al. 2401.09798 link
2024-08-03 AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models Dong shu et.al. 2401.09002 null
2024-12-24 Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective Tianlong Li et.al. 2401.06824 null
2024-12-16 Intention Analysis Makes LLMs A Good Jailbreak Defender Yuqi Zhang et.al. 2401.06561 link
2024-01-23 How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs Yi Zeng et.al. 2401.06373 link
2024-01-11 Combating Adversarial Attacks with Multi-Agent Debate Steffi Chern et.al. 2401.05998 link
2024-04-01 The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance Abel Salinas et.al. 2401.03729 link
2024-08-19 Malla: Demystifying Real-world Large Language Model Integrated Malicious Services Zilong Lin et.al. 2401.03315 link
2024-01-03 A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Andrew Lee et.al. 2401.01967 link
2023-12-30 Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks Aleksander Buszydlik et.al. 2401.00290 link
2023-12-28 Scalable and automated Evaluation of Blue Team cyber posture in Cyber Ranges Federica Bianchi et.al. 2312.17221 null
2024-08-04 Exploiting Novel GPT-4 APIs Kellin Pelrine et.al. 2312.14302 link
2023-12-12 Maatphor: Automated Variant Analysis for Prompt Injection Attacks Ahmed Salem et.al. 2312.11513 null
2023-12-08 A Red Teaming Framework for Securing AI in Maritime Autonomous Systems Mathew J. Walter et.al. 2312.11500 null
2024-06-18 JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks Xiaoyu Zhang et.al. 2312.10766 null
2023-12-16 Comprehensive Evaluation of ChatGPT Reliability Through Multilingual Inquiries Poorna Chander Reddy Puttaparthi et.al. 2312.10524 link
2023-12-04 Generative AI in Writing Research Papers: A New Type of Algorithmic Bias and Uncertainty in Scholarly Work Rishab Jain et.al. 2312.10057 null
2023-12-14 OSTINATO: Cross-host Attack Correlation Through Attack Activity Similarity Detection Sutanu Kumar Ghosh et.al. 2312.09321 null
2024-04-17 Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF Anand Siththaranjan et.al. 2312.08358 link
2023-12-13 Causality Analysis for Evaluating the Security of Large Language Models Wei Zhao et.al. 2312.07876 link
2024-07-23 AI Control: Improving Safety Despite Intentional Subversion Ryan Greenblatt et.al. 2312.06942 link
2024-05-30 Privacy Issues in Large Language Models: A Survey Seth Neel et.al. 2312.06717 link
2023-12-11 Control Risk for Potential Misuse of Artificial Intelligence in Science Jiyan He et.al. 2312.06632 link
2023-12-08 Seamless: Multilingual Expressive and Streaming Speech Translation Seamless Communication et.al. 2312.05187 link
2023-12-12 DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions Fangzhou Wu et.al. 2312.04730 null
2024-02-23 Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak Yanrui Du et.al. 2312.04127 null
2024-10-31 Tree of Attacks: Jailbreaking Black-Box LLMs Automatically Anay Mehrotra et.al. 2312.02119 link
2024-06-09 Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections Yuanpu Cao et.al. 2312.00027 link
2024-03-03 Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition Sander Schulhoff et.al. 2311.16119 link
2023-11-27 How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs Haoqin Tu et.al. 2311.16101 link
2023-11-27 InfoPattern: Unveiling Information Propagation Patterns in Social Media Chi Han et.al. 2311.15642 null
2023-11-15 Towards Publicly Accountable Frontier LLMs: Building an External Scrutiny Ecosystem under the ASPIRE Framework Markus Anderljung et.al. 2311.14711 null
2024-04-29 Universal Jailbreak Backdoors from Poisoned Human Feedback Javier Rando et.al. 2311.14455 link
2024-03-24 Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models Zhaowei Zhu et.al. 2311.11202 link
2024-06-15 Hijacking Large Language Models via Adversarial In-Context Learning Yao Qiang et.al. 2311.09948 link
2024-02-29 Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking Nan Xu et.al. 2311.09827 null
2024-06-19 RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models Jiongxiao Wang et.al. 2311.09641 null
2023-11-16 JAB: Joint Adversarial Prompting and Belief Augmentation Ninareh Mehrabi et.al. 2311.09473 null
2024-08-15 Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment Haoran Wang et.al. 2311.09433 link
2024-01-20 Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts Yuanwei Wu et.al. 2311.09127 null
2024-06-12 Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization Zhexin Zhang et.al. 2311.09096 link
2023-11-29 AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications Bhaktipriya Radharapu et.al. 2311.08592 null
2024-04-07 A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily Peng Ding et.al. 2311.08268 link
2023-11-13 MART: Improving LLM Safety with Multi-round Automatic Red-Teaming Suyu Ge et.al. 2311.07689 null
2024-05-22 Flames: Benchmarking Value Alignment of LLMs in Chinese Kexin Huang et.al. 2311.06899 link
2024-12-10 Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming Nanna Inie et.al. 2311.06237 null
2024-04-01 Fake Alignment: Are LLMs Really Aligned Well? Yixu Wang et.al. 2311.05915 link
2025-01-19 FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts Yichen Gong et.al. 2311.05608 link
2024-03-08 Can LLMs Follow Simple Rules? Norman Mu et.al. 2311.04235 link
2023-11-24 Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation Rusheb Shah et.al. 2311.03348 null
2024-11-28 DeepInception: Hypnotize Large Language Model to Be Jailbreaker Xuan Li et.al. 2311.03191 link
2024-05-22 LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B Simon Lermen et.al. 2310.20624 null
2024-03-10 From Chatbots to PhishBots? -- Preventing Phishing scams created using ChatGPT, Google Bard and Claude Sayak Saha Roy et.al. 2310.19181 null
2024-03-22 Self-Guard: Empower the LLM to Safeguard Itself Zezhong Wang et.al. 2310.15851 null
2023-12-14 AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models Sicheng Zhu et.al. 2310.15140 null
2023-11-13 Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases Rishabh Bhardwaj et.al. 2310.14303 null
2023-10-20 Adaptive Experimental Design for Intrusion Data Collection Kate Highnam et.al. 2310.13224 null
2023-10-28 Probing LLMs for hate speech detection: strengths and vulnerabilities Sarthak Roy et.al. 2310.12860 null
2023-10-19 Attack Prompt Generation for Red Teaming and Defending Large Language Models Boyi Deng et.al. 2310.12505 link
2023-10-17 Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models Hsuan Su et.al. 2310.11079 null
2023-10-16 Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks Erfan Shayegani et.al. 2310.10844 null
2024-02-16 Large Language Model Unlearning Yuanshun Yao et.al. 2310.10683 link
2024-06-07 Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? Yu-Lin Tsai et.al. 2310.10012 link
2023-11-11 ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models Alex Mei et.al. 2310.09624 link
2024-07-18 Jailbreaking Black Box Large Language Models in Twenty Queries Patrick Chao et.al. 2310.08419 link
2023-10-10 Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation Yangsibo Huang et.al. 2310.06987 link
2024-03-04 Multilingual Jailbreak Challenges in Large Language Models Yue Deng et.al. 2310.06474 link
2024-05-25 Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations Zeming Wei et.al. 2310.06387 null
2024-03-20 AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models Xiaogeng Liu et.al. 2310.04451 link
2023-09-17 Red Teaming Generative AI/NLP, the BB84 quantum cryptography protocol and the NIST-approved Quantum-Resistant Cryptographic Algorithms Petar Radanliev et.al. 2310.04425 null
2023-10-05 Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! Xiangyu Qi et.al. 2310.03693 link
2024-06-11 SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks Alexander Robey et.al. 2310.03684 link
2024-01-27 Low-Resource Languages Jailbreak GPT-4 Zheng-Xin Yong et.al. 2310.02446 null
2023-10-03 Jailbreaker in Jail: Moving Target Defense for Large Language Models Bocheng Chen et.al. 2310.02417 null
2023-10-03 Can Language Models be Instructed to Protect Personal Information? Yang Chen et.al. 2310.02224 null
2024-01-22 Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench Jen-tse Huang et.al. 2310.01386 link
2023-10-02 No Offense Taken: Eliciting Offensiveness from Language Models Anugya Srivastava et.al. 2310.00892 link
2024-07-28 Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games Chengdong Ma et.al. 2310.00322 null
2024-06-12 Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM Bochuan Cao et.al. 2309.14348 link
2024-06-27 GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts Jiahao Yu et.al. 2309.10253 link
2024-06-08 Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts Zhi-Yi Chin et.al. 2309.06135 link
2024-04-14 FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models Dongyu Yao et.al. 2309.05274 link
2024-08-05 Open Sesame! Universal Black Box Jailbreaking of Large Language Models Raz Lapid et.al. 2309.01446 null
2023-09-04 Baseline Defenses for Adversarial Attacks Against Aligned Language Models Neel Jain et.al. 2309.00614 null
2023-08-28 The Promise and Peril of Artificial Intelligence -- Violet Teaming Offers a Balanced Path Forward Alexander J. Titus et.al. 2308.14253 null
2023-11-07 Detecting Language Model Attacks with Perplexity Gabriel Alon et.al. 2308.14132 null
2023-08-25 Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models Zhenhua Wang et.al. 2308.11521 null
2023-08-21 On the Adversarial Robustness of Multi-Modal Foundation Models Christian Schlarmann et.al. 2308.10741 link
2023-08-21 Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions Wesley Tann et.al. 2308.10443 null
2023-08-30 Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment Rishabh Bhardwaj et.al. 2308.09662 link
2024-05-06 Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models Yugeng Liu et.al. 2308.07847 null
2024-03-26 GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher Youliang Yuan et.al. 2308.06463 link
2023-08-16 Where's the Liability in Harmful AI Speech? Peter Henderson et.al. 2308.04635 null
2024-11-07 FLIRT: Feedback Loop In-context Red Teaming Ninareh Mehrabi et.al. 2308.04265 null
2024-05-15 "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models Xinyue Shen et.al. 2308.03825 link
2024-04-01 XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models Paul Röttger et.al. 2308.01263 link
2023-08-03 Confidence-Building Measures for Artificial Intelligence: Workshop Proceedings Sarah Shoker et.al. 2308.00862 null
2023-12-20 Universal and Transferable Adversarial Attacks on Aligned Language Models Andy Zou et.al. 2307.15043 link
2023-10-10 Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models Erfan Shayegani et.al. 2307.14539 null
2023-10-25 MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots Gelei Deng et.al. 2307.08715 null
2023-08-28 Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models Huachuan Qiu et.al. 2307.08487 link
2023-07-05 Jailbroken: How Does LLM Safety Training Fail? Alexander Wei et.al. 2307.02483 null
2023-07-03 From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy Maanak Gupta et.al. 2307.00691 null
2023-08-16 Visual Adversarial Examples Jailbreak Aligned Large Language Models Xiangyu Qi et.al. 2306.13213 link
2024-02-26 DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models Boxin Wang et.al. 2306.11698 null
2023-10-11 Explore, Establish, Exploit: Red Teaming Language Models from Scratch Stephen Casper et.al. 2306.09442 link
2023-05-30 Seeing Seeds Beyond Weeds: Green Teaming Generative AI for Beneficial Uses Logan Stapleton et.al. 2306.03097 null
2023-10-19 Red Teaming Language Model Detectors with Language Models Zhouxing Shi et.al. 2305.19713 link
2023-05-27 Query-Efficient Black-Box Red Teaming via Bayesian Optimization Deokjae Lee et.al. 2305.17444 link
2024-03-27 Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks Abhinav Rao et.al. 2305.14965 link
2024-03-10 Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study Yi Liu et.al. 2305.13860 null
2023-11-10 SneakyPrompt: Jailbreaking Text-to-image Generative Models Yuchen Yang et.al. 2305.12082 link
2023-05-11 Towards best practices in AGI safety and governance: A survey of expert opinion Jonas Schuett et.al. 2305.07153 null
2023-05-09 Generating Phishing Attacks using ChatGPT Sayak Saha Roy et.al. 2305.05133 null
2023-10-19 Automatic Prompt Optimization with "Gradient Descent" and Beam Search Reid Pryzant et.al. 2305.03495 link
2023-04-21 Power to the Data Defenders: Human-Centered Disclosure Risk Calibration of Open Data Kaustav Bhattacharjee et.al. 2304.11278 null
2024-06-03 Fundamental Limitations of Alignment in Large Language Models Yotam Wolf et.al. 2304.11082 link
2023-11-01 Multi-step Jailbreaking Privacy Attacks on ChatGPT Haoran Li et.al. 2304.05197 link
2023-07-27 Clustered Federated Learning Architecture for Network Anomaly Detection in Large Scale Heterogeneous IoT Networks Xabier Sáez-de-Cámara et.al. 2303.15986 null
2023-03-09 Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback Hannah Rose Kirk et.al. 2303.05453 null
2023-09-21 Red Teaming Deep Neural Networks with Feature Synthesis Tools Stephen Casper et.al. 2302.10894 null
2023-01-05 Can Large Language Models Change User Preference Adversarially? Varshini Subhash et.al. 2302.10291 null
2023-05-29 Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity Terry Yue Zhuo et.al. 2301.12867 null
2024-08-23 Asymptotically Normal Estimation of Local Latent Network Curvature Steven Wilkins-Reeves et.al. 2211.11673 link
2023-05-05 Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks Stephen Casper et.al. 2211.10024 link
2023-06-07 Beyond the Surface: Investigating Malicious CVE Proof of Concept Exploits on GitHub Soufian El Yadmani et.al. 2210.08374 null
2022-11-10 Red-Teaming the Stable Diffusion Safety Filter Javier Rando et.al. 2210.04610 null
2022-11-22 Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Deep Ganguli et.al. 2209.07858 link
2023-10-13 Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents Stephen Casper et.al. 2209.02167 link
2022-08-16 CTI4AI: Threat Intelligence Generation and Sharing after Red Teaming AI Models Chuyen Nguyen et.al. 2208.07476 null
2022-08-12 PRIVEE: A Visual Analytic Workflow for Proactive Privacy Risk Inspection of Open Data Kaustav Bhattacharjee et.al. 2208.06481 null
2022-07-30 'PeriHack': Designing a Serious Game for Cybersecurity Awareness Roberto Dillon et.al. 2208.00235 null
2023-07-27 Gotham Testbed: a Reproducible IoT Testbed for Security Experiments and Dataset Generation Xabier Sáez-de-Cámara et.al. 2207.13981 link
2022-02-07 Red Teaming Language Models with Language Models Ethan Perez et.al. 2202.03286 null
2021-12-22 Catch Me If You GAN: Using Artificial Intelligence for Fake Log Generation Christian Toemmel et.al. 2112.12006 null
2021-12-18 Dynamic Defender-Attacker Blotto Game Daigo Shishika et.al. 2112.09890 null
2021-11-24 Needle in a Haystack: Detecting Subtle Malicious Edits to Additive Manufacturing G-code Files Caleb Beckwith et.al. 2111.12746 null
2021-10-04 Automating Privilege Escalation with Deep Reinforcement Learning Kalle Kujanpää et.al. 2110.01362 null
2021-08-20 CybORG: A Gym for the Development of Autonomous Cyber Agents Maxwell Standen et.al. 2108.09118 null
2021-05-27 Hopper: Modeling and Detecting Lateral Movement (Extended Report) Grant Ho et.al. 2105.13442 link
2021-04-23 Predicting Adversary Lateral Movement Patterns with Deep Learning Nathan Danneman et.al. 2104.13195 null
2021-03-29 Automating Defense Against Adversarial Attacks: Discovery of Vulnerabilities and Application of Multi-INT Imagery to Protect Deployed Models Josh Kalin et.al. 2103.15897 null
2022-06-28 Dynamically Modelling Heterogeneous Higher-Order Interactions for Malicious Behavior Detection in Event Logs Corentin Larroche et.al. 2103.15708 link
2021-05-04 An In-memory Embedding of CPython for Offensive Use Ateeq Sharfuddin et.al. 2103.15202 null
2020-11-26 Investigation on Research Ethics and Building a Benchmark Shun Inagaki et.al. 2011.13925 null
2020-09-17 Can ROS be used securely in industry? Red teaming ROS-Industrial Víctor Mayoral-Vilches et.al. 2009.08211 null
2020-07-17 HARMer: Cyber-attacks Automation and Evaluation Simon Yusuf Enoch et.al. 2006.14352 null
2021-04-16 HACK3D: Crowdsourcing the Assessment of Cybersecurity in Digital Manufacturing Michael Linares et.al. 2005.04368 null
2020-03-11 Passlab: A Password Security Tool for the Blue Team Saul Johnson et.al. 2003.07208 null
2020-10-02 SoK: A Survey of Open-Source Threat Emulators Polina Zilberman et.al. 2003.01518 null
2020-02-26 CybORG: An Autonomous Cyber Operations Research Gym Callum Baillie et.al. 2002.10667 null
2021-01-29 Anomaly Detection in Large Scale Networks with Latent Space Models Wesley Lee et.al. 1911.05522 null
2019-06-17 The Little Phone That Could Ch-Ch-Chroot Jack Whitter-Jones et.al. 1906.07242 null
2019-06-12 Relative Hausdorff Distance for Network Analysis Sinan G. Aksoy et.al. 1906.04936 null
2019-10-24 Quantifiable & Comparable Evaluations of Cyber Defensive Capabilities: A Survey & Novel, Unified Approach Michael D. Iannacone et.al. 1902.00053 null
2018-10-13 Two Can Play That Game: An Adversarial Evaluation of a Cyber-alert Inspection System Ankit Shah et.al. 1810.05921 null
2018-02-27 A Multi-Disciplinary Review of Knowledge Acquisition Methods: From Human to Autonomous Eliciting Agents George Leu et.al. 1802.09669 null
2018-02-27 Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition George Leu et.al. 1802.09660 null
2018-02-26 Shaping Influence and Influencing Shaping: A Computational Red Teaming Trust-based Swarm Intelligence Model Jiangjun Tang et.al. 1802.09647 null
2018-01-06 SLEUTH: Real-time Attack Scenario Reconstruction from COTS Audit Data Md Nahid Hossain et.al. 1801.02062 null
2017-12-02 Recurrent Neural Network Language Models for Open Vocabulary Event-Level Cyber Anomaly Detection Aaron Tuor et.al. 1712.00557 link
2015-04-07 Security Toolbox for Detecting Novel and Sophisticated Android Malware Benjamin Holland et.al. 1504.01693 null

(back to top)

About

Awesome Jailbreaking Multimodal Large Language Models (Automatically Update Every 12th hours)

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%