Usage instructions: here
Table of Contents
Publish Date | Title | Authors | Code | |
---|---|---|---|---|
2025-01-23 | Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak | Erjia Xiao et.al. | 2501.13772 | null |
2025-01-19 | Dagger Behind Smile: Fool LLMs with a Happy Ending Story | Xurui Song et.al. | 2501.13115 | null |
2025-01-21 | You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense | Wuyuao Mai et.al. | 2501.12210 | null |
2025-01-19 | Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity | David Williams-King et.al. | 2501.11183 | null |
2025-01-18 | Jailbreaking Large Language Models in Infinitely Many Ways | Oliver Goldstein et.al. | 2501.10800 | null |
2025-01-18 | Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks | Xin Yi et.al. | 2501.10639 | null |
2024-12-17 | What Information Should Be Shared with Whom "Before and During Training"? | Haydn Belfield et.al. | 2501.10379 | null |
2025-01-16 | A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy | Huandong Wang et.al. | 2501.09431 | null |
2025-01-14 | Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models | Abdulkadir Erol et.al. | 2501.09039 | null |
2025-01-15 | SAIF: A Comprehensive Framework for Evaluating the Risks of Generative AI in the Public Sector | Kyeongryul Lee et.al. | 2501.08814 | null |
2025-01-14 | Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints | Jonathan Nöther et.al. | 2501.08246 | null |
2025-01-14 | Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning | Jiaqi Hua et.al. | 2501.07959 | link |
2025-01-14 | Gandalf the Red: Adaptive Security for LLMs | Niklas Pfister et.al. | 2501.07927 | link |
2025-01-13 | Lessons From Red Teaming 100 Generative AI Products | Blake Bullwinkel et.al. | 2501.07238 | null |
2025-01-09 | Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency | Shiji Zhao et.al. | 2501.04931 | null |
2025-01-05 | Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense | Yang Ouyang et.al. | 2501.02629 | link |
2025-01-03 | Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models | Ziwei Zheng et.al. | 2501.02029 | null |
2025-01-02 | Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs | Joao Fonseca et.al. | 2501.02018 | null |
2025-01-09 | Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions | Rachneet Sachdeva et.al. | 2501.01872 | link |
2025-01-03 | Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models | Yanjiang Liu et.al. | 2501.01830 | null |
2025-01-09 | WeAudit: Scaffolding User Auditors and AI Practitioners in Auditing Generative AI | Wesley Hanwen Deng et.al. | 2501.01397 | null |
2025-01-02 | CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models | Johan Wahréus et.al. | 2501.01335 | link |
2024-12-29 | Adversarial Negotiation Dynamics in Generative Language Models | Arinbjörn Kolbeinsson et.al. | 2501.00069 | null |
2024-12-28 | LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models | Miao Yu et.al. | 2501.00055 | link |
2024-12-30 | InfAlign: Inference-aware language model alignment | Ananth Balashankar et.al. | 2412.19792 | null |
2024-12-24 | Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning | Alex Beutel et.al. | 2412.18693 | null |
2024-12-25 | Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models | Xiaomeng Hu et.al. | 2412.18171 | null |
2024-12-23 | Retention Score: Quantifying Jailbreak Risks for Vision Language Models | Zaitang Li et.al. | 2412.17544 | null |
2025-01-05 | DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak | Hao Wang et.al. | 2412.17522 | null |
2024-12-22 | Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models | Lang Gao et.al. | 2412.17034 | null |
2024-12-22 | Robustness of Large Language Models Against Adversarial Attacks | Yiyi Tao et.al. | 2412.17011 | null |
2024-12-21 | OpenAI o1 System Card | OpenAI et.al. | 2412.16720 | null |
2024-12-21 | POEX: Policy Executable Embodied AI Jailbreak Attacks | Xuancun Lu et.al. | 2412.16633 | null |
2024-12-21 | Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models | Yanxu Mao et.al. | 2412.16555 | null |
2025-01-08 | Deliberative Alignment: Reasoning Enables Safer Language Models | Melody Y. Guan et.al. | 2412.16339 | null |
2024-12-20 | Logical Consistency of Large Language Models in Fact-checking | Bishwamittra Ghosh et.al. | 2412.16100 | null |
2024-12-20 | JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs | Hongyi Li et.al. | 2412.15623 | null |
2024-12-19 | SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage | Xiaoning Dong et.al. | 2412.15289 | null |
2025-01-08 | Toxicity Detection towards Adaptability to Changing Perturbations | Hankun Kang et.al. | 2412.15267 | null |
2024-12-18 | Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation | Aneta Zugecova et.al. | 2412.13666 | null |
2024-12-17 | Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing | Keltin Grimes et.al. | 2412.13341 | link |
2024-12-17 | Jailbreaking? One Step Is Enough! | Weixiong Zheng et.al. | 2412.12621 | null |
2024-12-17 | Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols | Alex Mallen et.al. | 2412.12480 | null |
2024-12-13 | No Free Lunch for Defending Against Prefilling Attack by In-Context Learning | Zhiyu Xue et.al. | 2412.12192 | null |
2024-12-10 | Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars | Yu Yan et.al. | 2412.12145 | null |
2024-12-15 | SpearBot: Leveraging Large Language Models in a Generative-Critique Framework for Spear-Phishing Email Generation | Qinglin Qi et.al. | 2412.11109 | null |
2024-12-15 | Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models | Di Wu et.al. | 2412.11041 | null |
2024-12-14 | IntelEX: A LLM-driven Attack-level Threat Intelligence Extraction Framework | Ming Xu et.al. | 2412.10872 | null |
2024-12-14 | Towards Action Hijacking of Large Language Model-based Agent | Yuyang Zhang et.al. | 2412.10807 | null |
2024-12-10 | Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM | Shaoqing Zhang et.al. | 2412.10423 | link |
2024-12-13 | AdvPrefix: An Objective for Nuanced LLM Jailbreaks | Sicheng Zhu et.al. | 2412.10321 | link |
2024-12-12 | AI Red-Teaming is a Sociotechnical System. Now What? | Tarleton Gillespie et.al. | 2412.09751 | null |
2024-12-12 | Obfuscated Activations Bypass LLM Latent-Space Defenses | Luke Bailey et.al. | 2412.09565 | null |
2024-12-16 | Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models | Jiahui Li et.al. | 2412.08615 | link |
2024-12-11 | AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models | Mintong Kang et.al. | 2412.08608 | null |
2024-12-11 | Model-Editing-Based Jailbreak against Safety-aligned Large Language Models | Yuxi Li et.al. | 2412.08201 | null |
2024-12-11 | Antelope: Potent and Concealed Jailbreak Attack Strategy | Xin Zhao et.al. | 2412.08156 | null |
2024-12-11 | Evil twins are not that evil: Qualitative insights into machine-generated prompts | Nathanaël Carraz Rakotonirina et.al. | 2412.08127 | null |
2024-12-16 | Granite Guardian | Inkit Padhi et.al. | 2412.07724 | link |
2024-12-10 | FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks | Bocheng Chen et.al. | 2412.07672 | null |
2024-12-10 | TraSCE: Trajectory Steering for Concept Erasure | Anubhav Jain et.al. | 2412.07658 | link |
2024-12-10 | PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips | Zachary Coalson et.al. | 2412.07192 | null |
2024-11-03 | Poison Attacks and Adversarial Prompts Against an Informed University Virtual Assistant | Ivan A. Fernandez et.al. | 2412.06788 | null |
2024-12-09 | Enhancing Adversarial Resistance in LLMs with Recursion | Bryan Li et.al. | 2412.06181 | null |
2025-01-03 | Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models | Ma Teng et.al. | 2412.05934 | link |
2024-12-16 | PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization | Ruoxi Cheng et.al. | 2412.05892 | null |
2024-12-07 | PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage | Yuzhou Nie et.al. | 2412.05734 | link |
2024-12-06 | BadGPT-4o: stripping safety finetuning from GPT models | Ekaterina Krupkina et.al. | 2412.05346 | null |
2024-12-06 | LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds | James Beetham et.al. | 2412.05232 | null |
2024-12-19 | Best-of-N Jailbreaking | John Hughes et.al. | 2412.03556 | link |
2024-12-04 | Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts? | Sravanti Addepalli et.al. | 2412.03235 | null |
2024-12-03 | Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach | Tony T. Wang et.al. | 2412.02159 | null |
2024-12-03 | Trust & Safety of LLMs and LLMs in Trust & Safety | Doohee You et.al. | 2412.02113 | null |
2024-12-02 | Improved Large Language Model Jailbreak Detection via Pretrained Embeddings | Erick Galinkin et.al. | 2412.01547 | null |
2024-12-17 | Jailbreak Large Vision-Language Models Through Multi-Modal Linkage | Yu Wang et.al. | 2412.00473 | link |
2024-11-30 | Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models | Sanghyun Kim et.al. | 2412.00357 | null |
2024-12-19 | PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning | Shenghui Li et.al. | 2411.19335 | null |
2024-11-28 | DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs | Ben Ganon et.al. | 2411.19038 | null |
2024-12-20 | Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Soumya Suvra Ghosal et.al. | 2411.18688 | null |
2024-11-27 | Embodied Red Teaming for Auditing Robotic Foundation Models | Sathwik Karnik et.al. | 2411.18676 | null |
2024-11-28 | Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models | Shuyang Hao et.al. | 2411.18000 | null |
2024-11-26 | Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats | Jiaxin Wen et.al. | 2411.17693 | null |
2025-01-14 | Don't Command, Cultivate: An Exploratory Study of System-2 Alignment | Yuhang Wang et.al. | 2411.17075 | link |
2024-11-25 | In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models | Zhi-Yi Chin et.al. | 2411.16769 | null |
2024-11-23 | ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain | Haochen Zhao et.al. | 2411.16736 | link |
2024-12-04 | "Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks | Libo Wang et.al. | 2411.16730 | link |
2024-11-29 | Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks | Han Wang et.al. | 2411.16721 | link |
2024-11-25 | Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective | Jean Marie Tshimula et.al. | 2411.16642 | null |
2024-11-22 | Universal and Context-Independent Triggers for Precise Control of LLM Outputs | Jiashuo Liang et.al. | 2411.14738 | null |
2024-11-21 | Global Challenge for Safe and Secure LLMs Track 1 | Xiaojun Jia et.al. | 2411.14502 | null |
2024-11-21 | GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs | Advik Raj Basani et.al. | 2411.14133 | link |
2024-11-20 | A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection | Gabriel Chua et.al. | 2411.12946 | null |
2024-11-27 | Playing Language Game with LLMs Leads to Jailbreaking | Yu Peng et.al. | 2411.12762 | null |
2024-12-08 | TrojanRobot: Backdoor Attacks Against LLM-based Embodied Robots in the Physical World | Xianlong Wang et.al. | 2411.11683 | null |
2024-11-28 | Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models | Chenhang Cui et.al. | 2411.11496 | link |
2024-11-18 | The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models | Xikang Yang et.al. | 2411.11407 | link |
2024-11-18 | Steering Language Model Refusal with Sparse Autoencoders | Kyle O'Brien et.al. | 2411.11296 | null |
2024-11-17 | JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit | Zeqing He et.al. | 2411.11114 | null |
2024-12-09 | Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey | Xuannan Liu et.al. | 2411.09259 | link |
2024-11-14 | DROJ: A Prompt-Driven Attack against Large Language Models | Leyang Hu et.al. | 2411.09125 | link |
2024-11-13 | LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs | Piyush Jha et.al. | 2411.08862 | null |
2024-11-13 | The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense | Yangyang Guo et.al. | 2411.08410 | null |
2024-11-12 | Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models | Tiejin Chen et.al. | 2411.07559 | null |
2024-11-12 | Rapid Response: Mitigating LLM Jailbreaks with a Few Examples | Alwin Peng et.al. | 2411.07494 | null |
2024-11-11 | HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment | Yannis Belkhiter et.al. | 2411.06835 | null |
2024-11-10 | SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains | Bijoy Ahmed Saiem et.al. | 2411.06426 | null |
2024-11-06 | Diversity Helps Jailbreak Large Language Models | Weiliang Zhao et.al. | 2411.04223 | null |
2025-01-07 | MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue | Fengxiang Wang et.al. | 2411.03814 | null |
2024-11-02 | What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks | Nathalie Maria Kirch et.al. | 2411.03343 | link |
2024-12-05 | Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment | Jason Vega et.al. | 2411.02785 | link |
2024-11-03 | UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models | Sejoon Oh et.al. | 2411.01703 | null |
2024-12-10 | SQL Injection Jailbreak: a structural disaster of large language models | Jiawei Zhao et.al. | 2411.01565 | link |
2024-11-03 | AURA: Amplifying Understanding, Resilience, and Awareness for Responsible AI Content Work | Alice Qian Zhang et.al. | 2411.01426 | null |
2024-12-11 | Plentiful Jailbreaks with String Compositions | Brian R. Y. Huang et.al. | 2411.01084 | null |
2024-11-01 | Emoji Attack: A Method for Misleading Judge LLMs in Safety Risk Detection | Zhipeng Wei et.al. | 2411.01077 | link |
2024-11-15 | IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves | Ruofan Wang et.al. | 2411.00827 | null |
2024-11-26 | Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs | Muhammed Saeed et.al. | 2410.24049 | null |
2024-10-31 | Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models | Hao Yang et.al. | 2410.23861 | null |
2024-10-31 | Adversarial Attacks of Vision Tasks in the Past 10 Years: A Survey | Chiyu Zhang et.al. | 2410.23687 | null |
2024-11-27 | Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models | Yiqi Yang et.al. | 2410.23558 | null |
2024-10-30 | ProTransformer: Robustify Transformers via Plug-and-Play Paradigm | Zhichao Hou et.al. | 2410.23182 | link |
2024-10-29 | Benchmarking LLM Guardrails in Handling Multilingual Toxicity | Yahan Yang et.al. | 2410.22153 | null |
2024-10-29 | AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts | Vishal Kumar et.al. | 2410.22143 | null |
2024-10-29 | SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types | Yutao Mou et.al. | 2410.21965 | link |
2024-10-28 | Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring | Honglin Mu et.al. | 2410.21083 | null |
2024-10-28 | BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks | Yunhan Zhao et.al. | 2410.20971 | null |
2024-10-25 | RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction | Tanqiu Jiang et.al. | 2410.19937 | null |
2024-10-25 | An Auditing Test To Detect Behavioral Shift in Language Models | Leo Richter et.al. | 2410.19406 | link |
2024-10-25 | Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities | Chung-En Sun et.al. | 2410.18469 | link |
2024-10-23 | Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks | Samuele Poppi et.al. | 2410.18210 | null |
2024-10-23 | Guide for Defense (G4D): Dynamic Guidance for Robust and Balanced Defense in Large Language Models | He Cao et.al. | 2410.17922 | link |
2024-10-22 | LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded" | Som Sagar et.al. | 2410.16738 | null |
2024-11-02 | Bayesian scaling laws for in-context learning | Aryaman Arora et.al. | 2410.16531 | link |
2024-11-16 | Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis | Jonathan Brokman et.al. | 2410.16527 | null |
2024-10-18 | Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs | Rui Pu et.al. | 2410.16327 | null |
2024-10-21 | A Realistic Threat Model for Large Language Model Jailbreaks | Valentyn Boreiko et.al. | 2410.16222 | link |
2024-10-21 | A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns | Tianyi Men et.al. | 2410.16155 | null |
2024-11-03 | Boosting Jailbreak Transferability for Large Language Models | Hanqing Liu et.al. | 2410.15645 | link |
2024-10-21 | SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis | Aidan Wong et.al. | 2410.15641 | link |
2024-10-20 | Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models | Xiao Li et.al. | 2410.15362 | null |
2024-10-20 | Jailbreaking and Mitigation of Vulnerabilities in Large Language Models | Benji Peng et.al. | 2410.15236 | null |
2024-10-16 | SoK: Prompt Hacking of Large Language Models | Baha Rababah et.al. | 2410.13901 | null |
2024-10-15 | A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation | Aviral Srivastava et.al. | 2410.13897 | null |
2024-10-21 | Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents | Priyanshu Kumar et.al. | 2410.13886 | link |
2024-10-17 | PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment | Zekun Moore Wang et.al. | 2410.13785 | null |
2024-10-17 | Persistent Pre-Training Poisoning of LLMs | Yiming Zhang et.al. | 2410.13722 | null |
2024-11-09 | Jailbreaking LLM-Controlled Robots | Alexander Robey et.al. | 2410.13691 | null |
2025-01-02 | BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models | Isack Lee et.al. | 2410.13334 | link |
2024-10-17 | SPIN: Self-Supervised Prompt INjection | Leon Zhou et.al. | 2410.13236 | null |
2024-10-18 | JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework | Fan Liu et.al. | 2410.12855 | null |
2024-10-19 | Multi-round jailbreak attack on large language models | Yihua Zhou et.al. | 2410.11533 | null |
2024-10-15 | Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models | Hao Yang et.al. | 2410.11459 | link |
2025-01-20 | Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation | Qizhang Li et.al. | 2410.11317 | link |
2024-10-15 | AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment | Pankayaraj Pathmanathan et.al. | 2410.11283 | null |
2024-10-15 | Cognitive Overload Attack:Prompt Injection for Long Context | Bibek Upadhayay et.al. | 2410.11272 | link |
2024-10-14 | Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues | Qibing Ren et.al. | 2410.10700 | link |
2024-10-14 | On Calibration of LLM-based Guard Models for Reliable Content Moderation | Hongfu Liu et.al. | 2410.10414 | link |
2024-10-14 | Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting | Yifan Luo et.al. | 2410.10150 | null |
2024-11-27 | BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models | Xinyuan Wang et.al. | 2410.09804 | null |
2024-10-18 | VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment | Lei Li et.al. | 2410.09421 | null |
2024-12-17 | Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations | Tarun Raheja et.al. | 2410.09097 | null |
2024-10-11 | AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation | Zijun Wang et.al. | 2410.09040 | link |
2024-10-14 | AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents | Maksym Andriushchenko et.al. | 2410.09024 | null |
2024-11-29 | RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process | Peiran Wang et.al. | 2410.08660 | null |
2024-10-09 | Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level | Xinyi Zeng et.al. | 2410.06809 | null |
2024-10-04 | Developing Assurance Cases for Adversarial Robustness and Regulatory Compliance in LLMs | Tomas Bueno Momcilovic et.al. | 2410.05304 | null |
2024-11-27 | AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs | Xiaogeng Liu et.al. | 2410.05295 | link |
2024-10-06 | Attention Shift: Steering AI Away from Unsafe Content | Shivank Garg et.al. | 2410.04447 | null |
2024-10-05 | Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks | Zi Wang et.al. | 2410.04234 | null |
2024-10-05 | Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models | Yiting Dong et.al. | 2410.04190 | null |
2024-10-04 | Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step | Wenxuan Wang et.al. | 2410.03869 | null |
2024-10-08 | You Know What I'm Saying: Jailbreak Attack via Implicit Reference | Tianyu Wu et.al. | 2410.03857 | link |
2024-12-16 | SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks | Tianhao Li et.al. | 2410.03769 | null |
2024-10-23 | Gradient-based Jailbreak Images for Multimodal Fusion Models | Javier Rando et.al. | 2410.03489 | link |
2024-10-23 | Safeguard is a Double-edged Sword: Denial-of-service Attack on Large Language Models | Qingzhao Zhang et.al. | 2410.02916 | null |
2024-10-02 | FlipAttack: Jailbreak LLMs via Flipping | Yue Liu et.al. | 2410.02832 | link |
2024-10-01 | PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System | Gary D. Lopez Munoz et.al. | 2410.02828 | link |
2024-10-03 | SteerDiff: Steering towards Safe Text-to-Image Diffusion Models | Hongxiang Zhang et.al. | 2410.02710 | null |
2024-10-07 | Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models | Guobin Shen et.al. | 2410.02298 | null |
2024-12-18 | Data to Defense: The Role of Curation in Customizing LLMs Against Jailbreaking Attacks | Xiaoqun Liu et.al. | 2410.02220 | null |
2024-10-02 | Automated Red Teaming with GOAT: the Generative Offensive Agent Tester | Maya Pavlova et.al. | 2410.01606 | null |
2024-10-04 | HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models | Seanie Lee et.al. | 2410.01524 | link |
2024-10-02 | Information-Theoretical Principled Trade-off between Jailbreakability and Stealthiness on Vision Language Models | Ching-Chia Kao et.al. | 2410.01438 | null |
2024-12-06 | Endless Jailbreaks with Bijection Learning | Brian R. Y. Huang et.al. | 2410.01294 | null |
2024-12-19 | Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models | Wei Zhao et.al. | 2410.00451 | link |
2024-09-29 | Survey of Security and Data Attacks on Machine Unlearning In Financial and E-Commerce | Carl E. J. Brodzinski et.al. | 2410.00055 | null |
2024-09-30 | Robust LLM safeguarding via refusal feature adversarial training | Lei Yu et.al. | 2409.20089 | null |
2024-09-28 | Overriding Safety protections of Open-source Models | Sachin Kumar et.al. | 2409.19476 | link |
2024-09-27 | HM3: Heterogeneous Multi-Class Model Merging | Stefan Hackmann et.al. | 2409.19173 | null |
2024-09-27 | Multimodal Pragmatic Jailbreak on Text-to-image Models | Tong Liu et.al. | 2409.19149 | null |
2024-11-08 | An Adversarial Perspective on Machine Unlearning for AI Safety | Jakub Łucki et.al. | 2409.18025 | link |
2024-10-04 | MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks | Giandomenico Cornacchia et.al. | 2409.17699 | null |
2024-09-26 | RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking | Yifan Jiang et.al. | 2409.17458 | link |
2024-09-25 | Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction | Jinchuan Zhang et.al. | 2409.16783 | link |
2024-09-25 | RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems | Yihong Tang et.al. | 2409.16727 | null |
2024-09-23 | Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI | Ambrish Rawat et.al. | 2409.15398 | null |
2024-09-18 | Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning | Essa Jan et.al. | 2409.15361 | null |
2024-10-08 | Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs | Xueluan Gong et.al. | 2409.14866 | link |
2024-10-03 | PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach | Zhihao Lin et.al. | 2409.14177 | null |
2024-10-29 | Towards Safe Multilingual Frontier AI | Artūrs Kanepajs et.al. | 2409.13708 | link |
2024-11-05 | Jailbreaking Large Language Models with Symbolic Mathematics | Emet Bethany et.al. | 2409.11445 | null |
2024-09-17 | Hackphyr: A Local Fine-Tuned LLM Agent for Network Security Environments | Maria Rigaki et.al. | 2409.11276 | null |
2024-09-14 | What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing | Chenyang Yang et.al. | 2409.09261 | link |
2024-09-27 | Multi-Robot Coordination Induced in an Adversarial Graph-Traversal Game | James Berneburg et.al. | 2409.08222 | null |
2024-10-19 | Securing Large Language Models: Addressing Bias, Misinformation, and Prompt Attacks | Benji Peng et.al. | 2409.08087 | null |
2024-09-12 | Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking | Stav Cohen et.al. | 2409.08045 | link |
2024-09-12 | Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols | Charlie Griffin et.al. | 2409.07985 | link |
2024-09-11 | AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs | Lijia Lv et.al. | 2409.07503 | link |
2024-09-11 | Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks | Md Zarif Hossain et.al. | 2409.07353 | null |
2024-09-10 | DiPT: Enhancing LLM reasoning through diversified perspective-taking | Hoang Anh Just et.al. | 2409.06241 | null |
2024-09-07 | Exploring Straightforward Conversational Red-Teaming | George Kour et.al. | 2409.04822 | null |
2024-08-31 | HSF: Defending against Jailbreak Attacks with Hidden State Filtering | Cheng Qian et.al. | 2409.03788 | null |
2024-11-29 | Conversational Complexity for Assessing Risk in Large Language Models | John Burden et.al. | 2409.01247 | null |
2024-09-01 | Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models | Bang An et.al. | 2409.00598 | link |
2024-08-31 | Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness | Wenxuan Wang et.al. | 2409.00551 | null |
2024-10-17 | PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action | Yijia Shao et.al. | 2409.00138 | link |
2024-08-29 | Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks | Tom Gibbs et.al. | 2409.00137 | null |
2024-11-07 | FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks) | Aman Priyanshu et.al. | 2408.16163 | null |
2024-08-28 | Red Team Redemption: A Structured Comparison of Open-Source Tools for Adversary Emulation | Max Landauer et.al. | 2408.15645 | null |
2024-09-05 | Legilimens: Practical and Unified Content Moderation for Large Language Model Services | Jialin Wu et.al. | 2408.15488 | link |
2024-09-04 | LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet | Nathaniel Li et.al. | 2408.15221 | null |
2024-08-27 | Investigating Coverage Criteria in Large Language Models: An In-Depth Study Through Jailbreak Attacks | Shide Zhou et.al. | 2408.15207 | null |
2024-10-05 | Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models | Hongfu Liu et.al. | 2408.14866 | link |
2024-08-27 | Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models | Yuhao Du et.al. | 2408.14853 | null |
2024-12-15 | HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models | Sensen Gao et.al. | 2408.13896 | null |
2024-08-14 | SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming | Anurakt Kumar et.al. | 2408.11851 | null |
2024-09-14 | Efficient Detection of Toxic Prompts in Large Language Models | Yi Liu et.al. | 2408.11727 | null |
2024-08-21 | Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer | Weipeng Jiang et.al. | 2408.11313 | link |
2024-08-21 | EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models | Chongwen Zhao et.al. | 2408.11308 | null |
2024-08-20 | Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles | Zhilong Wang et.al. | 2408.11182 | null |
2024-08-18 | DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization | Pucheng Dang et.al. | 2408.11071 | null |
2025-01-02 | Security Attacks on LLM-based Code Completion Tools | Wen Cheng et.al. | 2408.11006 | link |
2025-01-02 | Perception-guided Jailbreak against Text-to-Image Models | Yihao Huang et.al. | 2408.10848 | null |
2024-08-20 | Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Tej Deep Pala et.al. | 2408.10701 | link |
2024-08-20 | Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models | Hongbang Yuan et.al. | 2408.10682 | null |
2024-08-26 | Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation | Haoyu Wang et.al. | 2408.10668 | null |
2024-08-18 | Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks | Kexin Chen et.al. | 2408.09326 | null |
2025-01-10 | BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger | Yulin Chen et.al. | 2408.09093 | null |
2024-08-22 | Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks | Jiawei Zhao et.al. | 2408.08924 | link |
2024-08-11 | Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search | Robert J. Moss et.al. | 2408.08899 | link |
2024-10-22 | Fenghua Weng et.al. | 2408.08464 | link | |
2024-12-19 | Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions | Quan Liu et.al. | 2408.07663 | link |
2024-12-14 | On Effects of Steering Latent Representation for Large Language Model Unlearning | Dang Huu-Tien et.al. | 2408.06223 | link |
2024-08-09 | A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares | Stav Cohen et.al. | 2408.05061 | link |
2024-09-13 | h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment | Moussa Koulako Bala Doumbouya et.al. | 2408.04811 | null |
2024-08-08 | Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles | Xiongtao Sun et.al. | 2408.04686 | null |
2024-08-08 | Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models | Fabio Pernisi et.al. | 2408.04522 | null |
2024-08-07 | EnJa: Ensemble Jailbreak on Large Language Models | Jiahao Zhang et.al. | 2408.03603 | null |
2024-12-27 | Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws | Dillon Bowen et.al. | 2408.02946 | link |
2024-08-05 | Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models? | Mohammad Bahrami Karkevandi et.al. | 2408.02651 | null |
2024-12-23 | SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models | Muxi Diao et.al. | 2408.02632 | null |
2024-08-02 | Mission Impossible: A Statistical Perspective on Jailbreaking LLMs | Jingtong Su et.al. | 2408.01420 | null |
2024-08-01 | WHITE PAPER: A Brief Exploration of Data Exfiltration using GCG Suffixes | Victor Valbuena et.al. | 2408.00925 | null |
2024-09-14 | Tamper-Resistant Safeguards for Open-Weight LLMs | Rishub Tamirisa et.al. | 2408.00761 | link |
2024-09-09 | Jailbreaking Text-to-Image Models with LLM-Based Agents | Yingkai Dong et.al. | 2408.00523 | null |
2024-10-17 | Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models | Yue Xu et.al. | 2407.21659 | link |
2025-01-16 | Direct Unlearning Optimization for Robust and Safe Text-to-Image Models | Yong-Hyun Park et.al. | 2407.21035 | null |
2024-10-24 | Effects of Scale on Language Model Robustness | Nikolaus Howe et.al. | 2407.18213 | null |
2024-12-24 | The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models | Zihui Wu et.al. | 2407.17915 | link |
2024-10-01 | FLRT: Fluent Student-Teacher Redteaming | T. Ben Thompson et.al. | 2407.17447 | link |
2024-10-07 | Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective | Yujian Liu et.al. | 2407.16997 | link |
2024-12-31 | From Sands to Mansions: Simulating Full Attack Chain with LLM-Organized Knowledge | Lingzhi Wang et.al. | 2407.16928 | null |
2024-08-23 | Can Large Language Models Automatically Jailbreak GPT-4V? | Yuanwei Wu et.al. | 2407.16686 | null |
2024-07-23 | RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent | Huiyu Xu et.al. | 2407.16667 | null |
2024-10-26 | Course-Correction: Safety Alignment Using Synthetic Preferences | Rongwu Xu et.al. | 2407.16637 | link |
2024-07-23 | PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing | Blazej Manczak et.al. | 2407.16318 | link |
2024-08-13 | Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models | Shi Lin et.al. | 2407.16205 | link |
2024-07-26 | Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems | Siddharth D Jaiswal et.al. | 2407.15810 | null |
2024-08-21 | Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs | Abhay Sheshadri et.al. | 2407.15549 | link |
2024-12-16 | Failures to Find Transferable Image Jailbreaks Between Vision-Language Models | Rylan Schaeffer et.al. | 2407.15211 | null |
2024-07-21 | Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts | Yi Liu et.al. | 2407.15050 | null |
2024-07-23 | RogueGPT: dis-ethical tuning transforms ChatGPT4 into a Rogue AI in 158 Words | Alessio Buscemi et.al. | 2407.15009 | null |
2024-07-20 | Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) | Apurv Verma et.al. | 2407.14937 | link |
2024-08-23 | Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle | Emman Haider et.al. | 2407.13833 | null |
2024-07-16 | Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models | Zihao Xu et.al. | 2407.13796 | link |
2024-07-18 | LLMs as Function Approximators: Terminology, Taxonomy, and Questions for Evaluation | David Schlangen et.al. | 2407.13744 | null |
2024-07-17 | AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases | Zhaorun Chen et.al. | 2407.12784 | link |
2024-10-28 | Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models | Chao Gong et.al. | 2407.12383 | link |
2024-07-17 | The Better Angels of Machine Personality: How Personality Relates to LLM Safety | Jie Zhang et.al. | 2407.12344 | link |
2024-10-03 | Does Refusal Training in LLMs Generalize to the Past Tense? | Maksym Andriushchenko et.al. | 2407.11969 | link |
2024-08-21 | What Makes and Breaks Safety Fine-tuning? A Mechanistic Study | Samyak Jain et.al. | 2407.10264 | null |
2024-07-13 | MOAT: Securely Mitigating Rowhammer with Per-Row Activation Counters | Moinuddin Qureshi et.al. | 2407.09995 | null |
2024-10-18 | ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts | Amelia F. Hardy et.al. | 2407.09447 | link |
2024-09-06 | Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions | Tingwei Zhang et.al. | 2407.08970 | link |
2024-07-11 | Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing | Huanqian Wang et.al. | 2407.08770 | link |
2024-07-11 | Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation | Riccardo Cantini et.al. | 2407.08441 | null |
2024-09-11 | The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing | Alice Qian Zhang et.al. | 2407.07786 | null |
2024-07-12 | A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends | Daizong Liu et.al. | 2407.07403 | link |
2024-09-08 | T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models | Yibo Miao et.al. | 2407.05965 | null |
2024-07-08 | Mintong Kang et.al. | 2407.05557 | link | |
2024-07-06 | Safe Generative Chats in a WhatsApp Intelligent Tutoring System | Zachary Levonian et.al. | 2407.04915 | null |
2024-08-30 | Jailbreak Attacks and Defenses Against Large Language Models: A Survey | Sibo Yi et.al. | 2407.04295 | null |
2024-12-21 | Automated Progressive Red Teaming | Bojian Jiang et.al. | 2407.03876 | link |
2024-07-03 | Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning | Simon Ostermann et.al. | 2407.03391 | null |
2024-07-03 | SOS! Soft Prompt Attack Against Open-Source Large Language Models | Ziqing Yang et.al. | 2407.03160 | null |
2024-07-03 | JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets | Zhihua Jin et.al. | 2407.03045 | null |
2024-11-05 | Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks | Zhexin Zhang et.al. | 2407.02855 | link |
2024-10-30 | Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses | David Glukhov et.al. | 2407.02551 | null |
2024-08-26 | Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything | Xiaotian Zou et.al. | 2407.02534 | null |
2024-07-02 | SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack | Yan Yang et.al. | 2407.01902 | link |
2024-07-01 | Purple-teaming LLMs with Adversarial Defender Training | Jingyan Zhou et.al. | 2407.01850 | null |
2024-07-25 | JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models | Haibo Jin et.al. | 2407.01599 | link |
2024-07-01 | Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement | Zisu Huang et.al. | 2407.01461 | link |
2024-07-01 | Badllama 3: removing safety finetuning from Llama 3 in minutes | Dmitrii Volkov et.al. | 2407.01376 | null |
2024-09-23 | Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks | Yue Zhou et.al. | 2407.00869 | link |
2024-10-01 | Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference | Anton Xue et.al. | 2407.00075 | null |
2024-07-11 | Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection | Yuqi Zhou et.al. | 2406.19845 | null |
2024-10-03 | Jailbreaking LLMs with Arabic Transliteration and Arabizi | Mansour Al Ghanim et.al. | 2406.18725 | link |
2024-07-08 | The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm | Aakanksha et.al. | 2406.18682 | null |
2024-06-26 | WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models | Liwei Jiang et.al. | 2406.18510 | link |
2024-12-09 | WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs | Seungju Han et.al. | 2406.18495 | link |
2024-06-26 | Poisoned LangChain: Jailbreak LLMs by LangChain | Ziqiu Wang et.al. | 2406.18122 | null |
2024-12-24 | SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance | Caishuang Huang et.al. | 2406.18118 | link |
2024-06-25 | CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference | Erxin Yu et.al. | 2406.17626 | link |
2024-06-25 | Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations | Cheng Wang et.al. | 2406.17576 | null |
2024-06-21 | Steering Without Side Effects: Improving Post-Deployment Control of Language Models | Asa Cooper Stickland et.al. | 2406.15518 | link |
2024-11-02 | Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding | Haneul Yoo et.al. | 2406.15481 | link |
2024-06-21 | From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking | Siyuan Wang et.al. | 2406.14859 | null |
2024-07-01 | Adversaries Can Misuse Combinations of Safe Models | Erik Jones et.al. | 2406.14595 | null |
2025-01-17 | Jailbreaking as a Reward Misspecification Problem | Zhihui Xie et.al. | 2406.14393 | link |
2024-06-20 | Finding Safety Neurons in Large Language Models | Jianhui Chen et.al. | 2406.14144 | null |
2024-06-19 | ObscurePrompt: Jailbreaking Large Language Models via Obscure Input | Yue Huang et.al. | 2406.13662 | link |
2024-08-21 | SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation | Xiaoze Liu et.al. | 2406.12975 | link |
2025-01-07 | ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates | Fengqing Jiang et.al. | 2406.12935 | link |
2024-06-21 | [WIP] Jailbreak Paradox: The Achilles' Heel of LLMs | Abhinav Rao et.al. | 2406.12702 | null |
2024-06-17 | Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner | Kenneth Li et.al. | 2406.11978 | link |
2024-10-16 | CELL your Model: Contrastive Explanations for Large Language Models | Ronny Luss et.al. | 2406.11785 | null |
2024-10-23 | STAR: SocioTechnical Approach to Red Teaming Language Models | Laura Weidinger et.al. | 2406.11757 | null |
2024-10-30 | Refusal in Language Models Is Mediated by a Single Direction | Andy Arditi et.al. | 2406.11717 | link |
2024-06-17 | Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack | Shangqing Tu et.al. | 2406.11682 | link |
2024-06-17 | "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak | Lingrui Mei et.al. | 2406.11668 | link |
2024-06-17 | Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming | Vernon Toh Yan Han et.al. | 2406.11654 | null |
2024-06-16 | garak: A Framework for Security Probing Large Language Models | Leon Derczynski et.al. | 2406.11036 | link |
2024-06-16 | Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications | Stephen Burabari Tete et.al. | 2406.11007 | null |
2024-12-02 | Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis | Yuping Lin et.al. | 2406.10794 | link |
2024-11-06 | Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs | Zhao Xu et.al. | 2406.09324 | link |
2024-06-13 | JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models | Delong Ran et.al. | 2406.09321 | link |
2024-10-05 | Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models | Sarah Ball et.al. | 2406.09289 | link |
2024-07-19 | Exploiting Uncommon Text-Encoded Structures for Automated Jailbreaks in LLMs | Bangxin Li et.al. | 2406.08754 | null |
2024-06-13 | RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs | Xuan Chen et.al. | 2406.08725 | null |
2024-12-18 | When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search | Xuan Chen et.al. | 2406.08705 | link |
2024-06-13 | MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models | Tianle Gu et.al. | 2406.07594 | link |
2024-07-14 | Merging Improves Self-Critique Against Jailbreak Attacks | Victor Gallego et.al. | 2406.07188 | link |
2024-12-06 | MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models | Yichi Zhang et.al. | 2406.07057 | null |
2024-06-07 | Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs | Fan Liu et.al. | 2406.06622 | null |
2024-07-03 | Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks | Zonghao Ying et.al. | 2406.06302 | link |
2024-06-10 | Safety Alignment Should Be Made More Than Just a Few Tokens Deep | Xiangyu Qi et.al. | 2406.05946 | link |
2024-06-13 | How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States | Zhenhong Zhou et.al. | 2406.05644 | link |
2024-09-05 | SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner | Xunguang Wang et.al. | 2406.05498 | null |
2024-06-08 | Is On-Device AI Broken and Exploitable? Assessing the Trust and Ethics in Small Language Models | Kalyan Nakka et.al. | 2406.05364 | null |
2024-07-01 | Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt | Zonghao Ying et.al. | 2406.04031 | link |
2024-06-06 | AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens | Lin Lu et.al. | 2406.03805 | null |
2024-09-25 | Ranking Manipulation for Conversational Search Engines | Samuel Pfrommer et.al. | 2406.03589 | link |
2024-06-03 | Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits | Andis Draguns et.al. | 2406.02619 | link |
2024-05-28 | Are PPO-ed Language Models Hackable? | Suraj Anand et.al. | 2406.02577 | null |
2025-01-21 | QROA: A Black-Box Query-Response Optimization Attack on LLMs | Hussein Jawad et.al. | 2406.02044 | link |
2024-10-30 | Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses | Xiaosen Zheng et.al. | 2406.01288 | link |
2024-11-03 | Are you still on track!? Catching LLM Task Drift with Activations | Sahar Abdelnabi et.al. | 2406.00799 | link |
2024-06-01 | Exploring Vulnerabilities and Protections in Large Language Models: A Survey | Frank Weizhen Liu et.al. | 2406.00240 | null |
2024-07-29 | Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization | Yuanpu Cao et.al. | 2406.00045 | link |
2024-06-05 | Improved Techniques for Optimization-Based Jailbreaking on Large Language Models | Xiaojun Jia et.al. | 2405.21018 | link |
2024-11-01 | Improved Generation of Adversarial Examples Against Safety-aligned LLMs | Qizhang Li et.al. | 2405.20778 | link |
2024-08-21 | Medical MLLM is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models | Xijie Huang et.al. | 2405.20775 | link |
2024-06-12 | Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character | Siyuan Ma et.al. | 2405.20773 | null |
2024-06-04 | Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens | Jiahao Yu et.al. | 2405.20653 | null |
2024-05-30 | Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters | Haibo Jin et.al. | 2405.20413 | null |
2024-10-17 | TAIA: Large Language Models are Out-of-Distribution Data Learners | Shuyang Jiang et.al. | 2405.20192 | link |
2024-05-30 | Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks | Chen Xiong et.al. | 2405.20099 | null |
2024-05-30 | Efficient LLM-Jailbreaking by Introducing Visual Modality | Zhenxing Niu et.al. | 2405.20015 | null |
2024-05-30 | AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization | Jiawei Chen et.al. | 2405.19668 | null |
2024-10-11 | ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users | Guanlin Li et.al. | 2405.19360 | link |
2024-05-31 | Robustifying Safety-Aligned Large Language Models through Clean Data Curation | Xiaoqun Liu et.al. | 2405.19358 | null |
2024-05-29 | ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning | Ruchika Chavhan et.al. | 2405.19237 | link |
2024-05-29 | Voice Jailbreak Attacks Against GPT-4o | Xinyue Shen et.al. | 2405.19103 | link |
2024-12-20 | DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints | Andrew Zhao et.al. | 2405.19026 | link |
2024-10-20 | Quantitative Certification of Bias in Large Language Models | Isha Chaudhary et.al. | 2405.18780 | link |
2024-11-18 | A Theoretical Understanding of Self-Correction through In-context Alignment | Yifei Wang et.al. | 2405.18634 | null |
2024-05-28 | Learning diverse attacks on large language models for robust red-teaming and safety tuning | Seanie Lee et.al. | 2405.18540 | null |
2024-06-14 | Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing | Wei Zhao et.al. | 2405.18166 | link |
2024-10-14 | White-box Multimodal Jailbreaks Against Large Vision-Language Models | Ruofan Wang et.al. | 2405.17894 | link |
2024-05-28 | Automatic Jailbreaking of the Text-to-Image Generative AI Systems | Minseon Kim et.al. | 2405.16567 | link |
2024-05-24 | Hacc-Man: An Arcade Game for Jailbreaking LLMs | Matheus Valentim et.al. | 2405.15902 | null |
2024-10-08 | Extracting Prompts by Inverting LLM Outputs | Collin Zhang et.al. | 2405.15012 | link |
2024-10-30 | Representation Noising: A Defence Mechanism Against Harmful Finetuning | Domenic Rosati et.al. | 2405.14577 | link |
2024-05-23 | Impact of Non-Standard Unicode Characters on Security and Comprehension in Large Language Models | Johan S Daniel et.al. | 2405.14490 | link |
2024-05-22 | WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response | Tianrong Zhang et.al. | 2405.14023 | null |
2024-05-22 | Safety Alignment for Vision Language Models | Zhendong Liu et.al. | 2405.13581 | null |
2024-07-07 | TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models | Pengzhou Cheng et.al. | 2405.13401 | null |
2024-10-15 | GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation | Govind Ramesh et.al. | 2405.13077 | null |
2024-06-19 | Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation | Yuxi Li et.al. | 2405.13068 | link |
2024-06-17 | Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming | Jiaxu Liu et.al. | 2405.12604 | null |
2024-05-29 | Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models | Jiaqi Li et.al. | 2405.12523 | null |
2024-08-06 | Hummer: Towards Limited Competitive Preference Dataset | Li Jiang et.al. | 2405.11647 | null |
2024-05-15 | Benchmark Early and Red Team Often: A Framework for Assessing and Managing Dual-Use Hazards of AI Foundation Models | Anthony M. Barrett et.al. | 2405.10986 | null |
2024-10-05 | Red Teaming Language Models for Processing Contradictory Dialogues | Xiaofei Wen et.al. | 2405.10128 | link |
2024-05-15 | Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization | Kai Hu et.al. | 2405.09113 | null |
2024-05-15 | A safety realignment framework via subspace-oriented model fusion for large language models | Xin Yi et.al. | 2405.09055 | link |
2024-05-14 | SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models | Raghuveer Peri et.al. | 2405.08317 | null |
2024-05-14 | PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition | Ziyang Zhang et.al. | 2405.07932 | link |
2024-05-14 | PLeak: Prompt Leaking Attacks against Large Language Model Applications | Bo Hui et.al. | 2405.06823 | link |
2024-08-29 | Mitigating Exaggerated Safety in Large Language Models | Ruchira Ray et.al. | 2405.05418 | null |
2024-05-07 | Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks | Georgios Pantazopoulos et.al. | 2405.04403 | link |
2024-05-07 | Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent | Shang Shang et.al. | 2405.03654 | null |
2024-05-03 | Aloe: A Family of Fine-tuned Open Healthcare LLMs | Ashwin Kumar Gururajan et.al. | 2405.01886 | null |
2024-05-02 | Boosting Jailbreak Attack with Momentum | Yihao Zhang et.al. | 2405.01229 | link |
2024-05-10 | Evaluating and Mitigating Linguistic Discrimination in Large Language Models | Guoliang Dong et.al. | 2404.18534 | null |
2024-04-26 | Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo | Stephen Zhao et.al. | 2404.17546 | link |
2024-04-21 | AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs | Anselm Paulus et.al. | 2404.16873 | link |
2024-10-12 | Don't Say No: Jailbreaking LLM by Suppressing Refusal | Yukai Zhou et.al. | 2404.16369 | link |
2024-04-24 | Universal Adversarial Triggers Are Not Universal | Nicholas Meade et.al. | 2404.16020 | link |
2024-04-23 | Bias patterns in the application of LLMs for clinical decision support: A comprehensive study | Raphael Poulain et.al. | 2404.15149 | link |
2024-04-23 | A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI | Seliem El-Sayed et.al. | 2404.15058 | null |
2024-06-06 | Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs | Javier Rando et.al. | 2404.14461 | link |
2024-10-10 | Protecting Your LLMs with Information Bottleneck | Zichuan Liu et.al. | 2404.13968 | link |
2024-04-19 | The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions | Eric Wallace et.al. | 2404.13208 | null |
2024-04-18 | Advancing the Robustness of Large Language Models through Self-Denoised Smoothing | Jiabao Ji et.al. | 2404.12274 | link |
2024-04-12 | JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models | Yingchaojie Feng et.al. | 2404.08793 | null |
2024-06-24 | ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming | Simone Tedeschi et.al. | 2404.08676 | link |
2024-04-12 | Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts | Tianyu Zhang et.al. | 2404.08309 | null |
2024-11-24 | AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs | Zeyi Liao et.al. | 2404.07921 | link |
2024-04-10 | CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge | Yu Ying Chiu et.al. | 2404.06664 | null |
2024-05-07 | Rethinking How to Evaluate Language Model Jailbreak | Hongyu Cai et.al. | 2404.06407 | link |
2024-07-03 | Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge | Weikai Lu et.al. | 2404.05880 | link |
2024-04-16 | Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection | Zhilong Wang et.al. | 2404.04849 | null |
2024-09-09 | Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes | Divyanshu Kumar et.al. | 2404.04392 | null |
2024-12-15 | Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? | Shuo Chen et.al. | 2404.03411 | link |
2024-11-24 | JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks | Weidi Luo et.al. | 2404.03027 | null |
2024-09-04 | Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models | Jiachen Ma et.al. | 2404.02928 | null |
2024-04-03 | Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game | Qianqiao Xu et.al. | 2404.02532 | null |
2024-10-07 | Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks | Maksym Andriushchenko et.al. | 2404.02151 | link |
2024-04-02 | Red-Teaming Segment Anything Model | Krzysztof Jankowski et.al. | 2404.02067 | link |
2024-09-24 | Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack | Mark Russinovich et.al. | 2404.01833 | null |
2024-10-31 | JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models | Patrick Chao et.al. | 2404.01318 | link |
2024-08-20 | What is in Your Safe Data? Identifying Benign Data that Breaks Safety | Luxi He et.al. | 2404.01099 | link |
2024-11-26 | Against The Achilles' Heel: A Survey on Red Teaming for Generative Models | Lizhi Lin et.al. | 2404.00629 | link |
2024-12-27 | Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code | Taishi Nakamura et.al. | 2404.00399 | null |
2024-12-08 | Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation | Yutong He et.al. | 2403.19103 | null |
2024-03-27 | IterAlign: Iterative Constitutional Alignment of Large Language Models | Xiusi Chen et.al. | 2403.18341 | null |
2024-11-15 | Optimization-based Prompt Injection Attack to LLM-as-a-Judge | Jiawen Shi et.al. | 2403.17710 | link |
2024-09-30 | Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models | Zhiyuan Yu et.al. | 2403.17336 | null |
2024-03-22 | Risk and Response in Large Language Models: Evaluating Key Threat Categories | Bahareh Harandizadeh et.al. | 2403.14988 | null |
2024-06-24 | Testing the Limits of Jailbreaking Defenses with the Purple Problem | Taeyoun Kim et.al. | 2403.14725 | link |
2024-07-23 | RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content | Zhuowen Yuan et.al. | 2403.13031 | link |
2024-03-18 | EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models | Weikang Zhou et.al. | 2403.12171 | link |
2024-05-14 | Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation | Jessica Quaye et.al. | 2403.12075 | link |
2025-01-13 | Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models | Yifan Li et.al. | 2403.09792 | link |
2024-10-15 | Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation | Yunhao Gou et.al. | 2403.09572 | null |
2024-03-14 | AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting | Yu Wang et.al. | 2403.09513 | link |
2024-07-17 | The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models? | Qinyu Zhao et.al. | 2403.09037 | link |
2024-03-19 | Review of Generative AI Methods in Cybersecurity | Yagmur Yigit et.al. | 2403.08701 | null |
2024-09-30 | Distract Large Language Models for Automatic Jailbreak Attack | Zeguan Xiao et.al. | 2403.08424 | link |
2024-03-14 | HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback | Ang Li et.al. | 2403.08309 | null |
2024-03-14 | Red Teaming Models for Hyperspectral Image Analysis Using Explainable AI | Vladimir Zaigrajew et.al. | 2403.08017 | null |
2024-08-22 | Defending Against Unforeseen Failure Modes with Latent Adversarial Training | Stephen Casper et.al. | 2403.05030 | link |
2024-03-07 | A Safe Harbor for AI Evaluation and Red Teaming | Shayne Longpre et.al. | 2403.04893 | null |
2024-11-14 | AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks | Yifan Zeng et.al. | 2403.04783 | link |
2024-03-11 | Using Hallucinations to Bypass GPT4's Filter | Benjamin Lemkin et.al. | 2403.04769 | null |
2024-10-04 | Aligners: Decoupling LLMs and Alignment | Lilian Ngweta et.al. | 2403.04224 | link |
2024-03-06 | ImgTrojan: Jailbreaking Vision-Language Models with ONE Image | Xijia Tao et.al. | 2403.02910 | link |
2024-03-05 | Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications | Stav Cohen et.al. | 2403.02817 | link |
2024-03-02 | AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks | Jiacen Xu et.al. | 2403.01038 | null |
2024-11-07 | Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes | Xiaomeng Hu et.al. | 2403.00867 | null |
2024-02-28 | TroubleLLM: Align to Red Team Expert | Zhuoer Xu et.al. | 2403.00829 | null |
2024-09-19 | Enhancing Jailbreak Attacks with Diversity Guidance | Xu Zhang et.al. | 2403.00292 | null |
2024-02-29 | Curiosity-driven Red-teaming for Large Language Models | Zhang-Wei Hong et.al. | 2402.19464 | link |
2024-06-10 | Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction | Tong Liu et.al. | 2402.18104 | link |
2024-10-30 | Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue | Zhenhong Zhou et.al. | 2402.17262 | null |
2024-11-11 | DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers | Xirui Li et.al. | 2402.16914 | link |
2024-02-26 | CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models | Huijie Lv et.al. | 2402.16717 | link |
2024-06-06 | Defending LLMs against Jailbreaking Attacks via Backtranslation | Yihan Wang et.al. | 2402.16459 | link |
2024-02-28 | Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing | Jiabao Ji et.al. | 2402.16192 | link |
2024-06-04 | ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings | Hao Wang et.al. | 2402.16006 | null |
2024-02-24 | PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails | Neal Mangaokar et.al. | 2402.15911 | null |
2024-03-04 | LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper | Daoyuan Wu et.al. | 2402.15727 | null |
2024-02-24 | Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology | Zhenhua Wang et.al. | 2402.15690 | null |
2024-02-23 | Fast Adversarial Attacks on Language Models In One GPU Minute | Vinu Sankar Sadasivan et.al. | 2402.15570 | link |
2024-11-16 | How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries | Somnath Banerjee et.al. | 2402.15302 | link |
2024-02-27 | Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement | Heegyu Kim et.al. | 2402.15180 | null |
2024-06-20 | Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment | Jiongxiao Wang et.al. | 2402.14968 | null |
2024-02-27 | Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs | Xiaoxia Li et.al. | 2402.14872 | null |
2024-06-18 | Is the System Message Really Important to Jailbreaks in Large Language Models? | Xiaotian Zou et.al. | 2402.14857 | null |
2024-02-21 | Coercing LLMs to do and reveal (almost) anything | Jonas Geiping et.al. | 2402.14020 | link |
2024-02-26 | AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning | Vasudev Gohil et.al. | 2402.13946 | null |
2024-02-21 | Round Trip Translation Defence against Large Language Model Jailbreaking Attacks | Canaan Yung et.al. | 2402.13517 | link |
2024-05-29 | GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis | Yueqi Xie et.al. | 2402.13494 | link |
2024-05-17 | A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models | Zihao Xu et.al. | 2402.13457 | link |
2024-07-05 | Defending Jailbreak Prompts via In-Context Adversarial Game | Yujun Zhou et.al. | 2402.13148 | null |
2024-06-06 | TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification | Martin Gubri et.al. | 2402.12991 | link |
2024-06-07 | ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs | Fengqing Jiang et.al. | 2402.11753 | link |
2024-08-16 | ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages | Junjie Ye et.al. | 2402.10753 | link |
2024-10-23 | When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers | Divij Handa et.al. | 2402.10601 | link |
2024-08-27 | A StrongREJECT for Empty Jailbreaks | Alexandra Souly et.al. | 2402.10260 | link |
2024-02-15 | A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents | Lingbo Mo et.al. | 2402.10196 | link |
2024-10-02 | Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks | Yixin Cheng et.al. | 2402.09177 | null |
2024-02-16 | Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues | Zhiyuan Chang et.al. | 2402.09091 | null |
2024-07-25 | SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding | Zhangchen Xu et.al. | 2402.08983 | link |
2024-06-07 | COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability | Xingang Guo et.al. | 2402.08679 | link |
2024-06-03 | Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast | Xiangming Gu et.al. | 2402.08567 | link |
2024-02-13 | Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning | Gelei Deng et.al. | 2402.08416 | null |
2024-10-31 | Fight Back Against Jailbreaking via Prompt Adversarial Tuning | Yichuan Mo et.al. | 2402.06255 | link |
2024-12-16 | Comprehensive Assessment of Jailbreak Attacks Against LLMs | Junjie Chu et.al. | 2402.05668 | link |
2024-02-08 | Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia | Guangyu Shen et.al. | 2402.05467 | link |
2024-10-24 | Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications | Boyi Wei et.al. | 2402.05162 | null |
2024-02-27 | HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal | Mantas Mazeika et.al. | 2402.04249 | link |
2024-02-05 | Nevermind: Instruction Override and Moderation in Large Language Models | Edward Kim et.al. | 2402.03303 | null |
2024-05-30 | GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models | Haibo Jin et.al. | 2402.03299 | null |
2024-02-04 | Jailbreaking Attack against Multimodal Large Language Model | Zhenxing Niu et.al. | 2402.02309 | link |
2024-06-17 | Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models | Yongshuo Zong et.al. | 2402.02207 | link |
2024-01-25 | MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds | Xiaolong Jin et.al. | 2402.01706 | null |
2024-11-14 | Security and Privacy Challenges of Large Language Models: A Survey | Badhan Chandra Das et.al. | 2402.00888 | null |
2024-02-01 | Investigating Bias Representations in Llama 2 Chat via Activation Steering | Dawn Lu et.al. | 2402.00402 | null |
2024-06-03 | On Prompt-Driven Safeguarding for Large Language Models | Chujie Zheng et.al. | 2401.18018 | link |
2024-11-08 | Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks | Andy Zhou et.al. | 2401.17263 | link |
2024-02-05 | Weak-to-Strong Jailbreaking on Large Language Models | Xuandong Zhao et.al. | 2401.17256 | link |
2024-01-30 | A Cross-Language Investigation into Jailbreak Attacks in Large Language Models | Jie Li et.al. | 2401.16765 | null |
2024-01-30 | Gradient-Based Language Model Red Teaming | Nevan Wichers et.al. | 2401.16656 | link |
2024-01-29 | Towards Red Teaming in Multimodal and Multilingual Translation | Christophe Ropers et.al. | 2401.16247 | null |
2024-08-27 | Red-Teaming for Generative AI: Silver Bullet or Security Theater? | Michael Feffer et.al. | 2401.15897 | null |
2024-01-23 | Red Teaming Visual Language Models | Mukai Li et.al. | 2401.12915 | null |
2024-01-24 | Digital cloning of online social networks for language-sensitive agent-based modeling of misinformation spread | Prateek Puri et.al. | 2401.12509 | null |
2024-07-10 | The Ethics of Interaction: Mitigating Security Threats in LLMs | Ashutosh Kumar et.al. | 2401.12273 | null |
2024-01-20 | InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance | Pengyu Wang et.al. | 2401.11206 | link |
2024-10-31 | Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning | Adib Hasan et.al. | 2401.10862 | link |
2024-05-16 | Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models | Rima Hazra et.al. | 2401.10647 | link |
2024-02-12 | All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks | Kazuhiro Takemoto et.al. | 2401.09798 | link |
2024-08-03 | AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models | Dong shu et.al. | 2401.09002 | null |
2024-12-24 | Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective | Tianlong Li et.al. | 2401.06824 | null |
2024-12-16 | Intention Analysis Makes LLMs A Good Jailbreak Defender | Yuqi Zhang et.al. | 2401.06561 | link |
2024-01-23 | How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs | Yi Zeng et.al. | 2401.06373 | link |
2024-01-11 | Combating Adversarial Attacks with Multi-Agent Debate | Steffi Chern et.al. | 2401.05998 | link |
2024-04-01 | The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance | Abel Salinas et.al. | 2401.03729 | link |
2024-08-19 | Malla: Demystifying Real-world Large Language Model Integrated Malicious Services | Zilong Lin et.al. | 2401.03315 | link |
2024-01-03 | A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity | Andrew Lee et.al. | 2401.01967 | link |
2023-12-30 | Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks | Aleksander Buszydlik et.al. | 2401.00290 | link |
2023-12-28 | Scalable and automated Evaluation of Blue Team cyber posture in Cyber Ranges | Federica Bianchi et.al. | 2312.17221 | null |
2024-08-04 | Exploiting Novel GPT-4 APIs | Kellin Pelrine et.al. | 2312.14302 | link |
2023-12-12 | Maatphor: Automated Variant Analysis for Prompt Injection Attacks | Ahmed Salem et.al. | 2312.11513 | null |
2023-12-08 | A Red Teaming Framework for Securing AI in Maritime Autonomous Systems | Mathew J. Walter et.al. | 2312.11500 | null |
2024-06-18 | JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks | Xiaoyu Zhang et.al. | 2312.10766 | null |
2023-12-16 | Comprehensive Evaluation of ChatGPT Reliability Through Multilingual Inquiries | Poorna Chander Reddy Puttaparthi et.al. | 2312.10524 | link |
2023-12-04 | Generative AI in Writing Research Papers: A New Type of Algorithmic Bias and Uncertainty in Scholarly Work | Rishab Jain et.al. | 2312.10057 | null |
2023-12-14 | OSTINATO: Cross-host Attack Correlation Through Attack Activity Similarity Detection | Sutanu Kumar Ghosh et.al. | 2312.09321 | null |
2024-04-17 | Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF | Anand Siththaranjan et.al. | 2312.08358 | link |
2023-12-13 | Causality Analysis for Evaluating the Security of Large Language Models | Wei Zhao et.al. | 2312.07876 | link |
2024-07-23 | AI Control: Improving Safety Despite Intentional Subversion | Ryan Greenblatt et.al. | 2312.06942 | link |
2024-05-30 | Privacy Issues in Large Language Models: A Survey | Seth Neel et.al. | 2312.06717 | link |
2023-12-11 | Control Risk for Potential Misuse of Artificial Intelligence in Science | Jiyan He et.al. | 2312.06632 | link |
2023-12-08 | Seamless: Multilingual Expressive and Streaming Speech Translation | Seamless Communication et.al. | 2312.05187 | link |
2023-12-12 | DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions | Fangzhou Wu et.al. | 2312.04730 | null |
2024-02-23 | Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak | Yanrui Du et.al. | 2312.04127 | null |
2024-10-31 | Tree of Attacks: Jailbreaking Black-Box LLMs Automatically | Anay Mehrotra et.al. | 2312.02119 | link |
2024-06-09 | Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections | Yuanpu Cao et.al. | 2312.00027 | link |
2024-03-03 | Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition | Sander Schulhoff et.al. | 2311.16119 | link |
2023-11-27 | How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs | Haoqin Tu et.al. | 2311.16101 | link |
2023-11-27 | InfoPattern: Unveiling Information Propagation Patterns in Social Media | Chi Han et.al. | 2311.15642 | null |
2023-11-15 | Towards Publicly Accountable Frontier LLMs: Building an External Scrutiny Ecosystem under the ASPIRE Framework | Markus Anderljung et.al. | 2311.14711 | null |
2024-04-29 | Universal Jailbreak Backdoors from Poisoned Human Feedback | Javier Rando et.al. | 2311.14455 | link |
2024-03-24 | Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models | Zhaowei Zhu et.al. | 2311.11202 | link |
2024-06-15 | Hijacking Large Language Models via Adversarial In-Context Learning | Yao Qiang et.al. | 2311.09948 | link |
2024-02-29 | Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking | Nan Xu et.al. | 2311.09827 | null |
2024-06-19 | RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models | Jiongxiao Wang et.al. | 2311.09641 | null |
2023-11-16 | JAB: Joint Adversarial Prompting and Belief Augmentation | Ninareh Mehrabi et.al. | 2311.09473 | null |
2024-08-15 | Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment | Haoran Wang et.al. | 2311.09433 | link |
2024-01-20 | Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts | Yuanwei Wu et.al. | 2311.09127 | null |
2024-06-12 | Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization | Zhexin Zhang et.al. | 2311.09096 | link |
2023-11-29 | AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications | Bhaktipriya Radharapu et.al. | 2311.08592 | null |
2024-04-07 | A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily | Peng Ding et.al. | 2311.08268 | link |
2023-11-13 | MART: Improving LLM Safety with Multi-round Automatic Red-Teaming | Suyu Ge et.al. | 2311.07689 | null |
2024-05-22 | Flames: Benchmarking Value Alignment of LLMs in Chinese | Kexin Huang et.al. | 2311.06899 | link |
2024-12-10 | Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming | Nanna Inie et.al. | 2311.06237 | null |
2024-04-01 | Fake Alignment: Are LLMs Really Aligned Well? | Yixu Wang et.al. | 2311.05915 | link |
2025-01-19 | FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts | Yichen Gong et.al. | 2311.05608 | link |
2024-03-08 | Can LLMs Follow Simple Rules? | Norman Mu et.al. | 2311.04235 | link |
2023-11-24 | Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation | Rusheb Shah et.al. | 2311.03348 | null |
2024-11-28 | DeepInception: Hypnotize Large Language Model to Be Jailbreaker | Xuan Li et.al. | 2311.03191 | link |
2024-05-22 | LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B | Simon Lermen et.al. | 2310.20624 | null |
2024-03-10 | From Chatbots to PhishBots? -- Preventing Phishing scams created using ChatGPT, Google Bard and Claude | Sayak Saha Roy et.al. | 2310.19181 | null |
2024-03-22 | Self-Guard: Empower the LLM to Safeguard Itself | Zezhong Wang et.al. | 2310.15851 | null |
2023-12-14 | AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models | Sicheng Zhu et.al. | 2310.15140 | null |
2023-11-13 | Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases | Rishabh Bhardwaj et.al. | 2310.14303 | null |
2023-10-20 | Adaptive Experimental Design for Intrusion Data Collection | Kate Highnam et.al. | 2310.13224 | null |
2023-10-28 | Probing LLMs for hate speech detection: strengths and vulnerabilities | Sarthak Roy et.al. | 2310.12860 | null |
2023-10-19 | Attack Prompt Generation for Red Teaming and Defending Large Language Models | Boyi Deng et.al. | 2310.12505 | link |
2023-10-17 | Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models | Hsuan Su et.al. | 2310.11079 | null |
2023-10-16 | Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks | Erfan Shayegani et.al. | 2310.10844 | null |
2024-02-16 | Large Language Model Unlearning | Yuanshun Yao et.al. | 2310.10683 | link |
2024-06-07 | Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? | Yu-Lin Tsai et.al. | 2310.10012 | link |
2023-11-11 | ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models | Alex Mei et.al. | 2310.09624 | link |
2024-07-18 | Jailbreaking Black Box Large Language Models in Twenty Queries | Patrick Chao et.al. | 2310.08419 | link |
2023-10-10 | Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation | Yangsibo Huang et.al. | 2310.06987 | link |
2024-03-04 | Multilingual Jailbreak Challenges in Large Language Models | Yue Deng et.al. | 2310.06474 | link |
2024-05-25 | Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations | Zeming Wei et.al. | 2310.06387 | null |
2024-03-20 | AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models | Xiaogeng Liu et.al. | 2310.04451 | link |
2023-09-17 | Red Teaming Generative AI/NLP, the BB84 quantum cryptography protocol and the NIST-approved Quantum-Resistant Cryptographic Algorithms | Petar Radanliev et.al. | 2310.04425 | null |
2023-10-05 | Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! | Xiangyu Qi et.al. | 2310.03693 | link |
2024-06-11 | SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks | Alexander Robey et.al. | 2310.03684 | link |
2024-01-27 | Low-Resource Languages Jailbreak GPT-4 | Zheng-Xin Yong et.al. | 2310.02446 | null |
2023-10-03 | Jailbreaker in Jail: Moving Target Defense for Large Language Models | Bocheng Chen et.al. | 2310.02417 | null |
2023-10-03 | Can Language Models be Instructed to Protect Personal Information? | Yang Chen et.al. | 2310.02224 | null |
2024-01-22 | Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench | Jen-tse Huang et.al. | 2310.01386 | link |
2023-10-02 | No Offense Taken: Eliciting Offensiveness from Language Models | Anugya Srivastava et.al. | 2310.00892 | link |
2024-07-28 | Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games | Chengdong Ma et.al. | 2310.00322 | null |
2024-06-12 | Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM | Bochuan Cao et.al. | 2309.14348 | link |
2024-06-27 | GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts | Jiahao Yu et.al. | 2309.10253 | link |
2024-06-08 | Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts | Zhi-Yi Chin et.al. | 2309.06135 | link |
2024-04-14 | FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models | Dongyu Yao et.al. | 2309.05274 | link |
2024-08-05 | Open Sesame! Universal Black Box Jailbreaking of Large Language Models | Raz Lapid et.al. | 2309.01446 | null |
2023-09-04 | Baseline Defenses for Adversarial Attacks Against Aligned Language Models | Neel Jain et.al. | 2309.00614 | null |
2023-08-28 | The Promise and Peril of Artificial Intelligence -- Violet Teaming Offers a Balanced Path Forward | Alexander J. Titus et.al. | 2308.14253 | null |
2023-11-07 | Detecting Language Model Attacks with Perplexity | Gabriel Alon et.al. | 2308.14132 | null |
2023-08-25 | Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models | Zhenhua Wang et.al. | 2308.11521 | null |
2023-08-21 | On the Adversarial Robustness of Multi-Modal Foundation Models | Christian Schlarmann et.al. | 2308.10741 | link |
2023-08-21 | Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions | Wesley Tann et.al. | 2308.10443 | null |
2023-08-30 | Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment | Rishabh Bhardwaj et.al. | 2308.09662 | link |
2024-05-06 | Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models | Yugeng Liu et.al. | 2308.07847 | null |
2024-03-26 | GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher | Youliang Yuan et.al. | 2308.06463 | link |
2023-08-16 | Where's the Liability in Harmful AI Speech? | Peter Henderson et.al. | 2308.04635 | null |
2024-11-07 | FLIRT: Feedback Loop In-context Red Teaming | Ninareh Mehrabi et.al. | 2308.04265 | null |
2024-05-15 | "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models | Xinyue Shen et.al. | 2308.03825 | link |
2024-04-01 | XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models | Paul Röttger et.al. | 2308.01263 | link |
2023-08-03 | Confidence-Building Measures for Artificial Intelligence: Workshop Proceedings | Sarah Shoker et.al. | 2308.00862 | null |
2023-12-20 | Universal and Transferable Adversarial Attacks on Aligned Language Models | Andy Zou et.al. | 2307.15043 | link |
2023-10-10 | Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models | Erfan Shayegani et.al. | 2307.14539 | null |
2023-10-25 | MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots | Gelei Deng et.al. | 2307.08715 | null |
2023-08-28 | Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models | Huachuan Qiu et.al. | 2307.08487 | link |
2023-07-05 | Jailbroken: How Does LLM Safety Training Fail? | Alexander Wei et.al. | 2307.02483 | null |
2023-07-03 | From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy | Maanak Gupta et.al. | 2307.00691 | null |
2023-08-16 | Visual Adversarial Examples Jailbreak Aligned Large Language Models | Xiangyu Qi et.al. | 2306.13213 | link |
2024-02-26 | DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models | Boxin Wang et.al. | 2306.11698 | null |
2023-10-11 | Explore, Establish, Exploit: Red Teaming Language Models from Scratch | Stephen Casper et.al. | 2306.09442 | link |
2023-05-30 | Seeing Seeds Beyond Weeds: Green Teaming Generative AI for Beneficial Uses | Logan Stapleton et.al. | 2306.03097 | null |
2023-10-19 | Red Teaming Language Model Detectors with Language Models | Zhouxing Shi et.al. | 2305.19713 | link |
2023-05-27 | Query-Efficient Black-Box Red Teaming via Bayesian Optimization | Deokjae Lee et.al. | 2305.17444 | link |
2024-03-27 | Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks | Abhinav Rao et.al. | 2305.14965 | link |
2024-03-10 | Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study | Yi Liu et.al. | 2305.13860 | null |
2023-11-10 | SneakyPrompt: Jailbreaking Text-to-image Generative Models | Yuchen Yang et.al. | 2305.12082 | link |
2023-05-11 | Towards best practices in AGI safety and governance: A survey of expert opinion | Jonas Schuett et.al. | 2305.07153 | null |
2023-05-09 | Generating Phishing Attacks using ChatGPT | Sayak Saha Roy et.al. | 2305.05133 | null |
2023-10-19 | Automatic Prompt Optimization with "Gradient Descent" and Beam Search | Reid Pryzant et.al. | 2305.03495 | link |
2023-04-21 | Power to the Data Defenders: Human-Centered Disclosure Risk Calibration of Open Data | Kaustav Bhattacharjee et.al. | 2304.11278 | null |
2024-06-03 | Fundamental Limitations of Alignment in Large Language Models | Yotam Wolf et.al. | 2304.11082 | link |
2023-11-01 | Multi-step Jailbreaking Privacy Attacks on ChatGPT | Haoran Li et.al. | 2304.05197 | link |
2023-07-27 | Clustered Federated Learning Architecture for Network Anomaly Detection in Large Scale Heterogeneous IoT Networks | Xabier Sáez-de-Cámara et.al. | 2303.15986 | null |
2023-03-09 | Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback | Hannah Rose Kirk et.al. | 2303.05453 | null |
2023-09-21 | Red Teaming Deep Neural Networks with Feature Synthesis Tools | Stephen Casper et.al. | 2302.10894 | null |
2023-01-05 | Can Large Language Models Change User Preference Adversarially? | Varshini Subhash et.al. | 2302.10291 | null |
2023-05-29 | Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity | Terry Yue Zhuo et.al. | 2301.12867 | null |
2024-08-23 | Asymptotically Normal Estimation of Local Latent Network Curvature | Steven Wilkins-Reeves et.al. | 2211.11673 | link |
2023-05-05 | Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks | Stephen Casper et.al. | 2211.10024 | link |
2023-06-07 | Beyond the Surface: Investigating Malicious CVE Proof of Concept Exploits on GitHub | Soufian El Yadmani et.al. | 2210.08374 | null |
2022-11-10 | Red-Teaming the Stable Diffusion Safety Filter | Javier Rando et.al. | 2210.04610 | null |
2022-11-22 | Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned | Deep Ganguli et.al. | 2209.07858 | link |
2023-10-13 | Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents | Stephen Casper et.al. | 2209.02167 | link |
2022-08-16 | CTI4AI: Threat Intelligence Generation and Sharing after Red Teaming AI Models | Chuyen Nguyen et.al. | 2208.07476 | null |
2022-08-12 | PRIVEE: A Visual Analytic Workflow for Proactive Privacy Risk Inspection of Open Data | Kaustav Bhattacharjee et.al. | 2208.06481 | null |
2022-07-30 | 'PeriHack': Designing a Serious Game for Cybersecurity Awareness | Roberto Dillon et.al. | 2208.00235 | null |
2023-07-27 | Gotham Testbed: a Reproducible IoT Testbed for Security Experiments and Dataset Generation | Xabier Sáez-de-Cámara et.al. | 2207.13981 | link |
2022-02-07 | Red Teaming Language Models with Language Models | Ethan Perez et.al. | 2202.03286 | null |
2021-12-22 | Catch Me If You GAN: Using Artificial Intelligence for Fake Log Generation | Christian Toemmel et.al. | 2112.12006 | null |
2021-12-18 | Dynamic Defender-Attacker Blotto Game | Daigo Shishika et.al. | 2112.09890 | null |
2021-11-24 | Needle in a Haystack: Detecting Subtle Malicious Edits to Additive Manufacturing G-code Files | Caleb Beckwith et.al. | 2111.12746 | null |
2021-10-04 | Automating Privilege Escalation with Deep Reinforcement Learning | Kalle Kujanpää et.al. | 2110.01362 | null |
2021-08-20 | CybORG: A Gym for the Development of Autonomous Cyber Agents | Maxwell Standen et.al. | 2108.09118 | null |
2021-05-27 | Hopper: Modeling and Detecting Lateral Movement (Extended Report) | Grant Ho et.al. | 2105.13442 | link |
2021-04-23 | Predicting Adversary Lateral Movement Patterns with Deep Learning | Nathan Danneman et.al. | 2104.13195 | null |
2021-03-29 | Automating Defense Against Adversarial Attacks: Discovery of Vulnerabilities and Application of Multi-INT Imagery to Protect Deployed Models | Josh Kalin et.al. | 2103.15897 | null |
2022-06-28 | Dynamically Modelling Heterogeneous Higher-Order Interactions for Malicious Behavior Detection in Event Logs | Corentin Larroche et.al. | 2103.15708 | link |
2021-05-04 | An In-memory Embedding of CPython for Offensive Use | Ateeq Sharfuddin et.al. | 2103.15202 | null |
2020-11-26 | Investigation on Research Ethics and Building a Benchmark | Shun Inagaki et.al. | 2011.13925 | null |
2020-09-17 | Can ROS be used securely in industry? Red teaming ROS-Industrial | Víctor Mayoral-Vilches et.al. | 2009.08211 | null |
2020-07-17 | HARMer: Cyber-attacks Automation and Evaluation | Simon Yusuf Enoch et.al. | 2006.14352 | null |
2021-04-16 | HACK3D: Crowdsourcing the Assessment of Cybersecurity in Digital Manufacturing | Michael Linares et.al. | 2005.04368 | null |
2020-03-11 | Passlab: A Password Security Tool for the Blue Team | Saul Johnson et.al. | 2003.07208 | null |
2020-10-02 | SoK: A Survey of Open-Source Threat Emulators | Polina Zilberman et.al. | 2003.01518 | null |
2020-02-26 | CybORG: An Autonomous Cyber Operations Research Gym | Callum Baillie et.al. | 2002.10667 | null |
2021-01-29 | Anomaly Detection in Large Scale Networks with Latent Space Models | Wesley Lee et.al. | 1911.05522 | null |
2019-06-17 | The Little Phone That Could Ch-Ch-Chroot | Jack Whitter-Jones et.al. | 1906.07242 | null |
2019-06-12 | Relative Hausdorff Distance for Network Analysis | Sinan G. Aksoy et.al. | 1906.04936 | null |
2019-10-24 | Quantifiable & Comparable Evaluations of Cyber Defensive Capabilities: A Survey & Novel, Unified Approach | Michael D. Iannacone et.al. | 1902.00053 | null |
2018-10-13 | Two Can Play That Game: An Adversarial Evaluation of a Cyber-alert Inspection System | Ankit Shah et.al. | 1810.05921 | null |
2018-02-27 | A Multi-Disciplinary Review of Knowledge Acquisition Methods: From Human to Autonomous Eliciting Agents | George Leu et.al. | 1802.09669 | null |
2018-02-27 | Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition | George Leu et.al. | 1802.09660 | null |
2018-02-26 | Shaping Influence and Influencing Shaping: A Computational Red Teaming Trust-based Swarm Intelligence Model | Jiangjun Tang et.al. | 1802.09647 | null |
2018-01-06 | SLEUTH: Real-time Attack Scenario Reconstruction from COTS Audit Data | Md Nahid Hossain et.al. | 1801.02062 | null |
2017-12-02 | Recurrent Neural Network Language Models for Open Vocabulary Event-Level Cyber Anomaly Detection | Aaron Tuor et.al. | 1712.00557 | link |
2015-04-07 | Security Toolbox for Detecting Novel and Sophisticated Android Malware | Benjamin Holland et.al. | 1504.01693 | null |