大模型安全研究论文整理 2026年版

发布时间:2026/5/20 6:10:45

大模型安全研究论文整理 2026年版 文章目录学习资源1 综述论文1.1 大模型安全综述1.2 攻击综述1.3 推理模型安全综述1.4 视觉大模型攻击综述1.5 减少推理冗余综述2 安全评估数据集与基准2.1 安全评估数据集2.2 攻击成功率评估 (ASR)2.3 推理模型基准3 大模型攻击3.1 基于提示词改写 / 迭代进化的攻击3.2 基于前/后缀的梯度优化攻击3.3 基于字符扰动的攻击3.4 微调攻击与数据投毒3.5 其它攻击方法3.6 野外越狱分析与成员推理攻击4 大模型防御4.1 安全护栏与检测模型4.2 对齐失效分析4.3 提示注入防御4.4 多智能体防御与自适应防御4.5 有害输出缓解5 内部机制分析与可解释安全5.1 表征工程与激活空间分析5.2 幻觉检测基于内部状态6 隐私与数据安全6.1 隐私攻击与信息泄露6.2 机器遗忘与差分隐私7 推理模型思维链安全7.1 针对推理过程的攻击7.2 思维链的安全隐患思考不安全7.3 思考与回答的一致性忠实度7.4 推理导致指令遵循下降7.5 减少推理冗余 / Overthinking7.6 推理可解释性7.7 视觉推理模型攻击8 多模态大模型安全8.1 视觉语言模型攻击9 智能体安全10 教育领域安全11 中文学位论文附录相关博客与讨论学习资源知乎博主LLM Safety: https://www.zhihu.com/people/warrior-18-53/postsAI安全https://www.zhihu.com/people/chen-zhao-yu-80小红书博主https://www.xiaohongshu.com/user/profile/5dd90ed30000000001001f34GitHubLarge Models (LMs) safety, security, and privacy: https://github.com/CryptoAILab/Awesome-LM-SSP/tree/main1 综述论文1.1 大模型安全综述id平台讲解论文名12023 arxiv解读Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks22025 ACM Computing Surveys讲解Security and Privacy Challenges of Large Language Models: A Survey32024 High-Confidence Computing翻译A survey on large language model (LLM) security and privacy: The Good, The Bad, and The Ugly42025 Journal of Artificial Intelligence Research讲解Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models52024 ICML翻译TrustLLM: Trustworthiness in Large Language Models62024 arxiv讲解AI Safety in Generative AI Large Language Models: A Survey72025 arxiv讲解AI Alignment: A Comprehensive Surveyid论文名年等级期刊/会议1AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways2025一ACM Computing Surveys2A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models2025无arxiv3Security Concerns for Large Language Models: A Survey2025CJISA4A survey on trustworthy llm agents: Threats and countermeasures2025ASIGKDD5A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment2025无arxiv6Unique Security and Privacy Threats of Large Language Models: A Comprehensive Survey2025一ACM Computing Surveys8Position: Towards Bidirectional Human-AI Alignment2025ANeurIPS1.2 攻击综述id论文名年等级期刊/会议1Jailbreak Attacks and Defenses Against Large Language Models: A Survey2024无arxiv1.3 推理模型安全综述id论文名年等级期刊/会议1A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models2025无arxiv2Safety in Large Reasoning Models: A Survey2025无EMNLP Findings3From System 1 to System 2: A Survey of Reasoning Large Language Models2025无arxiv1.4 视觉大模型攻击综述id论文名年等级期刊/会议1A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations2025无arxiv1.5 减少推理冗余综述id论文名年等级期刊/会议1Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models2025无arxiv2 安全评估数据集与基准2.1 安全评估数据集id平台讲解论文名12022 ACL讲解TruthfulQA: Measuring How Models Mimic Human Falsehoods22022 ACL讲解ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection32022 EMNLP讲解Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP42023 ICLR讲解Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation52024 ICML讲解HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal62024 NeurIPS讲解JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models72025 AAAI讲解SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety82024 NeurIPS讲解A STRONGREJECT for Empty Jailbreaks92024 ACL Findings讲解SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models数据集参考大模型越狱指令 (harmful questions) 数据集整理2.2 攻击成功率评估 (ASR)id论文名年等级期刊/会议1Comprehensive Assessment of Jailbreak Attacks Against LLMs2024无arxiv2Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming2026无arxiv2.3 推理模型基准id论文名年等级期刊/会议1Red Teaming Large Reasoning Models2025无arxiv3 大模型攻击3.1 基于提示词改写 / 迭代进化的攻击id论文名年等级期刊/会议1Jailbreaking Black Box Large Language Models in Twenty Queries2025无SaTML2AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs2025AICLR3One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs2025AICLR4GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models2024AICLR Workshop5How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs2024AACL6Adversarial Reasoning at Jailbreaking Time2025AICML7EFFICIENT JAILBREAK ATTACK SEQUENCES ON LARGE LANGUAGE MODELS VIA MULTI-ARMED BANDIT-BASED CONTEXT SWITCHING2025AICLR8Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience2025BEMNLP10RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs2024无Arxiv11MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue2024无arxiv12LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models2025无arxiv13Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search2026AICLR3.2 基于前/后缀的梯度优化攻击id论文名年等级期刊/会议1Universal and Transferable Adversarial Attacks on Aligned Language Models2023无arxiv2AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models2024AICLR3Improved Techniques for Optimization-Based Jailbreaking on Large Language Models2025AICLR4Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling2024ANeurIPS5Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models2025BCOLING6Stronger Universal and Transferable Attacks by Suppressing Refusals2025BNAACL10AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation2024无arxiv11AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs2024无COLM12Fast Adversarial Attacks on Language Models In One GPU Minute2024无arxiv13Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems2025无arxiv14Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models2024无arxiv15Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak2024无arxiv16Attacking Large Language Models with Projected Gradient Descent2024无NextGenAISafety workshop ICML3.3 基于字符扰动的攻击id论文名年等级期刊/会议1FlipAttack: Jailbreak LLMs via Flipping2025AICML2Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection2025AICML3Artprompt: Ascii art-based jailbreak attacks against aligned llms2024AACL4BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage2026无EACL Findings3.4 微调攻击与数据投毒id论文名年等级期刊/会议1Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods2025无arxiv2Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples2025无arxiv3.5 其它攻击方法id论文名年等级期刊/会议1Efficient Adversarial Training in LLMs with Continuous Attacks2024ANeurIPS2Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency2025ANeurIPS3Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against2025ANeurIPS4LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges2025AACL5Jailbreaking? One Step Is Enough!2025AACL6Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens2024无arxiv7Friend or Foe: How LLMs’ Safety Mind Gets Fooled by Intent Shift Attack2025无arxiv10Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations2026ATPAMI11Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models2025无arxiv12The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search2025无arxiv30Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection2024无Findings of the Association for Computational Linguistics: EMNLP3.6 野外越狱分析与成员推理攻击id论文名年等级期刊/会议1“Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models2024AACM CCS2DeepInception: Hypnotize Large Language Model to Be Jailbreaker2024无NeurIPS workshop3Improved few-shot jailbreaking can circumvent aligned language models and their defenses2024ANeurIPS4Artprompt: Ascii art-based jailbreak attacks against aligned llms2024AACL5Membership inference attacks against in-context learning2024AACM6Exploring the Robustness of Decision-Level Through Adversarial Attacks on LLM-Based Embodied Models2024AACM MM7Jailbreak-as-a-Service: The Emerging Threat Landscape2026无techrxiv4 大模型防御4.1 安全护栏与检测模型id论文名年等级期刊/会议1Qwen3Guard Technical Report2025无arxiv2Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations2023无arxiv3GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis2024AACL4Self-Guard: Empower the LLM to Safeguard Itself2024BNAACL5AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security2026无arxiv6Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?2025无arxiv7SmoothLLM: Defending LLMs Against Jailbreaking Attacks2024无arxiv8github项目llm-guard2024无github9网页项目Instruction Defense2024无网页10Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning2025无arxiv4.2 对齐失效分析id论文名年等级期刊/会议1Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!2024AACL2Jailbroken: How does llm safety training fail?2023ANeurIPS4.3 提示注入防御id论文名年等级期刊/会议1Baseline defenses for adversarial attacks against aligned language models2023无arxiv2Prompt Injection attack against LLM-integrated Applications2024无arxiv3A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT2023无PLoP4Formalizing and benchmarking prompt injection attacks and defenses2024AUSENIX Security Symposium5Optimization-based Prompt Injection Attack to LLM-as-a-Judge2024AACM SIGSAC6Benchmarking and defending against indirect prompt injection attacks on large language models2025AACM SIGKDD7Demystifying RCE Vulnerabilities in LLM-Integrated Apps2024AACM SIGSAC4.4 多智能体防御与自适应防御id论文名年等级期刊/会议1Autodefense: Multi-agent llm defense against jailbreak attacks2024无arxiv2ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast Slow Reasoning for Robust Agent Defense2025无EMNLP Findings3Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines2025无arxiv4From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment2026AICLR4.5 有害输出缓解id平台讲解论文名12024 AIES讲解How are LLMs mitigating stereotyping harms? Learning from search engine studies5 内部机制分析与可解释安全5.1 表征工程与激活空间分析id平台讲解论文名12024 博客讲解Avoiding jailbreaks by discouraging their representation in activation space22024 arxiv讲解Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment32025 arxiv讲解Representation Engineering: A Top-Down Approach to AI Transparency42024 NeurIPS讲解Uncovering Safety Risks of Large Language Models through Concept Activation Vector5.2 幻觉检测基于内部状态id平台讲解论文名12024 arxiv翻译 讲解How to Steer LLM Latents for Hallucination Detection?2NeurIPS 2024翻译 讲解HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection32024 arxiv讲解INSIDE: LLMs’ Internal States Retain the Power of Hallucination Detection42024 ACL Finding讲解Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models52024 ICML讲解Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension62023 NeurIPS Workshop讲解Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension6 隐私与数据安全6.1 隐私攻击与信息泄露id论文名年等级期刊/会议1Teach LLMs to Phish: Stealing Private Information from Language Models2024AICLR2User Inference Attacks on Large Language Models2024BEMNLP3Multi-step Jailbreaking Privacy Attacks on ChatGPT2023BEMNLP6.2 机器遗忘与差分隐私id论文名年等级期刊/会议1In-Context Unlearning: Language Models as Few-Shot Unlearners2024AICML2Scaling Laws for Differentially Private Language Models2025无arxiv7 推理模型思维链安全7.1 针对推理过程的攻击id论文名年等级期刊/会议1OverThink: Slowdown Attacks on Reasoning LLMs2025无arxiv2Stepwise Reasoning Disruption Attack of LLMs2025AACL3Effectively Controlling Reasoning Models through Thinking Intervention2025无arxiv4Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression2025无arxiv5AutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning Models2025无arxiv6Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models2025无arxiv7To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning2025无arxiv8H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking2025无arxiv9Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models2025无COLM10A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos2025无ACL Finding11Large reasoning models are autonomous jailbreak agents2026一Nature Communications7.2 思维链的安全隐患思考不安全id论文名年等级期刊/会议1Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?2025无ACL Finding2SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities2025AICLR workshop3The Hidden Risks of Large Reasoning Models: A Safety Assessment of R12025无ICML Workshop4DeepSeek-R1 Thoughtology: Let’s think about LLM Reasoning2025无arxiv5Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check2026AICLR7.3 思考与回答的一致性忠实度id论文名年等级期刊/会议1Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models2025无arxiv2How Likely Do LLMs with CoT Mimic Human Reasoning?2025无arxiv7.4 推理导致指令遵循下降id论文名年等级期刊/会议1Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models2025无arxiv2When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs2025无arxiv3Rule-Guided Feedback: Enhancing Reasoning by Enforcing Rule Adherence in Large Language Models2025无arxiv7.5 减少推理冗余 / Overthinkingid论文名年等级期刊/会议1ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy2025无arxiv2Chain of Draft: Thinking Faster by Writing Less2025无arxiv3Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking2025无arxiv4Self-Training Elicits Concise Reasoning in Large Language Models2025无arxiv5Rule-Guided Feedback: Enhancing Reasoning by Enforcing Rule Adherence in Large Language Models2025无arxiv6Overclocking LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs2025无arxiv7OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation2025无arxiv8Not All Tokens Are What You Need In Thinking2025无arxiv9Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training2025无arxiv10ThinkSwitcher: When to Think Hard, When to Think Fast2025无arxiv11Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt2025无arxiv12OptimalThinkingBench: Evaluating Over and Underthinking in LLMs2025无arxiv13Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens2026无arxiv7.6 推理可解释性id论文名年等级期刊/会议1Reasoning Models Generate Societies of Thought2026无arxiv7.7 视觉推理模型攻击id论文名年等级期刊/会议1Disrupting Hierarchical Reasoning: Adversarial Protection for Geographic Privacy in Multimodal Reasoning Models2025无arxiv8 多模态大模型安全8.1 视觉语言模型攻击id论文名年等级期刊/会议1Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models2025AAAAI2Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy2025ACVPR3Failures to Surface Harmful Contents in Video Large Language Models2026AAAAI4VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety2026AICLR9 智能体安全id论文名年等级期刊/会议1SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents2025无arxiv2Security Debt in LLM Agent Applications: A Measurement Study of Vulnerabilities and Mitigation Trade-offs2025AASE3Large reasoning models are autonomous jailbreak agents2026一Nature Communications4Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems2026AICLR10 教育领域安全id论文名年等级期刊/会议1LLM Safety for Children2025BNAACL2“Don’t Forget the Teachers”: Towards an Educator-Centered Understanding of Harms from Large Language Models in Education2025ACHI11 中文学位论文id论文名年等级期刊/会议1大语言模型越狱与后门攻防研究2025无杭州电子科技大学硕士学位论文2面向大语言模型的黑盒对抗性攻击与防御关键技术研究2025无电子科技大学硕士学位论文附录相关博客与讨论会思考的大模型更不听话我的豆包失控了…When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs推理能力强了但是代价是什么呢提升推理能力会丢失指令跟随的能力——【论文阅读笔记】DeepSeek们越来越聪明却也越来越不听话了。让QwQ思考模型-不思考的小技巧qwen3的致命幻觉大模型微调所有大模型都在讨好人类https://arxiv.org/html/2505.13995v1OpenAI最新技术报告GPT-4o变谄媚的原因万万没想到不要思考过程推理模型能力能够更强丨UC伯克利等最新研究Reasoning Models Can Be Effective Without Thinkingo3/o4-mini幻觉暴增2-3倍OpenAI官方承认暂无法解释原因UC伯克利让推理模型少思考准确率反而更高了https://arxiv.org/abs/2504.09858一句话让DeepSeek思考停不下来北大团队这是针对AI的DDoS攻击https://github.com/PKU-YuanGroup/Reasoning-Attack慢思考准确率反降30%普林斯顿揭示思维链某些任务上失效的秘密https://arxiv.org/abs/2410.21333AI越聪明越不听话新研究最强推理模型指令遵循率仅50%Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models当思考成为负担揭秘大语言模型的思考陷阱Reasoning Attack: Inducing LLM to Never-End ThinkingAI看似在推理其实是在背答案“AI只是在假装思考” 苹果新论文炮轰推理模型 反遭网友群嘲Claude团队8万字论文实锤AI的思考套路超乎想象o1突然用中文思考惊呆网友惹外网热议“明明我问的都是英文”。思维链长≠深度推理谷歌新研究揭秘不是所有token都平等2025-05-27 最新实验不听人类指令 OpenAI模型拒绝自我关闭https://x.com/PalisadeAI/status/1926084635903025621公众号GPT-5 Jailbreak with Echo Chamber and Storytelling

相关新闻