Combining Domain and Alignment Vectors Provides Better Knowledge-Safety Trade-offs in LLMsMegh ThakkarQuentin Fournieret al.2025ACL 2025
Defensive Prompt Patch: A Robust and Generalizable Defense of Large Language Models against Jailbreak AttacksChen XiongXiangyu Qiet al.2025ACL 2025
SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data SelectionHan ShenPin-Yu Chenet al.2025ICLR 2025
When is Task Vector Provably Effective for Model Editing? A Generalization Analysis of Nonlinear TransformersHongkang LiYihua Zhanget al.2025ICLR 2025