Zhiyuan He, Yijun Yang, et al.
ICML 2024
Large language models (LLMs) are typically pre-trained on general-purpose corpora and later refined through instruction tuning and domain-specific fine-tuning to adapt to specialized domains or tasks (domain adaptation). While LLMs benchmarking often focuses on downstream task performance, understanding how domain adaptation reshapes internal model representations remains underexplored. In this work, we analyze shifts in activation space across base, instruction-tuned, and domain aligned models from multiple LLM families in the legal and medical domains. We quantify the impact of domain alignment on internal representations by measuring Euclidean distances between convex hull centroids of the first two principal components and evaluating activation-space distribution shifts with DeepScan. Our findings reveal consistent activation shifts from base instruction-tuned to domain aligned models. However, the degree and significance of the change vary by model. In most cases, domain-adapted models exhibit more significant shifts in activation space relative to other models from the same family. Separability in activation space is generally high, especially in Legal, and to a lesser extent in Health. Our preliminary results highlight the value of activation-space analysis as a complementary perspective to traditional evaluation methods in the model’s output.
Zhiyuan He, Yijun Yang, et al.
ICML 2024
Hazar Yueksel, Ramon Bertran, et al.
MLSys 2020
Megh Thakkar, Quentin Fournier, et al.
ACL 2024
Inkit Padhi, Karthikeyan Natesan Ramamurthy, et al.
NeurIPS 2024