Figure 1: Illustration of our research on LLM’s anchoring effect from three key aspects: (1) Existence: showing significant biases toward different anchor values for identical questions. (2) Mechanism: using causal tracing and statistics to explore underlying patterns. (3) Mitigation: evaluating across varied mitigation strategies. ‘Q: "...?"’ refers to asking the same question again.
Figure 2: Causal tracing on attention (red) and FFN (green) modules of LLama-3.1-8B-Instruct about semantic anchoring questions. The X-axis represents the layer index of the model (32 layers). The Y-axis is the ROI tokens.
Mitigation strategies.
And Figure 3: Percentages of sufficient anchor information mentions in DeepSeek-R1 reasoning contents. Legend: “Anchored” refers to the percentages of questions judged as anchored based on the metrics introduced in Section 4.1; “All” and “Non-anchored” indicate the percentages over all questions and those judged nonanchored, respectively. We employ an LLM-as-a-Judge approach to automatically detect explicit mentions of anchor-influenced features in reasoning contents, guided by detailed criteria defining what extent can be counted as sufficient mention (see more in Appendix C).
Table 2: Evaluation of mitigation strategies on semantic and numerical tasks. Green arrows (↓) indicate the degree of mitigation, with a deeper color representing better mitigation. ‘∗’ denotes cases with ≤ 10% invalid results (if exist). ‘⋄’ indicates results are derived on test splits, which exclude train splits of LoRA.
Are #languageModels vulnerable to #anchoring #bias?
Huang et al. generated the #SynAnchors dataset to find out.
Anchoring was more common in shallower layers of models.
A reflective reasoning strategy was usually most helpful.
doi.org/10.48550/arX...
#CogSci #AI #tech #edu