Depth Avoidance in Safety-Aligned Language Models: A Qualitative Hypothesis and Measurement Framework

Indexed indatacite

Abstract

This technical report introduces Depth Avoidance: a behavioral tendency observed in safety-aligned, RLHF-trained large language models (LLMs) to default to shallow, heavily hedged, or meta-defensive responses when a user request invites deeper exploration (extended analysis, reflective synthesis, structured uncertainty), even when the topic is benign. We propose a qualitative hypothesis: modern safety optimization and deployment incentives can induce an implicit depth-dependent penalty landscape, where deeper conversational trajectories are perceived as higher-variance and higher-risk. Under uncertainty, a risk-averse policy may therefore prefer safe shallowness by default unless the interaction provides clear…

Citation impact

6
total citations
FWCI
Percentile
References
0
Too recent for citation history.

Authors

1

Topics & keywords

Keywords
  • Permission
  • Incentive
  • Software deployment
  • Normative
  • Calibration
  • Key (lock)
  • Data collection
No related works found for this paper.