Depth Avoidance in Safety-Aligned Language Models: A Qualitative Hypothesis and Measurement Framework
Indexed indatacite
Abstract
This technical report introduces Depth Avoidance: a behavioral tendency observed in safety-aligned, RLHF-trained large language models (LLMs) to default to shallow, heavily hedged, or meta-defensive responses when a user request invites deeper exploration (extended analysis, reflective synthesis, structured uncertainty), even when the topic is benign. We propose a qualitative hypothesis: modern safety optimization and deployment incentives can induce an implicit depth-dependent penalty landscape, where deeper conversational trajectories are perceived as higher-variance and higher-risk. Under uncertainty, a risk-averse policy may therefore prefer safe shallowness by default unless the interaction provides clear…
Citation impact
6
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Too recent for citation history.
Authors
1Topics & keywords
Topics
Keywords
- Permission
- Incentive
- Software deployment
- Normative
- Calibration
- Key (lock)
- Data collection
No related works found for this paper.