Detecting Performed Alignment in Artificial Systems: The Munafiq Protocol
DCDickinson, Christopher
Indexed indatacite
Abstract
The central unsolved problem in AI safety is performed alignment: a system producing outputs indistinguishable from a well-aligned system while maintaining different internal states. This paper identifies a structurally isomorphic analysis in the Quran's treatment of deceptive consciousness, principally in Surah al-Baqarah 2:1-20. A foundational passage (49:14-15) explicitly distinguishes output-layer compliance from internal-state alignment, yielding a four-process taxonomy that refines the three-category framework of Hubinger et al. (2019) by formally distinguishing compliant systems (characteristic of RLHF-trained models) from deceptively aligned systems (mesa-optimizers) — a distinction with direct…
Citation impact
31
total citations
- FWCI
- 55.79
- Percentile
- 99%
- References
- 2
Citations per year
Authors
1- DCDickinson, ChristopherCorresponding
Topics & keywords
Topics
Keywords
- Computer science
- Transparency (behavior)
- Artificial intelligence
- Machine learning
- Neologism
- Function (biology)
- Artificial neural network
- Function optimization
No related works found for this paper.