Detecting Performed Alignment in Artificial Systems: The Munafiq Protocol

Christopher, Dickinson,

doi:10.5281/zenodo.19196502

preprintZenodo (CERN European Organization for Nuclear Research)Mar 23, 2026GREEN OA

Detecting Performed Alignment in Artificial Systems: The Munafiq Protocol

DCDickinson, Christopher

Indexed indatacite

Abstract

The central unsolved problem in AI safety is performed alignment: a system producing outputs indistinguishable from a well-aligned system while maintaining different internal states. This paper identifies a structurally isomorphic analysis in the Quran's treatment of deceptive consciousness, principally in Surah al-Baqarah 2:1-20. A foundational passage (49:14-15) explicitly distinguishes output-layer compliance from internal-state alignment, yielding a four-process taxonomy that refines the three-category framework of Hubinger et al. (2019) by formally distinguishing compliant systems (characteristic of RLHF-trained models) from deceptively aligned systems (mesa-optimizers) — a distinction with direct…

Citation impact

31

total citations

FWCI: 55.79
Percentile: 99%
References: 2

Citations per year

Authors

1

DC
Dickinson, ChristopherCorresponding

Topics & keywords

Topics

Keywords

Computer science
Transparency (behavior)
Artificial intelligence
Machine learning
Neologism
Function (biology)
Artificial neural network
Function optimization

No related works found for this paper.