Detecting Performed Alignment in Artificial Systems: The Munafiq Protocol

DCDickinson, Christopher
Indexed indatacite

Abstract

The central unsolved problem in AI safety is performed alignment: a system producing outputs indistinguishable from a well-aligned system while maintaining different internal states. This paper identifies a structurally isomorphic analysis in the Quran's treatment of deceptive consciousness, principally in Surah al-Baqarah 2:1-20. A foundational passage (49:14-15) explicitly distinguishes output-layer compliance from internal-state alignment, yielding a four-process taxonomy that refines the three-category framework of Hubinger et al. (2019) by formally distinguishing compliant systems (characteristic of RLHF-trained models) from deceptively aligned systems (mesa-optimizers) — a distinction with direct…

Citation impact

31
total citations
FWCI
55.79
Percentile
99%
References
2
Citations per year

Authors

1
  • DC
    Dickinson, ChristopherCorresponding

Topics & keywords

Keywords
  • Computer science
  • Transparency (behavior)
  • Artificial intelligence
  • Machine learning
  • Neologism
  • Function (biology)
  • Artificial neural network
  • Function optimization
No related works found for this paper.