CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence
Indexed inarxivdatacite
Abstract
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is…
Citation impact
304
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
51- BYBai, YuntaoCorresponding
- SKSaurav Kadavath
- SKSandipan Kundu
- AAAmanda Askell
- JKJackson Kernion
Topics & keywords
Keywords
- Leverage (statistics)
- Computer science
- Reinforcement learning
- Artificial intelligence
- Transparency (behavior)
- Preference
- Sample (material)
- Machine learning
UN Sustainable Development Goals
- Peace, Justice and strong institutions
No related works found for this paper.