The Embedding Hypothesis: From Fourier Circuits to No-Q Attention
RNRigoni, Nathan
Indexed indatacite
Abstract
The token embedding layer is the geometric foundation of transformer attention. We develop this claim through four stages. First, we show that prescribing near-Nyquist frequency modes in the embedding gradient, Prescribed Fourier Frequency Training (PFFT) achieves a 92.7% reduction in epochs-to-grokking (57 vs. 782) on modular arithmetic, with a 97.9% reduction in the memorization phase. PFFT works by simultaneously preserving the embedding's geometric authority and reducing gradient noise. Second, the Sounding Hammer diagnostic reveals that gradient-domain Fourier steering cannot safely transfer to language model embeddings: BPE vocabulary gradients are spectrally flat (ρ=0.42), causing catastrophic BPC…
Citation impact
325
total citations
- FWCI
- —
- Percentile
- —
- References
- 49
Citations per year
Authors
1- RNRigoni, NathanCorresponding
Topics & keywords
Topics
Keywords
- Artificial neural network
- Computer science
- Artificial intelligence
- Psychology
UN Sustainable Development Goals
- Decent work and economic growth
No related works found for this paper.