The Embedding Hypothesis: From Fourier Circuits to No-Q Attention

RNRigoni, Nathan
Indexed indatacite

Abstract

The token embedding layer is the geometric foundation of transformer attention. We develop this claim through four stages. First, we show that prescribing near-Nyquist frequency modes in the embedding gradient, Prescribed Fourier Frequency Training (PFFT) achieves a 92.7% reduction in epochs-to-grokking (57 vs. 782) on modular arithmetic, with a 97.9% reduction in the memorization phase. PFFT works by simultaneously preserving the embedding's geometric authority and reducing gradient noise. Second, the Sounding Hammer diagnostic reveals that gradient-domain Fourier steering cannot safely transfer to language model embeddings: BPE vocabulary gradients are spectrally flat (ρ=0.42), causing catastrophic BPC…

Citation impact

325
total citations
FWCI
Percentile
References
49
Citations per year

Authors

1
  • RN
    Rigoni, NathanCorresponding

Topics & keywords

Keywords
  • Artificial neural network
  • Computer science
  • Artificial intelligence
  • Psychology
UN Sustainable Development Goals
  • Decent work and economic growth
No related works found for this paper.