Generative Adversarial Networks
Indexed inarxivdatacite
Abstract
Large Language Models (LLMS) rely on Key-Value (KV) caches to store attention context during autoregressive decoding. In long-sequence settings, the KV cache can consume large amounts of VRAM and become a practical bottleneck for throughput . We introduce KVHALO, an auxiliary reconstruction model that restores higher-fidelity KV tensors from a compressed cache state when required, reducing persistent memory footprint during inference. In our evaluation, KVHALO achieves up to 91.85% directional cosine alignment at convergence and reduces long-context degradation relative to a low-bit baseline under our stress-test workloads. We used HRM instead of other architectures, which allowed for higher-quality results in…
Citation impact
4,550
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
8Topics & keywords
Topics
Keywords
- Discriminative model
- Minimax
- Computer science
- Inference
- Artificial intelligence
- Perceptron
- Generative grammar
- Machine learning
UN Sustainable Development Goals
- Reduced inequalities
No related works found for this paper.