Mixtral of Experts

Jiang, Albert Q.; Sablayrolles, Alexandre; Roux, Antoine; Mensch, Arthur; Savary, Blanche; Bamford, Chris; Chaplot, Devendra Singh; Casas, Diego de Las; Hanna, Emma Bou; Bressand, Florian; Lengyel, Gianna; Bour, Guillaume; Lample, Guillaume; Lavaud, Lélio Renard; Saulnier, Lucile; Lachaux, Marie-Anne; Stock, Pierre; Subramanian, Sandeep; Yang, Sophia; Antoniak, Szymon; Scao, Teven Le; Gervet, Théophile; Lavril, Thibaut; Wang, Thomas J.; Lacroix, Timothée; Sayed, William El

doi:10.48550/arxiv.2401.04088

preprintarXiv (Cornell University)Jan 8, 2024GREEN OA

Mixtral of Experts

AQAlbert Q. Jiang ASAlexandre Sablayrolles ARAntoine Roux AMArthur Mensch BSBlanche Savary

Indexed inarxivdatacite

Abstract

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral…

Citation impact

120

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

26

Topics & keywords

Topics

Keywords

Computer science
Security token
Context (archaeology)
License
Code (set theory)
Process (computing)
Feed forward
Router

UN Sustainable Development Goals

Quality Education

No related works found for this paper.

Funding

N
Nvidia