Tag: moe

Blog Post·2024-06-19·3 min read

The Architecture Playground: What a Transformer Config Actually Buys You

An interactive research blog. Drag the config of a decoder-only transformer — hidden size, head counts, FFN type — and watch the parameter count, KV cache, and mixture-of-experts routing recompute live.

architecture transformers interactive kv-cache moe

Blog Post·2024-06-19·9 min read

Comparing Large Model Architectures: Attention, Normalization, and Scale

GPT-4, Gemini, LLaMA, Mistral, DeepSeek, Qwen — they all build on the same transformer skeleton. But the architectural choices diverge sharply. Here's a systematic comparison across model families.

architecture transformers llm comparison moe

Blog Post·2024-06-19·6 min read

Inside the FFN: MoE, SwiGLU, and the Architectural Details That Scale

The FFN block consumes most of a transformer's parameters. The choices made there — activation function, gating, expert routing — account for much of the quality gap between model families.

transformers moe swiglu architecture llm-training