Tag: architecture

Blog Post·2024-06-19·3 min read

The Architecture Playground: What a Transformer Config Actually Buys You

An interactive research blog. Drag the config of a decoder-only transformer — hidden size, head counts, FFN type — and watch the parameter count, KV cache, and mixture-of-experts routing recompute live.

architecture transformers interactive kv-cache moe

Blog Post·2024-06-19·9 min read

Comparing Large Model Architectures: Attention, Normalization, and Scale

GPT-4, Gemini, LLaMA, Mistral, DeepSeek, Qwen — they all build on the same transformer skeleton. But the architectural choices diverge sharply. Here's a systematic comparison across model families.

architecture transformers llm comparison moe

Blog Post·2024-06-19·8 min read

How Decoder-Only Transformers Evolved Since GPT-2

GPT-2 established the decoder-only transformer as the dominant paradigm. What followed was six years of systematic improvements — in scale, efficiency, alignment, and reasoning. Here's the arc.

transformers gpt architecture history llm

Blog Post·2024-06-19·6 min read

Inside the FFN: MoE, SwiGLU, and the Architectural Details That Scale

The FFN block consumes most of a transformer's parameters. The choices made there — activation function, gating, expert routing — account for much of the quality gap between model families.

transformers moe swiglu architecture llm-training

Blog Post·2024-06-19·6 min read

Attention Variants: MHA, MQA, GQA, and the Memory Math Behind Them

Multi-head attention was the original. Multi-query attention was the efficient approximation. Grouped-query attention is the synthesis that modern LLMs converged on — and the reason is bandwidth, not FLOPs.

transformers attention gqa architecture inference

Blog Post·2024-06-19·6 min read

Positional Encodings: From Sinusoids to RoPE

Attention is permutation-invariant. Positional encodings break that symmetry. The choice of encoding method determines whether your model can generalize to longer sequences than it trained on.

transformers rope positional-encoding architecture