Tag: adam

Blog Post·2024-06-19·6 min read

Optimization for LLMs: Gradient Descent to Adam

Training a neural network is an optimization problem: minimize a loss function over billions of parameters. The journey from vanilla gradient descent to Adam reveals why each step was necessary.

math optimization gradient-descent adam llm-training

Blog Post·2024-06-19·4 min read

Optimizers for LLMs: Adam, Weight Decay, and Why Learning Rate Matters More Than You Think

Adam is the default optimizer for language model training, but using it correctly — the right β values, weight decay, learning rate schedule — makes a larger difference than most people expect.

transformers optimizer llm-training adam