S. Roy

Blog Post

Activation Functions: The Nonlinearity That Makes Neural Networks Work

Without nonlinearity, stacking layers collapses to a single matrix multiplication. Activation functions break that linearity — and the choice of which one determines expressivity, gradient flow, and training efficiency.

Views: 6 min readCite

Part 1 made a point that becomes a problem here: a matrix is a linear map, and a stack of linear maps is still just a linear map. Multiply WnW2W1W_n \cdots W_2 W_1 together and you get a single matrix, so a network of nothing but linear layers — however deep — can only ever represent linear functions of its input, no matter how many layers you pile on. The activation function is what breaks that collapse; inserting a nonlinearity between the layers makes the composition genuinely nonlinear and unlocks the network's ability to approximate complicated functions.

The earliest choice was the sigmoid, which squashes any real number into the open interval (0,1)(0, 1).

σ(x)=11+ex,σ(x)=σ(x)(1σ(x))\sigma(x) = \frac{1}{1 + e^{-x}}, \qquad \sigma'(x) = \sigma(x)\big(1 - \sigma(x)\big)

Its derivative is the source of its downfall: at x=±5x = \pm 5 the gradient is already about 0.0060.006, and in a deep network these tiny factors multiply together through the chain rule, shrinking the gradient exponentially as it propagates backward — the vanishing gradient problem that made deep sigmoid networks nearly impossible to train, which is why sigmoid survives today only inside gating components and the occasional binary output, never as a hidden activation.

Tanh is the same shape shifted and stretched to be zero-centered, mapping inputs to (1,1)(-1, 1) instead of (0,1)(0, 1).

tanh(x)=exexex+ex=2σ(2x)1,tanh(x)=1tanh2(x)\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1, \qquad \tanh'(x) = 1 - \tanh^2(x)

Being centered at zero helps gradient flow, because the activations no longer carry a constant positive bias into the next layer, but tanh saturates just as flatly as sigmoid at its extremes and so inherits the same vanishing-gradient disease in deep stacks.

The function that broke the field open was the rectified linear unit, which is almost embarrassingly simple: pass positive inputs straight through and zero out the rest.

ReLU(x)=max(0,x)\mathrm{ReLU}(x) = \max(0, x)

Its derivative is exactly one for positive inputs and zero otherwise, so it never saturates on the positive side and gradients flow through unattenuated — the property that made very deep networks trainable and let ReLU dominate from roughly 2012 to 2020. Its one failure mode is the dead neuron: if a unit's pre-activation is always negative, perhaps because of a large negative bias, its gradient is always zero and it never updates again, though in practice this afflicts only a small fraction of units and training proceeds fine.

The dead-neuron problem invites obvious patches, and Leaky ReLU and ELU are the two common ones. Leaky ReLU gives the negative side a small slope so gradient never fully vanishes there, while ELU uses a smooth exponential curve for negative inputs that also produces negative outputs.

LeakyReLU(x)=max(αx,x),ELU(x)={xx>0α(ex1)x0\mathrm{LeakyReLU}(x) = \max(\alpha x, x), \qquad \mathrm{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \le 0 \end{cases}

Both keep negative-region gradients alive and ELU's negative outputs help re-center activations the way tanh does, but in practice the improvement over plain ReLU is modest, which is why neither displaced it.

The activation that took over language models is GeLU, which weights each input by the probability that a standard Gaussian falls below it — a soft, probabilistic version of ReLU's hard gate.

GeLU(x)=xΦ(x)xσ(1.702x)\mathrm{GeLU}(x) = x \, \Phi(x) \approx x \, \sigma(1.702 \, x)

Here Φ\Phi is the Gaussian CDF, so instead of ReLU's abrupt cutoff at zero, GeLU smoothly scales each input by how likely it is to be "on," giving a function that is differentiable everywhere including at the origin; this smoothness consistently helps on language tasks, and GeLU became the activation of BERT, GPT-2, and GPT-3.

The current frontier default goes a step further by making the nonlinearity itself input-dependent. SwiGLU splits the input through two separate linear projections, runs one through a Swish nonlinearity, and multiplies them element-wise.

SwiGLU(x,W,V)=Swish(xW)(xV),Swish(x)=xσ(βx)\mathrm{SwiGLU}(x, W, V) = \mathrm{Swish}(xW) \otimes (xV), \qquad \mathrm{Swish}(x) = x \, \sigma(\beta x)

The second projection acts as a gate that decides, per component, how much of the first projection passes through, and because this gating burns parameters on a third matrix the feed-forward hidden dimension is set to 83dmodel\tfrac{8}{3} d_\text{model} rather than the usual 4dmodel4 d_\text{model} to keep the parameter count matched; SwiGLU is now used in LLaMA, PaLM, Gemini, and essentially every frontier model since 2022, and the practical reasons sit alongside the other architecture internals that determine how transformers train.

It is worth being explicit about why a gate helps, because it is a different idea from a plain nonlinearity. The gate σ(xV)\sigma(xV) is a learned, soft mask over the hidden units, so the feed-forward block can switch different subsets of its neurons on and off depending on the input — a form of conditional computation where the effective function the block computes changes from token to token rather than being fixed by the weights alone. That extra flexibility buys roughly a point of perplexity over GeLU at matched compute, which at scale is a large margin for so small a change.

Standing behind all of these choices is a guarantee that explains why any of it is worth doing: the universal approximation theorem says a single hidden layer with a nonlinear activation and enough width can approximate any continuous function on a compact domain to arbitrary precision. The theorem is existence, not construction — it promises nothing about how many neurons you need, which could be astronomically many, or whether gradient descent will ever find the weights — but it establishes that the function class neural networks can represent is rich enough to be worth searching, and the activation function is the ingredient that makes the class rich in the first place.

These five tools — the geometry of vectors and matrices, the probability of distributions, the optimizer that minimizes a loss, the loss that specifies the goal, and the nonlinearity that gives the network its expressive power — are the load-bearing mathematics under every transformer paper. The series continues from here into the tools for analyzing, compressing, and evaluating what these networks learn: the SVD and low-rank structure that make LoRA work, the entropy and information measures that quantify what a model knows, and the evaluation metrics that tell us whether it knows the right things.

Cite this work

Generated from article front matter.

Roy, Swastik. (2024). Activation Functions: The Nonlinearity That Makes Neural Networks Work. S. Roy. https://swastikroy.me/blog/math-llm-activation-functions

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.