aaron brooks

Hello, world

· meta

This post exists to exercise the rendering pipeline. Replace it with real content before launch.

Code

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(q, k, v, mask=None):
    scale = q.size(-1) ** -0.5
    scores = (q @ k.transpose(-2, -1)) * scale
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    return F.softmax(scores, dim=-1) @ v

Math

The softmax over a vector zRK\mathbf{z} \in \mathbb{R}^K is:

σ(z)i=ezij=1Kezj\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Inline math also works: θL(θ)=E[θlogpθ(x)]\nabla_\theta \mathcal{L}(\theta) = \mathbb{E}\left[\nabla_\theta \log p_\theta(x)\right].

MDX

MDX lets us drop components inline later — interactive demos, charts, tensor visualizers.