1. Foundation

Tokenization

3 types: char, word, subwords
- word: 1) OOV 2) long-tail effect 3) won’t learn the shared meaning between swim and swimming.
- char: sequence length too long → expensive compute.
- subword units.
Subwords Methodologies
- 1. Start with 26 chars + punctuation, a training corpus, and a required vocab size \(V\).
  2. Split all tokens into sub-pieces, e.g. cat → [c, a, t, ca, at, cat].
  3. Question: how to merge units into \(|V|\) important subwords so that you best represent the corpus? You need an importance score over candidate subwords.
- 3 methods
  - BPE / Byte-level BPE.
    Merge on frequency. Count frequencies of all units, merge subwords that reduce the description length / token count, and update the vocab. Start from the most frequent units (usually chars).
    [code]
  - BBPE: Byte-level BPE.
    - Instead of using strings, use bytes.
    - Pros:
      - BPE enables multilingual learning.
      - Compression.
      - Not just for text; can also work on images etc.
  - WordPiece: find \(|V|\) subwords that give the best log-likelihood of the training corpus \(S\). text token x, y --> token z \[ \log P(z) - (\log P(x) + \log P(y)) = \log \frac{P(z)}{P(x)P(y)} \]
  - Unigram Language Model
    - How to score if a token is important? A token is important if deleting it (and only using its sub-units) causes larger loss for the unigram model.
    - Keep single-char tokens to avoid OOV. \[ p(x) = \prod_i p(x_i), \qquad \hat{x} = \arg\max_x p(x) \]
Comparisons
- BPE and WordPiece: how to merge? From small vocab to big.
- Unigram LM: how to use the LM to shrink vocab from big to small?
Tokenization is the root of a lot of issues: because LLMs are next-token predictors, they struggle with exact arithmetic.

Embedding

Why? Word embeddings are dense representations that capture semantic meanings. It’s basically training an MLP to map one-hot vectors into dense vectors that represent word meaning.
Lookup table nn.Embedding(vocab_size, embedding_dim) turns a \(|V|\)-dim one-hot vector into a 512/768‑dim dense vector.
Everything is a mapping: map a word into a vector, a sentence into a vector, or tokens into a high‑dimensional embedding space. The question is: how do you train that mapping?
Embedding is injective and (approximately) structure‑preserving.
Methods
- one‑hot
  - Problems: dimensionality explosion; every token is orthogonal, so you can’t directly learn semantic similarity.
  - In some cases, you can do one‑hot encoding + PCA.
  - Why use one‑hot? It turns discrete IDs into a structured continuous space where each feature is orthogonal.
- distributed representation: you need to turn one‑hot features into Euclidean space so you can measure \(L_p\) distances. Dense representation is (1) easy to measure similarity and (2) more compact.
- word2vec: predict target word given context words.
  - CBOW (Continuous Bag of Words)
    - Given the words around the blank, what word should be in the blank?
    - Average over context words’ embeddings.
    - Ignores order.
    1. Context Words
      
      Context words: \(w_1, w_2, \dots, w_C\)
      
      Target word: \(w_t\)
    2. One‑Hot Encoding
      
      Each word is a one‑hot vector of size \(V\) (vocabulary size): \[ x_i \in \mathbb{R}^V \]
    3. Embedding Lookup
      
      Embedding matrix: \[ W \in \mathbb{R}^{V \times D} \] Word embedding for context word \(w_i\): \[ v_i = W^T x_i \] (Just selects the word’s embedding row.)
    4. Hidden layer: average the context vectors \[ h = \frac{1}{C} \sum_{i=1}^{C} v_i \]
    5. Output Layer
      
      Second matrix: \[ W' \in \mathbb{R}^{D \times V} \] Scores for all words: \[ u = {W'}^T h \]
    6. Softmax Prediction \[ p(w_t \mid \text{context}) = \frac{\exp(u_t)}{\sum_{j=1}^{V} \exp(u_j)} \] Maximize the probability of the target word: \[ \max \log p(w_t \mid \text{context}) \] (Usually approximated with negative sampling)
- skip‑gram: predict context words FROM the target word.
- FastText: still uses CBOW or Skip‑Gram, but replaces the word embedding with the sum of its character n‑gram vectors.
  
  Given word playing, n‑grams with \(n = 3\): <pl, pla, lay, ayi, yin, ing, ng> + <playing> (sliding window size = 3)
  
  Embedding of the word playing is now \[ v_{\text{playing}} = \sum_{g \in G} z_g \] Where:
  - \(G\) = set of character n‑grams of the word
  - \(z_g\) = embedding vector of each n‑gram
  Then apply CBOW or skip‑gram on top: \[ \max \sum_{c \in \text{context}} \log p\bigl(c \mid \text{ngrams}(w_t)\bigr) \] This works because granularity is better: (1) fewer OOVs, (2) more efficient since vocab size drops.
word2vec: how to accelerate?
- hierarchical softmax
  - Binary tree. Complexity: \(|V| \to \log V\).
  - Higher‑freq words are located closer to the root.
  - “Greedy optimization”.
  - Cons: if the word is rare, you have to go deep into the tree.
- Negative sampling
  - Definition: a training trick used in word2vec (CBOW & Skip‑Gram) to avoid computing a slow full softmax over the entire vocabulary.
  - Why? Faster and more efficient.
  - How?
    - Full softmax requires a sum over all \(V\) words, which is extremely slow when \(V\) is large.
    - Negative sampling replaces this expensive softmax with a binary classification task.
      
      1 positive example (the true target/context word): text (target = "cat", context = "cute") \(K\) negative examples (random wrong words): text ("cat", "banana") ("cat", "engine") ("cat", "chair")
    - Math
      
      Positive example: \[ \log \sigma \bigl(v_{\text{context}}^\top v_{\text{target}}\bigr) \] Negative examples: \[ \sum_{i=1}^{K} \log \sigma \bigl(-v_{\text{neg}_i}^\top v_{\text{target}}\bigr) \] Total loss: \[ L = \log \sigma(v_{+}^\top v_t) + \sum_{i=1}^{K} \log \sigma(-v_{\text{neg}_i}^\top v_t) \] where \(\sigma\) is sigmoid, \(v_+\) is the real context embedding, and \(v_{\text{neg}_i}\) are negative embeddings.

Attention

Why?
- Core idea: assign importance scores to different tokens and focus on the important ones.
- Handles long sequences by keeping order information.
- Parallelizable and efficient to compute.
self‑attention
- modules
  - vocab embedding: what am I? what word meaning do I have?
  - positional embedding: where am I?
  - query: what information are you looking for? captured by the query matrix.
  - key
  - value: combined with key to construct importance scores. The better the match, the larger the score.
  - importance score becomes \[ \mathrm{Attention}(Q, K, V) = \operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V \]
  - why softmax? make sure the scores are normalized to \([0,1]\) and sum to 1 (a probability distribution).
  - why scale? control variance. If \(d_k\) is large, the unscaled dot products blow up, softmax saturates, and gradients vanish.
self‑attention / encoder attention: full attention; a token can see all other tokens.
decoder attention: masked / causal attention; auto‑regressive; prevents data leakage.
cross attention: queries come from sequence 1, keys and values from sequence 2 (e.g., English vs. Chinese in translation).
multi‑head self‑attention: multiple attention “heads” increase model capacity.
Complexity: \(O(N^2)\) where \(N\) is the number of tokens per sequence.
Sparse attention
- How? Only study relationships within the closest \(k\) tokens (sliding window). This adds a locality prior.
- By controlling the attention mask, you control locality patterns.
- Hand‑engineered masks can be suboptimal; ideally you would learn the sparsity pattern.
- [code]

Linear attention

How? Swap softmax into linear kernels.
Why? Drops complexity from \(O(N^2)\) to \(O(N)\).

Math

Softmax prevents factorization: \[ \operatorname{softmax}(QK^\top) \] cannot be written as a simple product of smaller matrices.

If we use a feature map \(\phi\) and write \[ \phi(Q)\phi(K)^\top \] we can reorder operations. Instead of: \[ \operatorname{softmax}(QK^\top)V \;\to\; O(N^2) \] we compute: \[ \phi(Q)\bigl(\phi(K)^\top V\bigr) \;\to\; O(N) \]

where \(\phi\) is a kernel feature map:

\(\phi(x)\)	Range	Positive?
ReLU(x) + 1	\([1, \infty)\)	Yes
ReLU(x)\(^2\)	\([0, \infty)\)	Yes
ELU(x) + 1	\((0, \infty)\)	Yes ✓ (best)
LeakyReLU(x) + 1	\((-\infty, \infty)\)	No
\(\exp(x)\)	\((0, \infty)\)	Yes ✓
\(\cos / \sin\)	\([-1, 1]\)	No
\(x\)	\((-\infty, \infty)\)	No
\(x^2\)	\([0, \infty)\)	Yes

As long as the kernel feature map is factorized, positive, and smooth, this works and is optimization‑friendly.

Now there is no explicit \(N \times N\) matrix → no quadratic cost → linear time.

downside:
- We lose the normalization effect of softmax.
- Attention distribution becomes less sharp.
- Sometimes lower quality on tasks requiring exact token interactions.
- Softmax enforces a probability distribution over attention weights—linear attention does not.
implementation of linear attention in transformers

KV cache
- Mainstream attention in decoders: masked self‑attention, auto‑regressive. It uses all \(K, V\) of previous tokens. To avoid recomputing them at each step, we cache previous \(K, V\) for efficiency.
inference framework lmdeploy
Flash‑attention
- Why? For long sequences: you split into chunks, compute attention in tiles, then aggregate.
- How? Map‑reduce style:
  1. Chunk the sequence into blocks (tiles).
  2. For each block: load \(Q, K, V\) tiles into high‑speed SRAM. Compute attention scores and partial outputs.
  3. Reduce the partial results to final output.
- This avoids materializing the full attention matrix.
- Flash‑attention leads to flash-decoding.
- Compute attention incrementally, tile by tile.
- Keep intermediary data on‑chip (GPU shared memory).
- Use numerically stable softmax with running maxima.
Streaming LLM
- Masked self‑attention + tiling to enable efficient auto‑regressive generation over streams.
MHA (multi‑head attention)
- Why multiple heads? Similar to group convolution. Many parallel channels, each focusing on different details / subspaces of the sequence. Each head can learn different semantics: text head 1: positional patterns head 2: syntax head 3: long-range links head 4: entities / names
MQA (multi‑query attention)
- How it works?
  - Multiple queries, shared keys, shared values.
  - \(Q\) still has \(H\) heads. But \(K\) and \(V\) are single‑headed (shared).
- Why?
  - Efficient in large‑scale inference with only a small performance drop.
Grouped‑query attention
- Why? MQA is too extreme (1 group), MHA is expensive (\(H\) groups). In between, you have \(G\) groups of queries. Each group has multiple heads, and shares \(K, V\).
- You keep good quality while maintaining speed.
Multi‑head latent attention
- Combines multi‑head attention and linear attention.
DCA (Dual Chunk Attention)
- How?
  1. Split long‑form sequence into chunks.
  2. Calculate attention within a chunk.
  3. Calculate cross‑block attention: attention between chunks.
  4. Concat results.
- Why?
  - Efficient compute.
  - Less RAM cost.
S2 attention:
- How? (various sparse + sliding schemes for long sequences).

FFN + LN

Why FFN?

After MHA, tokens exchange information. The FFN is for local computation. #### Why add residuals? 1. Avoid gradient explosion / vanishing in deep networks. Residuals add shortcut connections. 2. Standard residual: an express lane for information.

Method	Formula	Notes
Standard Residual	\[x_{l+1} = x_l + f(x_l)\]	Original ResNet‑style skip connection
Post‑LN Transformer	\[x_{l+1} = \text{LN}(x_l + f(x_l))\]	Used in early Transformers (BERT), less stable for deep models
Pre‑LN Transformer	\[x_{l+1} = x_l + f(\text{LN}(x_l))\]	Used in modern LLMs (GPT, LLaMA), very stable
DeepNorm	\[x_{l+1} = x_l + \alpha \cdot f(\text{LN}(x_l))\] Encoder: \[\alpha = \frac{1}{\sqrt{2N}}\] Decoder: \[\alpha = \frac{1}{\sqrt{4N}}\]	Adds residual scaling so very deep models stay stable; \(N\) is ## of layers.
ReZero	\[x_{l+1} = x_l + \alpha \cdot f(x_l)\] (α learnable, init = 0)	Starts with zero residual; stabilizes early training

Normalizations

Why? Stable convergence; avoid gradients exploding or vanishing.

Types

LayerNorm
- Normalize over feature dimension = hidden size = embedding size = token vector size (e.g. 512/768).
- Mainly for RNNs, Transformers. LayerNorm works per token, across its features.
- LayerNorm does NOT look across tokens. You want each token embedding on a similar scale.
BatchNorm
- Normalize over batch_size.
- Mainly for CNNs, where you want different images (e.g. faces) on the same scale.
How to do norm correctly?

Method	Formula	What It Normalizes / Changes	Notes
Vanilla LayerNorm	\[x_{\text{norm}} = \frac{x - \mathbb{E}[x]}{\sqrt{\operatorname{Var}(x)+\epsilon}} \cdot \gamma + \beta\]	Normalizes per token across features (mean + variance)	Standard LN used in early Transformers
RMSNorm	\[\text{RMS}(x)=\sqrt{\frac{1}{H}\sum_{i=1}^{H} x_i^2 + \epsilon}\] \[x_{\text{norm}} = \frac{x}{\text{RMS}(x)} \cdot \gamma\]	Normalizes only by root mean square (no mean subtraction, no β)	Faster, more stable; used by LLaMA, T5 (no need to shift, only scale).
DeepNorm	\[x_{l+1} = x_l + \alpha \cdot f(\text{LN}(x_l))\] \[\text{Encoder: } \alpha=\frac{1}{\sqrt{2N}}, \quad \text{Decoder: } \alpha=\frac{1}{\sqrt{4N}}\]	Scales residual branches, not normalization itself	Enables very deep transformers (hundreds–thousands of layers)

FFN + Activations

Types of Activation

Vanilla FFN activation \[ \mathrm{FFN}(x) = \operatorname{ReLU}(xW_1 + b_1)W_2 + b_2 \]
GLU‑style FFN activation \[ \mathrm{GLU}(x) = (xV) \odot (xW + b) \]

\[ \mathrm{FFN}_{\mathrm{GLU}}(x) = (xV) \odot \sigma(x W_1 + b) W_2 \]

Why this works: the activation is like a gate. \(\sigma\) is the gate that controls which affine transformation passes through.

If you swap \(\sigma\) with Swish or GELU, you get the newer FFN designs used in current LLMs.

Activation functions

Activation	Formula	Notes
Sigmoid	\[\sigma(x) = \frac{1}{1 + e^{-x}}\]	Causes vanishing/exploding gradients; saturates for large \(\|x\|\); expensive due to exponential
Tanh	\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]	Still suffers gradient vanishing; exponential compute; zero‑centered
ReLU	\[\text{ReLU}(x) = \max(0, x)\]	No gradient vanishing for \(x>0\); not smooth; gradients = 0 for negative values; not zero‑centered
Leaky ReLU	\[\text{LeakyReLU}(x) = \begin{cases}x, & x>0 \\ \alpha x, & x<0 \end{cases}\]	Prevents dying ReLU; still quite linear; simple compute
ELU	\[\text{ELU}(x) = \begin{cases}x, & x>0 \\ \alpha(e^x - 1), & x<0 \end{cases}\]	Smooth; negative values help shift mean; exponential = costly
Swish	\[\text{Swish}(x) = x \cdot \sigma(x)\]	Smooth, non‑monotonic; excellent performance in deep nets
GELU	\[\text{GELU}(x) = x \cdot \Phi(x)\] \(\Phi\) = Gaussian CDF	Used in Transformers; smooth, probabilistic gating behavior
SwiGLU	\[\text{SwiGLU}(x) = \text{Swish}(x_1) \odot x_2\]	Most effective MLP activation in modern LLMs
Softmax	\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]	Converts logits to a probability distribution

Positional Encoding

Why? Transformers don’t learn order by default.

Types

Absolute positional encoding \[ PE(t) = [\sin(\omega_0 t), \sin(\omega_1 t), \dots, \sin(\omega_n t)] \] \[ \omega_i = 10000^{-i/d_{\text{model}}} \]
- WordEmbedding + PositionalEmbedding: use ADD, not CONCAT.
- When you visualize it: visual
- For very long sequences, fixed sin/cos frequencies can become problematic.
- Attention requires pairwise dot‑products; positional encodings can be baked into those dot‑products.
- How to solve long‑context issues?
- BERT: learn an MLP / embedding table for positions. Initialize a \([512, 768]\) matrix, concat with word embeddings, then learn the positional matrix. Downside: cannot process longer than the max length (e.g., 512 tokens).
- RNN positional encoding: add an RNN on top of word embeddings so RNN learns positions. Downside: cannot fully parallelize.
Relative positional encoding
- Intra‑token distances: Construct a relative distance measure between token \(i\) and \(j\): \(f(|i-j|)\).
- 2 Major variants:
  - RoPE: if the encoding is embedded into attention.
  - ALiBi: if the encoding is added as a soft bias in attention; you penalize weights of tokens \(j\) farther from token \(i\).

Fancy Positional Encoding Methods

Method	Formula	What It Encodes	Intuition
Sinusoidal (Absolute)	\[PE(t)_i = \sin(t \cdot \omega_i),\quad \omega_i = 10000^{-i/d}\]	Absolute position index	Each position gets a unique “wave signature.” Different frequencies let the model infer order (like reading a timestamp).
Learned Absolute	\[PE(t) = E_t\]	Absolute position index	The model learns a table of positional vectors, like a “positional vocabulary.”
Relative (Shaw et al.)	\[\text{score}_{ij} = Q_i K_j^\top + a_{i-j}\]	Relative distance \((i-j)\)	Instead of “where are they,” it learns “how far apart are they.” This matches how language works (“this word modifies the next one”).
Intra-token Distance (General)	\[PE(i,j) = f(\lvert i-j \rvert)\]	Arbitrary distance functions	Can define any mapping from distance → bias. ALiBi and T5 are specific cases of this general framework.
ALiBi (Attention Linear Bias)	\[\text{score}_{ij} += w_h \cdot (i - j)\]	Linear distance penalty	Far‑away tokens get a soft negative penalty → similar to a soft mask that encourages attending to recent tokens. Generalizes to very long sequences.
RoPE (Rotary Position Embedding)	\[Q_t^{\text{rot}}=R_tQ_t,\quad K_t^{\text{rot}}=R_tK_t\] \[R_t=\begin{bmatrix}\cos(\theta t)&-\sin(\theta t)\\\sin(\theta t)&\cos(\theta t)\end{bmatrix}\]	Encodes relative shift through geometric rotation	Positions rotate Q/K vectors in a circular space. Relative distances become differences in rotation. Elegant, smooth, natural for attention.
T5 / DeBERTa (Relative Bias)	\[\text{score}_{ij} = Q_i K_j^\top + b_{\lvert i-j \rvert}\]	Bucketed distances	Distances are grouped into buckets. “Far but not too far” gets the same bucket. Stable and easy for NLP tasks.

My intuition is that

Sinusoidal / sin wave → every position = unique wave pattern
Learned → positions learned like embeddings
Shaw Relative → model learns “how far apart”
ALiBi → soft distance mask (recent > far)
RoPE → rotate queries/keys; relative emerges naturally
T5/DeBERTa → distance buckets (coarse relative)
Intra-token → any distance function you want

Length Extrapolation

def: when input sequence is too long. Training seq length is, say, 20k; inference seq 128k. Model collapses because it never saw tokens at such large positions.
Goal: encode positional info so the model can extrapolate to very long lengths.

Method	Math (Core Formula)	Intuition (Why it works)	Pros	Cons
RPE (Relative Position Embedding)	\[\text{score}_{ij} = Q_i K_j^\top + a(i-j)\]	Encode relative distance rather than absolute position → natural for long sequences	Good extrapolation; matches linguistic structure; stable	Needs learned embeddings; bucketization often required
RoPE (Rotary Positional Encoding)	\[Q_t^{\text{rot}}=R_tQ_t,\; K_t^{\text{rot}}=R_tK_t\] \[R_t=\begin{bmatrix}\cos(\theta t)&-\sin(\theta t) \\ \sin(\theta t)&\cos(\theta t)\end{bmatrix}\]	Encode positions as rotations; relative position emerges from angle differences	Smooth, elegant; widely used (LLaMA/Qwen); supports relative shifts	High‑frequency rotation grows too fast → breaks at very long context
ALiBi (Attention Linear Bias)	\[\text{score}_{ij} += w_h (i-j)\]	Apply linear distance penalty, like a soft mask → model naturally prefers recent tokens	Perfect long‑context extrapolation; no computation overhead	No rotational/periodic structure; may weaken modeling of long‑distance patterns
Position Insertion (PI)	\[t' = \frac{L}{L_{\text{new}}}\cdot t\]	Compress long sequence positions back into training length → model sees familiar ranges	Extremely simple; works decently; improved by quick fine‑tuning	Pure interpolation → may distort long‑range structure
Base‑N Positional Encoding	\[t = \sum_k d_k B^k,\quad PE_k(t) = d_k\]	Treat position as digits in multiple bases; digit count grows naturally	Infinite extrapolation; multi‑scale; simple math	Discontinuous jumps; may need smoothing / fine‑tuning
NTK‑aware RoPE Scaling	\[\theta' = \theta / s\]	Slow down all RoPE frequencies → prevent angle explosion	Easy, effective; improves long‑context stability	Uniform scaling may under/over‑correct different frequencies
NTK by Parts (Frequency‑segmented NTK)	\[\theta'_i = \theta_i / s\cdot{\text{part}(i)}\]	Low freq → keep; mid freq → mild scale; high freq → strong scale	Better balance than global NTK; reduces distortion	More parameters; requires careful design
Dynamic NTK	\[\theta'_i = \theta_i \cdot \frac{m}{\beta^{d/2-1}}\]	Per‑dimension adaptive scaling; distributes “frequency stress” smarter	Most adaptive NTK variant; very stable long‑range	Harder to derive; more implementation steps
YaRN (Yet Another RoPE Extension)	\[\text{softmax}\left(\frac{q_m^\top k_n}{t \sqrt{d_k}}\right)\] with \(q, k\) scaled by \(\sqrt{1/t}\), \[\sqrt{1/t}=0.1 \ln(s)+1\]	Combine frequency scaling + controlled growth → avoids over‑shrink & under‑rotation	SOTA long‑context RoPE scaling; robust and smooth	Implementation slightly more complex

Intuitively

Method	Key Idea
RPE	Learn distances
RoPE	Encode angle rotation
NTK	Slow down rotation globally
NTK parts	Slow down rotation by frequency segment
Dynamic NTK	Smart per‑dimension scaling
YaRN	Best RoPE scaling (balanced + stable)
ALiBi	Linear recency bias = soft mask
PI	Compress indices back to training max
Base‑N	Interpret positions as digits
Intra-token	Arbitrary distance mapping

Architecture

Encoder‑Only Models (Bi‑Directional)
- Definition: encoder‑only models compute attention over both the left and right context: \[ \mathrm{Attention}(x_i) = f(x_{<i}, x_i, x_{>i}) \]
- Characteristics
  - Naturally bi‑directional → good for feature extraction, embeddings, classification.
  - Pretraining objective usually MLM (Masked LM).
  - Not suitable for generation by themselves.
- Examples: BERT, RoBERTa, DeBERTa
Decoder‑Only Models (Auto‑Regressive)
- Definition: next‑token prediction. Masked causal attention ensures only left context is accessible: \[ p(x_t \mid x_{<t}) \]
- Characteristics
  - Ideal for generation.
  - Naturally supports zero‑shot, few‑shot prompting.
  - Efficient training (causal LM).
  - Unified pretraining → directly used for downstream tasks.
- Examples: GPT, GPT‑2/3/4, LLaMA, BLOOM, OPT, Mistral
Encoder‑Decoder Models (Seq2Seq)
- Encoder \[ h = \text{Enc}(x_{1:n}), \quad \text{bi‑directional} \]
- Decoder \[ p(y_t \mid y_{<t}, h), \quad \text{auto‑regressive} \]
- Characteristics
  - Best for machine translation, summarization, instruction models.
  - Decoder attends to both the encoder output and its own past.
- Examples: T5, Flan‑T5, BART
PrefixLM (Partial Causal Masking)
- A hybrid: the prefix is encoded bi‑directionally; the suffix is decoded causally. This gives encoder‑decoder benefits with decoder‑only efficiency.
- Mask \[ M_{ij} = \begin{cases} 0, & j \leq P \\ (\text{prefix bi‑dir})\\ 0, & j < i \\ (\text{causal})\\ -\infty, & \text{otherwise} \end{cases} \]
- Examples: GLM, U‑PaLM
Mixture of Experts (MoE)

Replaces a single feedforward block with multiple experts \[ \mathrm{FFN}(x) = \sum_{k=1}^{N} g_k(x)\, E_k(x) \] Where
1. \(E_k\) = expert networks
2. \(g_k\) = gating weights (softmax or top‑\(k\))
- Gating Functions
  - Linear gate
  - GLU gate
  - Domain routing
  - Expert‑choice gating
  - Attention router
- SoftMoE
  - Experts are merged into a single FNN via soft combination: \[ \mathrm{SoftMoE}(x)=W\Big(\sum_k g_k(x) E_k(x)\Big) \]
  - Examples: DeepSeek V2/V3, Mistral MoE, Switch Transformer

Summary

Architecture	Attention Style	Objective	Strengths	Examples
Encoder‑Only	Bi‑directional	MLM	Feature extraction	BERT
Decoder‑Only	Causal	NLL	Generation, zero‑shot	GPT, LLaMA
Encoder‑Decoder	Bi‑dir + Causal	Seq2Seq	Translation, summarization	T5, BART
PrefixLM	Hybrid mask	Prefix	Reasoning, dialogue	GLM
MoE	Routed Experts	Sparse	Efficient scaling	DeepSeek, Mistral

Decoding

Decoding = sampling the next token from the model’s distribution: \[ p(x_t \mid x_{<t}) \] Different methods trade off diversity, coherence, and accuracy.

Greedy Search [code]

\[ x_t = \arg\max_i p(i \mid x_{<t}) \] * Always pick the highest‑probability token. * Fast but deterministic → monotonic / repetitive.

Beam Search [code]

Keeps top‑\(B\) candidates: \[ \mathrm{Beam}_t = \text{TopB}\left( \prod_{\tau=1}^t p(x_\tau) \right) \] * More global optimality. * Often too conservative for creative tasks.

Top‑k Sampling [code]

Draw from the top‑\(k\) tokens: \[ P_k = \text{TopK}(p),\quad x_t \sim P_k \] * Works well when distribution is peaked. * Poor when distribution is flat.

Top‑p Sampling (Nucleus Sampling) [code]

Choose smallest set \(S\) such that: \[ \sum_{i\in S} p(i) \ge p \] Then sample: \[ x_t \sim S \] * Adaptive; avoids fixed \(k\). * Very stable for creative generation.

Temperature Sampling

Scaled logits: \[ p_i' = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)} \] * \(T>1\): makes distribution flatter → more random. * \(T<1\): makes distribution sharper → more deterministic.

KPT (k → p → T Pipeline)

Apply:

Top‑k
Top‑p
Temperature

This filters noise, preserves diversity, and controls randomness.

Best‑of‑N (Self‑Scoring)

Generate \(N\) samples: \[ \{y^{(1)}, y^{(2)}, \ldots, y^{(N)}\} \] Then score them using either:

the LLM itself, or
a reward model

Pick the best.

Majority Voting / Self‑Consistency

Generate multiple chains: \[ y^{(1)}, y^{(2)}, \ldots, y^{(N)} \] Choose the majority or cluster center.
Used in reasoning tasks:

Reduces hallucination
Amplifies correct reasoning paths

Summary

Method	Deterministic?	Diversity	Notes
Greedy	Yes	Low	Simple, repetitive
Beam Search	Semi	Low	Conservative
Top‑k	No	Medium	Fails if distribution is flat
Top‑p	No	High	Adaptive nucleus
Temp	No	Adjustable	Directly controls randomness
KPT	No	High	Robust combined method
Best‑of‑N	Yes	High	Needs a scoring model
Majority / Self‑Consistency	Yes	Very high	Great for reasoning