LLM Inference Lifecycle

Understanding the two distinct phases of Large Language Model computation.

The "Prefill" Phase

When you hit enter, the model reads your entire input at once. This is a parallel operation. The GPU processes all tokens simultaneously to understand context and build the initial KV Cache.

Nature

Parallelized computation across all input tokens.

Bottleneck

Compute Bound (FLOPs). The GPU works hard to crunch math.

The "Decoding" Phase

Once the prompt is understood, the model starts writing. This is a sequential operation. Each new token depends on every previous token generated.

Nature

One token at a time. Auto-regressive loop.

Bottleneck

Memory Bandwidth Bound. Moving weights from memory to chip is the slow part.