Understanding the two distinct phases of Large Language Model computation.
When you hit enter, the model reads your entire input at once. This is a parallel operation. The GPU processes all tokens simultaneously to understand context and build the initial KV Cache.
Parallelized computation across all input tokens.
Compute Bound (FLOPs). The GPU works hard to crunch math.
Once the prompt is understood, the model starts writing. This is a sequential operation. Each new token depends on every previous token generated.
One token at a time. Auto-regressive loop.
Memory Bandwidth Bound. Moving weights from memory to chip is the slow part.