Learning

LLM inference, from request to token

May 31, 2026

Inference is the part of an LLM system that runs after training.

A user sends text. The server turns that text into token ids, runs the model, chooses the next token id, appends it to the request, and repeats. The answer arrives as a stream because the server is producing one token after another.

That is the core loop.

The hard part is everything around the loop. A real inference server has to keep many requests alive at once. It has to batch work without making people wait too long. It has to store the model’s memory of earlier tokens. It has to reuse shared prompt prefixes when it can. It has to split work across GPUs when one GPU is not enough. It has to stream clean text back to the client.

I read nano-vllm and Mini-SGLang because they make that machinery small enough to study. I use them as source material, but the article is about the inference stack itself: the loop, the model, the cache, the scheduler, the serving boundaries, and the tricks that keep the GPU busy.

Animated architecture map showing client, API, tokenizer, scheduler, engine, KV cache, detokenizer, and stream.
LLM inference is a serving pipeline wrapped around a repeated next-token loop. The model predicts tokens; the server moves requests, memory, batches, and text.

Vocabulary First

The rest of the article uses a handful of terms. I want them on the table before the architecture gets busy.

TermPlain meaning
TokenA number the model reads or writes. A token can be a word, part of a word, a space plus a word, or punctuation.
TokenizerThe code that turns text into token ids and token ids back into text.
PromptThe input tokens the user sends to the model.
CompletionThe output tokens the model generates.
LogitsOne score for each possible next token.
SamplerThe code that turns logits into the next token id. It may use greedy, temperature, top-k, or top-p sampling.
RequestOne user’s live generation job: prompt, output tokens, cache state, sampling settings, and finish state.
BatchA group of requests run together on the GPU.
PrefillThe first pass over the known prompt tokens.
DecodeThe repeated step that produces one new token per request.
KV cacheGPU memory holding key and value vectors for earlier tokens, so decode does not recompute them.
SchedulerThe code that decides which requests run next and how much cache they get.
Page or blockA fixed-size piece of KV cache memory.
Prefix cacheA cache that reuses prompt work when two requests start with the same tokens.
Tensor parallelismSplitting model weights across multiple GPUs.
CUDA graphA captured GPU execution pattern that can be replayed to reduce CPU launch overhead.

One more term matters: attention.

Attention is a lookup step inside the transformer. The current token is trying to update its vector. To do that, it asks which earlier tokens are useful.

That lookup has three parts:

PartPlain meaningWhat it does
QueryThe current token’s question.It asks what kind of past information would help now.
KeyA label on an earlier token.It is compared with the query to produce a match score.
ValueThe information stored for that earlier token.It is read when the key matches well.

The model compares the query with all the old keys. Good matches get higher scores. Those scores become weights. The model then mixes the matching values back into the current token’s vector.

The KV cache stores those old keys and values so the model does not rebuild them on every decode step.

Animated diagram showing a current token query matching earlier token keys, weighting values, and mixing them back into the current vector.
The query asks. Keys match. Values carry the information. Attention scores decide how much of each value is mixed back into the current token.

In the diagram, the current token is mat. Its query is not English text; it is a learned vector. The earlier tokens have cached keys and values. The key for cat matches best in this toy example, so the value from cat contributes the most to the updated vector.

The Whole Request Path

Here is the server path without code names:

  1. The client sends a prompt and sampling settings.
  2. The API layer accepts the request.
  3. The tokenizer turns text into token ids.
  4. The scheduler decides when the request can run.
  5. The engine runs the model on the GPU.
  6. The sampler chooses the next token id.
  7. The scheduler appends that token and updates request state.
  8. The detokenizer turns token ids into text.
  9. The API layer streams the text chunk to the client.
  10. Steps 4 through 9 repeat until the request stops.
Animated diagram showing a request becoming token ids, prefill, then a repeated decode loop.
The prompt is processed once. The answer is produced by a loop: run the model, sample one token, append it, and repeat.

This path explains why an inference server is more than a model wrapper. The model forward pass is one piece. The server also owns request state, memory allocation, batching, cancellation, and streaming.

One Decode Step

Decode is the loop the user feels as streaming text.

At each decode step, every active request contributes its latest token. The model reads that token, the model weights, token positions, and the KV cache. It produces logits. The sampler picks one next token id. The server appends that token to the request and stores its new key and value vectors in the KV cache.

Animated diagram showing last token, model, logits, sampler, append, and KV cache update.
One decode step starts with the last token and ends with one appended token. The KV cache grows at the same time.

This is why LLM output is naturally sequential. Token 53 depends on token 52. Token 52 depends on token 51. The server can batch many users’ token-52 work together, but it cannot know one user’s token 53 before token 52 exists.

Prefill And Decode

Inference has two phases because the prompt and the answer behave differently.

Prefill handles the prompt. If the prompt has 700 tokens, all 700 are already known. The server can process a large slice of known tokens and fill the KV cache for them.

Decode handles the completion. Future output tokens are unknown. The server produces one token, appends it, and runs another decode step.

Animated diagram showing known prompt tokens filling the KV cache and generated tokens appending new KV entries.
Prefill writes cache entries for known prompt tokens. Decode reads those entries, emits one token, and appends that token's new cache entries.

This split shows up everywhere in inference systems:

PhaseShape of workMain pressure
PrefillMany known prompt tokens at once.Large token batches and cache allocation.
DecodeOne new token per active request.Repeated GPU launches, cache reads, and request scheduling.

Many optimizations make sense only after this split is clear. Chunked prefill is about admitting long prompts without blowing up memory. CUDA graphs are mostly about making repeated decode steps cheaper. Prefix caching is about avoiding prefill work that another request already paid for.

What The Model Does

The model is a stack of transformer layers. A layer takes token vectors in and returns updated token vectors.

Animated diagram showing token ids becoming vectors, passing through transformer layers, and turning into logits.
The model is the hot path on the GPU: token ids become vectors, the layer stack updates those vectors, and the final head produces logits.

In a simplified transformer layer, the token vector goes through two main blocks.

First comes attention. The layer normalizes the vector, builds the query, key, and value views, adds position information to the query and key, then lets the current token read earlier values through attention.

Then comes the MLP, a small feed-forward network applied to the token vector. Both blocks have a residual connection: the old vector is copied around the block and added back to the block’s result. That copy path keeps information from being overwritten too easily as the vector moves through many layers.

Animated diagram of one transformer layer showing attention, residual add, MLP, residual add, and the next token vector.
The attention block reads useful past values. The first add combines that result with a copied hidden vector. The MLP transforms the result. The second add combines the MLP result with another copied input.

Every label in that diagram is doing a specific job:

LabelMeaning
HiddenThe token vector entering this layer.
AttentionThe part that uses query, key, and value to read earlier tokens.
AddA residual add: block result plus the copied old vector.
MLPA feed-forward network that transforms the token vector at this position.
NextThe updated vector passed to the next transformer layer.

At the end of the stack, the language-model head turns hidden vectors into logits. A logit is just a score. The sampler converts those scores into one token id.

This is the model-side story. The serving-side story is how to feed this model efficiently for many live requests.

Sampling And Stopping

The model does not directly write text. It produces logits: one score for each token in the vocabulary. The sampler turns those scores into the next token id.

Sampling can be simple or controlled:

SettingPlain meaning
GreedyPick the highest-scoring token.
TemperatureMake the score distribution sharper or softer before sampling.
Top-kSample only from the k best-scoring tokens.
Top-pSample from the smallest group whose probabilities add up to p.
Animated diagram showing logits filtered by sampling rules, converted into a next token id, then checked for stopping.
The sampler chooses one token id from the model's scores. A stop check then decides whether the request should continue.

The request stops when it hits a stop condition: an end-of-sequence token, a maximum output length, a user-provided stop rule, or cancellation. Until then, the sampled token is appended and the scheduler brings the request back for another decode step.

The KV Cache

Without a KV cache, every decode step would recompute attention data for the whole prompt and all previous output tokens. That would be wasteful.

The cache stores key and value vectors for earlier tokens. Decode can then say: “I have one new query token. Read the old keys and values from cache. Store the new token’s keys and values for next time.”

Animated diagram showing logical request tokens mapped into a layered KV cache buffer without overlapping labels.
The KV cache is layered GPU memory. Each token gets key and value vectors in every transformer layer.

The cache grows with live tokens. If 100 requests are active, each with a long prompt and a long answer, the server is holding a lot of key/value vectors. That is why inference systems spend so much code on memory management.

The cache is also why prefill and decode are linked. Prefill writes the cache. Decode depends on it.

Scheduling And Batching

The scheduler decides what runs next.

It is balancing several goals:

GoalWhat can go wrong
High throughputThe GPU waits if batches are too small or CPU scheduling is slow.
Low latencyUsers wait if the server delays a request too long to make a better batch.
Memory safetyThe server fails if KV cache pages run out.
FairnessA long prompt should not block all smaller requests forever.
ReuseShared prefixes should reuse cached work when possible.

A simple scheduler keeps two groups of requests in mind: requests that still need prompt work, and requests that are ready to decode.

QueueMeaning
WaitingRequests that still need prefill work.
RunningRequests that have finished prefill and can decode.

The scheduler tries prefill first. It pulls requests from the waiting queue until it hits a request limit or a token budget. If no prefill batch can run, it schedules decode. Decode gives one token step to each running request that fits.

Animated diagram showing waiting prompts, a prefill token budget, and running requests decoding one token each.
Prefill spends a token budget. Decode gives each running sequence one next-token step, as long as cache space is available.

Batching is the reason one GPU can serve many users. If the model has to read a large weight matrix, it is better to use that read for many requests at once. The cost is waiting: a request may pause briefly while the scheduler forms a batch.

Blocks, Pages, And Prefix Reuse

The KV cache is too large and too dynamic to manage as one growing list per request. Inference engines split it into fixed-size pieces.

One implementation calls them blocks. Another calls them pages. The idea is the same: a request has a logical token sequence, and the server maps those token positions to physical cache pieces.

Animated diagram showing request token positions mapped through a page table into physical KV cache pages.
A block table separates the logical request from physical KV memory. That makes allocation, freeing, and reuse manageable.

A page table is the bridge. The logical request says, “I have tokens 0 through N.” The page table says, “The KV vectors for those token ranges live in physical pages 17, 4, and 31.”

This also enables prefix caching.

Suppose two requests start with the same system prompt. The first request fills cache blocks for that prefix. The second request should not recompute those blocks. One compact engine hashes reusable full prefix blocks. A serving stack can also store shared prefixes in a radix tree. Common beginnings are shared. Divergent suffixes branch.

Animated diagram showing a radix tree with shared prompt prefixes branching into different continuations.
A radix cache stores shared prompt beginnings as tree paths. Repeated prefixes reuse old KV pages before the prompts diverge.

The important word is “managed.” Cached pages can be locked while live requests use them. Old unused leaves can be evicted. Prefix caching is not a vague memory of text. It is a memory-managed index over physical KV cache pages.

The Engine Loop

The engine loop is the smallest useful version of the inference server.

It keeps live request state, asks the scheduler what can run, packs that work into GPU-friendly tensors, runs the model, samples token ids, then updates the requests. Finished requests leave the loop. Unfinished requests come back for another decode step.

Animated diagram showing request state, scheduler, GPU batch, sampler, and the loop back to request state.
The core engine loop is not an HTTP story. It is a state machine for live token streams.

The loop has four jobs:

JobPlain meaning
Hold request stateKeep token ids, cache locations, sampling settings, and finish status.
Choose workDecide whether this step should prefill prompts or decode running requests.
Prepare GPU inputsTurn request objects into dense tensors, positions, and cache-page tables.
Update after samplingAppend new token ids, write new K/V, free finished requests, and repeat.

That is the loop to understand before worrying about process boundaries.

The Tensor Metadata Boundary

GPU kernels do not consume request objects. They consume tensors.

The engine turns scheduled requests into the metadata attention kernels need:

MetadataMeaning
Tokens nowThe token ids to run in this step.
PositionsThe position of each token in its request.
Write slotsWhere newly computed K/V vectors should be stored.
Cache pagesWhich cache pages hold old K/V vectors.
Query lengthsHow many query tokens each request contributes.
Context lengthsHow much old context each request can attend to.
Animated diagram showing scheduled work transformed into tokens, positions, write slots, cache pages, and an attention kernel.
The engine packs request state into dense tensors and attention metadata. That is what the GPU kernels actually consume.

During prefill, a request may contribute many prompt tokens. During decode, it usually contributes one last token. The metadata tells the attention kernel which old cache entries belong to each request and where the new ones should go.

CUDA Graphs

Decode repeats the same kind of GPU work many times. The CPU overhead of launching kernels can become visible.

A CUDA graph captures a fixed pattern of GPU work and replays it later. Engines usually capture decode graphs for selected batch sizes. If a real decode batch is smaller than a captured size, the engine can pad it with dummy work so the graph shape stays stable.

This is the point: CUDA graphs do not change what the model computes. They make the repeated decode step cheaper to launch.

Animated diagram contrasting ordinary per-kernel GPU launches with replaying a captured CUDA graph for decode.
A CUDA graph keeps the same compute but replays a captured launch pattern. Engines may pad small decode batches so the replay shape stays fixed.

Tensor Parallelism

Large models may not fit on one GPU. Tensor parallelism splits model weights across GPU ranks.

Each rank computes a shard of the layer. Some layers produce different output features on each rank. Some layers produce partial sums that must be combined. The combine step is a collective operation, such as all-reduce or all-gather.

Animated diagram showing model weight shards on four tensor-parallel ranks flowing into a collective operation.
Tensor parallelism slices large matrices across GPUs, then uses collective communication to combine partial results.

The serving consequence is simple: all ranks must run matching work for the same batch, or the distributed model stops making sense.

Serving Boundaries

Online serving wraps the engine loop in process boundaries.

The API server accepts requests and streams responses. The tokenizer turns text into ids before the GPU sees anything. The scheduler batches work and manages cache pages. The GPU engine runs the model and sampler. The detokenizer turns generated ids back into text chunks.

Animated diagram showing API, tokenizer, scheduler, GPU engine, detokenizer, stream, and multi-GPU coordination.
A serving stack separates text work, scheduling work, GPU work, and streaming work. That keeps each boundary simple.

One serving implementation uses ZeroMQ (ZMQ), a messaging library, for small control messages between these workers. Heavy GPU tensor communication still happens through distributed collectives.

The request path is:

  1. The API server receives a generation request.
  2. The tokenizer worker turns text into token ids.
  3. The scheduler receives token ids and sampling settings.
  4. If the model is split across GPUs, the lead rank shares the same request with the other ranks.
  5. The scheduler admits the request into prefill or decode.
  6. The engine runs the model and sampler.
  7. The scheduler sends new token ids to the detokenizer.
  8. The detokenizer returns printable text chunks.
  9. The API server streams those chunks to the client.

Request State And Page Tables

The scheduler can only make good decisions if it knows where each request is.

Useful request state includes:

StateWhy it matters
Token idsThe full prompt plus generated ids so far.
Cached lengthHow many tokens already have K/V entries.
Staged lengthHow many tokens are currently staged for GPU work.
Maximum lengthWhere generation must stop.
Page-table rowWhere this request’s cache-page pointers live.
Prefix-cache handleWhich shared prefix, if any, the request is reusing.

Cached length and staged length explain most prefill behavior. During prefill, the request is trying to make cached length catch up to the known prompt. During decode, the request usually extends by one token.

A token pool stores active token ids. A page table stores physical cache locations for each request position. The page table is the seating chart for cache memory: a request does not own one neat continuous slab of GPU memory. It owns table entries that point to pages.

Chunked Prefill

A long prompt can be too large to prefill all at once. Chunked prefill splits that prompt into pieces.

A scheduler can enforce a maximum number of prompt tokens to extend in one step. If an uncached prompt is longer than the current budget, the scheduler admits only the next chunk. That request cannot decode yet. It comes back for more prefill until the prompt is caught up.

Animated diagram showing a long prompt being split into several prefill chunks under a fixed token budget.
Chunked prefill lets long prompts enter gradually instead of taking a whole step and a large cache allocation all at once.

This is a serving feature. It keeps one long-context request from making the system less usable for everyone else.

Attention Backends

An attention backend is the implementation of attention used by the engine. It may be FlashAttention, FlashInfer, TensorRT-LLM FMHA, or a hybrid. A hybrid can use one backend for prefill and another for decode.

That choice matters because prefill and decode have different shapes. Prefill may process long variable-length prompt chunks. Decode often processes one new token per request while reading a large cached context. A kernel that is best for one shape may not be best for the other.

Animated diagram showing prefill and decode using different attention backends because their batch shapes differ.
Attention is the same operation, but prefill and decode present different shapes to the GPU. A serving engine can choose the backend that fits each shape best.

The attention path is:

  1. The scheduler forms a batch.
  2. The cache manager allocates pages.
  3. The attention backend prepares metadata.
  4. The attention layer stores new K/V vectors.
  5. The backend runs the attention kernel over the right cache pages.

That is what people mean when they say serving performance depends on the attention backend.

Overlap Scheduling

CPU work can leave the GPU waiting. The scheduler has to receive messages, allocate pages, prepare metadata, handle sampled tokens, free finished requests, and send replies.

Overlap scheduling rearranges that work. While the GPU runs the current batch, the CPU prepares the next batch and processes the previous result.

Animated timeline showing CPU scheduler preparation overlapping with GPU engine forward work.
Overlap scheduling does not remove CPU work. It hides more of that work under GPU computation.

The normal serving loop is easy to read:

  1. Receive messages.
  2. Schedule a batch.
  3. Run the batch.
  4. Process the result.

The overlap loop carries an in-flight result between iterations:

  1. Receive messages.
  2. Schedule and start the next batch.
  3. Process the previous result.
  4. Carry the current result into the next loop.

The goal is less idle GPU time.

Detokenization And Streaming

The model emits token ids. The client wants text.

The detokenizer keeps decode state for each request. It decodes new ids, avoids sending broken replacement characters, and tries not to flush half a word when it can wait for printable text.

Animated diagram showing generated token ids passing through a detokenizer before reaching a client stream.
Streaming text has its own state. The backend emits token ids; the detokenizer decides what text is safe to send.

That last boundary matters. A good streaming server has to be fast and send chunks that clients can display cleanly.

The Mental Model

The model predicts the next token. The inference server makes that prediction usable for many live requests at once.

Animated mental model showing the serving loop: request state, scheduling, model step, cache storage, sampling, and streaming.
The whole system is one loop with guardrails: keep request state, schedule work, run the model, update the cache, sample a token, and stream clean text.

The whole system follows from a few constraints:

ConstraintConsequence
Future tokens depend on past tokens.Decode is a repeated one-token loop.
Attention needs old keys and values.The KV cache is central.
GPU memory is finite.Blocks, pages, eviction, and preemption matter.
Prompts repeat.Prefix caching saves prefill work.
CPU overhead can stall the GPU.CUDA graphs and overlap scheduling matter.
Models outgrow one GPU.Tensor parallelism adds collectives.
Users expect text, not token ids.Detokenization and streaming state matter.

Everything else in the serving stack is an answer to those constraints.