Google logo
4.9 on Google
Get a free Magento audit

Recommendations on page speed, accessibility issues, and real user experience data

Interactive simulator

Under the Hood: How an LLM Runs on a Server

Everyone talks to an LLM through a chat box — but what is actually running on the server behind it? This interactive simulator opens up the black box: watch a model load from disk into GPU memory, follow your prompt through the tokenize → prefill → decode loop that generates each word, and see the real memory maths that decides how much hardware it takes to run.

  1. 1 Powered offWeights on disk
  2. 2 LoadingDisk → VRAM
  3. 3 ReadyListening for prompts
  4. 4 InferenceGenerating tokens
OFF

The server is powered off

Right now the 70B model is just 140 GB of numbers sitting on disk. Press “Boot server” to copy it into the GPU’s memory so it can run.

Start here

Where the model lives

A model is billions of numbers. They start on disk and must be copied into fast GPU memory (VRAM) before it can run.

Hard disk (SSD) 140 GB
140 GB weights · safetensors

Always holds the full model. When the server is off, this is the only place it exists.

GPU VRAM 0.0 GB / 160 GB
Empty — boot the server to load weights here
Weights 140 GBKV cache 0 MBFree

Fast enough for real-time matrix maths — but limited. This is why model size and precision matter.

Why it works this way

The cold state

The server is shut down. The LLM is just static safetensors files on the hard drive — billions of numbers consuming zero memory and zero compute. Pick a model size and precision on the left to see what it would need, then boot the server.

Bigger models and higher precision mean more weights to store. Switch FP16 → INT4 and watch the disk and VRAM figures shrink by 4×.

Key terms

Parameters / weightsThe billions of numbers learned during training. They define the model and are what gets loaded into memory.
safetensorsThe file format the weights are stored in on disk — a safe, fast way to save raw tensors of numbers.
VRAMVideo memory on the GPU. Fast enough for the matrix maths inference needs; this is where the weights must live to run.
QuantizationStoring each weight in fewer bits (FP16 → INT8 → INT4). Shrinks VRAM use, with a small quality trade-off.
TokenizationChopping text into tokens (sub-words) and mapping each to an integer ID the model can process.
PrefillThe first pass where the entire prompt is processed in parallel, building the KV cache. Drives time-to-first-token.
DecodeThe generation loop: predict one token, append it, feed back in, repeat until a stop token.
KV cacheStored keys & values for every token so far, so the model never re-computes past context. Grows with the conversation.
TTFTTime To First Token — how long prefill takes before the first word appears.
ThroughputHow many tokens per second the model streams during decode. Smaller / quantized models are faster.

Common questions about LLM inference

Inference is billions of matrix multiplications per token. GPUs run those in massively parallel fashion — far faster than a CPU. The weights must sit in the GPU’s VRAM so the maths can read them at full speed.
Disk is huge but slow. Streaming 140GB of weights from an SSD for every token would make generation crawl. So at boot the weights are copied once from disk into fast VRAM, where they stay for the whole session.
Before generating, the model must process the entire prompt in one pass (prefill) to build the KV cache. That work happens before the first token appears, which is why time-to-first-token is higher than the steady streaming speed.
Each weight is normally a 16-bit number (FP16, 2 bytes). Quantization stores it in 8 or 4 bits instead, roughly halving or quartering the VRAM the model needs — at the cost of a little precision. Try the FP16 / INT8 / INT4 toggle and watch the VRAM bar move.
At FP16 a 70B model is ~140GB of weights, but a single H100 only has 80GB of VRAM. So the model is split across multiple GPUs. Quantize it to INT4 (~35GB) and it suddenly fits on one — change the precision to see it happen.
Every token generated adds its keys and values to the cache so they never need recomputing. The longer the conversation, the bigger the KV cache — which is why very long contexts also cost VRAM, on top of the weights.

Thinking about building AI into your store?

We design and ship production AI features for commerce — from on-site assistants to back-office automation. Let’s talk about what’s possible.