Everyone talks to an LLM through a chat box — but what is actually running on the server behind it? This interactive simulator opens up the black box: watch a model load from disk into GPU memory, follow your prompt through the tokenize → prefill → decode loop that generates each word, and see the real memory maths that decides how much hardware it takes to run.
Right now the 70B model is just 140 GB of numbers sitting on disk. Press “Boot server” to copy it into the GPU’s memory so it can run.
A model is billions of numbers. They start on disk and must be copied into fast GPU memory (VRAM) before it can run.
Always holds the full model. When the server is off, this is the only place it exists.
Fast enough for real-time matrix maths — but limited. This is why model size and precision matter.
The server is shut down. The LLM is just static safetensors files on the hard drive — billions of numbers consuming zero memory and zero compute. Pick a model size and precision on the left to see what it would need, then boot the server.
Bigger models and higher precision mean more weights to store. Switch FP16 → INT4 and watch the disk and VRAM figures shrink by 4×.
We design and ship production AI features for commerce — from on-site assistants to back-office automation. Let’s talk about what’s possible.