
You load a 70B LLM on 2Ă NVIDIA A100 80 GB GPUs and think:
âThis has more memory than I need. My model should run fine.â
The model loads successfully. But the moment you start generating text, VRAM usage spikes. Suddenly, you hit the Out of Memory (OOM) error.
Your batch size collapses to 1. Your context length has to be cut in half. Your inference throughput drops to a crawl.
So you start searching for fixes like KV-aware routing and cache offloading. Every solution sounds promising, but brings new dependencies, complex optimization tricks, and performance trade-offs. Before you know it, youâre buried in GitHub issues and CUDA error logs. This is all because your GPU memory didnât behave the way the spec sheet said it would.
It doesn't have to be this hard.
In this post, weâll clear up the confusion. Youâll learn:
When people talk about GPU memory (e.g., A100 80 GB) for inference, they often mean dedicated VRAM (video random access memory). It is high-speed memory physically attached to the GPU chip, such as HBM3 or GDDR6X.
key facts:
Note that strictly speaking, GPU memory and VRAM are not exactly the same thing.
GPU memory is a broader term that means whatever memory the GPU is currently using. It usually means VRAM (and people often use the terms interchangeably), but can refer to other sources depending on the system architecture.
For example, integrated GPUs are the most common case where GPU memory â VRAM. They carve out a portion of system RAM, and there's no dedicated VRAM hardware at all. Weâll break that down in the next section.
Shared GPU memory refers to system RAM that the GPU can use when it runs out of its own dedicated VRAM.
If youâre using data center GPUs like A100 or H200 for LLM inference, you donât need to care about shared GPU memory. They rely entirely on their own dedicated HBM, which is fast and optimized for large-scale AI workloads.
GPU memory isnât just a container for model weights. Itâs actively consumed throughout the entire inference process. At a high level, VRAM is used for:
Letâs break down each stage.
Before inference starts, all model weights must be loaded into GPU memory (or sharded across GPUs if using tensor parallelism).
You can calculate the baseline memory required just to load the model like this:
Model Memory â num_parameters Ă bytes_per_parameter
For example, a 70B model in FP16 precision requires:
70B Ă 2 = 140 GB
Note: FP32 â 4 bytes, FP16/BF16 â 2 bytes, INT8/FP8 â 1 byte, INT4/FP4 â 0.5 bytes
This is why a single A100 80 GB cannot hold a 70B FP16 model without quantization or multi-GPU sharding.
Once inference begins, memory consumption continues to grow due to the KV cache.
During prefill, the model processes the full input prompt.
During decoding, the model generates tokens one by one autoregressively. Each new token appends to the KV cache. Even if the prompt was short, long generations would still expand the KV cache significantly. Learn more about LLM inference in our handbook.
For chat applications with multi-turn conversations, every new turn must include all previous messages. This means:

This is why models with long context windows require GPUs with large VRAM.
Even after accounting for weights and KV cache, a few more items consume VRAM:
The specific overhead size depends on the framework and GPU.
A rough estimate for the required GPU memory can be calculated by multiplying the number of parameters by the bytes per parameter and then add extra overhead (%).
Memory (GB) = num_parameters Ă bytes_per_parameter Ă (1 + Overhead)
For LLM inference, the real variable you must watch is the KV cache, not the model weights. This is what grows with context length, batch size, and multi-turn interactions.
The KV cache stores the key/value vectors for every token the model has processed. Its size grows linearly with sequence length, batch size, and number of layers.
A simplified formula:
KV Cache Size (GB) = 2 Ă batch_size Ă seq_len Ă num_layers Ă hidden_dim Ă bytes_per_parameter / 1024Âł
You can experiment with different batch sizes or context windows using the interactive KV cache calculator online.
Before LLMs became common, AI teams were used to serving small NLP or vision models. They were only a few hundred MBs to a few GBs, so a single GPU could easily host multiple models at once with techniques like MIG (Multi-Instance GPU).
But note that in such cases each model requires very little memory and compute usage is low.
This led to the expectation that âIf my LLM is 60 GB, an 80 GB GPU should be enough and I might be able to run another small model on the same GPU.â
However, LLMs behave very differently. They donât just load weights; they grow at runtime.
As mentioned in previous sections, even if an LLMâs weights fit comfortably in VRAM, it still needs a large and growing amount of memory for the KV cache. It expands with every token in the prompt and every token generated during decoding.
This results in the problems people commonly hit:
Ultimately, if you try inference, youâll hit the classic OOM issue.
This is why an âLLM that needs 60 GB for weightsâ often cannot run reliably on an 80 GB GPU with long prompts, high concurrency, or multi-turn conversations.
The limiting factor isnât the weights; itâs the runtime memory growth.
Once you understand why LLMs consume so much memory, the next question is: How do you reduce memory pressure and scale?
Below are the some of the most effective strategies used in production. All of them work, but several require complex engineering to deploy correctly.
And thatâs exactly why we built the Bento inference platform: to give AI teams all of these optimizations out-of-the-box, without hiring a dedicated infra team.
Quantization reduces the precision of model weights (e.g., from FP16 to INT8 or INT4), which lowers memory usage.
For example, a 70B model in FP16 requires 140 GB for model weights alone, plus additional memory for KV cache and overhead. After INT4 quantization, the weights need 35 GB of VRAM. This is a 4Ă reduction in memory footprint, enough to run a previously multi-GPU model on a single H200 (141 GB) with room left for KV cache and other overhead.
Note that quantization does introduce accuracy degradation. However, modern quantization methods like GPTQ and AWQ minimize this loss. For many workloads (chatbots, RAG, internal tools), the gains in memory savings and throughput outweigh the precision trade-off.
How Bento helps:
Bento supports any open-source, fine-tuned, or custom model, including quantized formats (INT8, FP8, INT4, GGUF, AWQ, GPTQ, etc.). You can load and serve quantized models directly without extra engineering work, container hacks, or custom inference wrappers.
Distributed inference lets you spread the workload across multiple GPUs when a model or context window exceeds the capacity of a single device.
Tensor parallelism splits the internal computations of each layer across multiple GPUs. Instead of placing entire layers on different devices, it slices the layer's tensors into smaller chunks and distributes them across multiple GPUs.
This is useful when:

However, tensor parallelism introduces significant communication overhead, which:
The overhead can become a bottleneck and affect LLM performance if not carefully optimized.
KV cache is often the real memory bottleneck. You can integrate multiple runtime strategies to control its growth and reduce VRAM usage.
Prefix caching: Reuses shared prompts across requests and users (e.g., system prompts). This cuts redundant KV computation and is especially useful for chat applications, RAG systems, and AI agents.
KV-aware routing: Ensures tokens for the same request are executed on the same GPU or worker, preventing fragmented KV allocations and improving the cache hit rate. In our internal tests with long context data (>20K), KV-aware routing delivered:
KV cache offloading: Moves less frequently used KV blocks to lower-cost storage like CPU memory or disk, freeing up VRAM without blocking token generation.
Â
Â
These strategies work, but they are hard to implement yourself. To configure all of them manually, youâd need to handle:
Most AI teams donât have the engineering bandwidth to build and maintain this infrastructure.
How Bento helps:
Bento packages these optimizations into a production-ready inference platform so you donât have to reinvent the entire inference stack:
Instead of spending time building your own inference system, you get an optimized one that handles GPU memory intelligently. With it, your models run faster, cheaper, and with higher throughput, and your engineering team can focus on AI development.
VRAM is far more than a static storage specification. It is a dynamic resource that fluctuates with every token generated, every new user connected, and every millisecond of your inference runtime. Understanding the math behind model weights and KV cache growth is the first step to avoiding those deployment-day crashes.
The next step, namely solving it with quantization and different optimization techniques, is an engineering marathon.
That is why we built Bento. We believe your team should focus on building better AI products, not fighting with infrastructure issues.
If you want to run LLMs reliably, efficiently, and at scale: