GPU & Compute Fundamentals

Modern Generative AI runs on an enormous amount of computation. Training a large model can involve more calculations than a human could do in many lifetimes β€” and the hardware that makes this possible is mainly the GPU. Understanding compute and GPUs explains why AI is so powerful, so expensive, and how it scales.

πŸ’‘ In one line: AI needs massive parallel math, and GPUs β€” with thousands of cores working at once β€” are the hardware that delivers it.

Why Does AI Need So Much Compute?

A neural network is, at its heart, a giant pile of matrix multiplications β€” billions of multiply-and-add operations. Training makes this far heavier: the model processes huge datasets and adjusts billions of parameters, repeating the calculation millions of times. The total amount of math is staggering, which is why specialised hardware is essential.

CPU vs. GPU

The two main types of processor are built for different jobs:

  • CPU (Central Processing Unit) β€” has a few very powerful cores. Brilliant at complex, sequential tasks done one after another.
  • GPU (Graphics Processing Unit) β€” has thousands of simpler cores. Brilliant at doing the same operation on lots of data at once β€” that is, parallel work.

Neural-network math is exactly the kind of work that splits into thousands of identical small operations, so GPUs can do it far faster than CPUs. (GPUs were originally built for rendering graphics, which is also massively parallel β€” which is why they turned out to be perfect for AI.)

What is "Compute"?

Compute simply means the amount of calculation needed. It's measured in FLOPs β€” floating-point operations (the basic math operations like multiply and add).

  • Training a large model takes a colossal total number of FLOPs.
  • More compute means you can train bigger models on more data β€” which, per the scaling laws, tends to produce better models.

This is why compute is treated as a precious, strategic resource in AI.

Key Hardware Terms

TermMeaning
GPUChip with thousands of cores for parallel math β€” the AI workhorse
VRAMThe GPU's own memory; must hold the model and data
FLOPsFloating-point operations β€” a measure of how much compute is used
TPUA custom AI accelerator chip (made by Google)
ClusterMany GPUs networked together for large-scale training

Training vs. Inference Compute

  • Training needs massive, sustained compute β€” often clusters of hundreds or thousands of GPUs running for weeks.
  • Inference needs less compute per request, but it runs constantly for every user β€” so it still requires capable GPUs, and at scale its total compute can be huge.

Memory Matters (VRAM)

A model's parameters must fit in the GPU's memory (VRAM) to run. Roughly, at half precision a parameter takes ~2 bytes, so:

  • A 7-billion-parameter model needs about 14 GB of VRAM just to load.
  • Very large models don't fit on one GPU and must be split across many.

This is why quantization (storing parameters at lower precision) is so useful β€” it shrinks memory needs and lets bigger models run on smaller hardware.

Why It's Expensive

  • GPUs are costly and power-hungry.
  • Training frontier models can cost millions of dollars in compute and electricity.
  • Scaling laws mean better models generally demand even more compute.

Compute is often the single biggest bottleneck β€” and cost β€” in building large AI systems.

Reducing Compute Needs

Common ways to do more with less:

  • Quantization β€” lower-precision parameters (less memory, faster).
  • Smaller / efficient models β€” right-size to the task.
  • Mixed precision β€” use lower precision where accuracy allows.
  • Batching β€” process many inputs together for efficiency.

Summary

  • AI is built on massive parallel computation, mostly matrix multiplications.
  • CPUs have a few powerful cores (sequential); GPUs have thousands of small cores (parallel) β€” ideal for AI.
  • Compute is measured in FLOPs; more compute enables bigger, better models.
  • A model must fit in GPU memory (VRAM) β€” a 7B model needs ~14 GB β€” so big models span many GPUs.
  • Compute is expensive and power-hungry, and techniques like quantization and batching help reduce the load.