Introduction

Modern computing systems increasingly rely on:

  • Multicore processors

  • Multi-socket CPUs

  • Large memory systems

  • Parallel workloads

  • Cloud infrastructure

  • High-performance servers

As CPU core counts increased dramatically, traditional memory architectures began facing serious scalability problems:

  • Memory bottlenecks

  • Bus contention

  • Increased latency

  • Reduced throughput

Older systems used:

UMA (Uniform Memory Access)

where all processors accessed memory with roughly equal latency.

However, as systems scaled to:

  • Many CPUs

  • Large memory capacities

UMA architectures became inefficient.

To solve this problem, modern systems introduced:

NUMA (Non-Uniform Memory Access)

NUMA is one of the most important advanced memory architecture concepts in:

  • Operating systems

  • Multicore systems

  • Cloud computing

  • Databases

  • High-performance computing

  • Linux kernel design

because it directly affects:

  • Memory latency

  • CPU performance

  • Scalability

  • Scheduling efficiency

  • Parallel execution

What is NUMA?

NUMA stands for:

Non-Uniform Memory Access

NUMA is a memory architecture in which memory access time depends on:

  • Which CPU accesses the memory

  • Where the memory physically resides

Core Idea

Accessing nearby memory is faster than accessing distant memory

Important Insight

In NUMA systems, memory latency varies depending on processor-memory locality

Why NUMA is Necessary

Suppose large server contains:

  • Multiple CPU sockets

  • Hundreds of cores

  • Large RAM capacity

If all CPUs share single centralized memory:

  • Severe contention occurs

Problems:

  • Memory bottlenecks

  • Bus congestion

  • Poor scalability

NUMA solves this by:

  • Distributing memory closer to processors

UMA vs NUMA

UMA (Uniform Memory Access)

All processors access:

  • Same shared memory

with:

  • Equal latency

NUMA

Each CPU/node has:

  • Local memory

Access speed depends on:

  • Memory location

Comparison Table

FeatureUMANUMA
Memory latencyUniformNon-uniform
ScalabilityLimitedBetter
Processor localityLess importantCritical
Large-system efficiencyLowerHigher

NUMA Node

Very important concept.

A NUMA system is divided into:

NUMA nodes

Each node contains:

  • CPU cores

  • Local memory

  • Memory controller

Example

Server may contain:

  • Node 0

  • Node 1

  • Node 2

  • Node 3

Each with:

  • Own RAM region

Local Memory Access

When CPU accesses:

  • Its own node memory

Result:

  • Low latency

  • High bandwidth

Remote Memory Access

When CPU accesses:

  • Another node’s memory

Result:

  • Higher latency

  • Lower performance

Important Insight

NUMA performance depends heavily on maximizing local memory access

NUMA Memory Topology

Memory physically distributed across:

  • Multiple nodes

Nodes connected through:

  • High-speed interconnects

Examples:

  • Intel UPI

  • AMD Infinity Fabric

Memory Access Example

Suppose:

  • CPU in Node 0

Local Access

Accesses:

  • Node 0 RAM

Fast.

Remote Access

Accesses:

  • Node 2 RAM

Slower due to:

  • Inter-node communication

NUMA Latency Difference

Example:

Access TypeApprox Latency
Local MemoryLow
Remote MemoryHigher

NUMA and Operating Systems

Operating systems must become:

NUMA-aware

OS responsibilities:

  • NUMA-aware scheduling

  • Memory allocation

  • Process placement

  • Load balancing

NUMA-Aware Scheduling

Scheduler tries to:

  • Keep threads near their memory

Advantages:

  • Reduced latency

  • Better cache locality

  • Higher throughput

Example

Thread using Node 1 memory:

  • Preferably scheduled on Node 1 CPU

Processor Affinity

Very important NUMA optimization.

CPU Affinity

Thread/process prefers:

  • Specific CPU/core

NUMA Affinity

Thread/process prefers:

  • Specific NUMA node

Important Insight

NUMA-aware scheduling minimizes expensive remote memory accesses

NUMA Memory Allocation

Operating system attempts:

  • Allocate memory close to executing CPU

Called:

Local allocation policy

Example

Thread running on Node 0:

  • Memory allocated from Node 0 RAM

NUMA Balancing

Modern Linux kernels perform:

Automatic NUMA balancing

Kernel monitors:

  • Memory access patterns

and may:

  • Migrate pages

  • Move threads

to improve locality.

Linux NUMA Support

Linux provides strong NUMA support.

Tools:

  • numactl

  • numastat

  • taskset

Example

numactl --cpunodebind=0 --membind=0 program

Purpose

Bind process to:

  • Specific CPU node

  • Specific memory node

NUMA Policies

Linux supports multiple NUMA allocation policies.

1. Local Allocation

Use local node memory.

2. Interleaving

Spread memory across nodes.

3. Preferred Node

Prefer specific node.

4. Strict Binding

Restrict memory to chosen nodes.

NUMA and Virtual Memory

Virtual memory system must track:

  • Physical page location

  • NUMA node ownership

Page tables still operate normally, but:

  • Physical page placement matters

Important Insight

Virtual memory abstraction remains unchanged, but physical page locality becomes critical in NUMA systems

NUMA and Cache Coherency

Modern NUMA systems maintain:

  • Cache coherence

between:

  • Multiple processors

Challenges:

  • Synchronization overhead

  • Memory consistency traffic

NUMA and Databases

Databases heavily affected by NUMA.

Examples:

  • MySQL

  • PostgreSQL

  • Oracle

Optimizations:

  • NUMA-aware memory pools

  • Local thread placement

NUMA and Virtualization

Hypervisors must manage:

  • NUMA-aware VM placement

Advantages:

  • Better VM performance

  • Reduced remote memory access

NUMA in Cloud Computing

Cloud servers often contain:

  • Large NUMA systems

Cloud schedulers optimize:

  • VM placement

  • Memory locality

NUMA and Containers

Containers may use:

  • CPU pinning

  • NUMA-aware allocation

for:

  • Better scalability

NUMA Challenges

1. Remote Access Penalty

Remote memory slower.

2. Scheduling Complexity

OS must optimize placement.

3. Load Balancing Difficulty

Balancing locality and utilization difficult.

4. Cache Coherency Traffic

Cross-node synchronization expensive.

False Sharing in NUMA

Occurs when:

  • Multiple CPUs modify nearby memory

Leads to:

  • Excessive coherency traffic

NUMA Optimization Techniques

1. Thread Pinning

Bind threads to CPUs.

2. Memory Locality Optimization

Allocate local memory.

3. Data Partitioning

Keep related data near processing cores.

4. NUMA-Aware Allocators

Optimize memory placement.

Advantages of NUMA

1. Better Scalability

Supports large multicore systems.

2. Reduced Memory Bottlenecks

Distributed memory access.

3. Higher Throughput

Parallel memory access possible.

4. Improved Performance

Local accesses very fast.

Disadvantages of NUMA

1. Complexity

Programming becomes harder.

2. Remote Memory Penalty

Performance inconsistent.

3. OS Scheduling Challenges

Requires NUMA awareness.

4. Optimization Difficulty

Applications may require tuning.

Real-World Example

Suppose database server contains:

  • 4 NUMA nodes

  • 256 CPU cores

  • 1 TB RAM

Without NUMA optimization:

  • Frequent remote memory access

  • High latency

  • Poor scalability

With NUMA-aware scheduling:

  1. Worker threads pinned locally

  2. Memory allocated near CPUs

  3. Reduced inter-node traffic

  4. Better throughput achieved

NUMA vs SMP

Students commonly confuse:

  • NUMA

  • SMP

SMP (Symmetric Multiprocessing)

Processors treated equally.

NUMA

Memory locality explicitly important.

Modern Systems

Often combine:

  • SMP + NUMA characteristics