NUMA (Non-Uniform Memory Access) in Operating Systems

Last updated: May 21, 2026

Author :

Christy Harshitha Dakarapu

Introduction

Modern computing systems increasingly rely on:

Multicore processors
Multi-socket CPUs
Large memory systems
Parallel workloads
Cloud infrastructure
High-performance servers

As CPU core counts increased dramatically, traditional memory architectures began facing serious scalability problems:

Memory bottlenecks
Bus contention
Increased latency
Reduced throughput

Older systems used:

UMA (Uniform Memory Access)

where all processors accessed memory with roughly equal latency.

However, as systems scaled to:

Many CPUs
Large memory capacities

UMA architectures became inefficient.

To solve this problem, modern systems introduced:

NUMA (Non-Uniform Memory Access)

NUMA is one of the most important advanced memory architecture concepts in:

Operating systems
Multicore systems
Cloud computing
Databases
High-performance computing
Linux kernel design

because it directly affects:

Memory latency
CPU performance
Scalability
Scheduling efficiency
Parallel execution

What is NUMA?

NUMA stands for:

Non-Uniform Memory Access

NUMA is a memory architecture in which memory access time depends on:

Which CPU accesses the memory
Where the memory physically resides

Core Idea

Accessing nearby memory is faster than accessing distant memory

Important Insight

In NUMA systems, memory latency varies depending on processor-memory locality

Why NUMA is Necessary

Suppose large server contains:

Multiple CPU sockets
Hundreds of cores
Large RAM capacity

If all CPUs share single centralized memory:

Severe contention occurs

Problems:

Memory bottlenecks
Bus congestion
Poor scalability

NUMA solves this by:

Distributing memory closer to processors

UMA vs NUMA

UMA (Uniform Memory Access)

All processors access:

Same shared memory

with:

Equal latency

NUMA

Each CPU/node has:

Local memory

Access speed depends on:

Memory location

Comparison Table

Feature	UMA	NUMA
Memory latency	Uniform	Non-uniform
Scalability	Limited	Better
Processor locality	Less important	Critical
Large-system efficiency	Lower	Higher

NUMA Node

Very important concept.

A NUMA system is divided into:

NUMA nodes

Each node contains:

CPU cores
Local memory
Memory controller

Example

Server may contain:

Node 0
Node 1
Node 2
Node 3

Each with:

Own RAM region

Local Memory Access

When CPU accesses:

Its own node memory

Result:

Low latency
High bandwidth

Remote Memory Access

When CPU accesses:

Another node’s memory

Result:

Higher latency
Lower performance

Important Insight

NUMA performance depends heavily on maximizing local memory access

NUMA Memory Topology

Memory physically distributed across:

Multiple nodes

Nodes connected through:

High-speed interconnects

Examples:

Intel UPI
AMD Infinity Fabric

Memory Access Example

Suppose:

CPU in Node 0

Local Access

Accesses:

Node 0 RAM

Fast.

Remote Access

Accesses:

Node 2 RAM

Slower due to:

Inter-node communication

NUMA Latency Difference

Example:

Access Type	Approx Latency
Local Memory	Low
Remote Memory	Higher

NUMA and Operating Systems

Operating systems must become:

NUMA-aware

OS responsibilities:

NUMA-aware scheduling
Memory allocation
Process placement
Load balancing

NUMA-Aware Scheduling

Scheduler tries to:

Keep threads near their memory

Advantages:

Reduced latency
Better cache locality
Higher throughput

Example

Thread using Node 1 memory:

Preferably scheduled on Node 1 CPU

Processor Affinity

Very important NUMA optimization.

CPU Affinity

Thread/process prefers:

Specific CPU/core

NUMA Affinity

Thread/process prefers:

Specific NUMA node

Important Insight

NUMA-aware scheduling minimizes expensive remote memory accesses

NUMA Memory Allocation

Operating system attempts:

Allocate memory close to executing CPU

Called:

Local allocation policy

Example

Thread running on Node 0:

Memory allocated from Node 0 RAM

NUMA Balancing

Modern Linux kernels perform:

Automatic NUMA balancing

Kernel monitors:

Memory access patterns

and may:

Migrate pages
Move threads

to improve locality.

Linux NUMA Support

Linux provides strong NUMA support.

Tools:

numactl
numastat
taskset

Example

numactl --cpunodebind=0 --membind=0 program

Purpose

Bind process to:

Specific CPU node
Specific memory node

NUMA Policies

Linux supports multiple NUMA allocation policies.

1. Local Allocation

Use local node memory.

2. Interleaving

Spread memory across nodes.

3. Preferred Node

Prefer specific node.

4. Strict Binding

Restrict memory to chosen nodes.

NUMA and Virtual Memory

Virtual memory system must track:

Physical page location
NUMA node ownership

Page tables still operate normally, but:

Physical page placement matters

Important Insight

Virtual memory abstraction remains unchanged, but physical page locality becomes critical in NUMA systems

NUMA and Cache Coherency

Modern NUMA systems maintain:

Cache coherence

between:

Multiple processors

Challenges:

Synchronization overhead
Memory consistency traffic

NUMA and Databases

Databases heavily affected by NUMA.

Examples:

MySQL
PostgreSQL
Oracle

Optimizations:

NUMA-aware memory pools
Local thread placement

NUMA and Virtualization

Hypervisors must manage:

NUMA-aware VM placement

Advantages:

Better VM performance
Reduced remote memory access

NUMA in Cloud Computing

Cloud servers often contain:

Large NUMA systems

Cloud schedulers optimize:

VM placement
Memory locality

NUMA and Containers

Containers may use:

CPU pinning
NUMA-aware allocation

for:

Better scalability

NUMA Challenges

1. Remote Access Penalty

Remote memory slower.

2. Scheduling Complexity

OS must optimize placement.

3. Load Balancing Difficulty

Balancing locality and utilization difficult.

4. Cache Coherency Traffic

Cross-node synchronization expensive.

False Sharing in NUMA

Occurs when:

Multiple CPUs modify nearby memory

Leads to:

Excessive coherency traffic

NUMA Optimization Techniques

1. Thread Pinning

Bind threads to CPUs.

2. Memory Locality Optimization

Allocate local memory.

3. Data Partitioning

Keep related data near processing cores.

4. NUMA-Aware Allocators

Optimize memory placement.

Advantages of NUMA

1. Better Scalability

Supports large multicore systems.

2. Reduced Memory Bottlenecks

Distributed memory access.

3. Higher Throughput

Parallel memory access possible.

4. Improved Performance

Local accesses very fast.

Disadvantages of NUMA

1. Complexity

Programming becomes harder.

2. Remote Memory Penalty

Performance inconsistent.

3. OS Scheduling Challenges

Requires NUMA awareness.

4. Optimization Difficulty

Applications may require tuning.

Real-World Example

Suppose database server contains:

4 NUMA nodes
256 CPU cores
1 TB RAM

Without NUMA optimization:

Frequent remote memory access
High latency
Poor scalability

With NUMA-aware scheduling:

Worker threads pinned locally
Memory allocated near CPUs
Reduced inter-node traffic
Better throughput achieved

NUMA vs SMP

Students commonly confuse:

NUMA
SMP

SMP (Symmetric Multiprocessing)

Processors treated equally.

NUMA

Memory locality explicitly important.

Modern Systems

Often combine:

SMP + NUMA characteristics