Introduction
Modern computing systems increasingly rely on:
Multicore processors
Multi-socket CPUs
Large memory systems
Parallel workloads
Cloud infrastructure
High-performance servers
As CPU core counts increased dramatically, traditional memory architectures began facing serious scalability problems:
Memory bottlenecks
Bus contention
Increased latency
Reduced throughput
Older systems used:
UMA (Uniform Memory Access)
where all processors accessed memory with roughly equal latency.
However, as systems scaled to:
Many CPUs
Large memory capacities
UMA architectures became inefficient.
To solve this problem, modern systems introduced:
NUMA (Non-Uniform Memory Access)
NUMA is one of the most important advanced memory architecture concepts in:
Operating systems
Multicore systems
Cloud computing
Databases
High-performance computing
Linux kernel design
because it directly affects:
Memory latency
CPU performance
Scalability
Scheduling efficiency
Parallel execution
What is NUMA?
NUMA stands for:
Non-Uniform Memory Access
NUMA is a memory architecture in which memory access time depends on:
Which CPU accesses the memory
Where the memory physically resides
Core Idea
Accessing nearby memory is faster than accessing distant memory
Important Insight
In NUMA systems, memory latency varies depending on processor-memory locality
Why NUMA is Necessary
Suppose large server contains:
Multiple CPU sockets
Hundreds of cores
Large RAM capacity
If all CPUs share single centralized memory:
Severe contention occurs
Problems:
Memory bottlenecks
Bus congestion
Poor scalability
NUMA solves this by:
Distributing memory closer to processors
UMA vs NUMA
UMA (Uniform Memory Access)
All processors access:
Same shared memory
with:
Equal latency
NUMA
Each CPU/node has:
Local memory
Access speed depends on:
Memory location
Comparison Table
| Feature | UMA | NUMA |
|---|---|---|
| Memory latency | Uniform | Non-uniform |
| Scalability | Limited | Better |
| Processor locality | Less important | Critical |
| Large-system efficiency | Lower | Higher |
NUMA Node
Very important concept.
A NUMA system is divided into:
NUMA nodes
Each node contains:
CPU cores
Local memory
Memory controller
Example
Server may contain:
Node 0
Node 1
Node 2
Node 3
Each with:
Own RAM region
Local Memory Access
When CPU accesses:
Its own node memory
Result:
Low latency
High bandwidth
Remote Memory Access
When CPU accesses:
Another node’s memory
Result:
Higher latency
Lower performance
Important Insight
NUMA performance depends heavily on maximizing local memory access
NUMA Memory Topology
Memory physically distributed across:
Multiple nodes
Nodes connected through:
High-speed interconnects
Examples:
Intel UPI
AMD Infinity Fabric
Memory Access Example
Suppose:
CPU in Node 0
Local Access
Accesses:
Node 0 RAM
Fast.
Remote Access
Accesses:
Node 2 RAM
Slower due to:
Inter-node communication
NUMA Latency Difference
Example:
| Access Type | Approx Latency |
|---|---|
| Local Memory | Low |
| Remote Memory | Higher |
NUMA and Operating Systems
Operating systems must become:
NUMA-aware
OS responsibilities:
NUMA-aware scheduling
Memory allocation
Process placement
Load balancing
NUMA-Aware Scheduling
Scheduler tries to:
Keep threads near their memory
Advantages:
Reduced latency
Better cache locality
Higher throughput
Example
Thread using Node 1 memory:
Preferably scheduled on Node 1 CPU
Processor Affinity
Very important NUMA optimization.
CPU Affinity
Thread/process prefers:
Specific CPU/core
NUMA Affinity
Thread/process prefers:
Specific NUMA node
Important Insight
NUMA-aware scheduling minimizes expensive remote memory accesses
NUMA Memory Allocation
Operating system attempts:
Allocate memory close to executing CPU
Called:
Local allocation policy
Example
Thread running on Node 0:
Memory allocated from Node 0 RAM
NUMA Balancing
Modern Linux kernels perform:
Automatic NUMA balancing
Kernel monitors:
Memory access patterns
and may:
Migrate pages
Move threads
to improve locality.
Linux NUMA Support
Linux provides strong NUMA support.
Tools:
numactl
numastat
taskset
Example
numactl --cpunodebind=0 --membind=0 program
Purpose
Bind process to:
Specific CPU node
Specific memory node
NUMA Policies
Linux supports multiple NUMA allocation policies.
1. Local Allocation
Use local node memory.
2. Interleaving
Spread memory across nodes.
3. Preferred Node
Prefer specific node.
4. Strict Binding
Restrict memory to chosen nodes.
NUMA and Virtual Memory
Virtual memory system must track:
Physical page location
NUMA node ownership
Page tables still operate normally, but:
Physical page placement matters
Important Insight
Virtual memory abstraction remains unchanged, but physical page locality becomes critical in NUMA systems
NUMA and Cache Coherency
Modern NUMA systems maintain:
Cache coherence
between:
Multiple processors
Challenges:
Synchronization overhead
Memory consistency traffic
NUMA and Databases
Databases heavily affected by NUMA.
Examples:
MySQL
PostgreSQL
Oracle
Optimizations:
NUMA-aware memory pools
Local thread placement
NUMA and Virtualization
Hypervisors must manage:
NUMA-aware VM placement
Advantages:
Better VM performance
Reduced remote memory access
NUMA in Cloud Computing
Cloud servers often contain:
Large NUMA systems
Cloud schedulers optimize:
VM placement
Memory locality
NUMA and Containers
Containers may use:
CPU pinning
NUMA-aware allocation
for:
Better scalability
NUMA Challenges
1. Remote Access Penalty
Remote memory slower.
2. Scheduling Complexity
OS must optimize placement.
3. Load Balancing Difficulty
Balancing locality and utilization difficult.
4. Cache Coherency Traffic
Cross-node synchronization expensive.
False Sharing in NUMA
Occurs when:
Multiple CPUs modify nearby memory
Leads to:
Excessive coherency traffic
NUMA Optimization Techniques
1. Thread Pinning
Bind threads to CPUs.
2. Memory Locality Optimization
Allocate local memory.
3. Data Partitioning
Keep related data near processing cores.
4. NUMA-Aware Allocators
Optimize memory placement.
Advantages of NUMA
1. Better Scalability
Supports large multicore systems.
2. Reduced Memory Bottlenecks
Distributed memory access.
3. Higher Throughput
Parallel memory access possible.
4. Improved Performance
Local accesses very fast.
Disadvantages of NUMA
1. Complexity
Programming becomes harder.
2. Remote Memory Penalty
Performance inconsistent.
3. OS Scheduling Challenges
Requires NUMA awareness.
4. Optimization Difficulty
Applications may require tuning.
Real-World Example
Suppose database server contains:
4 NUMA nodes
256 CPU cores
1 TB RAM
Without NUMA optimization:
Frequent remote memory access
High latency
Poor scalability
With NUMA-aware scheduling:
Worker threads pinned locally
Memory allocated near CPUs
Reduced inter-node traffic
Better throughput achieved
NUMA vs SMP
Students commonly confuse:
NUMA
SMP
SMP (Symmetric Multiprocessing)
Processors treated equally.
NUMA
Memory locality explicitly important.
Modern Systems
Often combine:
SMP + NUMA characteristics