Linux large pages and non-uniform memory distribution
Posted by Martin Bach on May 17, 2013
In my last post about large pages in 126.96.36.199 I promised a little more background information on how large pages and NUMA are related.
Background and some history about processor architecture
For quite some time now the CPUs you get from AMD and Intel both are NUMA, or better: cache coherent NUMA CPUs. They all have their own “local” memory directly attached to them, in other words the memory distribution is not uniform across all CPUs. This isn’t really new, Sequent has pioneered this concept on x86 a long time ago but that’s in a different context. You really should read Scaling Oracle 8i by James Morle which has a lot of excellent content related to NUMA in it, with contributions from Kevin Closson. It doesn’t matter that it reads “8i” most of it is as relevant today as it was then.
So what is the big deal about NUMA architecture anyway? To explain NUMA and why it is important to all of us a little more background information is on order.
Some time ago processor designers and architects of industry standard hardware could no longer ignore the fact that a front side bus (FSB) proved to be a bottleneck. There were two reasons for this: it was a) too slow and b) too much data had to go over it. As one direct consequence DRAM memory has been directly attached to the CPUs. AMD has done this first with it’s Opteron processors in its AMD64 micro architecture, followed by Intel’s Nehalem micro architecture. By removing the requirement of data retrieved from DRAM to travel across a slow bus latencies could be removed.
Now imagine that every processor has a number of memory channels to which DDR3 (DDR4 could arrive soon!) SDRAM is attached to. In a dual socket system, each socket is responsible for half the memory of the system. To allow the other socket to access the corresponding other half of memory some kind of interconnect between processors is needed. Intel has opted for the Quick Path Interconnect, AMD (and IBM for p-Series) use Hyper Transport. This is (comparatively) simple when you have few sockets, up to 4 each socket can directly connect to every other without any tricks. For 8 sockets it becomes more difficult. If every socket can directly communicate with its peers the system is said to be glue-less which is beneficial. The last production glue-less system Intel released was based on the Westmere architecture. Sandy Bridge (current until approximately Q3/2013) didn’t have an eight-way glue-less variant, and this is exactly why you get Westmere-EX in the X3-8, and not Sandy Bridge as in the X3-2.
Anyway, your system will have local and remote memory. For most of us, we are not going to notice this at all since there is little point in enabling NUMA on systems with two sockets. Oracle still recommends that you only enable NUMA on 8 way systems, and this is probably the reason the oracle-validated and preinstall RPMs add “numa=off” to the kernel command line in your GRUB boot loader.
Booting with NUMA enabled
The easiest way to boot with NUMA enabled is to get to your ILOM and boot the server. As soon as the GRUB line (“booting … in x seconds”) appears, hit a key. You will be dropped into the GRUB menu. It should highlight the default boot entry (Oracle Linux Server (188.8.131.520…x86-64). Hit the “e” key to edit the directives. You should see something like this now:
root (hd0,0) kernel /vmlinuz-2.6.39-400.xxx .... initrd /initramfs-2.6.39-400.xxx
Move the cursor to the line starting with kernel, then hit “e” again. The cursor will move to the end of the line, where you will find the numa=off directive. Hit the backspace key to remove numa=off, then hit return (it will bring you back to the previous 3 directions), then “b” to boot this configuration.
This is useful because it doesn’t involve editing the grub menu file, and if something should break you can simply restart and are back in a known good configuration.
Now when you log in as root you will notice that NUMA is turned on!
Signs of NUMA
My lab server is an AMD 6238 dual socket workstation with 32GB of RAM. To see the effect of NUMA, you can make use of the numactl tool:
[root@ol62 ~]# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 node 0 size: 8190 MB node 0 free: 1637 MB node 1 cpus: 6 7 8 9 10 11 node 1 size: 8192 MB node 1 free: 1732 MB node 2 cpus: 12 13 14 15 16 17 node 2 size: 8192 MB node 2 free: 1800 MB node 3 cpus: 18 19 20 21 22 23 node 3 size: 8176 MB node 3 free: 1745 MB node distances: node 0 1 2 3 0: 10 16 16 16 1: 16 10 16 16 2: 16 16 10 16 3: 16 16 16 10
You need to know that Opteron reports twice the number of NUMA nodes than there are sockets since their 6100 series. These processors are multi-module chips on the same die. Each of the sockets has 12 cores or better: modules. AMD’s processors are somewhere between HyperThreads and cores, to which extent I can’t tell. The server reports 24 CPUs in any case.
My configuration has allocated 12295 large pages at boot time or roughly 24 GB out of 32GB available. You can see how many pages have been allocated per CPU node in the first half of the output. Luckily the memory has been requested evenly across all NUMA nodes.
The second part of the numactl output gives you the node distances in a matrix. The numbers are provided by the Operating System at boot time in form of the System Locality Table (SLIT) and cannot be changed. They indicate the cost of accessing remote memory. 10 seems to be the base value for this parameter for local access. Higher values indicate more overhead.
NUMA in SYSFS
The SYS pseudo file system is set to replace the venerable /proc file system. The SYSFS exports more information than /proc does, which is apparent when it comes to memory allocation per NUMA node. Per node NUMA statistics are in /sys/devices/system/node*
Two files are out of interest, numastat and meminfo. I won’t go into detail for numastat (yet another post will follow), but meminfo is interesting.
[root@ol62 node0]# cat meminfo Node 0 MemTotal: 8386572 kB Node 0 MemFree: 1685988 kB Node 0 MemUsed: 6700584 kB Node 0 Active: 10516 kB Node 0 Inactive: 12704 kB Node 0 Active(anon): 2656 kB Node 0 Inactive(anon): 0 kB Node 0 Active(file): 7860 kB Node 0 Inactive(file): 12704 kB Node 0 Unevictable: 1172 kB Node 0 Mlocked: 1172 kB Node 0 Dirty: 0 kB Node 0 Writeback: 0 kB Node 0 FilePages: 21276 kB Node 0 Mapped: 2960 kB Node 0 AnonPages: 3156 kB Node 0 Shmem: 116 kB Node 0 KernelStack: 1384 kB Node 0 PageTables: 528 kB Node 0 NFS_Unstable: 0 kB Node 0 Bounce: 0 kB Node 0 WritebackTmp: 0 kB Node 0 Slab: 23788 kB Node 0 SReclaimable: 5652 kB Node 0 SUnreclaim: 18136 kB Node 0 AnonHugePages: 0 kB Node 0 HugePages_Total: 3074 Node 0 HugePages_Free: 3074 Node 0 HugePages_Surp: 0
This file is similar to /proc/meminfo but only relevant for node0, i.e. the first 6 “cores” on my system. Here you can see the large page allocation on this node.
Why does this matter
When you are consolidating lots of environments to your system with lots of sockets, you should try and stick to memory locality. Keep instances on a socket if possible, today’s servers can take a lot of memory and you shouldn’t have to use remote memory this avoiding latency. I personally would use control groups to ensure my instances stay where I want them to stay. There are other ways to control memory distribution (see some of the SLOB examples) but cgroups are by far the most elegant.
Using NUMA on your system and leaving it to chance how memory is distributed will lead to difficult-to-predict performance. You might even run out of memory on a local node causing unexpected problems. As with everything, understanding and tuning a configuration is the way to go! I will run a few benchmarks next to demonstrate the difference between local and remote memory access. Unfortunately I don’t have a 4-way system available for these tests-normally you wouldn’t really worry about NUMA settings on less than four cores.
Don’t go and rush your systems to NUMA! Like I said, there is little to be gained in about 80% of all servers out there on dual-socket systems. Four-way servers might be candidates for NUMA, 8 way are candidates. By saying candidates I mean if you understand NUMA and how it can affect your application, and have really load tested it and only if it provided to be predictable, stable performance, then I would think of enabling NUMA for a production workload. There is nothing like thorough testing that can tell you how your application will perform. I guess all I want to say is that turning on NUMA can have negative performance impact as well, or even crash your Oracle instance if the memory on a NUMA node is depleted. Search MOS for NUMA to get more information.