Linux large pages and non-uniform memory distribution

In my last post about large pages in 11.2.0.3 I promised a little more background information on how large pages and NUMA are related.

Background and some history about processor architecture

For quite some time now the CPUs you get from AMD and Intel both are NUMA, or better: cache coherent NUMA CPUs. They all have their own “local” memory directly attached to them, in other words the memory distribution is not uniform across all CPUs. This isn’t really new, Sequent has pioneered this concept on x86 a long time ago but that’s in a different context. You really should read Scaling Oracle 8i by James Morle which has a lot of excellent content related to NUMA in it, with contributions from Kevin Closson. It doesn’t matter that it reads “8i” most of it is as relevant today as it was then.

So what is the big deal about NUMA architecture anyway? To explain NUMA and why it is important to all of us a little more background information is on order.

Some time ago processor designers and architects of industry standard hardware could no longer ignore the fact that a front side bus (FSB) proved to be a bottleneck. There were two reasons for this: it was a) too slow and b) too much data had to go over it. As one direct consequence DRAM memory has been directly attached to the CPUs. AMD has done this first with it’s Opteron processors in its AMD64 micro architecture, followed by Intel’s Nehalem micro architecture. By removing the requirement of data retrieved from DRAM to travel across a slow bus latencies could be removed.

Now imagine that every processor has a number of memory channels to which DDR3 (DDR4 could arrive soon!) SDRAM is attached to. In a dual socket system, each socket is responsible for half the memory of the system. To allow the other socket to access the corresponding other half of memory some kind of interconnect between processors is needed. Intel has opted for the Quick Path Interconnect, AMD (and IBM for p-Series) use Hyper Transport. This is (comparatively) simple when you have few sockets, up to 4 each socket can directly connect to every other without any tricks. For 8 sockets it becomes more difficult. If every socket can directly communicate with its peers the system is said to be glue-less which is beneficial. The last production glue-less system Intel released was based on the Westmere architecture. Sandy Bridge (current until approximately Q3/2013) didn’t have an eight-way glue-less variant, and this is exactly why you get Westmere-EX in the X3-8, and not Sandy Bridge as in the X3-2.

Anyway, your system will have local and remote memory. For most of us, we are not going to notice this at all since there is little point in enabling NUMA on systems with two sockets. Oracle still recommends that you only enable NUMA on 8 way systems, and this is probably the reason the oracle-validated and preinstall RPMs add “numa=off” to the kernel command line in your GRUB boot loader.

Booting with NUMA enabled

The easiest way to boot with NUMA enabled is to get to your ILOM and boot the server. As soon as the GRUB line (“booting … in x seconds”) appears, hit a key. You will be dropped into the GRUB menu. It should highlight the default boot entry (Oracle Linux Server (2.6.39.400…x86-64). Hit the “e” key to edit the directives. You should see something like this now:

root (hd0,0)
kernel /vmlinuz-2.6.39-400.xxx ....
initrd /initramfs-2.6.39-400.xxx

Move the cursor to the line starting with kernel, then hit “e” again. The cursor will move to the end of the line, where you will find the numa=off directive. Hit the backspace key to remove numa=off, then hit return (it will bring you back to the previous 3 directions), then “b” to boot this configuration.

This is useful because it doesn’t involve editing the grub menu file, and if something should break you can simply restart and are back in a known good configuration.

Now when you log in as root you will notice that NUMA is turned on!

Signs of NUMA

My lab server is an AMD 6238 dual socket workstation with 32GB of RAM. To see the effect of NUMA, you can make use of the numactl tool:

[root@ol62 ~]# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 8190 MB
node 0 free: 1637 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 8192 MB
node 1 free: 1732 MB
node 2 cpus: 12 13 14 15 16 17
node 2 size: 8192 MB
node 2 free: 1800 MB
node 3 cpus: 18 19 20 21 22 23
node 3 size: 8176 MB
node 3 free: 1745 MB
node distances:
node   0   1   2   3
  0:  10  16  16  16
  1:  16  10  16  16
  2:  16  16  10  16
  3:  16  16  16  10

You need to know that Opteron reports twice the number of NUMA nodes than there are sockets since their 6100 series. These processors are multi-module chips on the same die. Each of the sockets has 12 cores or better: modules. AMD’s processors are somewhere between HyperThreads and cores, to which extent I can’t tell. The server reports 24 CPUs in any case.

My configuration has allocated 12295 large pages at boot time or roughly 24 GB out of 32GB available. You can see how many pages have been allocated per CPU node in the first half of the output. Luckily the memory has been requested evenly across all NUMA nodes.

The second part of the numactl output gives you the node distances in a matrix. The numbers are provided by the Operating System at boot time in form of the System Locality Table (SLIT) and cannot be changed. They indicate the cost of accessing remote memory. 10 seems to be the base value for this parameter for local access. Higher values indicate more overhead.

NUMA in SYSFS

The SYS pseudo file system is set to replace the venerable /proc file system. The SYSFS exports more information than /proc does, which is apparent when it comes to memory allocation per NUMA node. Per node NUMA statistics are in /sys/devices/system/node*

Two files are out of interest, numastat and meminfo. I won’t go into detail for numastat (yet another post will follow), but meminfo is interesting.

[root@ol62 node0]# cat meminfo
Node 0 MemTotal:        8386572 kB
Node 0 MemFree:         1685988 kB
Node 0 MemUsed:         6700584 kB
Node 0 Active:            10516 kB
Node 0 Inactive:          12704 kB
Node 0 Active(anon):       2656 kB
Node 0 Inactive(anon):        0 kB
Node 0 Active(file):       7860 kB
Node 0 Inactive(file):    12704 kB
Node 0 Unevictable:        1172 kB
Node 0 Mlocked:            1172 kB
Node 0 Dirty:                 0 kB
Node 0 Writeback:             0 kB
Node 0 FilePages:         21276 kB
Node 0 Mapped:             2960 kB
Node 0 AnonPages:          3156 kB
Node 0 Shmem:               116 kB
Node 0 KernelStack:        1384 kB
Node 0 PageTables:          528 kB
Node 0 NFS_Unstable:          0 kB
Node 0 Bounce:                0 kB
Node 0 WritebackTmp:          0 kB
Node 0 Slab:              23788 kB
Node 0 SReclaimable:       5652 kB
Node 0 SUnreclaim:        18136 kB
Node 0 AnonHugePages:         0 kB
Node 0 HugePages_Total:  3074
Node 0 HugePages_Free:   3074
Node 0 HugePages_Surp:      0

This file is similar to /proc/meminfo but only relevant for node0, i.e. the first 6 “cores” on my system. Here you can see the large page allocation on this node.

Why does this matter

When you are consolidating lots of environments to your system with lots of sockets, you should try and stick to memory locality. Keep instances on a socket if possible, today’s servers can take a lot of memory and you shouldn’t have to use remote memory this avoiding latency. I personally would use control groups to ensure my instances stay where I want them to stay. There are other ways to control memory distribution (see some of the SLOB examples) but cgroups are by far the most elegant.

Using NUMA on your system and leaving it to chance how memory is distributed will lead to difficult-to-predict performance. You might even run out of memory on a local node causing unexpected problems. As with everything, understanding and tuning a configuration is the way to go! I will run a few benchmarks next to demonstrate the difference between local and remote memory access. Unfortunately I don’t have a 4-way system available for these tests-normally you wouldn’t really worry about NUMA settings on less than four cores.

Warning

Don’t go and rush your systems to NUMA! Like I said, there is little to be gained in about 80% of all servers out there on dual-socket systems. Four-way servers might be candidates for NUMA, 8 way are candidates. By saying candidates I mean if you understand NUMA and how it can affect your application, and have really load tested it and only if it provided to be predictable, stable performance, then I would think of enabling NUMA for a production workload. There is nothing like thorough testing that can tell you how your application will perform. I guess all I want to say is that turning on NUMA can have negative performance impact as well, or even crash your Oracle instance if the memory on a NUMA node is depleted. Search MOS for NUMA to get more information.

Reference

Responses

  1. > , and this is exactly why you get Westmere-EX in the X3-8, and not Sandy Bridge as in the X3-2.

    The laggard state of X3-8 (remaining with Westmere EX) doesn’t really have to do with glue or no glue. It has to do with links. The reason is because there are no 3-QPI Sandy Bridge processors. All Sandy Sandy Bridge E5 SKUs have 2 QPI links which can work either to double-up Sock-Sock as in E5-2600 or to make a 4 socket system (with a hop of course). Westmere EX (aka E7) on the other hand has 3 QPI links so one can make a 4-socket box with no hops or a glue/glueless 8 way box with hops. The “glue” affects the number of hops. No glue is 2 hops. This is why you’ll *never* see Oracle publish a 4-way Westmere EX TPC-C because it would show a massive drop in scalability from 4 sockets to 8 sockets due to the hops.This fact also shows up in the TPC-C per core when comparing Oracle on X3-8 Server (The x4800 really) compared to any E5-2600 Oracle TPCC. Don’t get me wrong, 8-socket WSM-EX is a very tough thing to get right. Nobody has linear 8-socket WSM-EX with shared-data workloads.

    Ivy EX will put that crucial extra QPI link back in and thus we’ll see 12TB 8-sock boxes with up to 120 cores..which begs the question, “Offload processing, why?” :-)

    1. Hi Kevin,

      I was silently hoping for you to correct me on Westmere-EX vs Sandy Bridge comparison. Nevertheless I should have read my references right. Westmere EX is indeed shown with 4 links in the diagram, I just happened not to see it. This is what I found:

      http://www.qdpma.com/systemarchitecture/systemarchitecture_qpi.html

      Intel seems to have published something similar at hot chips too in HC22.24.610-Nagara-Intel-6-Westmere-EX.pdf‎ which is hosted at hotchips.org

      Thanks again for clarifying.

      1. Ah yes, the diagram you refer to reminds me of my bad habit of ignoring QPI links that cannot be used to link sockets. The E7 has 4 links but one is used to connect to the IOH so I discount it–leaving the 3 for socketsocket linking. Starting with SNB we no longer sacrifice a QPI link to attach to an IOH since multi-lane DMI 2.0 is the path to I/O.

  2. […] Linux large pages and non-uniform memory distribution from Martin Bach […]

  3. […] Bach book : http://goo.gl/lfdkli and Linux large pages and non-uniform memory distribution […]

  4. […] already have been a lot of blogposts about Linux and NUMA: Kevin Closson Martin Bach: linux-large-pages-and-non-uniform-memory-distribution, _enable_numa_support and numactl Yves Colin Bertrand Drouvot I am sure there will be more, these […]

Blog at WordPress.com.