In preparation for a research project and potential UKOUG conference papers I am researching the effect of NUMA on x86 systems.
NUMA is one of the key features to understand in modern computer organisation, and I recommend reading “Computer Architecture, Fifth Edition: A Quantitative Approach” from Hennessy and Patterson (make sure you grab the 5th edition). Read the chapter about cache optimisation and also the appendix about the memory hierarchy!
Now why should you know NUMA? First of all there is an increasing number of multi-socket systems. AMD has pioneered the move to a lot of cores, but Intel is not far behind. Although AMD is currently leading in the number of cores (“modules”) on a die, Intel doesn’t need to: the Sandy-Bridge EP processors are way more powerful on a one-to-one comparison than anything AMD has at the moment.
In the example, I am using a blade system with Opteron 61xx processors. The processor has 12 cores according to the AMD hardware reference. The output of /proc/cpuinfo lists 48 “processors”, so it should be fair to say that there are 48/12 = 4 sockets in the system. An AWR report on the machine lists it as 4 sockets, 24 cores and 48 processors. I didn’t think the processor was using SMT, when I find out why AWR reports 24c48t I’ll update the post.
Anyway, I ensured that the kernel command line (/proc/cmdline) didn’t include numa=off, which the oracle-validated RPM sets. Then after a reboot here’s the result:
$ ]$ numactl --hardware available: 8 nodes (0-7) node 0 size: 4016 MB node 0 free: 378 MB node 1 size: 4040 MB node 1 free: 213 MB node 2 size: 4040 MB node 2 free: 833 MB node 3 size: 4040 MB node 3 free: 819 MB node 4 size: 4040 MB node 4 free: 847 MB node 5 size: 4040 MB node 5 free: 834 MB node 6 size: 4040 MB node 6 free: 851 MB node 7 size: 4040 MB node 7 free: 749 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 20 20 20 20 20 20 20 1: 20 10 20 20 20 20 20 20 2: 20 20 10 20 20 20 20 20 3: 20 20 20 10 20 20 20 20 4: 20 20 20 20 10 20 20 20 5: 20 20 20 20 20 10 20 20 6: 20 20 20 20 20 20 10 20 7: 20 20 20 20 20 20 20 10
Right, I have 8 NUMA nodes from 0-7, total RAM on the machine is 32GB. There are huge pages allocated for another database to allow for a 24GB RAM SGA. A lot of information about NUMA can be found in the SYSFS which is now mounted by default on RHEL and Oracle Linux. Check the path to /sys/devices/system/node:
$ ls node0 node1 node2 node3 node4 node5 node6 node7 $ ls node0 cpu0 cpu12 cpu16 cpu20 cpu4 cpu8 cpumap distance meminfo numastat
For each NUMA node as shown in the output of numactl –hardware there is a subdirectory noden. There you can see also the processors that form the node as well. Oracle Linux 6.x offers a file called cpulist, previous releases with the RHEL-compatible kernel should have subdirectories cpux. Interestingly you find memory information local to the NUMA node in the file meminfo, as well as the distance matrix you can query in numactl –hardware. So far I have only seen distances of 10 or 20-if anyone knows where these numbers come from or has soon other figures please let me know!
Another useful tool to know is numastat which presents memory information (and cross-node memory requests!) which can be useful.
$ numastat node0 node1 node2 node3 numa_hit 3048548 25344114 14523218 13498057 numa_miss 0 0 0 0 numa_foreign 0 0 0 0 interleave_hit 8196 390371 415719 458362 local_node 2415628 24965781 14059618 12907752 other_node 632920 378333 463600 590305 node4 node5 node6 node7 numa_hit 9295098 4072364 3730878 3659625 numa_miss 0 0 0 0 numa_foreign 0 0 0 0 interleave_hit 512399 451099 417627 390960 local_node 8637176 3483582 3152133 3159090 other_node 657922 588782 578745 500535
Oracle and NUMA
Oracle has an if then else approach to NUMA as a post from Kevin Closson has explained already. I’m on 11.2.0.3 and need to use “_enable_numa_support” to enable NUMA support in the database. Before that however I though I’d give the numctl command a chance and bind it to node 7 (both for processor and memory)
This is easily done:
[oracle@server1 ~]> numactl --membind=7 --cpunodebind=7 sqlplus / as sysdba <<EOF startup exit EOF
Have a look at the numactl man page if you want to learn more about the options.
Now how can you check if it respected your settings? Simple enough, the tool is called “taskset”. Unlike the name may suggest not only can you set a task, but you can also get the affinities etc. A simple one-liner does that for my database SLOB:
$ for i in `ps -ef | awk '/SLOB/ {print $2}'`; do taskset -c -p $i; done pid 1434's current affinity list: 3,7,11,15,19,23 pid 1436's current affinity list: 3,7,11,15,19,23 pid 1438's current affinity list: 3,7,11,15,19,23 pid 1442's current affinity list: 3,7,11,15,19,23 pid 1444's current affinity list: 3,7,11,15,19,23 pid 1446's current affinity list: 3,7,11,15,19,23 pid 1448's current affinity list: 3,7,11,15,19,23 pid 1450's current affinity list: 3,7,11,15,19,23 pid 1452's current affinity list: 3,7,11,15,19,23 pid 1454's current affinity list: 3,7,11,15,19,23 pid 1456's current affinity list: 3,7,11,15,19,23 pid 1458's current affinity list: 3,7,11,15,19,23 pid 1460's current affinity list: 3,7,11,15,19,23 pid 1462's current affinity list: 3,7,11,15,19,23 pid 1464's current affinity list: 3,7,11,15,19,23 pid 1466's current affinity list: 3,7,11,15,19,23 pid 1470's current affinity list: 3,7,11,15,19,23 pid 1472's current affinity list: 3,7,11,15,19,23 pid 1489's current affinity list: 3,7,11,15,19,23 pid 1694's current affinity list: 3,7,11,15,19,23 pid 1696's current affinity list: 3,7,11,15,19,23 pid 5041's current affinity list: 3,7,11,15,19,23 pid 13374's current affinity list: 3,7,11,15,19,23
Is that really node7? Checking the cpus in node7:
$ ls node7 cpu11 cpu15 cpu19 cpu23 cpu3 cpu7
That’s us! Ok that worked.
_enable_NUMA_support
The next test I did was to see how Oracle handles NUMA in the database. There was a bit of a enable/don’t enable/enable/don’t enable from 10.2 to 11.2. If the MOS notes are correct then NUMA support is turned off by default now. The underscore parameter _enable_NUMA_support turns it on again. At least on my 11.2.0.3.2 system on Linux there was no relinking of the oracle binary necessary.
But to my surprise I saw this after starting the database with NUMA support enabled:
$ for i in `ps -ef | awk '/SLOB/ {print $2}'`; do taskset -c -p $i; done pid 17513's current affinity list: 26,30,34,38,42,46 pid 17515's current affinity list: 26,30,34,38,42,46 pid 17517's current affinity list: 26,30,34,38,42,46 pid 17521's current affinity list: 26,30,34,38,42,46 pid 17523's current affinity list: 26,30,34,38,42,46 pid 17525's current affinity list: 26,30,34,38,42,46 pid 17527's current affinity list: 26,30,34,38,42,46 pid 17529's current affinity list: 26,30,34,38,42,46 pid 17531's current affinity list: 0,4,8,12,16,20 pid 17533's current affinity list: 24,28,32,36,40,44 pid 17535's current affinity list: 1,5,9,13,17,21 pid 17537's current affinity list: 25,29,33,37,41,45 pid 17539's current affinity list: 2,6,10,14,18,22 pid 17541's current affinity list: 26,30,34,38,42,46 pid 17543's current affinity list: 27,31,35,39,43,47 pid 17545's current affinity list: 3,7,11,15,19,23 pid 17547's current affinity list: 24,28,32,36,40,44 pid 17549's current affinity list: 26,30,34,38,42,46 pid 17551's current affinity list: 26,30,34,38,42,46 pid 17553's current affinity list: 26,30,34,38,42,46 pid 17555's current affinity list: 26,30,34,38,42,46 pid 17557's current affinity list: 26,30,34,38,42,46 pid 17559's current affinity list: 26,30,34,38,42,46 pid 17563's current affinity list: 26,30,34,38,42,46 pid 17565's current affinity list: 26,30,34,38,42,46 pid 17568's current affinity list: 0,4,8,12,16,20 pid 17577's current affinity list: 0,4,8,12,16,20 pid 17584's current affinity list: 0,4,8,12,16,20 pid 17597's current affinity list: 0,4,8,12,16,20 pid 17599's current affinity list: 24,28,32,36,40,44
Interesting-so the database, with an otherwise identical pfile (and a SLOB PIO SGA of 270 M) is now distributed across lots of NUMA nodes…watch out for that interleaved memory transfer!
It doesn’t help trying to use numactl to force the creation of process on a node-Oracle now uses NUMA API calls internally it seems and overrides your command:
$ numactl --membind=7 --cpunodebind=7 sqlplus / as sysdba <<EOF > startup > EOF ... $ for i in `ps -ef | awk '/SLOB/ {print $2}'`; do taskset -c -p $i; done pid 20155's current affinity list: 3,7,11,15,19,23 pid 20157's current affinity list: 3,7,11,15,19,23 pid 20160's current affinity list: 3,7,11,15,19,23 pid 20164's current affinity list: 3,7,11,15,19,23 pid 20166's current affinity list: 3,7,11,15,19,23 pid 20168's current affinity list: 3,7,11,15,19,23 pid 20170's current affinity list: 3,7,11,15,19,23 pid 20172's current affinity list: 3,7,11,15,19,23 pid 20174's current affinity list: 0,4,8,12,16,20 pid 20176's current affinity list: 24,28,32,36,40,44 pid 20178's current affinity list: 1,5,9,13,17,21 pid 20180's current affinity list: 25,29,33,37,41,45 pid 20182's current affinity list: 2,6,10,14,18,22 pid 20184's current affinity list: 26,30,34,38,42,46 pid 20186's current affinity list: 27,31,35,39,43,47 pid 20188's current affinity list: 3,7,11,15,19,23 pid 20190's current affinity list: 24,28,32,36,40,44 pid 20192's current affinity list: 3,7,11,15,19,23 pid 20194's current affinity list: 3,7,11,15,19,23 pid 20196's current affinity list: 3,7,11,15,19,23 pid 20198's current affinity list: 3,7,11,15,19,23 pid 20200's current affinity list: 3,7,11,15,19,23 pid 20202's current affinity list: 3,7,11,15,19,23 pid 20206's current affinity list: 3,7,11,15,19,23 pid 20208's current affinity list: 3,7,11,15,19,23 pid 20211's current affinity list: 0,4,8,12,16,20 pid 20240's current affinity list: 0,4,8,12,16,20 pid 20363's current affinity list: 0,4,8,12,16,20 sched_getaffinity: No such process failed to get pid 20403's affinity
Little things I didn’t know! So next time I benchmark I will have that in mind!
_enable_NUMA_support is actually quite broken.
It’s broken in that it presumes you want NUMA awareness for the entire box. Chopping up a box into smaller sets of NUMA nodes for a particular instance requires *you* to specify in an init.ora parameter what nodes you want the instance to affinity itself to and that requires proper NUMA aware Oracle code. Proper NUMA awareness finally returns post 11.2.0.3
This processes that bolted away from node7 are DBWR processes?
The thing to remember about the Magny-Cours processor is that it is two Istanbul Opteron dies soldered together if you will. There are 4 sockets in your box but there are thus 8 “nodes”. With this arrangement you pick up an extra degree of non-uniformity to by the way. This association of nodes to sockets is also what is likely throwing off AWR because it is detecting more nodes than sockets and AMD is unique in this regard. It’s a bug in Oracle.
I had one of these boxes in my lab at Oracle and I do recall seeing that AWR anomaly as well but it didn’t really matter to me since I understood the topology so I ignored it (didn’t have to do with the specific task at hand at the time).
If you wish to test node scalability and the NUMA affect on this box I’d first recommend you determine the association of nodes to sockets (again, since there are 2 nodes in each socket). It’s been so long ago I can’t remember the mapping (e.g., are node 0 and node 1 in socket 0). It seems I must have. Perhaps here: http://kevinclosson.wordpress.com/category/amd-6100-magny-cours/ If not, a simple approach it to iterate for every CPU an invocation of the following script targeting ever other CPU. You’ll see clear, lumpy throughput: http://kevinclosson.files.wordpress.com/2010/10/px-sh.pdf
Once you discover what nodes are “close” to each other I’d treat them as sets in –cpunodebind, keep _enable_NUMA_support disabled and explore with SLOB LIOPS. That will certainly let you “feel” the NUMA effect.
If you’ll allow, I’d like to post a link to some NUMA-related material for your readers: http://kevinclosson.wordpress.com/kevin-closson-index/oracle-on-opteron-k8l-numa-etc/
Hi Kevin,
as always to the point! Keen to see which of the CGROUPS functionality will move into the next releases, and to what effect. I have made a (for me) interesting observation: when using _enable_NUMA_support = true and starting 100 sqlplus sessions taskset said the affinity was 0-47 (=none at all). Interesting. I can imagine the exact opposite effect when all new processes spawned by the listener are created on the same node-which wouldn’t be great either.
My DBWR processes are all over the place:
$ for i in `ps -ef | awk ‘/ora_dbw.*SLOB/ {print $2}’`; do taskset -c -p $i; done
pid 20174’s current affinity list: 0,4,8,12,16,20
pid 20176’s current affinity list: 24,28,32,36,40,44
pid 20178’s current affinity list: 1,5,9,13,17,21
pid 20180’s current affinity list: 25,29,33,37,41,45
pid 20182’s current affinity list: 2,6,10,14,18,22
pid 20184’s current affinity list: 26,30,34,38,42,46
pid 20186’s current affinity list: 27,31,35,39,43,47
pid 20188’s current affinity list: 3,7,11,15,19,23
$ ps -eo psr,args,pid | grep SLOB | sort -n | grep dbw
1 ora_dbw2_SLOB 20178
2 ora_dbw4_SLOB 20182
16 ora_dbw0_SLOB 20174
23 ora_dbw7_SLOB 20188
25 ora_dbw3_SLOB 20180
26 ora_dbw5_SLOB 20184
27 ora_dbw6_SLOB 20186
32 ora_dbw1_SLOB 20176
Good observation about the Magny-Cours-I heard about it but have since forgotten. I’ll try a bit more, but whatever I find will go to another post. This one is too long already.
We have a DL580 G7 (4 x Xeon E7 10core)
We moved a large DB from an IBM x3850 (Xeon 74000s) – Non NUMA.
Performance increase was expected to be significant… but it’s been dismal.
11.2.0.2 is the Oracle Version. We are using ASM on block devices (68 wide – Hi End XP12000)
I suggested to my DBA to enable NUMA as our kernel (RHEL 5.8) is NUMA aware:
nc8181@flph033 # numactl –hardware
available: 4 nodes (0-3)
node 0 size: 64608 MB
node 0 free: 2839 MB
node 1 size: 64640 MB
node 1 free: 2315 MB
node 2 size: 64640 MB
node 2 free: 2438 MB
node 3 size: 64640 MB
node 3 free: 1628 MB
node distances:
node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 20
2: 20 20 10 20
3: 20 20 20 10
DBA obliged, _enable_NUMA_support is set to TRUE. We’re still awaiting results of performance. But his alert log shows NUMA was detected and active.
NUMASTATS after NUMA support was explicitly enabled.
numa_miss 179264614 83711094 164579323 88929187
numa_foreign 102399480 162459794 139369806 112255138
interleave_hit 90335 88532 89057 90751
Over 36 hours and numa_miss and numa_foreign has not incremented at all! So does this mean we’re Good!? And those values likely were from pre-NUMA support in the DB — right?
I am also aware of Oracle’s X4800 8xE7 published result — which surprisngly was not NUMA enabled. Does this mean NUMA support is not really that HELPFUL or was trickology involved in that even if NUMA support was not enabled — the instance detected an Exadata or Exadata like storage (COMSTAR?) and in actuality enabled NUMA support silently?
Have a look at Kevin’s post here:
http://kevinclosson.wordpress.com/2010/12/02/_enable_numa_support-the-if-then-else-oracle-database-11g-release-2-initialization-parameter/
That should give you some more insights. Would you mind sharing the pfile for your database? I’m curious as to why there were no misses or foreign requests.
Martin
If you have 4S QPI drop to the BIOS and diable NUMA (e.g., enable memory interleave). Baseline that and *then* experiment with NUMA. These are 1-hop QPI systems so remote references are only 20% or so penalty and uniform in that penalty up to the point here the ondie memory controller is saturated due to remote reference requests.
You will be hard pressed to find a workload with Oracle on 4S QPI that outperforms in NUMA mode compared to SUMA. These are just really good systems and SUMA is a good model.
Also, _enable NUMA_support really does not work. It will in 11.2.0.4 since it will be integrated with CGROUPS. As it stands now, setting _enable_NUMA_support is a soft hint to foreground processes to prefer LRUs that govern SGA buffers from local memory. The problem is there is *zero* intelligent process placement. So if your connections come through a listener most are homed on the listeners node (socket) and then the scheduler will remotely execute them (um, lots and lots of remote memory references and lopsided demand on LLC). What’s worse is the fact that it is only at process birth that foregrounds detect what node they are executing one. Um, the scheduler will run you wherever it wants so the presumption about process locality is totally busted.
I recommend ignoring the numa stats about local and remote references. You’ll end up in the weeds. Not productive.
If you want to learn NUMA get a copy of SLB http://kevinclosson.wordpress.com/2010/11/17/reintroducing-slb-the-silly-little-benchmark/
Don’t confuse SLOB with SLB.
Pingback: Measure the impact of remote versus local NUMA node access thanks to processor_group_name | bdt's oracle blog
Pingback: The Oracle database, in-memory parallel execution and NUMA | Frits Hoogland Weblog