Martins Blog

Trying to explain complex things in simple terms

Little things I didn’t know: difference between _enable_NUMA_support and numactl

Posted by Martin Bach on April 27, 2012

In preparation for a research project and potential UKOUG conference papers I am researching the effect of NUMA on x86 systems.

NUMA is one of the key features to understand in modern computer organisation, and I recommend reading “Computer Architecture, Fifth Edition: A Quantitative Approach” from Hennessy and Patterson (make sure you grab the 5th edition). Read the chapter about cache optimisation and also the appendix about the memory hierarchy!

Now why should you know NUMA? First of all there is an increasing number of multi-socket systems. AMD has pioneered the move to a lot of cores, but Intel is not far behind. Although AMD is currently leading in the number of cores (“modules”) on a die, Intel doesn’t need to: the Sandy-Bridge EP processors are way more powerful on a one-to-one comparison than anything AMD has at the moment.

In the example, I am using a blade system with Opteron 61xx processors. The processor has 12 cores according to the AMD hardware reference. The output of /proc/cpuinfo lists 48 “processors”, so it should be fair to say that there are 48/12 = 4 sockets in the system. An AWR report on the machine lists it as 4 sockets, 24 cores and 48 processors. I didn’t think the processor was using SMT, when I find out why AWR reports 24c48t  I’ll update the post.

Anyway, I ensured that the kernel command line (/proc/cmdline) didn’t include numa=off, which the oracle-validated RPM sets. Then after a reboot here’s the result:

$ ]$ numactl --hardware
available: 8 nodes (0-7)
node 0 size: 4016 MB
node 0 free: 378 MB
node 1 size: 4040 MB
node 1 free: 213 MB
node 2 size: 4040 MB
node 2 free: 833 MB
node 3 size: 4040 MB
node 3 free: 819 MB
node 4 size: 4040 MB
node 4 free: 847 MB
node 5 size: 4040 MB
node 5 free: 834 MB
node 6 size: 4040 MB
node 6 free: 851 MB
node 7 size: 4040 MB
node 7 free: 749 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  20  20  20  20  20  20
  1:  20  10  20  20  20  20  20  20
  2:  20  20  10  20  20  20  20  20
  3:  20  20  20  10  20  20  20  20
  4:  20  20  20  20  10  20  20  20
  5:  20  20  20  20  20  10  20  20
  6:  20  20  20  20  20  20  10  20
  7:  20  20  20  20  20  20  20  10

Right, I have 8 NUMA nodes from 0-7, total RAM on the machine is 32GB. There are huge pages allocated for another database to allow for a 24GB RAM SGA. A lot of information about NUMA can be found in the SYSFS which is now mounted by default on RHEL and Oracle Linux. Check the path to /sys/devices/system/node:

$ ls
node0  node1  node2  node3  node4  node5  node6  node7

$ ls node0
cpu0  cpu12  cpu16  cpu20  cpu4  cpu8  cpumap  distance  meminfo  numastat

For each NUMA node as shown in the output of numactl –hardware there is a subdirectory noden. There you can see also the processors that form the node as well. Oracle Linux 6.x offers a file called cpulist, previous releases with the RHEL-compatible kernel should have subdirectories cpux. Interestingly you find memory information local to the NUMA node in the file meminfo, as well as the distance matrix you can query in numactl –hardware. So far I have only seen distances of 10 or 20-if anyone knows where these numbers come from or has soon other figures please let me know!

Another useful tool to know is numastat which presents memory information (and cross-node memory requests!) which can be useful.

$ numastat
                           node0           node1           node2           node3
numa_hit                 3048548        25344114        14523218        13498057
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit              8196          390371          415719          458362
local_node               2415628        24965781        14059618        12907752
other_node                632920          378333          463600          590305

                           node4           node5           node6           node7
numa_hit                 9295098         4072364         3730878         3659625
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit            512399          451099          417627          390960
local_node               8637176         3483582         3152133         3159090
other_node                657922          588782          578745          500535

Oracle and NUMA

Oracle has an if then else approach to NUMA as a post from Kevin Closson has explained already. I’m on 11.2.0.3 and need to use “_enable_numa_support” to enable NUMA support in the database. Before that however I though I’d give the numctl command a chance and bind it to node 7 (both for processor and memory)

This is easily done:

[oracle@server1 ~]> numactl --membind=7 --cpunodebind=7 sqlplus / as sysdba <<EOF
startup
exit
EOF

Have a look at the numactl man page if you want to learn more about the options.

Now how can you check if it respected your settings? Simple enough, the tool is called “taskset”. Unlike the name may suggest not only can you set a task, but you can also get the affinities etc. A simple one-liner does that for my database SLOB:

$ for i in `ps -ef | awk '/SLOB/ {print $2}'`; do taskset -c -p $i; done
pid 1434's current affinity list: 3,7,11,15,19,23
pid 1436's current affinity list: 3,7,11,15,19,23
pid 1438's current affinity list: 3,7,11,15,19,23
pid 1442's current affinity list: 3,7,11,15,19,23
pid 1444's current affinity list: 3,7,11,15,19,23
pid 1446's current affinity list: 3,7,11,15,19,23
pid 1448's current affinity list: 3,7,11,15,19,23
pid 1450's current affinity list: 3,7,11,15,19,23
pid 1452's current affinity list: 3,7,11,15,19,23
pid 1454's current affinity list: 3,7,11,15,19,23
pid 1456's current affinity list: 3,7,11,15,19,23
pid 1458's current affinity list: 3,7,11,15,19,23
pid 1460's current affinity list: 3,7,11,15,19,23
pid 1462's current affinity list: 3,7,11,15,19,23
pid 1464's current affinity list: 3,7,11,15,19,23
pid 1466's current affinity list: 3,7,11,15,19,23
pid 1470's current affinity list: 3,7,11,15,19,23
pid 1472's current affinity list: 3,7,11,15,19,23
pid 1489's current affinity list: 3,7,11,15,19,23
pid 1694's current affinity list: 3,7,11,15,19,23
pid 1696's current affinity list: 3,7,11,15,19,23
pid 5041's current affinity list: 3,7,11,15,19,23
pid 13374's current affinity list: 3,7,11,15,19,23

Is that really node7? Checking the cpus in node7:

$ ls node7
cpu11  cpu15  cpu19  cpu23  cpu3  cpu7

That’s us! Ok that worked.

_enable_NUMA_support

The next test I did was to see how Oracle handles NUMA in the database. There was a bit of a enable/don’t enable/enable/don’t enable from 10.2 to 11.2. If the MOS notes are correct then NUMA support is turned off by default now. The underscore parameter _enable_NUMA_support turns it on again. At least on my 11.2.0.3.2 system on Linux there was no relinking of the oracle binary necessary.

But to my surprise I saw this after starting the database with NUMA support enabled:


$ for i in `ps -ef | awk '/SLOB/ {print $2}'`; do taskset -c -p $i; done
pid 17513's current affinity list: 26,30,34,38,42,46
pid 17515's current affinity list: 26,30,34,38,42,46
pid 17517's current affinity list: 26,30,34,38,42,46
pid 17521's current affinity list: 26,30,34,38,42,46
pid 17523's current affinity list: 26,30,34,38,42,46
pid 17525's current affinity list: 26,30,34,38,42,46
pid 17527's current affinity list: 26,30,34,38,42,46
pid 17529's current affinity list: 26,30,34,38,42,46
pid 17531's current affinity list: 0,4,8,12,16,20
pid 17533's current affinity list: 24,28,32,36,40,44
pid 17535's current affinity list: 1,5,9,13,17,21
pid 17537's current affinity list: 25,29,33,37,41,45
pid 17539's current affinity list: 2,6,10,14,18,22
pid 17541's current affinity list: 26,30,34,38,42,46
pid 17543's current affinity list: 27,31,35,39,43,47
pid 17545's current affinity list: 3,7,11,15,19,23
pid 17547's current affinity list: 24,28,32,36,40,44
pid 17549's current affinity list: 26,30,34,38,42,46
pid 17551's current affinity list: 26,30,34,38,42,46
pid 17553's current affinity list: 26,30,34,38,42,46
pid 17555's current affinity list: 26,30,34,38,42,46
pid 17557's current affinity list: 26,30,34,38,42,46
pid 17559's current affinity list: 26,30,34,38,42,46
pid 17563's current affinity list: 26,30,34,38,42,46
pid 17565's current affinity list: 26,30,34,38,42,46
pid 17568's current affinity list: 0,4,8,12,16,20
pid 17577's current affinity list: 0,4,8,12,16,20
pid 17584's current affinity list: 0,4,8,12,16,20
pid 17597's current affinity list: 0,4,8,12,16,20
pid 17599's current affinity list: 24,28,32,36,40,44

Interesting-so the database, with an otherwise identical pfile (and a SLOB PIO SGA of 270 M) is now distributed across lots of NUMA nodes…watch out for that interleaved memory transfer!

It doesn’t help trying to use numactl to force the creation of process on a node-Oracle now uses NUMA API calls internally it seems and overrides your command:

$ numactl --membind=7 --cpunodebind=7 sqlplus / as sysdba <<EOF
> startup
> EOF
...
$ for i in `ps -ef | awk '/SLOB/ {print $2}'`; do taskset -c -p $i; done
pid 20155's current affinity list: 3,7,11,15,19,23
pid 20157's current affinity list: 3,7,11,15,19,23
pid 20160's current affinity list: 3,7,11,15,19,23
pid 20164's current affinity list: 3,7,11,15,19,23
pid 20166's current affinity list: 3,7,11,15,19,23
pid 20168's current affinity list: 3,7,11,15,19,23
pid 20170's current affinity list: 3,7,11,15,19,23
pid 20172's current affinity list: 3,7,11,15,19,23
pid 20174's current affinity list: 0,4,8,12,16,20
pid 20176's current affinity list: 24,28,32,36,40,44
pid 20178's current affinity list: 1,5,9,13,17,21
pid 20180's current affinity list: 25,29,33,37,41,45
pid 20182's current affinity list: 2,6,10,14,18,22
pid 20184's current affinity list: 26,30,34,38,42,46
pid 20186's current affinity list: 27,31,35,39,43,47
pid 20188's current affinity list: 3,7,11,15,19,23
pid 20190's current affinity list: 24,28,32,36,40,44
pid 20192's current affinity list: 3,7,11,15,19,23
pid 20194's current affinity list: 3,7,11,15,19,23
pid 20196's current affinity list: 3,7,11,15,19,23
pid 20198's current affinity list: 3,7,11,15,19,23
pid 20200's current affinity list: 3,7,11,15,19,23
pid 20202's current affinity list: 3,7,11,15,19,23
pid 20206's current affinity list: 3,7,11,15,19,23
pid 20208's current affinity list: 3,7,11,15,19,23
pid 20211's current affinity list: 0,4,8,12,16,20
pid 20240's current affinity list: 0,4,8,12,16,20
pid 20363's current affinity list: 0,4,8,12,16,20
sched_getaffinity: No such process
failed to get pid 20403's affinity

Little things I didn’t know! So next time I benchmark I will have that in mind!

About these ads

5 Responses to “Little things I didn’t know: difference between _enable_NUMA_support and numactl”

  1. _enable_NUMA_support is actually quite broken.

    It’s broken in that it presumes you want NUMA awareness for the entire box. Chopping up a box into smaller sets of NUMA nodes for a particular instance requires *you* to specify in an init.ora parameter what nodes you want the instance to affinity itself to and that requires proper NUMA aware Oracle code. Proper NUMA awareness finally returns post 11.2.0.3

    This processes that bolted away from node7 are DBWR processes?

    The thing to remember about the Magny-Cours processor is that it is two Istanbul Opteron dies soldered together if you will. There are 4 sockets in your box but there are thus 8 “nodes”. With this arrangement you pick up an extra degree of non-uniformity to by the way. This association of nodes to sockets is also what is likely throwing off AWR because it is detecting more nodes than sockets and AMD is unique in this regard. It’s a bug in Oracle.

    I had one of these boxes in my lab at Oracle and I do recall seeing that AWR anomaly as well but it didn’t really matter to me since I understood the topology so I ignored it (didn’t have to do with the specific task at hand at the time).

    If you wish to test node scalability and the NUMA affect on this box I’d first recommend you determine the association of nodes to sockets (again, since there are 2 nodes in each socket). It’s been so long ago I can’t remember the mapping (e.g., are node 0 and node 1 in socket 0). It seems I must have. Perhaps here: http://kevinclosson.wordpress.com/category/amd-6100-magny-cours/ If not, a simple approach it to iterate for every CPU an invocation of the following script targeting ever other CPU. You’ll see clear, lumpy throughput: http://kevinclosson.files.wordpress.com/2010/10/px-sh.pdf

    Once you discover what nodes are “close” to each other I’d treat them as sets in –cpunodebind, keep _enable_NUMA_support disabled and explore with SLOB LIOPS. That will certainly let you “feel” the NUMA effect.

    If you’ll allow, I’d like to post a link to some NUMA-related material for your readers: http://kevinclosson.wordpress.com/kevin-closson-index/oracle-on-opteron-k8l-numa-etc/

    • Martin Bach said

      Hi Kevin,

      as always to the point! Keen to see which of the CGROUPS functionality will move into the next releases, and to what effect. I have made a (for me) interesting observation: when using _enable_NUMA_support = true and starting 100 sqlplus sessions taskset said the affinity was 0-47 (=none at all). Interesting. I can imagine the exact opposite effect when all new processes spawned by the listener are created on the same node-which wouldn’t be great either.

      My DBWR processes are all over the place:

      $ for i in `ps -ef | awk ‘/ora_dbw.*SLOB/ {print $2}’`; do taskset -c -p $i; done
      pid 20174’s current affinity list: 0,4,8,12,16,20
      pid 20176’s current affinity list: 24,28,32,36,40,44
      pid 20178’s current affinity list: 1,5,9,13,17,21
      pid 20180’s current affinity list: 25,29,33,37,41,45
      pid 20182’s current affinity list: 2,6,10,14,18,22
      pid 20184’s current affinity list: 26,30,34,38,42,46
      pid 20186’s current affinity list: 27,31,35,39,43,47
      pid 20188’s current affinity list: 3,7,11,15,19,23

      $ ps -eo psr,args,pid | grep SLOB | sort -n | grep dbw
      1 ora_dbw2_SLOB 20178
      2 ora_dbw4_SLOB 20182
      16 ora_dbw0_SLOB 20174
      23 ora_dbw7_SLOB 20188
      25 ora_dbw3_SLOB 20180
      26 ora_dbw5_SLOB 20184
      27 ora_dbw6_SLOB 20186
      32 ora_dbw1_SLOB 20176

      Good observation about the Magny-Cours-I heard about it but have since forgotten. I’ll try a bit more, but whatever I find will go to another post. This one is too long already.

  2. We have a DL580 G7 (4 x Xeon E7 10core)
    We moved a large DB from an IBM x3850 (Xeon 74000s) – Non NUMA.

    Performance increase was expected to be significant… but it’s been dismal.
    11.2.0.2 is the Oracle Version. We are using ASM on block devices (68 wide – Hi End XP12000)

    I suggested to my DBA to enable NUMA as our kernel (RHEL 5.8) is NUMA aware:

    nc8181@flph033 # numactl –hardware
    available: 4 nodes (0-3)
    node 0 size: 64608 MB
    node 0 free: 2839 MB
    node 1 size: 64640 MB
    node 1 free: 2315 MB
    node 2 size: 64640 MB
    node 2 free: 2438 MB
    node 3 size: 64640 MB
    node 3 free: 1628 MB
    node distances:
    node 0 1 2 3
    0: 10 20 20 20
    1: 20 10 20 20
    2: 20 20 10 20
    3: 20 20 20 10

    DBA obliged, _enable_NUMA_support is set to TRUE. We’re still awaiting results of performance. But his alert log shows NUMA was detected and active.

    NUMASTATS after NUMA support was explicitly enabled.

    numa_miss 179264614 83711094 164579323 88929187
    numa_foreign 102399480 162459794 139369806 112255138
    interleave_hit 90335 88532 89057 90751

    Over 36 hours and numa_miss and numa_foreign has not incremented at all! So does this mean we’re Good!? And those values likely were from pre-NUMA support in the DB — right?

    I am also aware of Oracle’s X4800 8xE7 published result — which surprisngly was not NUMA enabled. Does this mean NUMA support is not really that HELPFUL or was trickology involved in that even if NUMA support was not enabled — the instance detected an Exadata or Exadata like storage (COMSTAR?) and in actuality enabled NUMA support silently?

    • Martin Bach said

      Have a look at Kevin’s post here:

      http://kevinclosson.wordpress.com/2010/12/02/_enable_numa_support-the-if-then-else-oracle-database-11g-release-2-initialization-parameter/

      That should give you some more insights. Would you mind sharing the pfile for your database? I’m curious as to why there were no misses or foreign requests.

      Martin

      • If you have 4S QPI drop to the BIOS and diable NUMA (e.g., enable memory interleave). Baseline that and *then* experiment with NUMA. These are 1-hop QPI systems so remote references are only 20% or so penalty and uniform in that penalty up to the point here the ondie memory controller is saturated due to remote reference requests.

        You will be hard pressed to find a workload with Oracle on 4S QPI that outperforms in NUMA mode compared to SUMA. These are just really good systems and SUMA is a good model.

        Also, _enable NUMA_support really does not work. It will in 11.2.0.4 since it will be integrated with CGROUPS. As it stands now, setting _enable_NUMA_support is a soft hint to foreground processes to prefer LRUs that govern SGA buffers from local memory. The problem is there is *zero* intelligent process placement. So if your connections come through a listener most are homed on the listeners node (socket) and then the scheduler will remotely execute them (um, lots and lots of remote memory references and lopsided demand on LLC). What’s worse is the fact that it is only at process birth that foregrounds detect what node they are executing one. Um, the scheduler will run you wherever it wants so the presumption about process locality is totally busted.

        I recommend ignoring the numa stats about local and remote references. You’ll end up in the weeds. Not productive.

        If you want to learn NUMA get a copy of SLB http://kevinclosson.wordpress.com/2010/11/17/reintroducing-slb-the-silly-little-benchmark/

        Don’t confuse SLOB with SLB.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 2,365 other followers

%d bloggers like this: