Huge Pages and Linux-Real World example

A common situation for many DBAs: all over sudden you are tasked to look at a database, and you are told you inherit it. Of course, a lot of problems exist with it, and you are supposed to fix them all. A few weeks ago this happened to me.

The system is a Sun 4660 x86-64 server with Red Hat5.3, and it has 64GB of memory, 8 dual core Opteron 8218 processors. SGA_TARGET was set to 40G, and PGA_AGGREGATE_TARGET to 5GB. That sounds like plenty, however the box was very busy trying to free up memory:

top - 12:16:11 up 23 days, 23:46, 19 users,  load average: 28.84, 25.69, 23.19
Tasks: 970 total,   3 running, 967 sleeping,   0 stopped,   0 zombie
Cpu(s):  9.3%us, 13.0%sy,  0.0%ni, 36.0%id, 40.6%wa,  0.1%hi,  1.0%si,  0.0%st
Mem:  66068664k total, 65992772k used,    75892k free,    43168k buffers
Swap:  2096472k total,  2096472k used,        0k free, 40782300k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1068 root      20  -5     0    0    0 R 76.0  0.0 566:03.07 [kswapd5]
13284 oracle    18   0 40.3g 6.3g 6.2g D 31.5 10.0  28:30.12 oraclePROD (LOCAL=NO)
 1070 root      10  -5     0    0    0 D 21.6  0.0 925:57.03 [kswapd7]
 1065 root      10  -5     0    0    0 D 13.8  0.0 196:27.93 [kswapd2]
 7771 oracle    18   0 40.6g 4.5g 4.1g D 12.1  7.1  26:05.04 oraclePROD (LOCAL=NO)
 8073 oracle    16   0 40.2g 2.8g 2.8g D 12.1  4.4   5:29.79 oraclePROD (LOCAL=NO)
 1066 root      10  -5     0    0    0 S 11.8  0.0  84:57.85 [kswapd3]
 1067 root      10  -5     0    0    0 D 10.5  0.0 165:39.98 [kswapd4]
 1069 root      10  -5     0    0    0 S  7.9  0.0 277:15.84 [kswapd6]

That looked really bad-load average far too high caused by kswapd trying to free memory. It was actually quite difficult for me to connect through ssh. So what’s using all this memory? And why is the swap size only 2 GB? This was the first thing to fix-but adding swap space is still not the solution, it prevents the box from crashing though which is good.

However we couldn’t account for a lot of the used memory initially. Theoretically, the server’s 64GB memory are used by 40G for the SGA and 5G for the PGA. We checked and the PGA didn’t exceed 3G. That makes for about 44G allocated, yet there was hardly any free space:

             total       used       free     shared    buffers     cached
Mem:         64520      64443         76          0        106      43762
-/+ buffers/cache:      20574      43946
Swap:         2047       2047          0

Looking at the /proc/meminfo which I unfortunately no longer have I could make out that the page tables used 22G:

cat /proc/meminfo | grep PageTables: 23418712 kB

HugePagesTotal of course returned 0. The question was: why is that number so hugely inflated? Remember that the standard page size for Linux x86-64 is 4k-huge pages are 2M in size. With huge pages in use, the length of the data structure to be maintained in the kernel for used and free pages is a lot shorter, read more efficient, and smaller in size.But that doesn’t really explain where the 22G went.

We checked the number of attached processes to the SGA and found around 700 in the nattach column of the ipcs command output which gave us the solution. For each process which maps the SGA into it’s own virtual memory address space, a new set of page table entries are required, so the pagetable memory space requirement is SGA size multiplied by attached processes.

A two pronged approach has been chosen here:

  1. We implement huge pages to reduce the memory preassure
  2. We try to find ways reducing the number of Oracle client processes

The first is of course easier to achieve, for the second one I need a lot of persuasion power and management buy in (read it’s the long term solution).

Configuring Huge Pages

So the immediate need was to configure huge pages for the system; using the calc hugepages shell script from Metalink we gathered the required number of huge pages. The number of huge pages was added in /etc/sysctl.conf in the vm.nr_hugepages parameter and we set soft and hard memlock limits in /etc/security/limits.conf

The Result

After restarting the servers to pick up the huge pages information we got the following information from meminfo:

[oracle@server ~]$ cat /proc/meminfo 
MemTotal:     66068668 kB
MemFree:      17265724 kB
Buffers:       1570656 kB
Cached:         766640 kB
SwapCached:          0 kB
Active:        5032060 kB
Inactive:       770412 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     66068668 kB
LowFree:      17265724 kB
SwapTotal:    16776528 kB
SwapFree:     16776528 kB
Dirty:            4164 kB
Writeback:           0 kB
AnonPages:     3541624 kB
Mapped:          76484 kB
Slab:           562968 kB
PageTables:     216372 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  28775852 kB
Committed_AS: 15617048 kB
VmallocTotal: 34359738367 kB
VmallocUsed:     86764 kB
VmallocChunk: 34359650291 kB
HugePages_Total: 20542
HugePages_Free:   1465
HugePages_Rsvd:   1404
Hugepagesize:     2048 kB

No swap in use, load average (not shown) down to normal levels and free memory!

[oracle@server ~]$ free -m
             total       used       free     shared    buffers     cached
Mem:         64520      47781      16738          0       1547        747
-/+ buffers/cache:      45487      19033
Swap:        16383          0      16383

I consider this a success :)

Responses

  1. Just a note that people on SUSE need to edit /etc/sysconfig/oracle instead of /etc/sysctl.conf. Nice blog post here: http://only4left.jpiwowar.com/2009/05/sles10_hugepages_x8664/

  2. Hi, Can you show what your ‘free -k’ looks like right now? (Not your ‘free -m’). Is it using 0 swap or has the swap usage creeped up, even if it’s a small amount? free -m will not show any swap used unless it’s in megabytes used. Misleading.
    So please can you share.
    Regards,
    RD.

    1. Always good to go back to old posts occasionally, I can still remember this well, but I somehow must have missed that question.

      Getting back to that: unfortunately I had left that company by the time you asked so the output wasn’t available, sorry. From what I recall the swap usage has gone way down, we made sure that we don’t allocate all of the SGA into large pages and leave a healthy portion available for PGA and operating system. Remember that large pages are not swappable and can cause the problem you saw in the post if other processes require memory that can’t be satisfied by the allocated large pages.

  3. Hi Martin,
    please take another look at this section.
    ——————————————————————–
    However we couldn’t account for a lot of the used memory initially. Theoretically, the server’s 64GB memory are used by 40G for the SGA and 5G for the PGA. We checked and the PGA didn’t exceed 3G. That makes for about 44G allocated, yet there was hardly any free space:

    total used free shared buffers cached
    Mem: 64520 64443 76 0 106 43762
    -/+ buffers/cache: 20574 43946
    Swap: 2047 2047 0
    ——————————————————————–

    Unless I’m misunderstanding, the memory you think is missing is actually sitting in disk cache.

    This page should help clarify what I mean:

    http://www.linuxatemyram.com/

    VBR,
    ~Brandon

Blog at WordPress.com.