More on use_large_pages in Linux and 11.2.0.3

Large Pages in Linux are a really interesting topic for me as I really like Linux and trying to understand how it works. Large pages can be very beneficial for systems with large SGAs and even more so for those with large SGA and lots of user sessions connected.

I have previously written about the benefits and usage of large pages in Linux here:

So now as you may know there is a change to the init.ora parameter “use_large_pages” in 11.2.0.3. The parameter can take these values:

SQL> select value,isdefault
  2  from V$PARAMETER_VALID_VALUES
  3* where name = 'use_large_pages'

VALUE		     ISDEFAULT
-------------------- --------------------
TRUE		     TRUE
AUTO		     FALSE
ONLY		     FALSE
FALSE		     FALSE

There is a new value named “auto” that didn’t exist prior to 11.2.0.3. The intention is to create large pages at instance startup if possible, even if /etc/sysctl.conf doesn’t have an entry for vm.nr_hugepages at all. The risk though is that-as with dynamic creation of large pages by echoing values into /proc/sys/vm/nr_hugepages-is that you get fewer than you expect. Maybe even 0. Now I’m interested to see if that works.

So let’s have a look, my system is Oracle Linux 6.4, 64bit running virtualised. Before any database was started I checked /proc/meminfo

[root@ol64 ~]# cat /proc/meminfo
MemTotal:        8192240 kB
MemFree:         5090124 kB
Buffers:           67408 kB
Cached:          2341504 kB
SwapCached:            0 kB
Active:           816116 kB
Inactive:        2055352 kB
Active(anon):     548760 kB
Inactive(anon):   284304 kB
Active(file):     267356 kB
Inactive(file):  1771048 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:        524284 kB
SwapFree:         524284 kB
Dirty:                60 kB
Writeback:             0 kB
AnonPages:        462560 kB
Mapped:           334424 kB
Shmem:            370516 kB
Slab:             103692 kB
SReclaimable:      47496 kB
SUnreclaim:        56196 kB
KernelStack:        2016 kB
PageTables:        26008 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     4620404 kB
Committed_AS:    3343896 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       26480 kB
VmallocChunk:   34359700348 kB
HardwareCorrupted:     0 kB
AnonHugePages:    247808 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        8128 kB
DirectMap2M:     8380416 kB

I am interested in the HugePages entries here towards the end of the list. If you have ever looked at /proc/meminfo in the previous release’s kernels (2.6.18.x to be precise) then you’ll notice it’s quite different now with a lot more information. Modern kernels are really a great step ahead. Have a look at the Outlook and References section below, this is a somewhat superficial explanation but good enough for the purpose of this article. A future post will go into more detail about SYSFS which is slated to replace the /proc file system and NUMA considerations.

Back to this article … the database I have running in my VM doesn’t use large pages, as shown in the alert.log:

Starting ORACLE instance (normal)
****************** Large Pages Information *****************

Total Shared Global Region in Large Pages = 0 KB (0%)

Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 0 (0 KB) (alloc incr 16 MB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB

RECOMMENDATION:
  Total Shared Global Region size is 2514 MB. For optimal performance,
  prior to the next instance restart increase the number
  of unused Large Pages by atleast 1257 2048 KB Large Pages (2514 MB)
  system wide to get 100% of the Shared
  Global Region allocated with Large pages
***********************************************************

So let’s change that, but dynamically and not manually. Again, a better (more predictable!) approach would be to manually add an additional 1257 large pages to /etc/sysctl.conf as recommended and reboot to ensure that they will be available when the database starts. And probably set use_large_pages to “only” to enforce their usage. But enough warnings that you probably don’t want to use the “auto” feature, I want to see this in real life!

SQL> alter system set use_large_pages=auto;
alter system set use_large_pages=auto
                 *
ERROR at line 1:
ORA-02095: specified initialization parameter cannot be modified

SQL> a  scope=spfile;
  1* alter system set use_large_pages=auto scope=spfile
SQL> /

System altered.

As you can see the parameter is static and requires an instance restart, so this is what I did next. Here is an interesting side effect of setting the parameter to “auto”: it doesn’t have an effect if you didn’t prepare the system for use of large pages in /etc/security/limits.conf. You could think that the oracle-preinstall RPM does so, but it misses the settings for “memlock”. Here is proof nothing happened:

[root@ol64 ~]# cat /proc/meminfo
MemTotal:        8192240 kB
MemFree:         5090124 kB
Buffers:           67408 kB
Cached:          2341504 kB
SwapCached:            0 kB
Active:           816116 kB
Inactive:        2055352 kB
Active(anon):     548760 kB
Inactive(anon):   284304 kB
Active(file):     267356 kB
Inactive(file):  1771048 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:        524284 kB
SwapFree:         524284 kB
Dirty:                60 kB
Writeback:             0 kB
AnonPages:        462560 kB
Mapped:           334424 kB
Shmem:            370516 kB
Slab:             103692 kB
SReclaimable:      47496 kB
SUnreclaim:        56196 kB
KernelStack:        2016 kB
PageTables:        26008 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     4620404 kB
Committed_AS:    3343896 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       26480 kB
VmallocChunk:   34359700348 kB
HardwareCorrupted:     0 kB
AnonHugePages:    247808 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        8128 kB
DirectMap2M:     8380416 kB

HugePagesTotal is still 0, which didn’t surprise me. To allow oracle to lock memory you need to grant it the privilege. I had to edit /etc/security/limits.conf and set the memlock parameter to 5GB which is too high for my 2.5 GB SGA but setting the value a little too high doesn’t hurt at all either. The value is in kb by the way.

oracle    soft    memlock    5242880
oracle    hard    memlock    5242880

After logging out and back in as oracle I tried once more and hey-success!

Starting ORACLE instance (normal)
DISM started, OS id=11969
****************** Large Pages Information *****************
Parameter use_large_pages = AUTO

Total Shared Global Region in Large Pages = 2514 MB (100%)

Large Pages used by this instance: 1257 (2514 MB)
Large Pages unused system wide = 0 (0 KB) (alloc incr 16 MB)
Large Pages configured system wide = 1257 (2514 MB)
Large Page size = 2048 KB
Time taken to allocate Large Pages = 0.130804 sec
***********************************************************
LICENSE_MAX_SESSION = 0

Also notice the DISM process here which is responsible for creating the large pages on the fly. This is an interesting “background process”, and Tanel Poder has mentioned it in one of his presentations already:

[root@ol64 ~]# ps -ef | grep 11969
root     11969     1  0 12:50 ?        00:00:00 ora_dism_ora11
root     12026 11911  0 12:50 pts/3    00:00:00 grep 11969
[root@ol64 bin]# ls -l $ORACLE_HOME/bin/oradism
-rwsr-x---. 1 root oinstall 71758 Sep 17  2011 oradism

It is owned by root with the setuid flag set … easy to miss when cloning a home …

Once the large pages are created, the process disappears when you start the instance a second time, and there is no mention of it in the alert.log pertaining to the startup sequence. But it has done its work.

[root@ol64 ~]# grep -i page /proc/meminfo
AnonPages:        346808 kB
PageTables:        12852 kB
AnonHugePages:    208896 kB
HugePages_Total:    1257
HugePages_Free:     1045
HugePages_Rsvd:     1045
HugePages_Surp:        0
Hugepagesize:       2048 kB

Notice that not all pages are actually in use yet, I only just started the database. Don’t worry though, 100% of the SGA are allocated in large pages as per the alert.log. Over time you will notice more and more pages being in use.

Now you can of course force the database to touch all these pages, but it’s another question whether that is a good idea. You probably don’t want to do so if you have a large SGA, the startup time can be very long. For the sake of completeness I added this here though to show you the effect in /proc/meminfo. I set pre_page_sga = true and bounced the instance:

[root@ol64 ~]# grep -i page /proc/meminfo
AnonPages:        370632 kB
PageTables:        15272 kB
AnonHugePages:    206848 kB
HugePages_Total:    1257
HugePages_Free:        3
HugePages_Rsvd:        3
HugePages_Surp:        0
Hugepagesize:       2048 kB
[root@ol64 ~]#

Now all pages are allocated straight after instance start. If you want to follow the example, I suggest you use the watch command as shown here:

watch grep -i page /proc/meminfo

Summary and a bit of warning

I personally wouldn’t rely on use_large_pages = auto in an environment I care about. It’s simply too unpredictable that you get the large pages requested, and you might fall back into 4k page mode. Planning is better than hoping-calculate the number of large pages beforehand, add them to /etc/sysctl.conf in vm.nr_hugepages and you should be almost guaranteed to have them allocated. Large pages need enough contiguous memory or otherwise the allocation may (partially) fail.

Also – large pages cannot be swapped out during memory pressure. Don’t forget you still need enough space for the PGAs and operating system! If the system starts swapping although “free” shows a lot of free memory then most likely you are using up all the 4k pages in memory.

Outlook

There is even more to be said about large pages on systems with more than 2 sockets, especially when it comes to allocating large pages per NUMA node. I’ll leave that for a future post.

Oh and yes, the large page information in /proc is only a legacy, it’s now all in SYSFS in /sys/kernel/mm/hugepages/hugepages-2048kB. Intel x86-64 supports three different page sizes: 1GB, 2048kb and 4kb.

Reference

https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt

Responses

  1. Martin, great post, thank you. For the sake of completeness in your blog, MOS note 401749.1 contains a script you could use to calculate vm.nr_hugepages; note 361468.1 provides all the basics on configuration.
    One question for you:
    How do you determine the performance delta, once you’ve implemented HugePages? I can see all the wonderful reasons why one should move (and, in fact, we moved to HugePages a year ago, which was also a great excuse to get away from AMM). But are there any stats you know of that can tell us how smart or dumb we were in making the change?

    1. Hi Dave,

      thanks for the link to the MOS note and your comments.

      Large pages don’t make applications go faster per se, they reduce overhead. Let me explain.

      The immediate benefit of implementing large pages is that the kernel doesn’t need to keep track of so many small (4k) pages as compared to 2M pages. If you divide the total amount of memory by the page size you should see the benefit. Internally the kernel needs to maintain track of all pages and their state (free, dirty, …).

      There are also implications about creating a copy of the page table if a process issues a fork() command. In simple terms, the bigger your SGA and the higher the number of sessions then the greater the benefit of large pages.

      Let me know if you’d like to discuss further,

      Martin

      1. Thanks Martin… I think it was poor word choice when I said “performance delta”. What I’m really getting at is, how can we measure the effect on the system, with and without HugePages?
        I agree, it’s more efficient to use hugePages, overhead is reduced, etc. But is there a measurable impact? (If we were talking about improving a query’s performance, we could say, building an index consumes 1GB but saves 2 minutes every time the query is run. Is there something similar we can apply to a kernel setting like this?)
        By the way, if you do write about what happens when fork() is called and hugepages is implemented, I’d love to read it. Thanks.

      2. Well as always it depends – I have done some testing but didn’t have time to write it up yet. You should check this page for a performance comparison:

        http://www.csn.ul.ie/~mel/docs/stream-api/

        Hope this helps.

  2. Hi Martin

    Interesting. Did you manage to find the exact description of the ‘AUTO’ parameter value in the Oracle Documentation?

    Radu

Blog at WordPress.com.