Monthly Archives: August 2014

A brief history of time^H^H Oracle session statistics

I didn’t intend to write another blog post yesterday evening at all, but found something that was worth sharing and got me excited… And when I started writing I intended it to be a short post, too.

If you have been digging around Oracle session performance counters a little you undoubtedly noticed how their number has increased with every release, and even with every patch set. Unfortunately I don’t have a 11.1 system (or earlier) at my disposal to test, but here is a comparison of how Oracle has instrumented the database. I have already ditched my 12.1.0.1 system as well, so no comparison there either :( This is Oracle on Linux.

The script

In the following examples I am going to use a simple query to list the session statistics by their class. The decode statement is based on the official documentation set. There you find the definition of v$statname plus an explanation of the meaning of the class-column. Here is the script:

with stats as (
        select name, decode(class,
                1, 'USER',
                2, 'REDO',
                4, 'ENQUEUE',
                8, 'CACHE',
                16, 'OS',
                32, 'RAC',
                64, 'SQL',
                128, 'DEBUG',
                'NA'
        ) as decoded_class from v$statname
)
select count(decoded_class), decoded_class
 from stats
 group by rollup(decoded_class)
 order by 1
/

Oracle 11.2.0.3

11.2.0.3 is probably the most common 11g Release 2 version currently out there in the field. Or at least that’s my observation. According to MOS Doc ID 742060.1 11.2.0.3 was released on 23 September 2011 (is that really that long ago?) and already out of error correction support by the way.

Executing the above-mentioned script gives me the following result:

COUNT(DECODED_CLASS) DECODED
-------------------- -------
                   9 ENQUEUE
                  16 OS
                  25 RAC
                  32 REDO
                  47 NA
                  93 SQL
                 107 USER
                 121 CACHE
                 188 DEBUG
                 638

So there are 638 of these counters. Let’s move on to 11.2.0.4

Oracle 11.2.0.4

Oracle 11.2.0.4 is interesting as it has been released after 12.1.0.1. It is the terminal release for Oracle 11.2, and you should consider migrating to it as it is in error correction support. The patch set came out on 28 August 2013. What about the session statistics?

COUNT(DECODED_CLASS) DECODED
-------------------- -------
                   9 ENQUEUE
                  16 OS
                  25 RAC
                  34 REDO
                  48 NA
                  96 SQL
                 117 USER
                 127 CACHE
                 207 DEBUG
                 679

A few more, all within what can be expected.

Oracle 12.1.0.2

Oracle 12.1.0.2 is fresh off the press, released just a few weeks ago. Unsurprisingly the number of session statistics has been increased again. What did surprise me was the number of statistics now available for every session! Have a look at this:

COUNT(DECODED_CLASS) DECODED
-------------------- -------
                   9 ENQUEUE
                  16 OS
                  35 RAC
                  68 REDO
                  74 NA
                 130 SQL
                 130 USER
                 151 CACHE
                 565 DEBUG
                1178

That’s nearly double what you found for 11.2.0.3. Incredible, and hence this post. Comparing 11.2.0.4 with 12.1.0.2 you will notice the:

  • same number of enqueue stats
  • same number of OS stats
  • 10 additional RAC stats
  • twice the number of REDO related statistics
  • quite a few more not classified (26)
  • 34 more sql related
  • 13 more in the user-class
  • 24 additional stats in the cache-class
  • and a whopping 298 (!) in the debug class

The debug class (128) shows lots of statistics (including spare ones) for the in-memory option (IM):

SQL> select count(1), class from v$statname where name like 'IM%' group by class;

  COUNT(1)      CLASS
---------- ----------
       211        128

Happy troubleshooting! Reminds me to look into the IM-option in more detail.

A first look at RAC 12c (part I)

I have recently upgraded my RAC 12.1.0.1.3 system to RAC 12.1.0.2 including the RDBMS installation. Currently I am updating my skills with information relevant to what I would normally have called 12c Release 2 (so that would also answer the question: when is 12c Release 2 coming out?). Then I realised I haven’t posted a first look at RAC 12c post yet-so here it comes.

There are a few things that aren’t specifically mentioned in the new features guide that caught my eye. First of all, RAC 12 does a few really cool things. Have a look at the srvctl command output:

[oracle@rac12node1 ~]$ srvctl
Usage: srvctl <command> <object> [<options>]
    commands: enable|disable|export|import|start|stop|relocate|status|add|remove|modify|getenv|setenv|unsetenv|
       config|convert|update|upgrade|downgrade|predict
    objects: database|instance|service|nodeapps|vip|network|asm|diskgroup|listener|srvpool|server|scan|scan_listener|
        oc4j|home|filesystem|gns|cvu|havip|exportfs|rhpserver|rhpclient|mgmtdb|mgmtlsnr|volume|mountfs
For detailed help on each command and object and its options use:
  srvctl <command> -help [-compatible] or
  srvctl <command> <object> -help [-compatible]
[oracle@rac12node1 ~]$

Quite a few more than with 11.2.0.3:

[oracle@rac112node1 ~]$ srvctl
Usage: srvctl <command> <object> [<options>]
 commands: enable|disable|start|stop|relocate|status|add|remove|modify|getenv|setenv|unsetenv|config|convert|upgrade
 objects: database|instance|service|nodeapps|vip|network|asm|diskgroup|listener|srvpool|server|scan|scan_listener|oc4j|home|filesystem|gns|cvu
For detailed help on each command and object and its options use:
 srvctl <command> -h or
 srvctl <command> <object> -h

I will detail the meaning of some of these later in this post or another one to follow.

Evaluation and Prediction

When you are working with policy managed databases RAC 12c already gave you a “what if” option in form of the -eval flag. If for example you wanted to grow your server pool from 2 to 3 nodes:

[oracle@rac12node1 ~]$ srvctl modify srvpool -serverpool pool1 -max 3 -eval -verbose
Database two will be started on node rac12node3
Server rac12node3 will be moved from pool Free to pool pool1
[oracle@rac12node1 ~]$

Now you will be able to predict a resource failure as well:

[oracle@rac12node1 ~]$ srvctl predict -h

The SRVCTL predict command evaluates the consequences of resource failure.

Usage: srvctl predict database -db <database_name> [-verbose]
Usage: srvctl predict service -db <database_name> -service <service_name> [-verbose]
Usage: srvctl predict asm [-node <node_name>] [-verbose]
Usage: srvctl predict diskgroup -diskgroup <diskgroup_name> [-verbose]
Usage: srvctl predict filesystem -device <volume_device> [-verbose]
Usage: srvctl predict vip -vip <vip_name> [-verbose]
Usage: srvctl predict network [-netnum <network_number>] [-verbose]
Usage: srvctl predict listener -listener <listener_name> [-verbose]
Usage: srvctl predict scan -scannumber <scan_ordinal_number> [-netnum <network_number>] [-verbose]
Usage: srvctl predict scan_listener -scannumber <scan_ordinal_number> [-netnum <network_number>] [-verbose]
Usage: srvctl predict oc4j [-verbose]

So what would happen if a disk group failed?

[oracle@rac12node1 ~]$ srvctl predict diskgroup -diskgroup DATA -verbose
Resource ora.DATA.dg will be stopped
Resource ora.DATA.ORAHOMEVOL.advm will be stopped
[oracle@rac12node1 ~]$

What it doesn’t do at this stage seems to be an assessment of cascading further problems. If +DATA went down, it would pretty much drag the whole cluster with it, too.

Status

Interestingly you can see a lot more detail with 12.1.0.2 than previously. Here is an example of a policy-managed RAC One Node database:

[oracle@rac12node1 ~]$ srvctl config database -d RONNCDB
Database unique name: RONNCDB
Database name: RONNCDB
Oracle home: /u01/app/oracle/product/12.1.0.2/dbhome_1
Oracle user: oracle
Spfile: +DATA/RONNCDB/PARAMETERFILE/spfile.319.854718651
Password file: +DATA/RONNCDB/PASSWORD/pwdronncdb.290.854718263
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: ronpool1
Disk Groups: RECO,DATA
Mount point paths:
Services: NCDB
Type: RACOneNode
Online relocation timeout: 30
Instance name prefix: RONNCDB
Candidate servers:
OSDBA group: dba
OSOPER group:
Database instances:
Database is policy managed

Did you spot the OSDBA and OSOPER group mappings in the output? DBCA by default creates the password file and server parameter file into ASM since 12.1.0.1.

You can get a lot more status information in 12.1.0.2 then previously, especially when compared to 11.2:

[oracle@rac12node1 ~]$ srvctl status -h

The SRVCTL status command displays the current state of the object.

Usage: srvctl status database {-db <db_unique_name> [-serverpool <serverpool_name>] | -serverpool <serverpool_name> | -thisversion | -thishome} [-force] [-verbose]
Usage: srvctl status instance -db <db_unique_name> {-node <node_name> | -instance <inst_name_list>} [-force] [-verbose]
Usage: srvctl status service {-db <db_unique_name> [-service  "<service_name_list>"] | -serverpool <serverpool_name> [-db <db_unique_name>]} [-force] [-verbose]
Usage: srvctl status nodeapps [-node <node_name>]
Usage: srvctl status vip {-node <node_name> | -vip <vip_name>} [-verbose]
Usage: srvctl status listener [-listener <lsnr_name>] [-node <node_name>] [-verbose]
Usage: srvctl status asm [-proxy] [-node <node_name>] [-detail] [-verbose]
Usage: srvctl status scan [[-netnum <network_number>] [-scannumber <scan_ordinal_number>] | -all] [-verbose]
Usage: srvctl status scan_listener [[-netnum <network_number>] [-scannumber <scan_ordinal_number>] | -all] [-verbose]
Usage: srvctl status srvpool [-serverpool <pool_name>] [-detail]
Usage: srvctl status server -servers "<server_list>" [-detail]
Usage: srvctl status oc4j [-node <node_name>] [-verbose]
Usage: srvctl status rhpserver
Usage: srvctl status rhpclient
Usage: srvctl status home -oraclehome <oracle_home> -statefile <state_file> -node <node_name>
Usage: srvctl status filesystem [-device <volume_device>] [-verbose]
Usage: srvctl status volume [-device <volume_device>] [-volume <volume_name>] [-diskgroup <group_name>] [-node <node_list> | -all]
Usage: srvctl status diskgroup -diskgroup <dg_name> [-node "<node_list>"] [-detail] [-verbose]
Usage: srvctl status cvu [-node <node_name>]
Usage: srvctl status gns [-node <node_name>] [-verbose]
Usage: srvctl status mgmtdb [-verbose]
Usage: srvctl status mgmtlsnr [-verbose]
Usage: srvctl status exportfs [-name <expfs_name> |-id <havip id>]
Usage: srvctl status havip [-id <id>]
Usage: srvctl status mountfs -name <mountfs_name>
For detailed help on each command and object and its options use:
  srvctl <command> <object> -help [-compatible]

RAC 12.1.0.2 adds a nifty few little flags: thisversion and thishome to srvctl status database. That works really well where you have multiple versions of Oracle on the same machine (think consolidation):

[oracle@rac12node1 ~]$ srvctl status database -thisversion
Database unique name: RONNCDB
Instance RONNCDB_1 is running on node rac12node4
Online relocation: INACTIVE

Database unique name: TWO
Instance TWO_1 is running on node rac12node1
Instance TWO_2 is running on node rac12node2

Verbose!

Some commands are actually more verbose when you specify the -verbose flag:

[oracle@rac12node1 ~]$ srvctl status database -d RONNCDB -verbose
Instance RONNCDB_1 is running on node rac12node4 with online services NCDB. Instance status: Open.
Online relocation: INACTIVE
[oracle@rac12node1 ~]$ srvctl status database -d RONNCDB
Instance RONNCDB_1 is running on node rac12node4
Online relocation: INACTIVE
[oracle@rac12node1 ~]$

But that’s not new in 12.1.0.2 I believe.

Interesting changes for database logging

The database itself will also tell you more about memory allocation:

**********************************************************************
Dump of system resources acquired for SHARED GLOBAL AREA (SGA)
 Per process system memlock (soft) limit = 64K
Thu Jul 31 13:34:58 2014
 Expected per process system memlock (soft) limit to lock
 SHARED GLOBAL AREA (SGA) into memory: 1538M
Thu Jul 31 13:34:58 2014
 Available system pagesizes:
  4K, 2048K
 Supported system pagesize(s):
  PAGESIZE  AVAILABLE_PAGES  EXPECTED_PAGES  ALLOCATED_PAGES  ERROR(s)
        4K       Configured               3          393219        NONE
     2048K                0             769               0        NONE

RECOMMENDATION:
 1. For optimal performance, configure system with expected number
 of pages for every supported system pagesize prior to the next
 instance restart operation.
 2. Increase per process memlock (soft) limit to at least 1538MB
 to lock 100% of SHARED GLOBAL AREA (SGA) pages into physical memory

As you can see I am not using large pages here at all, which I did for demonstration purposes only. I don’t see any reason not to use large pages on a 64bit system these days. I’m curious to see whether the AIX port supports all the AIX page sizes here.

End of part I

This has already turned into a longer post than I expected it to be when I started writing. I think I’ll continue the series in a couple of weeks when I’m finding the time.

How to get insights into the Linux Kernel

This is probably as much a note-to-self as it can possibly be. Recently I have enjoyed some more in-depth research about how the Linux kernel works. To that extent I started fairly low-level. Theoretically speaking, you need to understand the hardware-software interface first before you can understand the upper levels. But in practice you get by with less knowledge. But if you are truly interested in how computers work you might want to consider reading up on some background. Some very knowledgable people I deeply respect have recommended books by David A. Patterson and John L. Hennessy. I have these two:

  • Computer Organization and Design, Fifth Edition: The Hardware/Software Interface
  • Computer Architecture, Fifth Edition: A Quantitative Approach

I think I found a few references to the above books in James Morle’s recent blog article about the true cost of licensing the in-memory database option and he definitely refers to the second book in his Sane SAN paper. I complemented these books with The Linux Programming Interface: A Linux and UNIX System Programming Handbook to get an overview of the Linux API. Oh and Linux from Scratch is a great resource too!

The Foundation is set

But now-what next? The Linux kernel evolves rather quickly, and don’t be fooled by version numbers. The “enterprise” kernels keep a rather conservative, static version number scheme. Remember 2.6.18? The kernel with RHEL 5.10 has little in common with the one released years and years ago with RHEL 5.0. SuSE seems to be more aggressive, naming kernels differently. A good discussion of the pros and cons for that approach can be found on LWN: http://lwn.net/Articles/486304/ Long story short: the Linux kernel developers keep pushing the limits with the “upstream” or “vanilla” kernel. You can follow the development on the LKML or Linux Kernel Mailing List. But that list is busy… The distribution vendors in turn take a stable version of the kernel and add features they need. That includes back-porting as well, which is why it’s so hard to see what’s going on with a kernel internally. But there are exceptions.

The inner workings

Apologies to all SuSE and Red Hat geeks: I haven’t been able to find a web-repository for the kernel code! If you know of one and have the URL, let me know and I’ll add it here. I don’t want to sound biased but it simply happens to be that I know Oracle Linux best.

Now to really dive into the internals and implementation you need to look at the source code. When browsing the code it helps to understand the C-programming language. And maybe some Assembler. I would love to know more about Assembler than I do but I don’t believe it’s strictly speaking necessary.

Oracle publishes the kernel code at the GIT repositories on  oss.oracle.com:

Oracle also provides patches for Red Hat kernels in project Red Patch. If I understand things correctly then Red Hat provides changes to the kernel in a massive tarball with the patches already applied. Previously it appears to have shipped the kernel + patches, which caused some controversy.

The Linux Cross Reference gives you insights into the upstream kernel.

NB: Kernel documentation can be found in the Documentation subdirectory. This is very useful stuff!

Now why would you want to do this?

My use case! I wanted to find out if/how I could do NFS over RDMA. When in doubt, use an Internet search engine and common sense. In this case: use the kernel documentation and sure enough, NFS-RDMA seems possible.

https://www.kernel.org/doc/Documentation/filesystems/nfs/nfs-rdma.txt

The link suggests a few module names and pre-requisites on enabling NFA-RDMA. The nfs-utils package must be version 1.1.2 or later, and the kernel NFS server must be built with RDMA support. Using the kernel source RPM you can check the options being used for compiling the kernel. Normally you’d use make menuconfig or an equivalent to enable/disable options or to build them as modules (refer to the excellent Linux From Scratch). Except that you don’t do that with the enterprise distributions of course. Building kernels for fun is off limits on these. If you have a problem with the Linux kernel (like a buggy kernel module), your vendor provides the fix, not the Linux engineer. But I digress… Each subtree in the kernel has a Kconfig file that lists the configuration option and meaning.

For the purpose of NFS-RDMA Infiniband support must be enabled (no brainer), but also IPoIB and then the RDMA support for NFS (“sunrpc”).

Back to the source RPM: it installs a file called .config in /usr/src/kernels/nameAndVersion/ listing all the build options. Grepping for RDMA in the file shows the following for UEK 3:

[root@rac12node1 3.8.13-35.3.3.el6uek.x86_64]# grep -i rdma .config
CONFIG_RDS_RDMA=m
CONFIG_NET_9P_RDMA=m
CONFIG_CARDMAN_4000=m
CONFIG_CARDMAN_4040=m
# CONFIG_INFINIBAND_OCRDMA is not set
CONFIG_SUNRPC_XPRT_RDMA_CLIENT=m
# CONFIG_SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL is not set
CONFIG_SUNRPC_XPRT_RDMA_SERVER=m

And here is the same for UEK 2:

[root@server1 2.6.39-400.17.1.el6uek.x86_64]# grep -i rdma .config
CONFIG_RDS_RDMA=m
CONFIG_NET_9P_RDMA=m
CONFIG_CARDMAN_4000=m
CONFIG_CARDMAN_4040=m
CONFIG_SUNRPC_XPRT_RDMA=m

So that looks promising, the letter “m” stands for “module”. But what do these options mean? The Kconfig file to the rescue again, but I first have to find the correct one. This example is for UEK 2:

[root@server1 2.6.39-400.17.1.el6uek.x86_64]# for file in $(rpm -qil kernel-uek-devel | grep Kconfig );
> do grep -i SUNRPC_XPRT_RDMA $file /dev/null;
> done
/usr/src/kernels/2.6.39-400.17.1.el6uek.x86_64/net/sunrpc/Kconfig:config SUNRPC_XPRT_RDMA

Found you! Notice that I’m adding /dev/null to the grep command to get the file name where grep found a match. Looking at the file just found:

config SUNRPC_XPRT_RDMA
        tristate
        depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS && EXPERIMENTAL
        default SUNRPC && INFINIBAND
        help
          This option allows the NFS client and server to support
          an RDMA-enabled transport.

          To compile RPC client RDMA transport support as a module,
          choose M here: the module will be called xprtrdma.

          If unsure, say N.

All that remained to be done was to check if these other configurationvariables (INFINIBAND, INFINIBAND_ADDR_TRANS etc) were set in the top level .config file and they were.