This is probably as much a note-to-self as it can possibly be. Recently I have enjoyed some more in-depth research about how the Linux kernel works. To that extent I started fairly low-level. Theoretically speaking, you need to understand the hardware-software interface first before you can understand the upper levels. But in practice you get by with less knowledge. But if you are truly interested in how computers work you might want to consider reading up on some background. Some very knowledgable people I deeply respect have recommended books by David A. Patterson and John L. Hennessy. I have these two:
- Computer Organization and Design, Fifth Edition: The Hardware/Software Interface
- Computer Architecture, Fifth Edition: A Quantitative Approach
I think I found a few references to the above books in James Morle’s recent blog article about the true cost of licensing the in-memory database option and he definitely refers to the second book in his Sane SAN paper. I complemented these books with The Linux Programming Interface: A Linux and UNIX System Programming Handbook to get an overview of the Linux API. Oh and Linux from Scratch is a great resource too!
The Foundation is set
But now-what next? The Linux kernel evolves rather quickly, and don’t be fooled by version numbers. The “enterprise” kernels keep a rather conservative, static version number scheme. Remember 2.6.18? The kernel with RHEL 5.10 has little in common with the one released years and years ago with RHEL 5.0. SuSE seems to be more aggressive, naming kernels differently. A good discussion of the pros and cons for that approach can be found on LWN: http://lwn.net/Articles/486304/ Long story short: the Linux kernel developers keep pushing the limits with the “upstream” or “vanilla” kernel. You can follow the development on the LKML or Linux Kernel Mailing List. But that list is busy… The distribution vendors in turn take a stable version of the kernel and add features they need. That includes back-porting as well, which is why it’s so hard to see what’s going on with a kernel internally. But there are exceptions.
The inner workings
Apologies to all SuSE and Red Hat geeks: I haven’t been able to find a web-repository for the kernel code! If you know of one and have the URL, let me know and I’ll add it here. I don’t want to sound biased but it simply happens to be that I know Oracle Linux best.
Now to really dive into the internals and implementation you need to look at the source code. When browsing the code it helps to understand the C-programming language. And maybe some Assembler. I would love to know more about Assembler than I do but I don’t believe it’s strictly speaking necessary.
Oracle publishes the kernel code at the GIT repositories on oss.oracle.com:
- UEK 2 can be found at https://oss.oracle.com/git/?p=linux-2.6-unbreakable.git;a=summary
- UEK 3 can be found at https://oss.oracle.com/git/?p=linux-uek3-3.8.git
Oracle also provides patches for Red Hat kernels in project Red Patch. If I understand things correctly then Red Hat provides changes to the kernel in a massive tarball with the patches already applied. Previously it appears to have shipped the kernel + patches, which caused some controversy.
The Linux Cross Reference gives you insights into the upstream kernel.
NB: Kernel documentation can be found in the Documentation subdirectory. This is very useful stuff!
Now why would you want to do this?
My use case! I wanted to find out if/how I could do NFS over RDMA. When in doubt, use an Internet search engine and common sense. In this case: use the kernel documentation and sure enough, NFS-RDMA seems possible.
The link suggests a few module names and pre-requisites on enabling NFA-RDMA. The nfs-utils package must be version 1.1.2 or later, and the kernel NFS server must be built with RDMA support. Using the kernel source RPM you can check the options being used for compiling the kernel. Normally you’d use make menuconfig or an equivalent to enable/disable options or to build them as modules (refer to the excellent Linux From Scratch). Except that you don’t do that with the enterprise distributions of course. Building kernels for fun is off limits on these. If you have a problem with the Linux kernel (like a buggy kernel module), your vendor provides the fix, not the Linux engineer. But I digress… Each subtree in the kernel has a Kconfig file that lists the configuration option and meaning.
For the purpose of NFS-RDMA Infiniband support must be enabled (no brainer), but also IPoIB and then the RDMA support for NFS (“sunrpc”).
Back to the source RPM: it installs a file called .config in /usr/src/kernels/nameAndVersion/ listing all the build options. Grepping for RDMA in the file shows the following for UEK 3:
[root@rac12node1 3.8.13-35.3.3.el6uek.x86_64]# grep -i rdma .config CONFIG_RDS_RDMA=m CONFIG_NET_9P_RDMA=m CONFIG_CARDMAN_4000=m CONFIG_CARDMAN_4040=m # CONFIG_INFINIBAND_OCRDMA is not set CONFIG_SUNRPC_XPRT_RDMA_CLIENT=m # CONFIG_SUNRPC_XPRT_RDMA_CLIENT_ALLPHYSICAL is not set CONFIG_SUNRPC_XPRT_RDMA_SERVER=m
And here is the same for UEK 2:
[root@server1 2.6.39-400.17.1.el6uek.x86_64]# grep -i rdma .config CONFIG_RDS_RDMA=m CONFIG_NET_9P_RDMA=m CONFIG_CARDMAN_4000=m CONFIG_CARDMAN_4040=m CONFIG_SUNRPC_XPRT_RDMA=m
So that looks promising, the letter “m” stands for “module”. But what do these options mean? The Kconfig file to the rescue again, but I first have to find the correct one. This example is for UEK 2:
[root@server1 2.6.39-400.17.1.el6uek.x86_64]# for file in $(rpm -qil kernel-uek-devel | grep Kconfig ); > do grep -i SUNRPC_XPRT_RDMA $file /dev/null; > done /usr/src/kernels/2.6.39-400.17.1.el6uek.x86_64/net/sunrpc/Kconfig:config SUNRPC_XPRT_RDMA
Found you! Notice that I’m adding /dev/null to the grep command to get the file name where grep found a match. Looking at the file just found:
config SUNRPC_XPRT_RDMA tristate depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS && EXPERIMENTAL default SUNRPC && INFINIBAND help This option allows the NFS client and server to support an RDMA-enabled transport. To compile RPC client RDMA transport support as a module, choose M here: the module will be called xprtrdma. If unsure, say N.
All that remained to be done was to check if these other configurationvariables (INFINIBAND, INFINIBAND_ADDR_TRANS etc) were set in the top level .config file and they were.