Geeking out on Linux: file descriptors and their open modes & flags

While preparing my talk for the upcoming POUG 2021 conference I wanted to know which flags Oracle passes to open(2) when accessing a data file. A file needs to be opened before its contents can be accessed (no surprises here :). There are quite a few ways of opening a file, depending on whether you want to use the file system cache or not, and whether you want to perform synchronous or asynchronous I/O. There are more alternatives as well but I haven’t come across them with the Oracle database so let’s stick with the arguments/flags/modes available with open(2).

Based on what I understand of how Linux works, you can use the following methods to get the flags passed to open(2):

  1. Use a tracer to catch the call to open(2) when it happens
  2. Find a way to look up flags passed to open(2) after the call completed

This posts shows you both. I should point out though that tracing is an intrusive process and all steps you see in this post have been performed in my Oracle-on-vagrant build running on my laptop. The post is for educational purposes only, especially since you don’t need a tracer to get the answer.

WARNING: don’t use tracers in production environments! They can crash systems. If you are curious about Linux tracing and profiling, always experiment on lab VMs where you are on your own, and crashes don’t cause problems to others. Or worse.

Background

Consider the following example. I have created a session with sqlplus (any other client will do, too) and am about to issue my first query. This is an Oracle 19c database although I’m pretty sure this has been the standard behaviour for quite some time. Please note that the following steps don’t apply when you are using Automatic Storage Management (ASM). I also used filesystemio_options set to none. You see, this is really a post for educational purposes.

Tracing

At this stage, I haven’t touched a user-created segment. Oracle will have to open the data file containing my table next. Let’s see what happens when I run my select statement. Opening a second session on the host I can create a bpftrace probe and attach it to sys_enter_openat() on my server process ID (SPID, 21554). As soon as I start my query in session 1, Oracle opens the user tablespace’s only data file and my probe in session 2 fires:

[root@server1 ~]# bpftrace -e 'tracepoint:syscalls:sys_enter_openat /pid == 21554/ { 
    printf("%s opened file %s with flags %x mode %x\n", comm, str(args->filename), args->flags, args->mode); 
}'
Attaching 1 probe...
oracle_21554_or opened file /u02/oradata/ORCL/datafile/o1_mf_users_jdoz9om2_.dbf with flags 1002 mode 0

If you aren’t intimately familiar with Linux you might wonder how I knew which arguments to pass to printf()? That’s easy enough, the format of sys_enter_openat is available in /sys/kernel/debug/tracing/events/syscalls/sys_enter_openat/format

# cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_openat/format
name: sys_enter_openat
ID: 624
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:int __syscall_nr;	offset:8;	size:4;	signed:1;
	field:int dfd;	offset:16;	size:8;	signed:0;
	field:const char * filename;	offset:24;	size:8;	signed:0;
	field:int flags;	offset:32;	size:8;	signed:0;
	field:umode_t mode;	offset:40;	size:8;	signed:0;

print fmt: "dfd: 0x%08lx, filename: 0x%08lx, flags: 0x%08lx, mode: 0x%08lx", ((unsigned long)(REC->dfd)), ((unsigned long)(REC->filename)), ((unsigned long)(REC->flags)), ((unsigned long)(REC->mode))

Back to the output generated by bpftrace. Unfortunately this isn’t quite as straight-forward as I thought. First of all, I’m returning the arguments to openat() as a hexadecimal number. Actually the flags passed to openat() are in octal notation, which requires translating them. The obvious use of %o in printf() isn’t permitted. 1002 in base 16 equals 10002 in base 8.

But what does the flag really mean?

But that’s only half the job done. What does the flag actually mean? Researching a little bit I found a few header files with more details. Long story short I found what I needed in /usr/include/asm-generic/fcntl.h. But … Linux x86-64 doesn’t implement all of these. You can check whether a given flag is valid or not in /usr/src/kernels/$(uname -r)/include/linux/fcntl.h.

With all that research done I now can tell you that a flag of 10002 means the file was opened using O_RDWR and O_DSYNC.

[oracle@server1 ~]$ egrep 'define O.*[0-9]+' /usr/include/asm-generic/fcntl.h
...
#define O_RDONLY	00000000
#define O_WRONLY	00000001
#define O_RDWR		00000002
...
#define O_DSYNC		00010000	/* used to be O_SYNC, see below */
#define O_DIRECT	00040000	/* direct disk access hint */
#define O_LARGEFILE	00100000
#define O_DIRECTORY	00200000	/* must be a directory */
#define O_NOFOLLOW	00400000	/* don't follow links */
#define O_NOATIME	01000000
#define O_CLOEXEC	02000000	/* set close_on_exec */
...
[oracle@server1 ~]$ 

You test whether a flag is set by using a bitwise and operation, for O_RDWR that’s easy enough in bash:

$ echo $(( 10002 & 2 ))
2

If the result of the bitand operations matches the flag’s value (in this case 2 equals 2) the flag is set.

And what does it mean? Oracle opened the data file read/write for synchronous I/O. Which was to be expected given I’m using a file system with filesystemio_options set to none. Oracle will neither use direct I/O bypassing the file system cache, nor asynchronous I/O either.

Let’s recap for a minute. So far I have used a tracer to read the arguments to openat(2) and decoded the flag after translating its hexadecimal representation to an octal value. That’s quite a lot of work and unless you know the flags in open(2) by heart not very enjoyable. Furthermore, you won’t catch the arguments to open(2) if the process has issued the statement before you traced it.

Determining the flags retrospectively

There are many ways to do so, the easiest is to check /proc/pid/fdinfo/fd. The first step is to identify the file descriptor of the data file opened by the Oracle process. In this case the user tablespace’s data file was opened with a file descriptor of 259

[oracle@server1 ~]$ ls -l /proc/21554/fd
total 0
lr-x------. 1 oracle oinstall 64 Jul  9 19:49 0 -> /dev/null
l-wx------. 1 oracle oinstall 64 Jul  9 19:49 1 -> /dev/null
lr-x------. 1 oracle oinstall 64 Jul  9 19:49 10 -> 'pipe:[127168]'
l-wx------. 1 oracle oinstall 64 Jul  9 19:49 13 -> 'pipe:[127169]'
l-wx------. 1 oracle oinstall 64 Jul  9 19:49 2 -> /dev/null
lrwx------. 1 oracle oinstall 64 Jul  9 19:50 256 -> /u02/oradata/ORCL/datafile/o1_mf_system_jdoz8275_.dbf
lrwx------. 1 oracle oinstall 64 Jul  9 19:50 257 -> /u02/oradata/ORCL/datafile/o1_mf_sysaux_jdoz95dk_.dbf
lrwx------. 1 oracle oinstall 64 Jul  9 19:50 258 -> /u02/oradata/ORCL/datafile/o1_mf_undotbs1_jdoz9nj3_.dbf
lrwx------. 1 oracle oinstall 64 Jul  9 19:50 259 -> /u02/oradata/ORCL/datafile/o1_mf_users_jdoz9om2_.dbf
lr-x------. 1 oracle oinstall 64 Jul  9 19:49 3 -> /dev/null
lr-x------. 1 oracle oinstall 64 Jul  9 19:49 4 -> /u01/app/oracle/product/19.0.0/dbhome_1/rdbms/mesg/oraus.msb
lr-x------. 1 oracle oinstall 64 Jul  9 19:49 5 -> /proc/21554/fd
lr-x------. 1 oracle oinstall 64 Jul  9 19:49 6 -> /var/lib/sss/mc/passwd
lrwx------. 1 oracle oinstall 64 Jul  9 19:49 7 -> 'anon_inode:[eventpoll]'

I can grab the flags from fdinfo:

[oracle@server1 ~]$ grep flags /proc/21554/fdinfo/259
flags:	02110002

Huh? That’s 8 bit though, not 5 as in the previous example. Decoding the flag was a lot easier this time as I knew the flags and their bitmasks.

02110002
not setO_CLOEXECO_LARGEFILEO_DSYNCnot set not setnot setO_RDWR
Bitmask set in fdinfo decoded

Still very much the same picture as before with the addition of the O_LARGEFILE and O_CLOEXEC flags. Check man open(2) If you want to know what they imply.

Using lsof is cheating

If you are lucky enough to have lsof installed on your system you can use it as well. Although it’s not much of a challenge since lsof decodes the flags for you ;) Well most of them at least. The following output was generated on a different system hence the PID and fd don’t match the earlier outptut:

[oracle@server1 ~]$ lsof -p 10009 +fg | egrep 'COMMAND|users'
COMMAND     PID   USER   FD      TYPE         FILE-FLAG DEVICE   SIZE/OFF      NODE NAME
oracle_10 10009 oracle  257u      REG RW,SYN,LG,0x80000   8,33   15736832  16777348 /u02/oradata/ORCL/datafile/o1_mf_users_jdoz9om2_.dbf
[oracle@server1 ~]$ 

And you don’t even need to be root to get these details. Unfortunately lsof as packaged in Linux 8.4 (4.93.2) doesn’t recognise O_CLOEXEC as per a bugzilla entry I found.

Summary

I love Linux since there are so many ways to do things. My talk at POUG 2021 will contain lots of references to BCC and bpftrace so this was a good starting point to learn about these tools. On the other hand it turned out I didn’t need to concern myself with tracing tools at all as Linux often keeps information in many places. You just need to know what to do with it.

Advertisement