Martins Blog

Trying to explain complex things in simple terms

What happens in ASM if usable_file_mb is negative and you lose a failgroup

Posted by Martin Bach on February 16, 2015

Having read the excellent post “Demystifying ASM REQUIRED_MIRROR_FREE_MB and USABLE_FILE_MB” again by Harald von Breederode I wanted to see what happens if you create a setup where your usable_file_mb is negative and you actually have to rebalance after a fatal failgroup error. I am using 12.1.0.2.0 on Oracle Linux 6.6/UEK3 in a KVM in case anyone is interested. I/O times aren’t stellar on that environment. It’s Oracle Restart, not clustered ASM.

Note: this post is only applicable if you are using ASM for data protection, e.g. normal or high redundancy. External redundancy is a different thing: if you lose a failgroup in a disk group with external redundancy defined then you are toast. The disk group will dismount on all ASM instances. No disk group = nothing to write to = crash of dependent databases.

Setup

Here is my setup. I created the VM with 2 volume groups, one for the operating system, and another one for Oracle. After the installation of the database and Grid Infrastructure I have a disk group named DATA comprised of 4 disks, each in their own failgroup (which is both the default and intentional):

ASMCMD> lsdsk -k
Total_MB  Free_MB  OS_MB  Name       Failgroup  Failgroup_Type  Library  Label  UDID  Product  Redund   Path
    5119     4131   5119  DATA_0000  DATA_0000  REGULAR         System                         UNKNOWN  /dev/vdc1
    5119     4127   5119  DATA_0001  DATA_0001  REGULAR         System                         UNKNOWN  /dev/vdd1
    5119     4128   5119  DATA_0002  DATA_0002  REGULAR         System                         UNKNOWN  /dev/vde1
    5119     4139   5119  DATA_0003  DATA_0003  REGULAR         System                         UNKNOWN  /dev/vdf1
ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  NORMAL  N         512   4096  1048576     20476    16525             5119            5703              0             N  DATA/
ASMCMD>

My database uses the datafiles as shown here:

SQL> select name, bytes from v$datafile;

NAME                                                              BYTES
------------------------------------------------------------ ----------
+DATA/NCDB/DATAFILE/system.258.871538097                      817889280
+DATA/NCDB/DATAFILE/sysaux.257.871538033                      629145600
+DATA/NCDB/DATAFILE/undotbs1.260.871538187                     62914560
+DATA/NCDB/DATAFILE/users.259.871538183                         5242880

ASMCMD must take the file information from v$asm_file-the “space” column seems to take the normal redundancy into account:


SQL> select block_size, blocks, bytes, space, redundancy from v$asm_file where file_number = 259 and incarnation = 871538183;

BLOCK_SIZE     BLOCKS      BYTES      SPACE REDUND
---------- ---------- ---------- ---------- ------
      8192        641    5251072   12582912 MIRROR

As others have explained already the “space” column also takes into account ASM metadata that is of course mirrored too. So to me those numbers make sense. If you look up to the asmcmd output you will notice that the useable_file_mb is 5703MB.

This is where it goes wrong

Let’s cause trouble for the sake of this blog post and to trigger failure. The following is STRONGLY discouraged in your environment! I have put this here so you can see the effect of “negative free space” without having to endure this yourself.

I am deliberately oversubscribing space in the data disk group, which is definitely Not A Good Thing(tm). Please monitor usable_file_mb for each disk group carefully and add storage when needed (you might be unpleasantly surprised how long it can take to add storage so request it early). You will see why it shouldn’t be negative-ever-shortly:

SQL> create tablespace iamtoolargetofit datafile size 6g;

Tablespace created.

[oracle@server1 ~]$ asmcmd lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  NORMAL  N         512   4096  1048576     20476     4212             5119            -453              0             N  DATA/

The new tablespace is bigger than usable_file_mb and turned that number into the negative. Now if I were to lose a failgroup (the term “disk” is a bit misleading here) I’d be in Big Trouble. Since this is my controlled lab environment and I wanted to see what happens I carried on.

On the host I use virsh detach-disk to remove an arbitrary of my ASM disks. Note from further up the repair timer was 0 for all disks. In other words, there won’t be a grace period to allow for transient failures, the disk will be dropped straight away.

# virsh detach-disk --domain server_12c_asm --target /var/lib/.../server1_asm_data02.qcow2 --current
Disk detached successfully

This removed disk /dev/vdd on the guest. This has not gone unnoticed. From the ASM alert.log:

NOTE: Standard client NCDB:NCDB:ASM registered, osid 7732, mbr 0x1, asmb 7728 (reg:3524888596)
2015-02-15 06:01:56.858000 -05:00
NOTE: client NCDB:NCDB:ASM mounted group 1 (DATA)
2015-02-15 08:23:29.188000 -05:00
NOTE: process _user9772_+asm (9772) initiating offline of disk 1.3916023153 (DATA_0001) with mask 0x7e in group 1 (DATA) with client assisting
NOTE: checking PST: grp = 1
GMON checking disk modes for group 1 at 10 for pid 23, osid 9772
NOTE: checking PST for grp 1 done.
NOTE: initiating PST update: grp 1 (DATA), dsk = 1/0xe969c571, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 11 for pid 23, osid 9772
WARNING: Write Failed. group:1 disk:1 AU:1 offset:1044480 size:4096
path:/dev/vdd1
 incarnation:0xe969c571 synchronous result:'I/O error'
 subsys:System krq:0x7fb991c5e9e8 bufp:0x7fb991c5f000 osderr1:0x0 osderr2:0x0
 IO elapsed time: 0 usec Time waited on I/O: 0 usec
WARNING: found another non-responsive disk 1.3916023153 (DATA_0001) that will be offlined
NOTE: group DATA: updated PST location: disk 0000 (PST copy 0)
NOTE: group DATA: updated PST location: disk 0002 (PST copy 1)
NOTE: group DATA: updated PST location: disk 0003 (PST copy 2)
NOTE: PST update grp = 1 completed successfully

...

NOTE: process _b000_+asm (11355) initiating offline of disk 1.3916023153 (DATA_0001) with mask 0x7e in group 1 (DATA) without client assisting
NOTE: sending set offline flag message (1976690176) to 1 disk(s) in group 1
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_gmon_5836.trc:
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 4088
Additional information: 4294967295
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_gmon_5836.trc:
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 4088
Additional information: 4294967295
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_gmon_5836.trc:
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 4088
Additional information: 4294967295
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_gmon_5836.trc:

...


WARNING: Disk 1 (DATA_0001) in group 1 mode 0x15 is now being offlined
NOTE: initiating PST update: grp 1 (DATA), dsk = 1/0xe969c571, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 12 for pid 25, osid 11355

...

SUCCESS: PST-initiated drop disk in group 1(3645453725))
2015-02-15 08:24:04.910000 -05:00
NOTE: Attempting voting file refresh on diskgroup DATA
NOTE: starting rebalance of group 1/0xd949359d (DATA) at power 1
Starting background process ARB0
ARB0 started with pid=26, OS id=11364
NOTE: assigning ARB0 to group 1/0xd949359d (DATA) with 1 parallel I/O
NOTE: F1X0 on disk 3 (fmt 2) relocated at fcn 0.26087: AU 0 -> AU 4059
NOTE: 02/15/15 08:24:04 DATA.F1X0 copy 2 relocating from 1:10 to 2:10 at FCN 0.26087
NOTE: 02/15/15 08:24:04 DATA.F1X0 copy 3 relocating from 2:10 to 3:4059 at FCN 0.26087
NOTE: F1B1 fcn on disk 0 synced at fcn 0.26087
NOTE: F1B1 fcn on disk 2 synced at fcn 0.26087
2015-02-15 08:24:28.853000 -05:00
NOTE: restored redundancy of control and online logs in DATA
2015-02-15 08:24:33.219000 -05:00

...

And that went on for a little while. As with any rebalance you can see the progress in v$asm_operation:

SQL> r
  1* select * from v$asm_operation

GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID
------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------
           1 REBAL RESYNC    DONE         11         11          0          0          0           0                                                       0
           1 REBAL REBALANCE RUN          11         11        559       4529      12417           0                                                       0
           1 REBAL COMPACT   WAIT         11         11          0          0          0           0                                                       0

You can see detailed information about the disk dropping in asmcmd too:

ASMCMD> lsdsk -kpt
Total_MB  Free_MB  OS_MB  Name                Failgroup  Failgroup_Type  Library  Label  UDID  Product  Redund   Group_Num  Disk_Num      Incarn  Mount_Stat  Header_Stat  Mode_Stat  State    Create_Date  Mount_Date  Repair_Timer  Path
    5119     3237      0  _DROPPED_0001_DATA  DATA_0001  REGULAR         System                         UNKNOWN          1         1  3916023153  MISSING     UNKNOWN      OFFLINE    FORCING  15-FEB-15    15-FEB-15   0
    5119      323   5119  DATA_0000           DATA_0000  REGULAR         System                         UNKNOWN          1         0  3916023154  CACHED      MEMBER       ONLINE     NORMAL   15-FEB-15    15-FEB-15   0             /dev/vdc1
    5119      324   5119  DATA_0002           DATA_0002  REGULAR         System                         UNKNOWN          1         2  3916023152  CACHED      MEMBER       ONLINE     NORMAL   15-FEB-15    15-FEB-15   0             /dev/vde1
    5119      328   5119  DATA_0003           DATA_0003  REGULAR         System                         UNKNOWN          1         3  3916023156  CACHED      MEMBER       ONLINE     NORMAL   15-FEB-15    15-FEB-15   0             /dev/vdf1
ASMCMD>

Missing, forcing, offline all don’t sound too good… What is worse was this:

ASMCMD> lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  NORMAL  Y         512   4096  1048576     15357        0             5119           -2559              1             N  DATA/

And then the inevitable happened:

2015-02-15 08:31:55.491000 -05:00
ERROR: ORA-15041 thrown in ARB0 for group number 1
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM/trace/+ASM_arb0_11547.trc:
ORA-15041: diskgroup "DATA" space exhausted
WARNING: Resync encountered ORA-15041; continuing rebalance
NOTE: stopping process ARB0
NOTE: requesting all-instance membership refresh for group=1
WARNING: rebalance not completed for group 1/0xd949359d (DATA)
GMON updating for reconfiguration, group 1 at 20 for pid 25, osid 11773
NOTE: cache closing disk 1 of grp 1: (not open) _DROPPED_0001_DATA

ORA-15041 is the dreaded error: space exhausted. This was to be expected, but I had to see it with my own eyes.

The ASM rebalance is not complete:

SQL> r
  1* select * from v$asm_operation

GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID
------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------
           1 REBAL RESYNC    DONE         11                                                                                                               0
           1 REBAL REBALANCE ERRS         11                                                         ORA-15041                                             0
           1 REBAL COMPACT   WAIT         11                                                                                                               0

Interestingly the database remained up-but I can’t do anything with it as these examples demonstrate:

SQL> alter tablespace IAMTOOLARGETOFIT add datafile size 100m;
alter tablespace IAMTOOLARGETOFIT add datafile size 100m
*
ERROR at line 1:
ORA-01119: error in creating database file '+DATA'
ORA-17502: ksfdcre:4 Failed to create file +DATA
ORA-15041: diskgroup "DATA" space exhausted



SQL> create table t as select * from dba_objects;
create table t as select * from dba_objects
*
ERROR at line 1:
ORA-01652: unable to extend temp segment by 128 in tablespace USERS

Errors in file /u01/app/oracle/diag/rdbms/ncdb/NCDB/trace/NCDB_arc3_12194.trc:
ORA-19504: failed to create file "+DATA"
ORA-17502: ksfdcre:4 Failed to create file +DATA
ORA-15041: diskgroup "DATA" space exhausted
ARC3: Error 19504 Creating archive log file to '+DATA'
2015-02-15 08:47:31.410000 -05:00
Unable to create archive log file '+DATA'

You are now in a bit of a situation … So be careful when you are on negative values for usable_file_mb-a loss of only one failgroup does not cause the disk group to be dismounted in ASM normal redundancy, but you are likewise ending up in a very undesirable place.

One Response to “What happens in ASM if usable_file_mb is negative and you lose a failgroup”

  1. […] My system is fully operational, and the rebalance did not run into any space problems. Space problems are the last thing you want to have when rebalancing. I did some research about this subject earlier and documented it in a blog post. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: