Martins Blog

Trying to explain complex things in simple terms

Adventures in RAC: gc buffer busy acquire and release

Posted by Martin Bach on December 16, 2014

It seems that I’m getting more and more drawn into the world of performance analysis, and since I sometimes tend to forget things I need to write them down. I almost enjoy the “getting there” more than ultimately solving the problem. You pick up quite a few things on the way.

This environment is Exadata 12.1.1.1.1/Oracle 12.1.0.2 but as with so many things the fact that the database is on Exadata shouldn’t matter.

So here is one of these posts, this time I’m writing up what I saw related to GC Buffer Busy Acquire.

gc buffer busy acquire?

Whenever I see a wait event I haven’t dealt with extensively in the past I try to provoke behaviour to study it more closely. But first you need to know the event’s meaning. One option is to check v$event_name:

SQL> select name, parameter1, parameter2, parameter3, wait_class
  2  from v$event_name where name = 'gc buffer busy acquire';

NAME                           PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS
------------------------------ ---------- ---------- ---------- -------------------
gc buffer busy acquire         file#      block#     class#     Cluster

So when you see this event in ASH/traces/v$session etc you know it’s a cluster wait and potentially limiting your processing throughput. It also tells you the file#, block# and class# of the buffer which you can link to v$bh. This view allows you to find the data object ID given these input parameters.

Using https://orainternals.wordpress.com/2012/04/19/gc-buffer-busy-acquire-vs-release/ as a source I worked out that the event has to do with acquiring a buffer (=block) in RAC (gc = global cache) on the local instance. If the block you need is on the remote instance you wait for it to be released, and the wait event is gc buffer busy release.

Since Oracle will clone blocks in buffer caches for consistent reads and use a shared lock on these for reading I thought that waiting can only happen if someone requested a block in XCUR (exclusive current) mode. So with that working hypothesis I went to work.

How to test

I started off writing a small java class that creates a connection pool against my RAC database. I initially used the default service name in the connect descriptor but had to find out that dbms_monitor.SERV_MOD_ACT_TRACE_ENABLE did not trace my sessions. In the end I created a true RAC service with CLB and RLB goals against both instances and I ended up with traces in the diagnostic_dest.

After setting up the UCP connection pool the code will create a number of threads that each will pull a connection from the pool, do some work (*) and hand it back to the pool as good citizens should do.

(*) The do some work bit is this::

...
                        try {

                                PreparedStatement pstmt = conn.prepareStatement(
                                  "select /* bufferbusy001 */ id, to_char(sysdate,'hh24:mi:ss') d " +
                                  "from t1 where id = ? for update");

                                int randomID = new Random().nextInt((1450770 - 1450765) + 1) + 1450765;
                                System.out.println("thread " + mThreadID + " now querying for ID " + randomID);
                                pstmt.setInt(1, randomID);

                                ResultSet rs = pstmt.executeQuery();

                                while (rs.next()) {
                                        System.out.println("Thread " + mThreadID + " reporting an id of "
                                        + rs.getInt("id") + ". Now it is " + rs.getString("d"));
                                }

                                rs.close();
                                pstmt.close();
                                conn.rollback();
                                conn.close();
                                conn = null;

                                Thread.sleep(2000);
                        } catch (Exception e) {
                                e.printStackTrace();
                        }
...

I think that’s how a Java developer would do it (with more error handling of course) but then I’m not a Java developer. It did work though! What I considered most important was to generate contention on a single block. Using dbms_rowid I could find out which IDs belong to (a random) block:

SQL> select * from (
  2    select id,DBMS_ROWID.ROWID_BLOCK_NUMBER (rowid, 'BIGFILE') as block
  3      from t1
  4 ) where block = 11981654;

        ID      BLOCK
---------- ----------
   1450765   11981654
   1450766   11981654
   1450767   11981654
   1450768   11981654
   1450769   11981654
   1450770   11981654

6 rows selected.

So if I manage to randomly select from the table where ID in the range …765 to …770 then I should be ok and just hit that particular block.

It turned out that the SQL statement completed so quickly I had to considerably ramp up the number of sessions in the pool to see anything. I went up from 10 to 500 before I could notice a change. Most of the statements are too quick to even be caught in ASH-Tanel’s ashtop script showed pretty much nothing except ON-CPU occasionally as well as the odd log file sync event. Snapper also reported sessions in idle state.

SQL> r
  1  select count(*), inst_id, status, sql_id, event, state
  2  from gv$session where module = 'BufferBusy'
  3* group by inst_id, status, sql_id, event, state

  COUNT(*)    INST_ID STATUS   SQL_ID        EVENT                          STATE
---------- ---------- -------- ------------- ------------------------------ -------------------
       251          1 INACTIVE               SQL*Net message from client    WAITING
       248          2 INACTIVE               SQL*Net message from client    WAITING

2 rows selected.

That’s what you see for most of the time.

Let me trace this for you

So in order to get any meaningful idea about the occurrence (or absence) of the gc buffer busy acquire event I added a MODULE to my sessions so I can later on run trcsess to combine traces. Here is the resulting raw trace, or rather an excerpt from it:

=====================
PARSING IN CURSOR #140650659166120 len=96 dep=0 uid=65 oct=3 lid=65 tim=4170152514049 hv=1500360262 ad='5b58a4a10' sqlid='6a5jfvpcqvbk6'
select /* bufferbusy001 */ id, to_char(sysdate,'hh24:mi:ss') d from t1 where id = :1  for update
END OF STMT
PARSE #140650659166120:c=0,e=5598,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2291732815,tim=4170152514046
WAIT #140650659166120: nam='gc buffer busy acquire' ela= 12250 file#=6 block#=11981654 class#=1 obj#=20520 tim=4170152526533
WAIT #140650659166120: nam='buffer busy waits' ela= 1890 file#=6 block#=11981654 class#=1 obj#=20520 tim=4170152528475
WAIT #140650659166120: nam='enq: TX - row lock contention' ela= 188606 name|mode=1415053318 usn<<16 | slot=1179674 sequence=1485 obj#=20520 tim=4170152717199
WAIT #140650659166120: nam='gc buffer busy acquire' ela= 1590 file#=6 block#=11981654 class#=1 obj#=20520 tim=4170152718839
WAIT #140650659166120: nam='enq: TX - row lock contention' ela= 36313 name|mode=1415053318 usn<<16 | slot=1245199 sequence=1894 obj#=20520 tim=4170152755340
WAIT #140650659166120: nam='gc buffer busy acquire' ela= 1268 file#=6 block#=11981654 class#=1 obj#=20520 tim=4170152756655
WAIT #140650659166120: nam='buffer busy waits' ela= 668 file#=6 block#=11981654 class#=1 obj#=20520 tim=4170152757363
WAIT #140650659166120: nam='KJC: Wait for msg sends to complete' ela= 11 msg=26941469232 dest|rcvr=65539 mtype=8 obj#=20520 tim=4170152757492
EXEC #140650659166120:c=1999,e=243530,p=0,cr=9,cu=4,mis=0,r=0,dep=0,og=1,plh=2291732815,tim=4170152757651
WAIT #140650659166120: nam='SQL*Net message to client' ela= 4 driver id=1413697536 #bytes=1 p3=0 obj#=20520 tim=4170152757709
FETCH #140650659166120:c=0,e=16,p=0,cr=0,cu=0,mis=0,r=1,dep=0,og=1,plh=2291732815,tim=4170152757749
STAT #140650659166120 id=1 cnt=1 pid=0 pos=1 obj=0 op='FOR UPDATE  (cr=9 pr=0 pw=0 time=243443 us)'
STAT #140650659166120 id=2 cnt=2 pid=1 pos=1 obj=0 op='BUFFER SORT (cr=3 pr=0 pw=0 time=60 us)'
STAT #140650659166120 id=3 cnt=1 pid=2 pos=1 obj=48863 op='INDEX RANGE SCAN I_T1$SEC1 (cr=3 pr=0 pw=0 time=37 us cost=3 size=6 card=1)'
WAIT #140650659166120: nam='SQL*Net message from client' ela= 260 driver id=1413697536 #bytes=1 p3=0 obj#=20520 tim=4170152758109
CLOSE #140650659166120:c=0,e=5,dep=0,type=1,tim=4170152758141
XCTEND rlbk=1, rd_only=0, tim=4170152758170
WAIT #0: nam='gc buffer busy acquire' ela= 3764 file#=6 block#=11981654 class#=1 obj#=20520 tim=4170152761976
WAIT #0: nam='buffer busy waits' ela= 1084 file#=6 block#=11981654 class#=1 obj#=20520 tim=4170152763104
WAIT #0: nam='log file sync' ela= 246 buffer#=119491 sync scn=19690898 p3=0 obj#=20520 tim=4170152763502
WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1413697536 #bytes=1 p3=0 obj#=20520 tim=4170152763536

Result! There are gc buffer busy acquire events recorded. I can’t rule out TX-row lock contention since with all those threads and only 6 IDs to choose from there was going to be some locking on the same ID caused by the “for update” clause.

Now I am reasonably confident that I worked out at least one scenario causing a gc buffer busy acquire. You might also find the location of the blocks in the buffer cache interesting:

SYS:dbm011> select count(*), inst_id, block#, status
  2  from gv$bh where block# = 11981654
  3  group by inst_id, block#, status order by inst_id, status;

   COUNT(*)     INST_ID      BLOCK# STATUS
----------- ----------- ----------- ----------
          9           1    11981654 cr
          1           1    11981654 xcur
          9           2    11981654 cr

There is the one block in XCUR mode and 9 in CR mode in the buffer cache for that block.

Making it worse

Now I didn’t want to stop there, I was interested in what would happen under CPU load. During my career I noticed cluster waits appear primarily when you are CPU-bound (all other things being equal). This could be the infamous middle-tier-connection-pool-mismanagement or an execution plan going wrong with hundreds of users performing nested loop joins when they should hash-join large data sets… This is usually the point where OEM users ask the DBAs to do something against that “sea of grey” in the performance pages.

As with every cluster technology an overloaded CPU does not help. Well-I guess that’s true for all computing. To increase the CPU load I created 10 dd sessions to read from /dev/zero and write to /dev/null. Sounds silly but one of these hogs 1 CPU core 100%. With 10 out of 12 cores 100% occupied that way on node 1 I relaunched my test. The hypothesis that CPU overload has an effect was proven right by suddenly finding ASH samples of my session.

SQL> @ash/ashtop sql_id,session_state,event "sql_id='6a5jfvpcqvbk6'" sysdate-5/1440 sysdate

    Total
  Seconds     AAS %This   SQL_ID        SESSION EVENT
--------- ------- ------- ------------- ------- ----------------------------------------
      373     1.2   79% | 6a5jfvpcqvbk6 WAITING enq: TX - row lock contention
       54      .2   11% | 6a5jfvpcqvbk6 WAITING gc buffer busy release
       20      .1    4% | 6a5jfvpcqvbk6 ON CPU
       11      .0    2% | 6a5jfvpcqvbk6 WAITING gc buffer busy acquire
       11      .0    2% | 6a5jfvpcqvbk6 WAITING gc current block busy
        1      .0    0% | 6a5jfvpcqvbk6 WAITING gc current block 2-way

6 rows selected.

Using ASQLMON I can even see where time is spent:

SQL> @scripts/ash/asqlmon 6a5jfvpcqvbk6 % sysdate-1 sysdate

    SECONDS Activity Visual       Line ID Parent ASQLMON_OPERATION                   SESSION EVENT                                         AVG_P3 OBJ_ALIAS_ ASQLMON_PR
----------- -------- ------------ ------- ------ ----------------------------------- ------- ---------------------------------------- ----------- ---------- ----------
         38    2.4 % |          |       0        SELECT STATEMENT                    ON CPU                                                883065
          1     .1 % |          |       1      0  FOR UPDATE                         WAITING gc current block 2-way                      33554433  [SEL$1]
         25    1.6 % |          |       1      0                                     ON CPU                                            5369727.36  [SEL$1]
         17    1.1 % |          |       1      0                                     WAITING buffer busy waits                                  1  [SEL$1]
        109    6.8 % |#         |       1      0                                     WAITING gc buffer busy release                             1  [SEL$1]
         31    1.9 % |          |       1      0                                     WAITING gc buffer busy acquire                             1  [SEL$1]
         27    1.7 % |          |       1      0                                     WAITING gc current block busy                       33554433  [SEL$1]
        768   48.0 % |#####     |       1      0                                     WAITING enq: TX - row lock contention            6685.143229  [SEL$1]
          3     .2 % |          |       2      1   BUFFER SORT                       ON CPU                                                     0
          2     .1 % |          |       3      2    INDEX RANGE SCAN [I_T1$SEC1]     ON CPU                                                     0 T1@SEL$1 [ [A:]
                                                                                                                                                  SEL$1]     "ID"=:1


          2     .1 % |          |       0        SELECT STATEMENT                    ON CPU                                            16777216.5
          2     .1 % |          |       1      0  FOR UPDATE                         WAITING gc current block busy                       33554433  [SEL$1]
         24    1.5 % |          |       1      0                                     WAITING write complete waits                               0  [SEL$1]
          9     .6 % |          |       1      0                                     WAITING gc buffer busy acquire                             1  [SEL$1]
         30    1.9 % |          |       1      0                                     WAITING gc buffer busy release                             1  [SEL$1]
          9     .6 % |          |       1      0                                     WAITING buffer busy waits                                  1  [SEL$1]
          7     .4 % |          |       1      0                                     ON CPU                                           158.8571429  [SEL$1]
        496   31.0 % |###       |       1      0                                     WAITING enq: TX - row lock contention            6396.395161  [SEL$1]
                   % |          |       2      1   BUFFER SORT
                   % |          |       3      2    INDEX RANGE SCAN [I_T1$SEC1]                                                                  T1@SEL$1 [ [A:]
                                                                                                                                                  SEL$1]     "ID"=:1

Further Reading

I’m sure there is a wealth of resources available out there, in my case Riyaj’s blog helped me a lot. He even tagged posts with gc buffer busy: https://orainternals.wordpress.com/tag/gc-buffer-busy

Have a look at the Oaktable World 2014 agenda and watch Tanel Poder’s session attentively. You’d be surprised how many scripts he made publicly available to troubleshoot perform. Like snapper? It’s only the tip of the iceberg. And if you can, you should really attend his advanced troubleshooting seminar.

2 Responses to “Adventures in RAC: gc buffer busy acquire and release”

  1. Hi Martin,
    nice blog post.

    I usually have seen extensive “gc buffer busy release” waits in case of LMS / LGWR (gcs log flush sync / log file sync) issues. The long duration of “gcs log flush sync / log file sync” was mostly caused by “broadcast on commit” bugs (in case of fast log file parallel write), but CPU load can be an influencing factor as well as you have demonstrated nicely :-))

    Regards
    Stefan

    • Martin Bach said

      Hi Stefan!

      Always nice to get comments from the experts. If memory serves me right then Riyaj’s post had an example of an overloaded lgwr process for anyone who wants to see what happens in that case.

      Martin

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: