Martins Blog

Trying to explain complex things in simple terms

Collecting and analysing Exadata cell metrics

Posted by Martin Bach on September 15, 2011

Recently I have been asked to write a paper about Exadata Flash Cache and its impact on performance. This was a challenge to my liking! I won’t reproduce the paper I wrote, but I’d like to demonstrate the methods I used to get more information about what’s happening in the software.

Hardware

The Exadata Flash Cache is provided by four F20 PCIe cards in each cell. Currently the PCI Express bus is the most potent way to realise the potential of the flash disk in terms of latency and bandwidth. SSDs attached to a standard storage array will be slowed by fibre channel as the transport medium.

Each of the F20 cards holds 96G of raw space, totalling in 384GB of capacity per storage cell. The usable capacity is slightly less. The F20 card is subdivided into 4 so called FMODs, or solid state flash modules visible to the operating system using the standard SCSI SD driver.

Cellcli can also be used to view the FMODs using the “LIST PHYSICALDISK” command. The output is slightly different from the spinning disks as they are reported SCSI drivee’s [host:bus:target:lun] notation.

Now please don’t be tempted to take the FMODs and transform them into celldisks!

How it works

The flash cache is a write-through cache, and therefore won’t speed up write operations to disk. If an internal, undocumented algorithm decides that the data just written to disk is suitable for caching, it will put the data into the ESFC. It is quite clever in this respect and performs optimisations “ordinary” cache cannot perform.

Smart scans are going to ignore the flash cache by default and read directly from disk, unless the segment’s storage clause specifically instructs Oracle to read from flash cache as well. Meddling with the storage clause is also referred to as object pinning, a practice which is not recommended. Pinning objects to the flash cache immediately reminded me of the bad practice in the old 8i days of pinning objects in the shared pool. It didn’t necessarily made things go faster or help (except in some special cases).

The main statistics counter to query in relation to the smart cache is called “cell flash cache read hits”. The remainder of this paper has to do with the cell metrics which cannot be queried directly from the database. They are more granular though making them well suited for analysis.

The test case

The swingbench order entry benchmark has provided the data for the following text. The table ORDER_ITEMS contains most rows in the schema, and it is partitioned to reflect a realistic environment. Indexes have been made invisible as they may have impacted the performance statistics. Additionally cellcli’s WHERE clause isn’t very flexible and querying a number of objectIDs can be time consuming.

SOE.ORDER_ITEMS is divided into partitions with object IDs 74506 to 74570. These object IDs are required for mapping the output of cellcli to the data dictionary later.

A little bit of setup has been performed to capture the cell metrics. For pre and post comparison it was necessary to capture statistics in a statspack-like way and to analyse them. Rather than using Excel I decided to store the gathered information in the database in a dedicated schema.

I gathered two different sets of information from the cells: firstly the metrics for the objectType “FLASHCACHE”, and secondly the contents of the flash cache using the fabulous “dcli” command. The distributed command line interface can be used to execute arbitrary commands on a number of hosts simultaneously and reports the results back to the caller. I am redirecting the output from STDOUT to a CSV file in my home directory and use external tables to copy this information into the so-called “RESULT” table.

The table DDL is as follows:

create table cacheContents_csv (
cellname varchar2(50),
cachedKeepSize     number,
cachedSize         number,
hitCount           number,
missCount          number,
objectNumber             number
)
organization external (
type oracle_loader
default directory testdir
access parameters (
records delimited  by newline
fields  terminated by ';'
missing field values are null
)
location ('flashcontents.csv')
)
reject limit unlimited;

CREATE TABLE cacheContents_RESULT AS
SELECT 1 AS
TESTRUN#,
CELLNAME,
CACHEDKEEPSIZE,
CACHEDSIZE,
HITCOUNT,
MISSCOUNT,
OBJECTNUMBER
FROM cacheContents_csv A;

CREATE TABLE "METRICCURRENT_CSV"
(
"CELLNAME" VARCHAR2(100),
"NAME" VARCHAR2(50),
"OBJECTTYPE" VARCHAR2(50),
"METRICVALUE" VARCHAR2(100)
)
ORGANIZATION external
(
TYPE oracle_loader
DEFAULT DIRECTORY TESTDIR
ACCESS PARAMETERS
(
RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII
FIELDS LDRTRIM
REJECT ROWS WITH ALL NULL FIELDS
(
"CELLNAME" (1:12) CHAR,
"NAME" (14:36) CHAR,
"OBJECTTYPE" (39:48) CHAR,
"METRICVALUE" (51:90) CHAR
)
)
location
(
'metrics.csv'
)
) REJECT LIMIT UNLIMITED;

CREATE TABLE "METRICCURRENT_RESULT"
(
"TESTRUN#"    NUMBER,
"CELLNAME"    VARCHAR2(100 BYTE),
"NAME"        VARCHAR2(50 BYTE),
"OBJECTTYPE"  VARCHAR2(50 BYTE),
"METRICVALUE" VARCHAR2(100 BYTE)
)

I additionally added a table called “metricdefinition” which allows me to match the short metric name from metriccurrent with a more human readable description.

CREATE TABLE "METRICDEFINITION"
(
"NAME"        VARCHAR2(100 BYTE),
"DESCRIPTION" VARCHAR2(500 BYTE),
"METRICTYPE"  VARCHAR2(50 BYTE),
"OBJECTTYPE"  VARCHAR2(50 BYTE),
"UNIT"        VARCHAR2(20 BYTE)
)

The definitions, too, were loaded into the database using sqlldr.

First the metrics were captured:

# dcli -l root -g /tmp/martin_group 'cellcli -e "list METRICCURRENT where objecttype = \"FLASHCACHE\""' | sed -e 's/://' > /home/oracle/martin/metrics.csv

I admit it’s probably not the most elegant way to do so but it worked for me and I didn’t have weeks to prepare (as always…)

Now what does this command do? It will execute a call to cellcli on all nodes defined in /tmp/martin_group to list the current metrics for the Flash Cache. The result is then piped into sed which strips out any colons after the cell names and then saves the result in my TESTDIR directory on the first RDBMS node.

Similarly I am using the following command to get the contents of the flash cache to see if anything has changed after executing the query.

 # dcli -l root -g /tmp/martin_group 'cellcli -e "list flashcachecontent where dbUniqueName like 'TEST' and objectNumber > 74505 and objectNumber < 74571 attributes cachedKeepSize,cachedSize,hitCount,missCount,objectNumber"' | sed -e 's/://' | awk '{print $1";"$2";"$3";"$4";"$5";"$6}' > /home/oracle/martin/flashcontents.csv

This looks a little more complex, but it isn’t. The apparent complexity is required to limit the result set to the contents of the flash cache for database TEST and the object_ids of the table partitions. There is a little more transformation involved with AWK to convert the output in a “;” separated line.

With this done, it was possible to transfer the information in the staging tables into the RESULT tables. This is a simple “insert into table select testrun#, * from …CSV”.

The analysis I did was very simple once the data has been loaded into the RESULT tables. For the flashcache contents I used the following queries:

WITH prestats AS
( SELECT * FROM metriccurrent_result WHERE testrun# = &1
),
poststats AS
( SELECT * FROM metriccurrent_result WHERE testrun# = &2
)
SELECT pre.cellname,
--pre.name,
defn.description,
pre.metricvalue pre_metricvalue,
post.metricvalue post_metricvalue,
to_number(REGEXP_SUBSTR(post.metricvalue, '(\S*)'),'999G999G999D999999', 'NLS_NUMERIC_CHARACTERS = ''.,''') -
to_number(REGEXP_SUBSTR( pre.metricvalue, '(\S*)'),'999G999G999D999999', 'NLS_NUMERIC_CHARACTERS = ''.,''') delta
FROM prestats pre,
poststats post,
metricdefinition defn
WHERE pre.cellname = post.cellname
and pre.name = post.name
and pre.objecttype = post.objecttype
and pre.name = defn.name(+)
and pre.name = 'FC_IO_BYKEEP_R'
and (to_number(REGEXP_SUBSTR(post.metricvalue, '(\S*)'),'999G999G999D999999', 'NLS_NUMERIC_CHARACTERS = ''.,''') -
to_number(REGEXP_SUBSTR( pre.metricvalue, '(\S*)'),'999G999G999D999999', 'NLS_NUMERIC_CHARACTERS = ''.,''') ) <> 0
order by 1,2;

WITH prestats AS
( SELECT * FROM cachecontents_result WHERE testrun# = &1
),
poststats AS
( SELECT * FROM cachecontents_result WHERE testrun# = &2
)
SELECT pre.cellname,
pre.objectnumber,
pre.cachedkeepsize pre_cachedkeepsize,
pre.cachedsize pre_cachedsize,
pre.hitcount pre_hitcount,
pre.misscount pre_misscount,
post.cachedkeepsize post_cachedkeepsize,
post.cachedsize post_cachedsize,
post.hitcount post_hitcount,
post.misscount post_misscount,
(post.cachedkeepsize-pre.cachedkeepsize) delta_cachedkeepsize,
(post.cachedsize-pre.cachedsize) delta_cachedsize,
(post.hitcount-pre.hitcount) delta_hitcount,
(post.misscount-pre.misscount) delta_misscount
FROM prestats pre,
poststats post
WHERE pre.cellname   = post.cellname
AND pre.objectnumber = post.objectnumber
and
ORDER BY 1,2

This revealed quite a lot about the underlying workings of the flash cache. The above queries are best executed with SQL Developer and the result is then copied and pasted into a spreadsheet application to be saved from accidental overwriting.

Summary

The above certainly isn’t bullet proof or very elegant, but a good starting point for your own experiments. Time permitting I would like to automate the generation of the CSV files via a scheduled job, and also the population of the result tables. There really should be some sort of unique key on the results tables to avoid duplicates. Maybe one day I’d like to write a statspack like version of the lot…

About these ads

6 Responses to “Collecting and analysing Exadata cell metrics”

  1. Very nice post Martin. You are making good use of your exadata. I had an exadata in the lab between jan-feb of this year, but it was on loan from Oracle so it had to go back. I am getting a delivery of another exadata tomorrow on a permanent basis for our lab. So it should be fun working on it.

  2. >>> Now please don’t be tempted to take the FMODs and transform them into celldisks!

    You would be surprised that our Oracle account manager did suggest us to use FlashDisks for storing user data. Obiviously, we did not heed their advise, based on our experience of two of flash cards getting (falsely) reported as failed every now and then. Workaround is restart the affected cell node, so that affected flash card are remove from cell and re-added. Now imagine, if you had your user data on these flash cards, you would be in a constant cycle of restoring your user data. Ofcourse if you had normal redundancy, then you might survive. But i have seen this problem affecting two cells with in the span of two days. This is due a Sunbug 6997590 – MOS Note 1351559.1 , which is available as one-off patch and included in Exadata Storage software 11.2.2.3.5.

    • Martin Bach said

      > which is available as one-off patch and included in Exadata Storage software 11.2.2.3.5.

      Wow, are you already on that version? You are way ahead of the game!

      • We are currently on 11.2.2.2.0 in production. But since we are getting our engineering rack soon, so soon we will start upgrading. But based on past experience, our approach is to stay one step behind. We don’t want to solve the problems with newest Exadata software patches the hard way. I hope oracle sort out the disk controller firmware issues for once and all. Because of that they keep withdrawing the cell images.

  3. [...] Martin blogs about collecting and analysing Exadata cell metrics. [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 2,233 other followers

%d bloggers like this: