View NOTE:62161.1

WebIV:View NOTE:62161.1

Article-ID:         <Note:62161.1>
Circulation:        PUBLISHED (EXTERNAL)
Folder:             server.Rdbms.Performance.Database
Topic:              ** Tuning Oracle7 and Oracle8
Platform:           GENERIC  Generic issue
Subject:            Systemwide Tuning using UTLESTAT Reports in Oracle7/8
Author:             RPOWELL.UK
Creation-Date:      27-JUL-1998 14:26:12
Modified-Date:      09-FEB-1999 13:51:10
Document-Type:      BULLETIN
Impact:             MEDIUM
Skill-Level:        ACCOMPLISHED
Component:          RDBMS  07.01 - 08.00  Server 07.01 - 08.00
Content-Type:       TEXT/X-HTML
Attachments:        NONE

UTLBSTAT / UTLESTAT (commonly known as bstat/estat reports)

There are several articles about bstat/estat reports already in circulation. Here we will give an overview of what a bstat/estat is and then focus on the important parts. It is important that you have read <Note:61998.1> before continuing.

BSTAT / ESTAT

The bstat/estat scripts are designed to provide a summary of Oracle activity over a reasonable period of time (>10 minutes). UTLBSTAT creates a snapshot of some of the internal Oracle dynamic performance views (V$ views). UTLESTAT creates a second snapshot of these views and reports on the differences between the two snapshots to a file called 'report.txt'.

Some key things to remember about bstat/estat:

The BSTAT/ESTAT report is of limited use if TIMED_STATISTICS is not set to TRUE in the init.ora file. Without this parameter set you cannot determine how much time was spent running versus waiting.
Setting TIMED_STATISTICS to TRUE adds only a very small overhead to database operations - it is advisable to keep this set at all times.
You may hear concerns that bstat/estat will impact performance during the period being monitored. This is incorrect - the data which the bstat/estat scripts capture is always recorded by Oracle in the V$ views but note the following point.
As the UTLBSTAT/UTLESTAT scripts issue 'create table as select' commands they add to any recorded system activity hence they can distort the figures. Make sure that your sample interval is large enough not to be affected by this 'noise'.
Remember that when the database and application first starts the various caches will not be populated. This can distort the figures in the first few minutes. Normally you would not start collecting statistics until the database is operating in a 'stable' state.
Choose an interval of bstat/estat to suit the situation you are tuning. Eg: If there is a general slowdown between 10.00 and 12.00 then a 12 hour sample is likely to dilute any problem during these hours. It would be better to take several one hour samples to allow you to compare good and bad times of the day.
ESTAT reports covering more than a couple of hours are of limited use as they dilute any issues that occur at peak periods.
Never run the scripts in SQLPLUS (before Oracle 8.1) or the output may be very badly formatted and incomplete.

To capture a bstat/estat report:

    At the start of the sample period:
    In Server Manager:

            connect / as sysdba                 (or connect internal)
            @?/rdbms/admin/utlbstat             (Unix)
     or     @D:/orant/rdbms80/admin/utlbstat    (NT - use correct path)


    At the end of the sample period:
    Change to a directory where you have WRITE permission
    Ensure there is no local file called 'report.txt' as this is the name
     of the output file that UTLESTAT will try to create.
    In Server Manager:

            connect / as sysdba                 (or connect internal)
            @?/rdbms/admin/utlestat             (Unix)
     or     @D:/orant/rdbms80/admin/utlestat    (NT - use correct path)

    The report should be written to a file called 'report.txt' in the
    current directory.

If there is EVER a shutdown between UTLBSTAT and UTLESTAT the results will be nonsense. If the database gets restarted re-run UTLBSTAT when a stable operating point has been reached after startup.

UTLESTAT Output

The REPORT.TXT output file is quite large. Rather than aimlessly describing all the values and what they mean we first give an overview of the sections of the output and then describe where to start. Each section is then covered in isolation with comments as to when you can skip over it.

Note: The format of the REPORT.TXT file varies slightly between releases so
      sections may not be in exactly the same order as shown here. Also
      not all sections are reported in all Oracle versions.


<A> 
SVRMGR> Rem Select Library cache statistics. ... 
   LIBRARY      GETS       GETHITRATI PINS      ...

        The library cache is where Oracle stores object definitions, SQL 
        statements etc.. Each namespace (or library) contains different 
        types of object. The figures here give a quick summary of the
        usage of each namespace. The most useful indicator here is the
        RELOADS column.

<B> 
   Statistic                   Total        Per Transact ...

        There are hundreds of statistics gathered by Oracle. Only a few of 
        these are needed for general tuning. In particular we get information 
        about CPU usage, buffer cache hit ratio, table scanning and sorts 
        from this section.

<C> 
Average Write Queue Length 

        The average write queue length is a good indication of how busy 
        the buffer cache is.

<D> 
SVRMGR> Rem System wide wait events for non-background processes ...
   Event Name                       Count         Total Time    Avg Time     

        This is arguably the most important section in the report as it shows 
        how long Oracle is waiting for resources. This will be the starting 
        point for looking at tuning Oracle.

<E>
SVRMGR> Rem System wide wait events for background processes ...
   Event Name                       Count         Total Time    Avg Time     

        In earlier Oracle7 releases this section and section <D> are 
        combined.  In Oracle 7.3 and Oracle8 this section shows time spent 
        waiting by background processes.

<F>
SVRMGR> Rem Latch statistics. ...
   LATCH_NAME         GETS        MISSES      HIT_RATIO   SLEEPS    ...

        Latches are points of serialisation within Oracle so are critical 
        to good performance. This section is important if there are waits 
        for latches occurring.

<G>
   LATCH_NAME         NOWAIT_GETS      NOWAIT_MISSES    NOWAIT_HIT_RATIO

        This section shows no-wait latch misses - the cost of a "miss"
        is difficult to assess - it is likely to be less significant than
        a "sleep" in the previous section.

<H>
SVRMGR> Rem Buffer busy wait statistics. 
   CLASS              COUNT            TIME            

        This section indicates any waiting for block access and is only 
        really important if there is evidence of delays being introduced 
        due to block contention.

<I>
SVRMGR> Rem Waits_for_trans_tbl high implies you should add rollback ...
   UNDO_SEGMENT        TRANS_TBL_GETS      TRANS_TBL_WAITS    ... 

        This sections indicates rollback segment activity.

<J>
SVRMGR> Rem The init.ora parameters currently in effect:
   NAME                                    VALUE             

        A list of non-default init.ora parameter settings. 

<K>
   NAME            GET_REQS GET_MISS SCAN_REQ SCAN_MIS MOD_REQS ...

        This indications how well the dictionary cache is performing.  This 
        is self managed by Oracle so this section can generally be ignored.
        This section was important in Oracle6.

<L>
SVRMGR> Rem Sum IO operations over tablespaces.
   TABLE_SPACE                                                                  
   READS      BLKS_READ  READ_TIME  WRITES     BLKS_WRT   WRITE_TIME MEGABYTES 
 
        Both this and the next section indicate where read and write IO is
        occurring, and show which tablespaces are servicing full table scans.

<M>
SVRMGR> Rem I/O should be spread evenly across drives. 
   TABLE_SPACE                    FILE_NAME                                     
   READS      BLKS_READ  READ_TIME  WRITES     BLKS_WRT   WRITE_TIME MEGABYTES 

        Per file read and write IO.

<N>
SVRMGR> Rem The times that bstat and estat were run.
   START_TIME                END_TIME                 
   ------------------------- -------------------------
   14-may-98 10:30:22        14-may-98 11:30:06       

        It is important to know the duration of any report. Longer reports 
        tend to average out "peaks" and may hide problem situations.

<O>
   BANNER                                                          
   ----------------------------------------------------------------
   Oracle7 Server Release 7.3.3.5.0 - Production Release           

        Which version is being used - only the first 3 digits should be
        treated as reliable.

UTLESTAT - What to look at

1. Skim over the report and look for negative figures. Usually these mean the database was stopped / started between the bstat and estat runs. Prior to Oracle 7.2 it was common for a few of the larger figures to "wrap" to give negative results (especially the 'memory' statistics). This is less likely to occur in Oracle 7.2 onwards.

2. From section <B> note down the CPU used by this session statistic. This is the cumulative amount of CPU that all sessions have used in the sample period - the name is rather misleading ! Also note down the parse time cpu figure. ('parse time cpu' is included within 'CPU used by this session'). These statistics will be used elsewhere for comparisons.

3. Sections <D> Wait Events and <E> Background Wait Events
In 7.1 and 7.2 there is only one wait-event list for all processes. This, along with your CPU figures from '2', gives you the information on where most time is going. Whenever an Oracle process waits for something it records it as a 'wait' using one of a set of predefined waits (see V$EVENT_NAME for a list of all wait events). Some of these events can be considered as 'idle' events - ie: the process is waiting for work. Others indicate time spent waiting for a resource or action to complete. By comparing the relative time spent waiting on each wait event and the CPU used by this session (from 2 above) we can see where the Oracle instance is spending most of its time.

In order to get an indication of where time is going:

- Ignore any 'idle' wait events. Common IDLE wait events include:

     client message               } Process waiting for data from the
     SQL*Net message from client  } client application
     SQL*Net more data from client}
     rdbms ipc message            - Usually background process waiting for work
     pipe get                     - DBMS_PIPE read waiting for data
     Null event                   - Miscellaneous
     pmon timer                   - PMON waiting for work
     smon timer                   - SMON waiting for work
     parallel query dequeue       - Waiting for input (Discussed later)

- Ignore any wait events which represent a very small percentage of the total time waited.
- Add up the remaining wait event times and show each one as a percentage of total time waited.
- Compare the total time waited with the CPU used by this session figure.

This summary gives a quick indication of where most of the time is going. Eg: if 50% of the time is spent waiting for 'enqueue' there is probably (but not necessarily) a serious locking conflict in the application. If CPU used is above 25% of the total time waited the application may be CPU heavy (there are no hard rules as to what is a high value for CPU used). The top wait events and CPU figures help determine which other sections of the bstat/estat report are worth looking at.

WARNING: The CPU statistics are only updated by Oracle when a user call completes. This means that long running PL/SQL jobs which are CPU intensive will NOT show up in the CPU figures until the outermost PL/SQL block completes. This also means that long running statements that started before UTLBSTAT was run and complete during the sampled period will add their entire CPU usage to the CPU used by this session statistic potentially distorting the figures.

4. If everything in the report looks normal then look at the IO and CPU related sections of this article. If Oracle is behaving well for the workload presented to it there may still be scope to reduce the workload (tuning the SQL etc..)

The aim of the above steps is to identify areas that account for a significant portion of the overall time. Once a potential problem area has been identified one can then look at what options are available to improve the timings. In some cases there are suggestions within this text but you may be referred to separate articles.

>From this point onwards you can either look at the sections appropriate to you or read the entire article. The table below suggests where you should go next for specific scenarios - the wait event names are hyper-links to the relevant section. If using a hard copy of this article then the sections follow the table in the same order as presented here.

Using CPU

High CPU usage You need to know where the CPU is going to

Common Wait Events (alphabetical order)

buffer busy waits Wait to access a block buffer

db file scattered read Multi-block read

db file sequential read OS block read - usually single block read

enqueue Waiting for an enqueue (lock)

free buffer waits Waiting for a free buffer to use.

latch free Waiting for a latch

library cache pin Waiting for a PIN on a library cache object.

log buffer xxx Waiting for log buffer space events

log file switch xxx Waiting for log file switch events

log file sync Waiting on LGWR to sync redo to disk

SQL*Net message from dblink Waiting for response from remote DB

write complete waits Wait for a write of a block buffer to complete.

Special wait events

Parallel query dequeue PQ slave or QC is waiting for a message. Ignore this in calculations

If the top wait events are not listed in the table above and do not appear as idle wait-event list then see Systemwide Waits for Other Wait Events.

High CPU Usage

There can be many reasons that the CPU is a large portion of the overall time spent within Oracle. In a well tuned system the waits will have been reduced as far as possible and all time spent in Oracle uses CPU to accomplish the requested task. However, it is unlikely you would be reading this if that were the case so we need to consider what the CPU cycles are being used for in order to decide if Oracle is doing useful work or not.

Get the parse time cpu and CPU used by this session figures from the "Statistics" section of the estat report. If parse time cpu represents a large percentage of the CPU time then time is being spent parsing rather than executing statements. If this is the case then it is likely that the application is using literal SQL and not sharing it or the shared pool is very badly configured. See <Note:62143.1> for information on the shared pool and the impact of literal SQL.

If the parse CPU is only a small percentage of the total CPU used then the next task is to determine where the CPU is going. There are several things we can do to help with this:

Find statements with large numbers of buffer gets as these are typically heavy on CPU
Find which sessions are responsible for most CPU usage
Trace typical user sessions using the SQL_TRACE option to see how CPU is apportioned amongst the main application statements. See <Note:62160.1>

We cover each of these below:

Locating CPU-Heavy SQL

The statement below finds SQL statements which access database buffers a lot. Such statements are typically heavy on CPU usage because they are probably looking at a lot of rows of data:

  SELECT address, hash_value, 
         buffer_gets, executions, buffer_gets/executions "Gets/Exec", 
         sql_text
    FROM v$sqlarea 
   WHERE buffer_gets > 50000 and executions>0
   ORDER BY 3
   ;

The cut-off value of 50000 is an arbitrary starting point and should be increased/decreased gradually until the top 10 - 20 statements are listed. This shows the SQL statements which have the most BUFFER_GETS which typically relates to using most CPU. The statements of interest are those with a large number of gets per execution especially if executions is high. One off statements with high buffer gets may be due to batch jobs so it helps if you have an understanding of the application components to know which statements are expected to be expensive.

NB: The above statement will not highlight CPU intensive PL/SQL blocks

Once candidate statements have been isolated the full statement text can be obtained using following query substituting relevant values for ADDRESS and HASH_VALUE pairs:

  SELECT sql_text 
    FROM v$sqltext
   WHERE address='&ADDRESS_WANTED'
     and hash_value=&HASH_VALUE
   ORDER BY piece
   ;

The statement can then be explain (using EXPLAIN PLAN) or isolated for further testing to see how CPU intensive it really is. Note that if the statement uses bind variables and your data is highly skewed then the statement may only be CPU intensive for certain bind values.

Locating CPU-Heavy Sessions

The statement below can help locate sessions which have used the most CPU:

  SELECT v.sid, substr(s.name,1,30) "Statistic", v.value
    FROM v$statname s , v$sesstat v
   WHERE s.name = 'CPU used by this session'
     and v.statistic#=s.statistic#
     and v.value>0
   ORDER BY 3
   ;

Note that the CPU time is cumulative so a session which has been connected for several days may appear to be heavier on CPU than one which has only been connected for a short period of time. Thus it is better to write a script to sample the CHANGE in the statistic between 2 known points in time allowing one to see how much CPU was used by each session in a known timeframe.

Once any CPU intensive sessions have been identified <View:V$SESSION> can be used to get more information. At this stage it is generally best to revert to user session tracing (SQL_TRACE) to determine where the CPU is being used. See <Note:62160.1>

Wait Events

The following few sections describe the meanings of the most common wait events from sections C and D of the Estat report. For each wait event we attempt to drill down to locate more information on what is causing the waits. The wait events are included in alphabetical order.

System wide waits for "buffer busy waits"

This wait happens when a session wants to access a database block in the buffer cache but it cannot as the buffer is "busy". The two main cases where this can occur are:

Another session is reading the block into the buffer
Another session holds the buffer in an incompatible mode to our request

If the TIME spent waiting for buffers is significant then we need to determine which segment/s is/are suffering from contention. The "Buffer busy wait statistics" section of the Bstat/estat shows which block type/s are seeing the most contention. This information is derived from the <View:V$WAITSTAT> which can be queried in isolation:

  SELECT time, count, class 
    FROM V$WAITSTAT 
   ORDER BY time,count
  ;

This shows the class of block with the most waits at the BOTTOM of the list. Additional information can be obtained from the internal view X$KCBFWAIT thus:

  SELECT count, file#, name
    FROM x$kcbfwait, v$datafile
   WHERE indx + 1 = file#
   ORDER BY count
  ;

This shows the file/s with the most waits (at the BOTTOM of the list) so by combining the two sets of information we know what block type/s in which file/s are causing waits. The segments in each file can be seen using a query like:

  SELECT distinct owner, segment_name, segment_type
    FROM dba_extents
   WHERE file_id= &FILE_ID 
  ;

If there are a large number of segments of the type listed then monitoring V$SESSION_WAIT may help isolate which object is causing the waits. Eg: Repeatedly run the following statement and collect the output. After a period of time sort the results to see which file & blocks are showing contention:

  SELECT p1 "File", p2 "Block", p3 "Reason"
    FROM v$session_wait 
   WHERE event='buffer busy waits'
  ;

Note:
In the above query there is no reference to WAIT_TIME as you are not interested in whether a session is currently waiting or not, just what buffers are causing waits.

Alternatively capturing session trace may help - See <Note:62160.1>. If a particular block or range of blocks keep showing waits you can try to isolate the object using:

  SELECT distinct owner, segment_name, segment_type
    FROM dba_extents
   WHERE file_id= &FILE_ID 
     and &BLOCK_NUMBER between block_id and block_id+blocks-1
  ;

Actions

As buffer busy waits are due to contention for particular blocks then you cannot take any action until you know which blocks are being competed for and why. Eliminating the cause of the contention is the best option. Note that "buffer busy waits" for data blocks are often due to several processes repeatedly reading the same blocks (eg: if lots of people scan the same index) - the first session processes the blocks that are in the buffer cache quickly but then a block has to be read from disk - the other sessions (scanning the same index) quickly 'catch up' and want the block which is currently being read from disk - they wait for the buffer as someone is already reading the block in.

The following hints may be useful for particular types of contention - these are things that MAY reduce contention for particular situations:

Block Type Possible Actions

data blocks Eliminate HOT blocks from the application. Check for repeatedly scanned / unselective indexes. Change PCTFREE and/or PCTUSED. Check for 'right- hand-indexes' (indexes that get inserted into at the same point by many processes). Increase INITRANS. Reduce the number of rows per block.

segment header Increase of number of FREELISTs. Use FREELIST GROUPs (even in single instance this can make a difference).

freelist blocks Add more FREELISTS. In case of Parallel Server make sure that each instance has its own FREELIST GROUP(s).

undo header Add more rollback segments.

System wide waits for "db file sequential read"

This wait happens when a session is waiting for an IO to complete. Typically this indicates single block reads, although reads from a disk sort area may use this wait event when reading several contiguous blocks. Remember IO is a normal activity so you are really interested in unnecessary or slow IO activity.

If the TIME spent waiting for IOs is significant then we can determine which segment/s Oracle has to go to disk for. See the "Tablespace IO" and "File IO" sections of the ESTAT report to get information on which tablespaces / files are servicing the most IO requests, and to get an indication of the speed of the IO subsystem.

Actions

Block reads are fairly inevitable so the aim should be to minimise un-necessary IO. This is best achieved by good application design and efficient execution plans. Changes to execution plans can yield orders of magnitude changes in performance. Tweaking at system level usually only achieves percentage gains.

A larger buffer cache may help - test this by actually increasing DB_BLOCK_BUFFERS and not by using DB_BLOCK_LRU_EXTENDED_STATISTICS. Never increase the SGA size if it may induce additional paging or swapping on the system.

A less obvious issue which can affect the IO rates is how well data is clustered physically. Eg: Assume that you frequently fetch rows from a table where a column is between two values via an index scan. If there are 100 rows in each index block then the two extremes are:

Each of the table rows is in a different physical block (100 blocks need to be read for each index block)
The table rows are all located in the few adjacent blocks (a handful of blocks need to be read for each index block)

Pre-sorting or re-organising data can help to tackle this in severe situations.

System wide waits for "db file scattered read"

This wait happens when a session is waiting for a multiblock IO to complete. This typically occurs during full table scans or index fast full scans. Oracle reads up to DB_FILE_MULTIBLOCK_READ_COUNT consecutive blocks at a time and scatters them into buffers in the buffer cache. How this is done depends on the value of USE_READV.

If the TIME spent waiting for multiblock reads is significant then we need to determine which segment/s we are performing the reads against. See the "Tablespace IO" and "File IO" sections of the ESTAT report to get information on which tablespaces / files are servicing multiblock reads (BLKS_READ/READS>1).

It is probably best at this stage to find which sessions are performing scans and trace them to see if the scans are expected or not. This statement can be used to see which sessions may be worth tracing:

  SELECT sid,total_waits, time_waited
    FROM v$session_event
   WHERE event='db file scattered read'
     and total_waits>0
   ORDER BY 3,2
  ;

Actions

Ideally you do not want to repeated perform full table scans in online portions of an application when there is a faster more selective way to get at the data. In non online portions of an application table scanning is much more likely to be required so the following points may help:

USE_READV can have a dramatic effect on the performance of table scans. On many platforms USE_READV=FALSE performs better than TRUE but this should be tested.
DB_FILE_MULTIBLOCK_READ_COUNT should generally be made as large as possible. The value is usually capped by Oracle and so it cannot be set too high. The 'capped' value differs between platforms and versions and usually depends on the settings of DB_BLOCK_SIZE and USE_READV.
The table CACHE feature can be used to help cache tables that are frequently scanned. Additionally use of multiple buffer pools and the CACHE option can help in Oracle 8. Take care with the CACHE option as this can cause the buffers in the cache to age more quickly possibly creating problems with the Average Write Queue length.
On some releases / platforms a ROWID range scan will perform better than a normal full table scan, even if the ROWID range covers the whole table as the range scan may cause read-ahead to be used. Eg:
```
        Oracle7:
          SELECT /*+ ROWID(T)*/ * FROM tab T where ROWID>='0.0.0';

        Oracle8:
          In Oracle8 read-ahead is only enabled for Parallel Query Slave
	  processes.
```
WARNING: The ROWID hint causes a range checkpoint and so can have an adverse performance impact, especially in Parallel Server environments.

System wide waits for "enqueue"

A wait for an enqueue is a wait for a local 'lock'. The count and average wait times for this wait-event can be misleading as "enqueue" waits re-arm every few seconds. To qualify how many waits have really occurred you need the enqueue waits statistic from the statistics section of the estat report.

Eg: Assume
        Wait Events show enqueue time_waited=3000 total_waits=10
        Statistics  show enqueue waits has a count of 2
    This means there were 2 actual waits whose individual wait times totalled 
    to 3000 hundredths of a second (ie: 30 seconds).

To determine which enqueues are causing the most waits system-wide look at <View:X$KSQST> thus:

  SELECT  ksqsttyp "Lock", 
          ksqstget "Gets",
          ksqstwat "Waits"
    FROM X$KSQST where KSQSTWAT>0;

This gives the system wide number of waits for each lock type. Remember that it only takes one long wait to distort the average wait time figures.

Actions

The action to take depends on the lock type which is causing the most problems. It is beyond the scope of this article to look at the reasons for waits on each lock type - the most common lock waits are generally for:

     TX      Transaction Lock 
                Generally due to application or table setup issues
                See <Note:62354.1> for example scenarios which can cause
                TX lock waits.

     TM      DML enqueue
                Generally due to application issues, particularly if 
                foreign key constraints have not been indexed.
                See <OLS:106754.289> for details of TM locks and 
                referential integrity.

     ST      Space management enqueue 
                Usually caused by too much space management occurring 
                (Eg: small extent sizes, lots of sorting etc..) 
                See <Note:33567.1> for more information about the ST
                enqueue.

System wide waits for "free buffer waits"

This wait happens when a session has to wait for a free buffer in the buffer cache. Eg: A buffer is needed for a session to read a database block from disk into the buffer cache. If the TIME spent waiting for free buffers is significant then it is likely that DBWR is not keeping up with the cache throughput rate.

Actions

See <Note:62172.1> on issues affecting the buffer cache and DBWR.

System wide waits for "latch free"

Latches are like short duration locks that protect critical bits of code. As a latch wait is typically quite short it is possible to see a large number of latch waits which only account for a small percentage of time.

If the TIME spent waiting for latches is significant then we need to determine which latches are suffering from contention. The Bstat/estat section on latches shows latch activity in the period sampled. This section of the estat report is based on <View:V$LATCH> gives a summary of latch activity since instance startup and can give an indication of which latches are responsible for the most time spent waiting for "latch free" thus:

  SELECT latch#, name, gets, misses, sleeps
    FROM v$latch
   WHERE sleeps>0
   ORDER BY sleeps 
  ;

  Note that this select gives the worst latches at the BOTTOM of the list.

Some lines in this report are actually for multiple latches all of the same type. To determine if the latch activity is concentrated on one particular latch in the set one can query <View:V$LATCH_CHILDREN> (only available from 7.3 onwards)

Oracle 7.3 and Oracle 8.0

<View:V$LATCH_CHILDREN>

  SELECT addr, latch#, gets, misses, sleeps
    FROM v$latch_children
   WHERE sleeps>0
     and latch# = &LATCH_NUMBER_WANTED
   ORDER BY sleeps 
  ;

This gives the system wide number of waits for each child latch of the type LATCH#. If there are no rows returned then there is only a single latch of the type you are looking at.

If there are multiple rows the important thing to note is whether the SLEEPS are reasonably distributed or if there are one or two child latches responsible for 80% of the SLEEPS. If the contention is focused on one or two child latches make a note of which children are seeing a problem - note the ADDR column.

Actions

The latches most likely to show high sleeps are listed below along with some possible actions:

  shared pool latch
      In Oracle 7.1/7.2 there are known issues involving the shared
      pool and library cache latches. See <Note:32871.1> for details.

      Heavy use of literal SQL will stress this latch significantly.
      If your online application makes heavy use of literal SQL statements
      then converting these to use bind variables will give significant
      improvements in latch contention in this area.
      See <Note:62143.1> for issues affecting the shared pool.

      MTS places extra load on this latch - the use of the LARGE_POOL
      in Oracle8 may help this. 

  library cache pin (7.1) and
  library cache latches
      From Oracle 7.2 onwards the library cache latch has child latches .
      Problems with these latches are typically due to heavy use of literal 
      SQL or very poor shared pool configuration. 
      If your online application makes heavy use of literal SQL statements
      then converting these to use bind variables will give significant
      improvements in latch contention in this area.
      See <Note:62143.1> for issues affecting the shared pool.
      
        
  cache buffers lru chain latch
      Setting DB_BLOCK_LRU_STATISTICS to TRUE can adversely affect this 
      latch - always ensure DB_BLOCK_LRU_STATISTICS is set to FALSE.
      From Oracle 7.3 it is possible to have multiple of these latches 
      by specifying DB_BLOCK_LRU_LATCHES although this really needs multiple
      CPU's to be of most benefit.
      Heavy contention for this latch is generally due to heavy buffer cache 
      activity which can be caused, for example, by:

                Sorting in buffer cache and not using SORT_DIRECT_WRITES

                Repeatedly scanning large unselective indexes

      See <Note:62172.1>


  cache buffers chains latches
      Individual block contention can show up as contention for one of these
      latches.  Each cache buffers chains latch covers a list of buffers in 
      the buffer cache. If one or two child latches stand out from 
      V$LATCH_CHILDREN then:

                SELECT dbafil,dbablk,class, state 
                  FROM x$bh
                 WHERE hladdr='&ADDR_OF_CHILD_LATCH'
                ;

      If this list contains lots of entries for the same dbafil, dbablk
      in 7.1/7.2 then see <Note:38350.1> and <Bug:256032>.

      If this list is short (3 to 10 buffers) then one of the buffers in
      this list is probably very 'hot' - ie: suffers from lots of concurrent
      access attempts.

Additionally Oracle 7.3 introduces a new view which may be of help to Oracle Support in more obscure cases:

  SELECT sleep_count, "WHERE"
    FROM v$latch_misses
   WHERE parent_name='&PROBLEM_LATCH_NAME'
   ORDER BY 1
  ;

This shows circumstances where latches were requested but not obtained immediately.

System wide waits for "library cache pin"

Waits on the "library cache pin" wait event occur if a session wants to use an item in the shared pool but someone else has the item pinned in an incompatible mode. This is normally only seen if there is heavy contention for a particular object or statement in the shared pool.

Addressing this problem is likely to need the help of Oracle support but the following few statements should help obtain useful information.

Repeatedly run the following statement and collect the output. After a period of time sort the results to see if the waits are always for a single "handle" or for several different handles.

  SELECT p1 "Handle"
    FROM v$session_wait 
   WHERE event='library cache pin'
  ;

Also isolate any statements which have high version counts:

  SELECT address, to_char(hash_value,'9999999999999'), version_count, sql_text
    FROM v$sqlarea
   WHERE version_count>10
    ORDER BY version_count
  ;

Actions

If a particular statement has a high version count it is possible that there is a problem. Contact Oracle support if this wait event is showing as a significant chunk of the total time.

System wide waits for "log file space/switch" and "log buffer ..." events

There are several wait events which may indicate problems with the redo buffer and redo log throughput:

7.0 to 7.2:
              log buffer space/switch

7.3 onwards:
              log buffer space 
              log file switch (checkpoint incomplete) 
              log file switch (archiving needed) 
              log file switch/archive
              log file switch (clearing log file) 
              log file switch completion 
              switch logfile command

Actions

For the "log buffer space" events:

Increase the LOG_BUFFER size until there is no incremental benefit (sizes > 1Mb are unlikely to add any benefit)

For all other waits:

Ensure redo logs are on fast disks (NOT RAID5)
Ensure redo logs are large enough to give a sensible gap between log switches. A 'rule-of-thumb' is to have one log switch every 15 to 30 minutes.
Ensure the ARCHiver process is running and keeping up.

For further possible actions see "Tuning LGWR".

System wide waits for "log file sync"

This wait happens when a commit (or rollback) is issued and the session has to wait for the redo entry to be flushed to disk to guarantee that instance failure will not roll back the transaction.

Actions

The 2 ways to reduce waits for this are:

Where possible reduce the commit frequency. Eg: Commit at batch intervals and not per row.
Speed up redo writing (Eg: Do NOT use RAID 5, use fast disks etc..)

See "Tuning LGWR" for other actions.

Actions

This can be tackled from two angles:

Decrease the cache aging rate
Increase the throughput of DBWR

These are described in <Note:62172.1>

System wide waits for "Parallel Query Dequeue"

Parallel query dequeue waits are problematic in that the same wait event is used in both 'idle' and 'busy' situations. This article does not cover the details of parallel query so we assume parallel query dequeue waits are idle waits and ignore them in calculations. This is not a bad assumption as any query slave waiting on a message is dependent on the supplier/s of that message - the supplying process/es should either be working (using CPU) or waiting for some other wait event (Eg: IO) so the overview seen in the bstat/estat output should still give an indication of the source of any wasted time.

Actions

If your site makes significant use of parallel query for the periods being looked at then it is best to tune the queries independently. This is covered in other articles.

System wide waits for other wait events

There are a large number of wait events, many of which occur infrequently. If any other wait event represents a significant portion of the time waited then the first thing you need to know is whether the wait represents:

Idle time - Ie: A process was just waiting for work
A real wait which is hindering performance

Contact Oracle Support and/or use Metalink to help determine whether the wait is important and if so how to find more information about why and where the waits are occurring.

Actions

Have all information gathered to date available before contacting support.

More details on each BSTAT / ESTAT Section

The following notes give more detail on each section of the BSTAT/ESTAT report. The order of the sections here match the order give in the "UTLESTAT Output" summary earlier in this article.

<A> Library Cache Statistics

Based on <View:V$LIBRARY_CACHE>
The columns are:

   LIBRARY              Different object types are stored in different
                        'libraries'.
   GETS                 Number of times an item in this 'library' was 
                        requested.
   GETHITRATIO          Percentage of times the item was requested and found
                        to already be present in the cache.
   PINS                 Number if times an item was 'pinned'. An item has to be
                        'pinned' in order to be used. Items may be looked up
                        once and then pinned/unpinned many times.
   PINHITRATIO          Percentage of times an item was requested to be pinned
                        and was successfully pinned.
   RELOADS              Number of times an item had to be re-loaded because
                         part of it had been flushed from the cache and was 
                         needed.
   INVALIDATIONS        Number of times objects were invalidated in the sample
                         period.

If there are a significant number of RELOADS then re-usable information is being flushed from the SGA and thus having to be reloaded / rebuilt. See <Note:62143.1> for details about setting up the shared pool.

If the GetHitRatio or PinHitRatio is low (<.9) then it is likely that the application is using unsharable SQL or infrequently referencing objects. If there is a poor hit ratio see <Note:62143.1> Note that due to the way Oracle works it is unusual for the hit ratios in this section to very low.

<B> System Statistics

Based on <View:V$SYSSTAT>
This is a very long section full of lots of statistical information. The columns are:

   Statistic            Statistic name 
   Total                Change in this statistic between sample times
   Per Transact         "Total" divided by number of COMMITs in 
                         the sample period.
   Per Logon            "Total" divided by the number LOGONS that
                         occurred in the sample period.
   Per Second           "Total" divided by time interval between
                         the samples.

Most of the statistics can be ignored as they are only useful in specific circumstances. Below are some of the more commonly referenced statistics:

    CPU used by this session    Actually the CPU used across all sessions
                                 in the sample period
    consistent gets             Gets of a block in consistent mode
    db block gets               Gets of a block in current mode
    enqueue waits               Actual number of waits for a lock
    execute count               Number of execute calls
    parse count                 Number of parse calls
    parse count (total) V8        "             "
    parse count (hard) V8        resulting in a hard parse
    parse time cpu              CPU spent parsing
    physical reads              Number of blocks physically read . Note this 
                                 is blocks read and NOT the number of IO read 
                                 requests.
    recursive calls             Number of recursive calls (either for
                                 dictionary information or PLSQL)
                                        
    sorts (disk)                Number of disk based sorts
    table scans (long tables)   Number of full table scans
    user calls                  Number of user calls

There is no intention to describe all the statistics here especially as many have self describing names. Some of the statistics may be useful in particular situations so if the text refers to "statistic XXX" then this is the section of the report to find the value. A few notes on some of the more widely used statistics are included below.

WARNING: CPU statistics are only updated by Oracle when a user call completes. This means that long running PL/SQL jobs which are CPU intensive will NOT show up in the CPU figures until the outermost PL/SQL block completes.

The HIT RATIO

The hit ratio for the buffer cache is calculated as

        hit ratio =     1   -   (       physical reads            )
                                ( ------------------------------- )
                                ( consistent gets + db block gets )

Converting this to a percentage (multiply by 100) gives the percentage of times an Oracle process asked for a data block and found it in the buffer cache. A good hit ratio is expected for OLTP type systems but decision support type systems may have much lower hit ratios.

Note:In Oracle 7.3.4 and Oracle8 the "physical reads" statistic includes direct block reads as well as reads to get data into the buffer cache. Hence the above formula only gives an lower bound for the hit ratio on these releases. See <Note:33883.1> for more information.

A good hit ratio does not mean the application is good. It is quite possible to get an excellent hit ratio by using a very unselective index in a heavily used SQL statement.
Eg: Consider a statement like:

        select * from employee where empid=1023 and gender='MALE';

If EMPLOYEE is a large table and this statement always uses the GENDER index rather than the EMPID index then you scan LOTS of blocks (from the GENDER index) and find nearly all of them in the cache as everyone is scanning this same index over and over again. The hit ratio is excellent but performance is very bad. Conversely repeated full table scans of large tables do not cause the blocks to be cached so make the hit ratio worse.

In Oracle8 it is possible to use multiple buffer pools (by setting BUFFER_POOL_KEEP or BUFFER_POOL_RECYCLE init.ora parameters). Multiple buffer pools are not discussed here but the hit ratios can be seen using the V$BUFFER_POOL_STATISTICS view which is created by the CATPERF.SQL script.

Calls

parse count, execute count, user calls and recursive calls give an indication of the number of requests the server processed in the sample period and whether a significant portion of these were parse calls .

Misconceptions

This short section describes a few of the statistics which have been seen incorrectly documented in the past:

Per Logon figures in this section are really 'per fresh logon in the sampled period' so may be misleading. Eg: 400 users logged on but only 5 new logons occurred in the sample period.
cluster key scan statistics cannot be used to indicate problems with clusters. Typically these statistics will always contain figures due to dictionary operations.
enqueue waits are the number of times a process waited for an enqueue (lock). This differs from the figure in the wait events section as enqueue waits time out (usually every 3 seconds). Hence a single session waiting for 1 minute shows up as 20 waits for 'enqueue' in the wait event section but only ONE enqueue wait in the statistics.
enqueue timeouts are the number of times a process waited for an enqueue (lock) and then gave up totally timing out. Eg: You may see these due to DISTRIBUTED_LOCK_TIMEOUT being exceeded in a distributed transaction. Comments about changing ENQUEUE_RESOURCES are ill-founded.
recursive calls can be due to dictionary access or normal PLSQL so the ratio of recursive calls to user calls has no meaning.
session pga/uga memory the session memory figures may wrap at system level (especially in Oracle 7.1) so should be considered as useless. Better information on memory usage can be seen in V$SESSTAT.

<C> Average Write Queue

Based on figures from <View:V$SYSSTAT>
The average write queue length gives an indication of whether the cache is being cleared of dirty blocks fast enough or not. Unless the system is very high throughput the write queue should never be very large. As a general guideline if the average write queue is in double figures it is certainly worth finding where the activity is coming from. If the figure is in single digits there may still be un-necessary activity in the cache.

The 2 main ways to improve the write queue length are:

Eliminate any unnecessary write IO from going via the cache.

Improve DBWR throughput. Eg: Use asynchronous writes, multiple database writer processes, larger block write batch.

See <Note:62172.1> for information about the Tuning the buffer cache and DBWR.

<D> Wait Events and <E> Background Wait Events

Based on <View:V$SYSTEM_EVENT>
In 7.1 and 7.2 there is only one wait-event list for all processes. In 7.3 and Oracle8 the waits are split out for background and user processes. This section , along with information about CPU usage gives a picture of what Oracle is really doing.

Columns in this section:

   Event Name           Name of the waitevent
   Count                Number of times a wait occurred in the sample
                         period
   Total Time           Total time spent waiting (1/100ths of a second)
   Avg Time             Average time per wait (1/100ths of a second)

See "What to look at" for details of this section of the report.

<F> Latch Waits

Based on <View:V$LATCH>
This section is important if there are significant waits for latch free in the list of wait events.

Columns in this section:

   LATCH_NAME           Name of the latch. Note that some latches have a 
                        parent with children so this may be a summary line.   
   GETS                 Number of times latch was requested
   MISSES               Number of times it could not be acquired on the
                        first attempt
   HIT_RATIO            Ratio indicating percentage of times we acquire a 
                        latch as soon as it is requested. Many articles 
                        indicate this is important but it can be misleading. 
                        Eg: GETS=10, MISSES=8 gives a very poor hit ratio 
                            but is insignificant if the total latch misses 
                            are 200000.
   SLEEPS               This is much more important - it indicates the number 
                        of times we had to sleep because we could not get a 
                        latch. The latch/es with the highest SLEEPS figures 
                        are the ones to concentrate on.
   SLEEPS/MISS          Average sleeps per miss gives an idea how long
                        each get is having to wait to get the latch.

Time spent waiting for a latch is time wasted. Isolate the latches with the top SLEEPS figures and refer to 'latch free' for what to do with this information.

<G> NoWait Latch Misses

Based on <View:V$LATCH>
If an Oracle process asks for a latch but is not prepared to wait for that latch it can request the latch NOWAIT. If the latch is not available the process may take some other course of action. It is very difficult to assess the impact of NOWAIT MISSES as the cost of a miss can be very small.

<H> Buffer Busy waits

Based on <View:V$WAITSTAT>
This section is useful if the wait event buffer busy waits is a significant portion of the time waited. It shows which class/es of block are responsible for the waits. Such waits can be due to heavy contention for data blocks, segment headers, undo blocks etc.. Isolating the true cause of the waits needs more information than is contained in the bstat/estat report.

The columns are:

   CLASS                Block class waited for.
   COUNT                Number of waits
   TIME                 Total time waited (1/100ths of a second)

See buffer busy waits for more details.

<I> Rollback Segment statistics

Based on <View:V$ROLLSTAT>
This section is useful to see the size and usage of the rollback segments.

Columns in this section are

   UNDO_SEGMENT         Undo segment number. Get the name from:
                         SELECT * form DBA_ROLLBACK_SEGS 
                          WHERE segment_id=&UNDO_SEGMENT;
   TRANS_TBL_GETS       Number of gets for the rollback segment header
                         in the sample period
   TRANS_TBL_WAITS      Number of waits for the rollback segment header
                         in the sample period
   UNDO_BYTES_WRITTEN   Number of bytes written to the rollback segment
                         in the sample period
   SEGMENT_SIZE_BYTES   Size of the rollback segment at end of sample
                         period
   XACTS                Meaningless: active Xacts at end (estat) MINUS 
                                     active transactions at start
   SHRINKS              Number of rollback segment shrinks during the
                         sample period.
   WRAPS                Number of wraps from one extent to another
                         during the sample period.

TRANS_TBL_WAITS indicates which rollback segment headers had waits for them. Typically you would want to reduce such contention by adding rollback segments. How much this is affecting performance can be gauged from the time spent waiting for buffer busy waits for undo segment header blocks.

If SHRINKS is non zero then one or more rollback segments has OPTIMAL set. While OPTIMAL can be useful there have been problems in certain Oracle versions related to rollback segments shrinking back to their optimal size.

<J> Parameter Settings

Based on <View:V$PARAMETER>
This shows non-default parameter settings. Note that the actual value of some of the parameters may differ from the settings in the init.ora file - Eg: DB_FILE_MULTIBLOCK_READ_COUNT generally has an upper bound enforced.

It is difficult to give general tips for init.ora parameters but check for the following:

Generic:
   CURSOR_SPACE_FOR_TIME        Should generally be FALSE
   DB_BLOCK_LRU_STATISTICS      TRUE may show latch contention in Oracle 7.2 
                                 onwards 
   EVENT                        Should only ever be set under guidance from
                                 Oracle Support.
   TIMED_STATISTICS             Should be TRUE

Operating System Specific:
   USE_READV                    TRUE may impact table scan speed
   REDUCE_ALARM                 TRUE may severely impact performance
   USE_ISM                      TRUE may cause problems on some platforms/
                                 versions 
   CPU_COUNT                    This should normally default correctly but
                                see platform specific information as on some
                                platforms/versions it may need setting
                                explicitly. 
   PRE_PAGE_SGA                 On some platforms TRUE may impact performance

<K> Dictionary Cache

Based on <View:V$ROWCACHE>
In Oracle6 days users had to tune the dictionary cache but it is now self tuning.

It is unlikely there is any problem in this section of the report. If there are any values with large GETS and a poor hit ratio check the figures with Oracle support.

<L> Tablespace IO and <M> File IO

Based on <View:V$FILESTAT>
These 2 sections can be very long as they list all tablespaces and files together with the number of IOs and timings for the IOs. Both of these sections report the same information, one summarising per tablespace and one per file.

The columns in these sections are:

   TABLE_SPACE      Tablespace name
   FILE_NAME        File name in this tablespace   
   READS            Number of read calls to the OS
   BLKS_READ        Number of blocks read.
                       The above two differ only when multi-block
                       reads are being performed. Eg: On full table
                       scans we read up to DB_FILE_MULTIBLOCK_READ_COUNT
                       blocks per read.
   READ_TIME        Total time for the reads in 1/100ths of a second
   WRITES           Number of OS write calls
   BLKS_WRT         Number of blocks written
   WRITE_TIME       Write time in 1/100ths of a second. NB: this figure
                    may be incorrect.

Take care with information presented in this section as it is now very common for files to be striped or managed by a volume manager so there is not any obvious relationship between each Oracle file and the physical disks. Note too that an Oracle IO is a call to the OS to perform an IO operation and may not result in a physical disk IO. For this reason OS statistics should be used when looking at tuning IO throughput.

The 4 important things you can get from this section are:

Which tablespace/s have most read activity
If tablespace/s are involved in multiblock reads
IO subsystem read times. >20mS per read is typically slow for single block reads. Very fast read times may indicate that reads are being satisfied from an OS cache.
Which tablespace/s have the most write activity

Read IO

Isolate the files (and tablespace names) where 80% of the read IO is occurring. Typically there will be many files with little IO and a few with heavy IO. If the read times are large (>40mS per block read) then look at the IO sub-system. If the read times are very small then it is possible that the blocks being read are cached below Oracle (Eg: in the OS buffer cache) which may indicate repeated reads from the same disk blocks are occurring.

Remember that in some applications random access of large amounts of data mean that you do expect to be using disk access to some information so do not expect to be able to eliminate read IO completely.

If there are an excessive number of reads occurring the source of these can often be isolated using the following two statements:

Locating IO heavy SQL:

  SELECT address, to_char(hash_value,'999999999999') "Hash Value", 
         disk_reads, executions, disk_reads/executions "Reads/Exec", 
         sql_text
    FROM v$sqlarea 
   WHERE disk_reads > 500 and executions>0
   ORDER BY 3
   ;

Locating IO heavy sessions to trace:

  SELECT v.sid, substr(s.name,1,30) "Statistic", v.value
    FROM v$statname s , v$sesstat v
   WHERE s.name =  'physical reads'
     and v.statistic#=s.statistic#
     and v.value>0
   ORDER BY 3
   ;

  Note: This shows reads since the session connected and does not 
        represent the IO rate of each session.

Once the statement and/or sessions have been isolated SQL_TRACE can be used.

Multi-Block reads

If any of these shows BLKS_READ/READS > 1 then there is some mutliblock reading occurring. If this is not expected look for full table scans occurring on non-cached tables. Repeated scans of the same tables may also accompanied by very fast read times when blocks are cached below Oracle (Eg: In the OS buffer cache)

See the db file scattered read wait event for further information.

Write IO

Isolate the tablespaces and files responsible for 80% of the write IO. In particular one should note if most IO is to a tablespace called TEMP or similar as this implies disk sorts are occurring. Use SORT_DIRECT_WRITES where possible to ensure such writes to not impact buffer cache performance (See <Note:62172.1>).

Write times are often incorrect so if the figures look unbelievable across all files then they probably are (unbelievable). If some files seem to have sensible write times and others do not then provided all files are the same type (RAW files or OS files) then it is worth checking if there are problems with particular files.

Using CPU
High CPU usage	You need to know where the CPU is going to
Common Wait Events (alphabetical order)
buffer busy waits	Wait to access a block buffer
db file scattered read	Multi-block read
db file sequential read	OS block read - usually single block read
enqueue	Waiting for an enqueue (lock)
free buffer waits	Waiting for a free buffer to use.
latch free	Waiting for a latch
library cache pin	Waiting for a PIN on a library cache object.
log buffer xxx	Waiting for log buffer space events
log file switch xxx	Waiting for log file switch events
log file sync	Waiting on LGWR to sync redo to disk
SQL*Net message from dblink	Waiting for response from remote DB
write complete waits	Wait for a write of a block buffer to complete.
Special wait events
Parallel query dequeue	PQ slave or QC is waiting for a message. Ignore this in calculations

Block Type	Possible Actions
data blocks	Eliminate HOT blocks from the application. Check for repeatedly scanned / unselective indexes. Change PCTFREE and/or PCTUSED. Check for 'right- hand-indexes' (indexes that get inserted into at the same point by many processes). Increase INITRANS. Reduce the number of rows per block.
segment header	Increase of number of FREELISTs. Use FREELIST GROUPs (even in single instance this can make a difference).
freelist blocks	Add more FREELISTS. In case of Parallel Server make sure that each instance has its own FREELIST GROUP(s).
undo header	Add more rollback segments.

UTLBSTAT / UTLESTAT (commonly known as bstat/estat reports)

BSTAT / ESTAT

UTLESTAT Output

UTLESTAT - What to look at

High CPU Usage

Locating CPU-Heavy SQL

Wait Events

System wide waits for "buffer busy waits"

Actions

System wide waits for "db file sequential read"

Actions

System wide waits for "db file scattered read"

Actions

System wide waits for "enqueue"

Actions

System wide waits for "free buffer waits"

Actions

System wide waits for "latch free"

Oracle 7.3 and Oracle 8.0

Actions

System wide waits for "library cache pin"

Actions

System wide waits for "log file space/switch" and "log buffer ..." events

Actions

System wide waits for "log file sync"

Actions

System wide waits for "SQL*Net message from dblink"

Actions

System wide waits for "write complete waits"

Actions

System wide waits for "Parallel Query Dequeue"

Actions

System wide waits for other wait events

Actions

More details on each BSTAT / ESTAT Section

The HIT RATIO

Calls

Misconceptions

<D> Wait Events and <E> Background Wait Events

<L> Tablespace IO and <M> File IO

Read IO

Multi-Block reads

Write IO