Zurück

The Power of Analytic Functions


Overview

One of the deficiencies of SQL is the lack of support for analytic calculations like moving averages, cumulative sums, ranking, percentile and lead and lag which are critical for OLAP applications. These functions work on ordered sets of data. Expressing these functions in current SQL is not elegant, requires self-joins and is difficult to optimize. To address this issue, Oracle has introduced SQL extensions called Analytic functions.

Analytic functions compute an aggregate value based on a group of rows. They differ from aggregate functions in that they return multiple rows for each group. The group of rows is called a window and is defined by the analytic clause.

Analytic functions are the last set of operations performed in a query except for the final ORDER BY clause. All joins and all WHERE, GROUP BY, and HAVING clauses are completed before the analytic functions are processed. Therefore, analytic functions can appear only in the select list or ORDER BY clause. More Information about Analytic Functions can be found here.

In this article we show the power of analytic functions on an example: Removing Duplicate Rows.

Example - Removing Duplicates

For the example we constructed a large table based on the ALL_OBJECTS table. Then we want to remove all Rows with duplicate OBJECT_NAME's.

Setup the large Table:

create table t as select * from all_objects;
insert into t (select * from t);    /* Many Times */

Count the unique Rows

select count(*), count(distinct OBJECT_NAME) from t;

  COUNT(*) COUNT(DISTINCT OBJECT_NAME)
---------- ---------------------------
   1697024                        5017

Here you can see that we have more than 1.6 million rows but only 5'017 unique ones. We need to delete 1'697'024 - 5'017 = 1'692'007 of my rows (lots). Let's see how speedy this can be done using analytic functions.

The normal (non Analytic Function) Approach:

delete from t
 where rowid not in (select min(rowid)
                       from t group by OBJECT_NAME);

This statements runs hours and hours, using nearly 100% CPU Power and even after increasing the rollback segment tablespace to 7GB, we are not getting the desired results. This is not the way to go!

For the sake of clarity we first demonstrate the deletion on a small table - we duplicate the EMP table: insert into emp (select * from emp), then we want to delete duplicate ENAME's.

The Analytic Function Approach:

The approach to removing duplicates is a little different. We would generate the set of ROWID's to delete by using analytics and then delete them, like this:

select count(*), count(distinct ENAME) from emp;

  COUNT(*) COUNT(DISTINCT ENAME)
---------- ---------------------
       28                     14

delete from emp
 where rowid in
(select rid
   from
   (select rowid rid,
           row_number()
           over (
              partition by ENAME
              order by rowid
           )
rn
      from emp
   )
   where rn <> 1
);

14 rows deleted.

Let's look closer to the solution - the inner query shows the groups:

SELECT ename Ename, deptno Deptno, rowid Rid,
  ROW_NUMBER()
  OVER (
     PARTITION BY ename
     ORDER BY rowid
  ) RNo

FROM emp
/

Ename          Deptno Rid                       RNo
---------- ---------- ------------------ ----------
ADAMS              20 AAABrzAAEAAAAASAAK          1
ADAMS              20 AAABrzAAEAAAAASAAY          2
ALLEN              30 AAABrzAAEAAAAASAAB          1
ALLEN              30 AAABrzAAEAAAAASAAP          2
BLAKE              30 AAABrzAAEAAAAASAAF          1
BLAKE              30 AAABrzAAEAAAAASAAT          2
CLARK              10 AAABrzAAEAAAAASAAG          1
CLARK              10 AAABrzAAEAAAAASAAU          2
FORD               20 AAABrzAAEAAAAASAAM          1
FORD               20 AAABrzAAEAAAAASAAa          2
JAMES              30 AAABrzAAEAAAAASAAL          1
JAMES              30 AAABrzAAEAAAAASAAZ          2
JONES              20 AAABrzAAEAAAAASAAD          1
JONES              20 AAABrzAAEAAAAASAAR          2
KING               10 AAABrzAAEAAAAASAAI          1
KING               10 AAABrzAAEAAAAASAAW          2
MARTIN             30 AAABrzAAEAAAAASAAE          1
MARTIN             30 AAABrzAAEAAAAASAAS          2
MILLER             10 AAABrzAAEAAAAASAAN          1
MILLER             10 AAABrzAAEAAAAASAAb          2
SCOTT              20 AAABrzAAEAAAAASAAH          1
SCOTT              20 AAABrzAAEAAAAASAAV          2
SMITH              20 AAABrzAAEAAAAASAAA          1
SMITH              20 AAABrzAAEAAAAASAAO          2
TURNER             30 AAABrzAAEAAAAASAAJ          1
TURNER             30 AAABrzAAEAAAAASAAX          2
WARD               30 AAABrzAAEAAAAASAAC          1
WARD               30 AAABrzAAEAAAAASAAQ          2

Now show only the Rows which must be deleted, these are all Rows with RNo <> 1.

SELECT Ename, Deptno, Rid, RNo
  FROM
    (SELECT ename Ename, deptno Deptno, rowid Rid,
      ROW_NUMBER()
      OVER (
         PARTITION BY ename
         ORDER BY rowid
      ) RNo
    FROM emp
    )
    WHERE RNo <> 1
/

ENAME          DEPTNO RID                       RNO
---------- ---------- ------------------ ----------
ADAMS              20 AAABrzAAEAAAAASAAY          2
ALLEN              30 AAABrzAAEAAAAASAAP          2
BLAKE              30 AAABrzAAEAAAAASAAT          2
CLARK              10 AAABrzAAEAAAAASAAU          2
FORD               20 AAABrzAAEAAAAASAAa          2
JAMES              30 AAABrzAAEAAAAASAAZ          2
JONES              20 AAABrzAAEAAAAASAAR          2
KING               10 AAABrzAAEAAAAASAAW          2
MARTIN             30 AAABrzAAEAAAAASAAS          2
MILLER             10 AAABrzAAEAAAAASAAb          2
SCOTT              20 AAABrzAAEAAAAASAAV          2
SMITH              20 AAABrzAAEAAAAASAAO          2
TURNER             30 AAABrzAAEAAAAASAAX          2
WARD               30 AAABrzAAEAAAAASAAQ          2

Now these Rows can easily be deleted:

DELETE FROM emp
 WHERE rowid IN
    (SELECT Rid
      FROM
        (SELECT rowid Rid,
          ROW_NUMBER()
          OVER (
             PARTITION BY ename
             ORDER BY rowid
          ) RNo
        FROM emp
        )
        where RNo <> 1
    );

14 rows deleted.

Conclusion

Now, we go back to the huge Table: t

DELETE FROM t NOLOGGING
 WHERE rowid IN
    (SELECT Rid
      FROM
        (SELECT rowid Rid,
          ROW_NUMBER()
          OVER (
             PARTITION BY object_name
             ORDER BY rowid
          ) RNo
        FROM t
        )
        where RNo <> 1
    );


1692007 rows deleted.

Elapsed: 00:08:57.15

Execution Plan
----------------------------------------------------------
   0      DELETE STATEMENT Optimizer=CHOOSE
   1    0   DELETE OF 'T'
   2    1     NESTED LOOPS
   3    2       VIEW OF 'VW_NSO_1'
   4    3         SORT (UNIQUE)
   5    4           VIEW
   6    5             WINDOW (SORT)
   7    6               TABLE ACCESS (FULL) OF 'T'
   8    2       TABLE ACCESS (BY USER ROWID) OF 'T'

The huge Deletion was complete after 9 Minutes instead of hours - Great!

select count(*), count(distinct OBJECT_NAME) from t;

  COUNT(*) COUNT(DISTINCT OBJECT_NAME)
---------- --------------------------
      5017                       5017

No more duplicate Rows on OBJECT_NAME.