In this article we show the power of analytic functions on an
example: Removing Duplicate Rows.
Example - Removing Duplicates
For the example we constructed a large table based on the
ALL_OBJECTS table. Then we want to remove all Rows with duplicate
OBJECT_NAME's.
Setup the large Table:
create table t as select * from all_objects;
insert into t (select * from t); /* Many
Times */
Count the unique Rows
select count(*), count(distinct OBJECT_NAME) from t;
COUNT(*) COUNT(DISTINCT
OBJECT_NAME)
---------- ---------------------------
1697024
5017
Here you can see that we have more than 1.6 million rows but only
5'017 unique ones. We need to delete 1'697'024 - 5'017 = 1'692'007 of my rows (lots).
Let's see how speedy this can be done using analytic functions.
The normal (non Analytic Function) Approach:
delete from t
where rowid not in (select min(rowid)
from t group by OBJECT_NAME);
This statements runs hours and hours, using nearly 100% CPU Power
and even after increasing the rollback segment tablespace to 7GB, we are not getting
the desired results. This is not the way to go!
For the sake of clarity we first demonstrate the deletion on a
small table - we duplicate the EMP table: insert into emp
(select * from emp), then we want to delete duplicate ENAME's.
The Analytic Function Approach:
The approach to removing duplicates is a little different. We
would generate the set of ROWID's to delete by using analytics and then delete them,
like this:
select count(*), count(distinct ENAME) from emp;
COUNT(*) COUNT(DISTINCT ENAME)
---------- ---------------------
28
14
delete from emp
where rowid in
(select rid
from
(select rowid rid,
row_number()
over (
partition by ENAME
order
by rowid
) rn
from emp
)
where rn <> 1
);
14 rows deleted.
Let's look closer to the solution - the inner query shows the
groups:
SELECT ename Ename, deptno Deptno, rowid Rid,
ROW_NUMBER()
OVER (
PARTITION BY ename
ORDER BY rowid
) RNo
FROM emp
/
Ename Deptno
Rid
RNo
---------- ---------- ------------------ ----------
ADAMS
20 AAABrzAAEAAAAASAAK 1
ADAMS
20 AAABrzAAEAAAAASAAY 2
ALLEN
30 AAABrzAAEAAAAASAAB 1
ALLEN
30 AAABrzAAEAAAAASAAP 2
BLAKE
30 AAABrzAAEAAAAASAAF 1
BLAKE
30 AAABrzAAEAAAAASAAT 2
CLARK
10 AAABrzAAEAAAAASAAG 1
CLARK
10 AAABrzAAEAAAAASAAU 2
FORD
20 AAABrzAAEAAAAASAAM 1
FORD
20 AAABrzAAEAAAAASAAa 2
JAMES
30 AAABrzAAEAAAAASAAL 1
JAMES
30 AAABrzAAEAAAAASAAZ 2
JONES
20 AAABrzAAEAAAAASAAD 1
JONES
20 AAABrzAAEAAAAASAAR 2
KING
10 AAABrzAAEAAAAASAAI 1
KING
10 AAABrzAAEAAAAASAAW 2
MARTIN 30
AAABrzAAEAAAAASAAE 1
MARTIN 30
AAABrzAAEAAAAASAAS 2
MILLER 10
AAABrzAAEAAAAASAAN 1
MILLER 10
AAABrzAAEAAAAASAAb 2
SCOTT
20 AAABrzAAEAAAAASAAH 1
SCOTT
20 AAABrzAAEAAAAASAAV 2
SMITH
20 AAABrzAAEAAAAASAAA 1
SMITH
20 AAABrzAAEAAAAASAAO 2
TURNER 30
AAABrzAAEAAAAASAAJ 1
TURNER 30
AAABrzAAEAAAAASAAX 2
WARD
30 AAABrzAAEAAAAASAAC 1
WARD
30 AAABrzAAEAAAAASAAQ
2
Now show only the Rows which must be deleted, these are all Rows
with RNo <> 1.
SELECT Ename, Deptno, Rid, RNo
FROM
(SELECT ename Ename, deptno Deptno, rowid Rid,
ROW_NUMBER()
OVER (
PARTITION BY ename
ORDER BY rowid
) RNo
FROM emp
)
WHERE RNo <> 1
/
ENAME DEPTNO RID RNO
---------- ---------- ------------------ ----------
ADAMS 20 AAABrzAAEAAAAASAAY 2
ALLEN 30 AAABrzAAEAAAAASAAP 2
BLAKE 30 AAABrzAAEAAAAASAAT 2
CLARK 10 AAABrzAAEAAAAASAAU 2
FORD 20 AAABrzAAEAAAAASAAa 2
JAMES 30 AAABrzAAEAAAAASAAZ 2
JONES 20 AAABrzAAEAAAAASAAR 2
KING 10 AAABrzAAEAAAAASAAW 2
MARTIN 30 AAABrzAAEAAAAASAAS 2
MILLER 10 AAABrzAAEAAAAASAAb 2
SCOTT 20 AAABrzAAEAAAAASAAV 2
SMITH 20 AAABrzAAEAAAAASAAO 2
TURNER 30 AAABrzAAEAAAAASAAX 2
WARD 30 AAABrzAAEAAAAASAAQ 2
Now these Rows can easily be deleted:
DELETE FROM emp
WHERE rowid IN
(SELECT Rid
FROM
(SELECT rowid Rid,
ROW_NUMBER()
OVER (
PARTITION
BY ename
ORDER BY
rowid
) RNo
FROM emp
)
where RNo <> 1
);
14 rows deleted.
Conclusion
Now, we go back to the huge Table: t
DELETE FROM t NOLOGGING
WHERE rowid IN
(SELECT Rid
FROM
(SELECT rowid Rid,
ROW_NUMBER()
OVER (
PARTITION
BY object_name
ORDER BY
rowid
) RNo
FROM t
)
where RNo <> 1
);
1692007 rows deleted.
Elapsed: 00:08:57.15
Execution Plan
----------------------------------------------------------
0 DELETE STATEMENT Optimizer=CHOOSE
1 0 DELETE OF 'T'
2 1 NESTED LOOPS
3 2 VIEW OF
'VW_NSO_1'
4 3
SORT (UNIQUE)
5
4 VIEW
6
5 WINDOW
(SORT)
7
6
TABLE ACCESS (FULL) OF 'T'
8 2 TABLE ACCESS
(BY USER ROWID) OF 'T'
The huge Deletion was complete after 9 Minutes instead of hours -
Great!
select count(*), count(distinct OBJECT_NAME) from t;
COUNT(*) COUNT(DISTINCT OBJECT_NAME)
---------- --------------------------
5017
5017
No more duplicate Rows on OBJECT_NAME.
|