The Jeff Plumb Blog

Wednesday, April 26, 2006

Practical Examples of Ranking Analytic Functions

Introduction
Many people have heard about analytic functions which have been around since Oracle 8.1 but have never used them. This article demonstrates practical uses of some of the ranking analytic functions and explains how they work.

A Table Tennis Tournament System
Being a keen Table Tennis player, I was recently wondering how to implement a tournament system in Oracle. Basically I wanted to create a handicap tournament system, where each player was given a rating. For each match in the tournament, the player’s rating would be used to determine the starting score for each player. Tournament entries would be taken in advance and when the entries closed, all players would be split into groups based on the ratings. All players with similar ratings would be placed in the same group.

Test Data
Listing 1 contains the code to set up the tables and create some test data. It creates a table JP_PLAYER that holds player information and creates 2000 test players. The script also creates a table JP_TOURNAMENT_ENTRY that holds an entry of a player into a tournament. This table is populated with entries into 2 distinct tournaments. Two extra tables are then created, JP_TOURNAMENT_GROUP and JP_TOURNAMENT_GROUP2. These tables are used to hold the group each player is assigned to for the tournament. The first table will be populated using analytic functions and the second will be populated using PL/SQL as a comparison.

Listing 1
create table jp_player
(player_id number primary key
,rating number(4) not null
);

create table jp_tournament_entry
(tournament_id number
,player_id number
,constraint jp_tournament_entry_pk
primary key (tournament_id, player_id)
);

insert into jp_player
(player_id, rating)
select rownum as player_id, mod(rownum, 200) as rating
from all_objects
where rownum <= 2000;

insert into jp_tournament_entry
(tournament_id, player_id)
select 1, player_id
from jp_player
where mod(player_id, 99) = 1;

insert into jp_tournament_entry
(tournament_id, player_id)
select 2, player_id
from jp_player
where mod(player_id, 101) = 1;

create table jp_tournament_group
(tournament_id number
,player_id number
,group_id number(2)
,constraint jp_tournament_group
primary key (tournament_id, player_id) );

create table jp_tournament_group2
(tournament_id number
,player_id number
,group_id number(2)
,constraint jp_tournament_group2
primary key (tournament_id, player_id) );
NTILE Function
NTILE (expr) OVER ([query_partition_clause] order_by_clause) 
The NTILE function immediately came to mind to split the entries of each tournament into groups. It is a member of the ranking analytic functions and allows data to be split into a specified number of groups. For example, if there are 20 rows to be split into 4 groups, then each group will have 5 rows. If the number of rows does not divide evenly by the number of groups, then the number of rows in each group will only differ by at most 1. For example, if there are 21 rows to be split into 4 groups, then 3 groups will have 5 rows and 1 group will have 6 rows.

Query Partition Clause

The query partition clause of an analytic function allows the data to be partitioned. For our example of tournament entries, the data can be partitioned by TOURNAMENT_ID. This allows each tournament to be grouped separately so that each tournament will be split into 4 groups rather than the whole data set being split. So if we have two tournaments as is the case with our test data, we will end up with 4 groups for each tournament which gives us a total of 8 groups. Data can be partitioned on any columns and into either a single partition or as many partitions as the data dictates.

Listing 2 shows the SQL used to insert data from the tournament entries into the JP_TOURNAMENT_GROUP table using the NTILE analytic function. It partitions the data by tournament id and then orders the data by the player rating and then the player id to split it into 4 groups. To ensure reproducible results, it is important that the order by clause is on a unique key. The reason for this is that equal values can be split across groups. This is why the query orders by rating and then player id.

At first the syntax of these queries seemed quite unnatural to me and I believe this puts off many people from learning how to use them. However once I played around with these functions for only a few minutes, I began to understand what each element of the function represented and just how powerful these analytic functions are. So if you have not yet used these functions due to the syntax seeming quite foreign and not fully understanding how they work, then copy down the examples in this article and have a play around with them on your own Oracle database. I guarantee that you’ll be loving analytic functions and finding a use for them in your own code all the time.

Listing 2
insert into jp_tournament_group
select te.tournament_id, p.player_id
, ntile(4) over
(partition by te.tournament_id
order by p.rating, p.player_id)
from jp_tournament_entry te
join jp_player p on p.player_id = te.player_id;
Life Before Analytic Functions

The query in Listing 2 is very elegant. It is short, simple and easy to read. I wondered how easy it would be to achieve this functionality without using analytic functions. The code I came up with is in Listing 3. I find this code significantly harder to understand compared with the analytic version. And not only that, the analytic function version will perform better as well. The procedural code needs to process each tournament individually to break up the players into groups.

Listing 3
declare
cnt number;
groups number := 4;
groups_with_extra number;
players_per_group number;
current_group number;
current_cnt number;
begin
for x in (select tournament_id, count(*) cnt
from jp_tournament_entry
group by tournament_id)
loop
current_group := 1;
current_cnt := 1;
players_per_group := floor(x.cnt / groups);
groups_with_extra := mod(x.cnt, groups);
for y in (select te.tournament_id, p.player_id, p.rating
from jp_tournament_entry te
join jp_player p on p.player_id = te.player_id
where te.tournament_id = x.tournament_id
order by te.tournament_id, p.rating, p.player_id)
loop
insert into jp_tournament_group2
values (y.tournament_id, y.player_id, current_group);
current_cnt := current_cnt + 1;
if (current_group <= groups_with_extra
and current_cnt > players_per_group + 1)
or (current_group > groups_with_extra
and current_cnt > players_per_group)
then
current_cnt := 1;
current_group := current_group + 1;
end if;
end loop;
end loop;
end;
/
ROW_NUMBER Function
ROW_NUMBER ( ) OVER ( [query_partition_clause] order_by_clause )
The NTILE function worked beautifully to assign players into groups. Once the matches are actually played, I would like to take the top 2 players from each group as finalists to play a knockout competition to find the overall winner of the tournament. The ROW_NUMBER function will come in perfectly for this as it allows a unique sequence number to be assigned to each row. Again we will use the partition clause to break our data up for each group in each tournament. Once unique numbers have been assigned by the ROW_NUMBER function, the top 2 rows can be selected and we have our finalists. This function is also a member of the ranking group of analytic functions.

Listing 4 shows the code to create some test data. A new table JP_PLAYER_STATS is created to hold the number of wins and number of points scored for each player in a group. Then the table is populated with some test data from the JP_TOURNAMENT_GROUP table which is populated in Listing 2. Listing 5 shows the code to select the finalists. Again to ensure deterministic results, the order by clause is made unique by adding the player id. The data is partitioned by tournament id and group id, to get a ranking for each group. Then within each partition the data is ordered by the number of wins and then the number of points and finally by the player id to ensure reproducible results. Note that the default sort order is ascending but in our case we are looking for the top ranked player to have the highest number of wins and points, so the keyword DESC is added to both the WINS and POINTS columns. The outer query then selects the top 2 positions for each group. This technique can be used to produce any TOP N type report. Some other ranking functions worth looking at for this TOP N type reports are RANK and DENSE_RANK.

RANK and DENSE_RANK Functions

RANK ( ) OVER ( [query_partition_clause] order_by_clause )
DENSE_RANK ( ) OVER ( [query_partition_clause] order_by_clause )
These two functions are very similar to the ROW_NUMBER function and are both members of the ranking group of analytic functions. Whereas ROW_NUMBER generates a unique sequence number for each row, the RANK and DENSE_RANK functions generate a ranking based on the order by clause within each partition. If two rows tie, then they will be assigned the same number as opposed to a unique number. The difference between the two functions is that DENSE_RANK will not leave any gaps in between rows. For example if we are ranking 10 employees by salary and the fifth and sixth place salaries are equal, then RANK and DENSE_RANK would assign them both the number 5, but RANK would assign the next highest salary the number 7 and DENSE_RANK would assign the number 6.

Listing 4
create table jp_player_stats
(tournament_id number
,group_id number
,player_id number
,wins number(2)
,points number(3)
,constraint jp_player_stats_pk
primary key (tournament_id, player_id)
);

insert into jp_player_stats
(tournament_id, group_id, player_id, wins, points)
select tournament_id, group_id, player_id
, mod(rownum, 3) as wins, mod(player_id, 80) as points
from jp_tournament_group;
Listing 5
select *
from
(
select tournament_id, group_id, player_id, wins, points
, row_number() over
(partition by tournament_id, group_id
order by wins desc, points desc, player_id) as pos
from jp_player_stats
)
where pos <=2 order by tournament_id, group_id, pos;
Extra Information

Processing Order

It is important to know the processing order for SQL statements that include analytic functions. It happens in the following order:
  1. All joins, WHERE, GROUP BY, and HAVING clauses are performed
  2. The analytic functions are performed on the result set from step 1
  3. The ORDER BY clause is performed if provided.
This means that any rows excluded by the WHERE clause will not be part of the analytic function. Also if you use GROUP BY, this is applied first and so the aggregates calculated are available to the analytic function.

Other Groups of Analytic Functions

So far we have looked at only the ranking group of analytic functions. However there is much more to analytic functions. Other groups of analytic functions include windowing functions. Windowing functions can operate on more than just the current row without the need to perform a self join. This can be great for improving performance as only one scan of the data is required. You can calculate cumulative totals or moving averages. These functions are very powerful. There are also reporting aggregate functions, lag/lead functions, first/last functions, linear regression functions, inverse percentile functions, hypothetical rank and distribution functions.

Summary

I find many people have not used analytic functions due to the fact that the syntax seems complex and difficult to understand. But in reality, the different elements of the functions are fairly intuitive once you understand how they work. So take a few minutes to run your own queries and become comfortable with their syntax. Once you do, you’ll continuously be on the lookout for opportunities to use analytic functions in your code. To find out more about analytic functions, read the Chapter “SQL for Analysis and Reporting” of the Oracle® Database Data Warehousing Guide 10g Release 2 (10.2) Manual. Although analytic functions are described in the Data Warehouse Guide, they can be extremely useful in transactional systems as well.

0 Comments:

Post a Comment

<< Home