The Jeff Plumb Blog

Wednesday, April 26, 2006

Practical Examples of Ranking Analytic Functions

Introduction
Many people have heard about analytic functions which have been around since Oracle 8.1 but have never used them. This article demonstrates practical uses of some of the ranking analytic functions and explains how they work.

A Table Tennis Tournament System
Being a keen Table Tennis player, I was recently wondering how to implement a tournament system in Oracle. Basically I wanted to create a handicap tournament system, where each player was given a rating. For each match in the tournament, the player’s rating would be used to determine the starting score for each player. Tournament entries would be taken in advance and when the entries closed, all players would be split into groups based on the ratings. All players with similar ratings would be placed in the same group.

Test Data
Listing 1 contains the code to set up the tables and create some test data. It creates a table JP_PLAYER that holds player information and creates 2000 test players. The script also creates a table JP_TOURNAMENT_ENTRY that holds an entry of a player into a tournament. This table is populated with entries into 2 distinct tournaments. Two extra tables are then created, JP_TOURNAMENT_GROUP and JP_TOURNAMENT_GROUP2. These tables are used to hold the group each player is assigned to for the tournament. The first table will be populated using analytic functions and the second will be populated using PL/SQL as a comparison.

Listing 1
create table jp_player
(player_id number primary key
,rating number(4) not null
);

create table jp_tournament_entry
(tournament_id number
,player_id number
,constraint jp_tournament_entry_pk
primary key (tournament_id, player_id)
);

insert into jp_player
(player_id, rating)
select rownum as player_id, mod(rownum, 200) as rating
from all_objects
where rownum <= 2000;

insert into jp_tournament_entry
(tournament_id, player_id)
select 1, player_id
from jp_player
where mod(player_id, 99) = 1;

insert into jp_tournament_entry
(tournament_id, player_id)
select 2, player_id
from jp_player
where mod(player_id, 101) = 1;

create table jp_tournament_group
(tournament_id number
,player_id number
,group_id number(2)
,constraint jp_tournament_group
primary key (tournament_id, player_id) );

create table jp_tournament_group2
(tournament_id number
,player_id number
,group_id number(2)
,constraint jp_tournament_group2
primary key (tournament_id, player_id) );
NTILE Function
NTILE (expr) OVER ([query_partition_clause] order_by_clause) 
The NTILE function immediately came to mind to split the entries of each tournament into groups. It is a member of the ranking analytic functions and allows data to be split into a specified number of groups. For example, if there are 20 rows to be split into 4 groups, then each group will have 5 rows. If the number of rows does not divide evenly by the number of groups, then the number of rows in each group will only differ by at most 1. For example, if there are 21 rows to be split into 4 groups, then 3 groups will have 5 rows and 1 group will have 6 rows.

Query Partition Clause

The query partition clause of an analytic function allows the data to be partitioned. For our example of tournament entries, the data can be partitioned by TOURNAMENT_ID. This allows each tournament to be grouped separately so that each tournament will be split into 4 groups rather than the whole data set being split. So if we have two tournaments as is the case with our test data, we will end up with 4 groups for each tournament which gives us a total of 8 groups. Data can be partitioned on any columns and into either a single partition or as many partitions as the data dictates.

Listing 2 shows the SQL used to insert data from the tournament entries into the JP_TOURNAMENT_GROUP table using the NTILE analytic function. It partitions the data by tournament id and then orders the data by the player rating and then the player id to split it into 4 groups. To ensure reproducible results, it is important that the order by clause is on a unique key. The reason for this is that equal values can be split across groups. This is why the query orders by rating and then player id.

At first the syntax of these queries seemed quite unnatural to me and I believe this puts off many people from learning how to use them. However once I played around with these functions for only a few minutes, I began to understand what each element of the function represented and just how powerful these analytic functions are. So if you have not yet used these functions due to the syntax seeming quite foreign and not fully understanding how they work, then copy down the examples in this article and have a play around with them on your own Oracle database. I guarantee that you’ll be loving analytic functions and finding a use for them in your own code all the time.

Listing 2
insert into jp_tournament_group
select te.tournament_id, p.player_id
, ntile(4) over
(partition by te.tournament_id
order by p.rating, p.player_id)
from jp_tournament_entry te
join jp_player p on p.player_id = te.player_id;
Life Before Analytic Functions

The query in Listing 2 is very elegant. It is short, simple and easy to read. I wondered how easy it would be to achieve this functionality without using analytic functions. The code I came up with is in Listing 3. I find this code significantly harder to understand compared with the analytic version. And not only that, the analytic function version will perform better as well. The procedural code needs to process each tournament individually to break up the players into groups.

Listing 3
declare
cnt number;
groups number := 4;
groups_with_extra number;
players_per_group number;
current_group number;
current_cnt number;
begin
for x in (select tournament_id, count(*) cnt
from jp_tournament_entry
group by tournament_id)
loop
current_group := 1;
current_cnt := 1;
players_per_group := floor(x.cnt / groups);
groups_with_extra := mod(x.cnt, groups);
for y in (select te.tournament_id, p.player_id, p.rating
from jp_tournament_entry te
join jp_player p on p.player_id = te.player_id
where te.tournament_id = x.tournament_id
order by te.tournament_id, p.rating, p.player_id)
loop
insert into jp_tournament_group2
values (y.tournament_id, y.player_id, current_group);
current_cnt := current_cnt + 1;
if (current_group <= groups_with_extra
and current_cnt > players_per_group + 1)
or (current_group > groups_with_extra
and current_cnt > players_per_group)
then
current_cnt := 1;
current_group := current_group + 1;
end if;
end loop;
end loop;
end;
/
ROW_NUMBER Function
ROW_NUMBER ( ) OVER ( [query_partition_clause] order_by_clause )
The NTILE function worked beautifully to assign players into groups. Once the matches are actually played, I would like to take the top 2 players from each group as finalists to play a knockout competition to find the overall winner of the tournament. The ROW_NUMBER function will come in perfectly for this as it allows a unique sequence number to be assigned to each row. Again we will use the partition clause to break our data up for each group in each tournament. Once unique numbers have been assigned by the ROW_NUMBER function, the top 2 rows can be selected and we have our finalists. This function is also a member of the ranking group of analytic functions.

Listing 4 shows the code to create some test data. A new table JP_PLAYER_STATS is created to hold the number of wins and number of points scored for each player in a group. Then the table is populated with some test data from the JP_TOURNAMENT_GROUP table which is populated in Listing 2. Listing 5 shows the code to select the finalists. Again to ensure deterministic results, the order by clause is made unique by adding the player id. The data is partitioned by tournament id and group id, to get a ranking for each group. Then within each partition the data is ordered by the number of wins and then the number of points and finally by the player id to ensure reproducible results. Note that the default sort order is ascending but in our case we are looking for the top ranked player to have the highest number of wins and points, so the keyword DESC is added to both the WINS and POINTS columns. The outer query then selects the top 2 positions for each group. This technique can be used to produce any TOP N type report. Some other ranking functions worth looking at for this TOP N type reports are RANK and DENSE_RANK.

RANK and DENSE_RANK Functions

RANK ( ) OVER ( [query_partition_clause] order_by_clause )
DENSE_RANK ( ) OVER ( [query_partition_clause] order_by_clause )
These two functions are very similar to the ROW_NUMBER function and are both members of the ranking group of analytic functions. Whereas ROW_NUMBER generates a unique sequence number for each row, the RANK and DENSE_RANK functions generate a ranking based on the order by clause within each partition. If two rows tie, then they will be assigned the same number as opposed to a unique number. The difference between the two functions is that DENSE_RANK will not leave any gaps in between rows. For example if we are ranking 10 employees by salary and the fifth and sixth place salaries are equal, then RANK and DENSE_RANK would assign them both the number 5, but RANK would assign the next highest salary the number 7 and DENSE_RANK would assign the number 6.

Listing 4
create table jp_player_stats
(tournament_id number
,group_id number
,player_id number
,wins number(2)
,points number(3)
,constraint jp_player_stats_pk
primary key (tournament_id, player_id)
);

insert into jp_player_stats
(tournament_id, group_id, player_id, wins, points)
select tournament_id, group_id, player_id
, mod(rownum, 3) as wins, mod(player_id, 80) as points
from jp_tournament_group;
Listing 5
select *
from
(
select tournament_id, group_id, player_id, wins, points
, row_number() over
(partition by tournament_id, group_id
order by wins desc, points desc, player_id) as pos
from jp_player_stats
)
where pos <=2 order by tournament_id, group_id, pos;
Extra Information

Processing Order

It is important to know the processing order for SQL statements that include analytic functions. It happens in the following order:
  1. All joins, WHERE, GROUP BY, and HAVING clauses are performed
  2. The analytic functions are performed on the result set from step 1
  3. The ORDER BY clause is performed if provided.
This means that any rows excluded by the WHERE clause will not be part of the analytic function. Also if you use GROUP BY, this is applied first and so the aggregates calculated are available to the analytic function.

Other Groups of Analytic Functions

So far we have looked at only the ranking group of analytic functions. However there is much more to analytic functions. Other groups of analytic functions include windowing functions. Windowing functions can operate on more than just the current row without the need to perform a self join. This can be great for improving performance as only one scan of the data is required. You can calculate cumulative totals or moving averages. These functions are very powerful. There are also reporting aggregate functions, lag/lead functions, first/last functions, linear regression functions, inverse percentile functions, hypothetical rank and distribution functions.

Summary

I find many people have not used analytic functions due to the fact that the syntax seems complex and difficult to understand. But in reality, the different elements of the functions are fairly intuitive once you understand how they work. So take a few minutes to run your own queries and become comfortable with their syntax. Once you do, you’ll continuously be on the lookout for opportunities to use analytic functions in your code. To find out more about analytic functions, read the Chapter “SQL for Analysis and Reporting” of the Oracle® Database Data Warehousing Guide 10g Release 2 (10.2) Manual. Although analytic functions are described in the Data Warehouse Guide, they can be extremely useful in transactional systems as well.

Saturday, April 22, 2006

Is Spelling Important?

I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid! Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it wouthit a porbelm. Tihs is bcusea the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Amzanig huh? Yaeh and I awlyas thguoht slpeling was ipmorantt!

Wednesday, April 19, 2006

Magical Time

Did you know that at 2 minutes and 3 seconds past 1 on May 4th this year the time will be:

01:02:03 04/05/06 ?

Monday, April 17, 2006

Falls Creek Eggstraordinary Egg Toss

For Easter, I took the family on a holiday up to Falls Creek. We met up with our Friends Charles and Bridget from Canberra. The highlight of the trip was the "Eggstraordinary Egg Toss". The idea is to throw a raw egg to your partner and let him catch it without breaking it. Then he throws it back to you. This continues with the distance increases between each throw until there is only one pair with an unbroken egg. Now to make things interesting the winners get a fully paid holiday to Falls Creek including accommodation, ski lessons, ski passes and clothes hire. Not a bad prize at all.

So Charles and I were persuaded by our wives to enter. So we paid the $10 entry fee and received our garbage bags for protection from the flying yolk and lined up to compete. All the under 15's competed first until there was only one pair left. And then it was our turn to compete with all of the oldies (over 15). I was feeling a little nervous but Charles and his steady soft hands managed to see us as the only pair remaining with an unbroken egg after 5 throws, with the last one being flung around 25 metres by Charles right into my safe hands.


So this meant that it was down to me and Charles versus the under 15 twins (who as we found out later, were runners up last year) for the holiday. Now the judge of the competition thought that it was fair to impose a handicap on myself and Charles to even things up. So after each throw, the young guys would have to move back 2 metres and we would have to go back 4 metres. Now as Charles and I are both over 30 and well past our former sporting best, and these young athletes were right in their prime, I think that the handicap should have worked the other way. But none the less, Charles and I pushed on with the competition bravely, without complaint.

For the final, both teams were allowed to select a brand new egg. Charles carefully selected a strong looking egg with no cracks, and the under 15's chose their egg too.


The first few throws went smoothly for both teams but then disaster struck. I hurled the egg 16 metres to Charles who took the catch cleanly only to look down at the egg and notice a hairline crack.


At this stage, I did not know the perilous state of our egg and moved back the 4 metres to attempt the next catch. Charles was left with no option but to give a mighty throw of the damaged egg. He tossed it the 20 metres right into my hands. I took the catch softly but the egg exploded covering my hands in yolk. At the same time, the young athletes took the catch cleanly and the dream was squashed along with egg. The holiday was gone. I did manage to get one last laugh when I shook the hands of the winners covering them in yolk.

However all was not lost, and due to our overwhelming popularity with the crowd, the judge introduced a second place prize - a box of 300 Cadbury Easter Eggs each.

So our wives are now planning a rigorous 12 month training schedule in the hope of winning the holiday next year.

Thursday, April 13, 2006

Dual Booting Madness

Hooray, I finally have a system that Dual boots Windows XP and Linux. I don't really understand how I got it to work but by 12:30am, I decided to just be happy that it was working.

Now I can try and install Oracle Database 10G Release 2. I'll let you know how it goes.

Tuesday, April 11, 2006

Centos 4.2 Troubles

I am still having trouble trying to install CentOS 4.2 onto my PC. I bought a second hard drive so I wouldn't have to worry about partitioning and destroying my Windows installation. However, the installer won't recognize either of the drives when they are both plugged in. I read somewhere that this may be because I was using cable select to determine the primary and slave drives, so I changed them not to use this method but still the disk drives could not be written to.

In a desperate attempt to install the operating system, I unplugged my Windows disk and then tried to run the installer. This time it worked beautifully. So I plugged back in the Windows drive as the primary and configured the NT bootloader to give me a choice to boot up Linux. However now when I choose that option all I get is:

GRUB

If I remove the windows disk, Linux starts up without a problem. So the only other option I can think of is to try and make the Linux disk the primary and somehow configure GRUB to boot up Windows. I don't know if this is possible but I'll let you know how I go.

Friday, April 07, 2006

How to make your wife happy - Install Linux

Last night I tried to install Linux on our home PC. I bought a second hard drive so that I could leave the Windows XP installation untouched. I inserted the Linux CentOS DVD and rebooted the machine. All was going smoothly to start with until after I had chosen my keyboard settings. Then the CentOS installer tried to access the hard drives, but it could not find them. My installation would not go any further. I also had some Fedora Core 3 CD's lying around so I started up that installer to see if the same problem would occur but Fedora could see the drives without any issues.

Now this is where the problems really started to begin. I took out all of the CD's and went to reboot back to Windows so I could get on the internet and search for a resolution to my CentOS installation issue but when the computer restarted all I got was:

GRUB

Somehow I had overwritten the master book record with GRUB. I pulled out my Windows XP CD and rebooted the computer with this in the drive. When it restarted it said "Press any key to boot from the CD...", so I did. From here I followed the prompts to repair the existing Windows installation. I think now that I should not have done anything but without being able to connect to the Internet to find a solution I simply pressed on. So I reinstalled Windows and everything seemed to go OK and I logged back into my Windows system. But my CD was Service pack 1 and it had reset the system back to this level. Thinking my problem was solved, I rebooted only to be greeted with:

GRUB

Uh Oh! By this time my wife had been in bed sleeping for some time and came out to tell me to get to bed as well as I had to work the next day. This is when I confessed that I had a small problem with the computer. Around about 6 months ago, our hard disk crashed and we lost a lot of information that I didn't have backed up so she wasn't very happy with this and still blames me for the disk crashing. So she said that if I had lost some of our photos that I would be in trouble. So now I had no choice but to ensure that I got the photos backed up. I rebooted Windows from the CD and without pressing any key as instructed, my Windows booted up normally. So now I got out some trusty CD's and backed up all of our photos. I suggested that the problem was fixed and whenever we wanted to use the computer we simply had to put in the Windows CD first. However she didn't think this was a suitable solution. So I went to bed to dream about how to repair the master boot record.

The next day at work, I took the chance to search on the Internet to try and find out how to fix the master boot record. I found some sites detailing how to boot from the CD into recovery mode and then issue the fixmbr command. So later that night when I arrived home I tried it. However at first it didn't work and this is when I thought I was in big trouble. The last thing I tried was to install a second version of Windows onto my second hard drive and hope that somehow the installation would fix the problem. So I did this only to find that when I rebooted I had the following message:

GRUB

So just for fun, one last time I tried the fixmbr command. Now when I rebooted everything worked. Yes!!!! So then I downloaded all of the updates to Windows to restore my system to the state it was before hand.

So in summary, I still don't really know what happened. I don't know how I destroyed the master boot record in the first place and I don't really know how I got it fixed. But searching around on the Internet I found that sometimes Linux systems don't like it when the disks are configured with "cable select". So I might set these things manually and then try to install CentOS and see what happens. I'll just make sure that any new photos are backed up first :)

Tuesday, April 04, 2006

Working with dates in Oracle

Recently I had an issue where a stored procedure was tested by a
developer and worked perfectly. When the stored procedure got migrated
to test, a user tried to execute it but it did not work. The problem
took a while to track down but in the end it was because the stored
procedure did not explicitly define the date format when converting a
character string into a date. The developer and the tester had different
settings for nls_date_format and this was causing the problem. Below is
an example of how this can happen.

Create the table

control@DWDEV> create table visit
2 (visit_date date
3 ,visit_place varchar2(20));

Table created.

Insert in some values

control@DWDEV> insert into visit values (to_date('20060403','YYYYMMDD'), 'Zoo');

1 row created.

control@DWDEV> insert into visit values (to_date('20060403','YYYYMMDD'), 'Park');

1 row created.

control@DWDEV> insert into visit values (to_date('20060402','YYYYMMDD'), 'Bank');

1 row created.

Select all records

control@DWDEV> select * from visit;

VISIT_DATE VISIT_PLACE
----------- --------------------
03-APR-2006 Zoo
03-APR-2006 Park
02-APR-2006 Bank

Select using to_date function with format mask

control@DWDEV> select * from visit where visit_date = to_date('03-Apr-2006','DD-Mon-YYYY');

VISIT_DATE VISIT_PLACE
----------- --------------------
03-APR-2006 Zoo
03-APR-2006 Park

Select using to_date without a format mask, lucky it works but....

control@DWDEV> select * from visit where visit_date = to_date('03-Apr-2006');

VISIT_DATE VISIT_PLACE
----------- --------------------
03-APR-2006 Zoo
03-APR-2006 Park

Alter the nls_date_format

control@DWDEV>
control@DWDEV> alter session set nls_date_format = 'YYYYMMDDHH24MISS';

Session altered.

Select using to_date without a format mask, this time we are out of luck

control@DWDEV> select * from visit where visit_date = to_date('03-Apr-2006');
select * from visit where visit_date = to_date('03-Apr-2006')
*
ERROR at line 1:
ORA-01858: a non-numeric character was found where a numeric was
expected

However our original query still works
Select using to_date function with a format mask

control@DWDEV> select * from visit where visit_date = to_date('03-Apr-2006','DD-Mon-YYYY');

VISIT_DATE VISIT_PLACE
-------------- --------------------
20060403000000 Zoo
20060403000000 Park

So the moral of the story is to always explicity specify the format when
converting a character string to a date. Any functions that rely on a
client setting should be avoided because just by having a different
person run the code, it may break.