2008-11-21

Ordering Dimensions: Recursion vs Loops in T-SQL

Recently I had to build a function, which returned dimension members from a parent-child relational table. I needed to be able to order the members hierarchically and I decided to use a recursive CTE call.

Usually, we have a parent-child dimension table with structure similar to this:

Dim_Business

Business_Skey int
Parent_Business_Skey int

with a root node with Skey of -3 and Parent_Business_Skey of -3.

Let's assume we have a hierarchy like this:

All (Id = -3, P_ID = -3)
-GLOBAL (Id = 1, P_ID = -3)
--Europe (Id = 2, P_ID = 1)
---UK (Id = 3, P_ID = 2)
---France (Id = 4, P_ID = 2)
---Spain (Id = 5, P_ID = 2)
--North America (Id = 6, P_ID = 1)
---USA (Id = 7, P_ID = 6)
---Canada (Id = 8, P_ID = 6)
---Mexico (Id = 9, P_ID = 6)


If we do something like:

SELECT Business_Skey
FROM Dim_Business
ORDER BY Parent_Business_Skey ASC, Business_Skey ASC

We will get:

-3 (All)
1 (GLOBAL)
2 (Europe)
6 (North America)
3 (UK)
4 (France)
5 (Spain)
7 (USA)
8 (Canada)
9 (Mexico)

Obviously, this hierarchy is incorrect, because we want to see the leaf nodes under their respective parents.

We can recursively create order, which concatenates the parent ids:

WITH b_ord(bid, bord)
AS
(
SELECT  Business_Skey AS bid
, CONVERT(nvarchar(1000), Business_Skey) AS bord
FROM Dim_Business
WHERE Business_Skey = -3

UNION ALL

SELECT  Business_Skey AS bid
, CONVERT(nvarchar(1000), bord + '|' + CONVERT(nvarchar, Business_Skey))
FROM Dim_Business db
INNER JOIN b_ord bo
ON db.Parent_Business_Skey = bo.bid
WHERE db.Business_Skey <> bo.bid
)
SELECT *
FROM b_ord
ORDER BY bord ASC

The result of the cte query is:
-3 -3 (All)
1 -3|1 (GLOBAL)
2 -3|1|2 (Europe)
3 -3|1|2|3 (UK)
4 -3|1|2|4 (France)
5 -3|1|2|5 (Spain)
6 -3|1|6 (North America)
7 -3|1|6|7 (USA)
8 -3|1|6|8 (Canada)
9 -3|1|6|9 (Mexico)

and the order is correct.


Because the code needed to go in a function, invoked by a number of stored procedures, .NET application and various reports, I needed the code to be quick and 

light. As some dimensions had a large number of members (50000+), which could grow with time, the code needed to implemented in a careful way. So, I decided 

to compare the recursive CTE function to a WHILE loop and a temporary table implementation:

DECLARE @c int
DECLARE @num_of_nodes int

SET @num_of_nodes = (SELECT  COUNT(*) FROM Dim_Business)

CREATE TABLE #order(
 skey int
, ord nvarchar(1000)
, lvl int
)

INSERT INTO #order
SELECT  Business_Skey
, CONVERT(nvarchar(1000), Business_Skey)
, 1
FROM Dim_Business
WHERE Business_Skey = -3

SET @c = 2

WHILE @c > 0
BEGIN
INSERT INTO #order
SELECT  Business_Skey
, CONVERT(nvarchar(1000), ord + '|' + CONVERT(nvarchar, Business_Skey))
, @c
FROM Dim_Business db
INNER JOIN #order o
ON db.Parent_Business_Skey = o.skey
AND o.lvl = @c - 1
WHERE db.Business_Skey <> o.skey

SET @c = @c + 1
IF (SELECT COUNT(*) FROM #order) = @num_of_nodes
SET @c = 0
END

SELECT  skey AS bid
, ord AS bord
FROM #order
ORDER BY ord ASC

DROP TABLE #order

After comparing the results in Client Statistics and the IO reads, the 1st recursive query performs more than 20% worse than the WHILE loop query. It also trails the non-recursive query in the count of logical reads,read-ahead reads and scan counts.

It seems like in SQL Server 2005 calling WITH recursively does not work as good as coding set-based operations through WHILE loops.

0 comments: