Archive

Archive for the ‘SSIS’ Category

Is 0.5 = 0.5 in SSIS?

April 29th, 2010
Comments Off

Today I had an issue which was causing some peculiar results in my ETL. After approximately 1 hour of trying to find the exact problem, I managed to narrow it down to a Sort task. The Data Viewer which I put immediately before that task showed me:

ColumnA              ColumnB
BBHI                     0.5
LLHR                     0.5

The Sort did:

  1. Sort by ColumnB in Descending order
  2. Sort by ColumnA in Ascending order

I was surprised to see that the output was:

ColumnA              ColumnB
LLHR                     0.5
BBHI                     0.5

After trying to reverse the sort order on the two columns (first ColumnA and then ColumnB), I noticed that ColumnB changes the order of the two rows. The type of that column was Double Precision Float in SSIS and float in SQL Server. Since SQL Server also showed 0.5 for each column, I did not expect the reason for my wrong results to be the floating point number. And I was wrong. Clearly the 0.5 was not exactly 0.5 in one of the two columns. Luckily, I got advised by my client that if I multiply the number by 10,000,000,000 and then round to whole number, I would not have issues with losing any information. My fix was simple – a derived column – ColumnB_Int which did exactly that:

 (DT_I8)Round((ColumnB*10000000000.0),0).

Then I used that column in the sort instead of the original one. This did the trick.

SSIS , , ,

 

Custom Dates for an SSIS SCD Task

April 12th, 2010
Comments Off

Just last weekend I implemented a number of Slowly Changing Dimensions in a SQL Server 2005 based project. For the large ones I wrote some SQL code, but for the smaller dimensions, I just decided to use the SSIS SCD task. Since the wizard does most of the work, there is not much beyond it I have done in the past with that component. This time, though we decided to have custom default EffectiveTo dates for the dimensions – 9999-12-31 instead of the default for the SCD task NULL. The wizard, however is not customisable and some manual teaks need to be done before it can handle custom dates. So, I decided to share these since there is not much around on this topic (or at least I could not find any particular references). There is a customisable component on Codeplex – Kimball Method SCD Component, however I could not use it as no custom tools could be used for an unknown reason.

I created a quick mock up of a dimension table for demonstration purposes:

 

Then, I created a SCD task in SSIS with one historical and one changing attribute. For Start and End dates I used my EffectiveFromDateId and EffectiveToDateId and got them populated with [System::StartTime]. Unfortunately, the SSIS task does not allow specifying custom values for the default To date, and uses NULL by default. To change it, we have to modify the following three dataflow components:

We can modify these through the Advanced Editor (right-click). For the actual Slowly Changing Dimension task we have make the following change:

Then we also have to modify the SQL script  for the two OLE DB commands (again through the advanced editor). For the Changing Attributes Updates Output:

And a similar change to the Historical Attributes Inserts Output:

After applying these three changes, we are ready to run the task:

As we have three new rows, they get inserted in the target dimension table. As they are all active, their EffectiveToDates are the default values of 9999-12-31:

Of course, if we decide to change anything through the SSIS SCD wizard, all of these will be lost and we have to redo these changes once again…

SSIS , ,

 

Inferred Members Implementation Best Practise

August 30th, 2009

Just last week a colleague of mine and I did a very thorough research on the best implementation practises in regards to Inferred Members. We went through a few web sites and we also had a look at previous implementations through SSIS. There are quite a few resources on the subject and the best two methods we found were based on Project REAL and a SQL CAT recommendation. So, in the end we agreed that we can recommend three approaches, which have their advantages and disadvantages and could be all successfully used when we need to have Inferred Member (late arriving dimensions = early arriving facts) support in a solution.

Our findings can be summarised in the following table:

Method Pros Cons
Stored procedures No need to use SSIS SlowComplicated developmentNeed to link source and target data server instances.
Script Component (SSIS) FastHigh reusability .NET skills required
Double Lookup (SSIS) FastHigh reusability Performance issues with multiple dimensions

I will go through each of these methods and provide some more details about them.

1. Stored Procedures

If SSIS data flow tasks are not an option for our ETL we can implement inferred member support through SQL code. The basic idea behind this method is:

  1. Find new dimension members in the fact data and add them to the dimension
  2. Move the fact data from source to target (i.e. staging to datamart) with surrogate keys from the dimension table

If we have our source and target data on separate instances of SQL Server it becomes quite inconvenient to utilise this approach. We need to link the two instances to use tables from both tables in the same stored procedure. This is a major issue and it is easily avoided by using SSIS.

2. SQL Server Integration Services

A general design of SSIS solutions can be represented with the following diagram:

Inferred Members Small

We can implement this in the following ways:

2.1. Script Component (Project REAL)

The Data Flow task for this approach is:

image

After looking up missing states, we pass them to a Script Component, which hashes the misses and hits the database only when a genuinely new miss occurs. We have a stored procedure in place which simply adds the new dim member in the dimension table in the database and returns the new surrogate key, which then gets sent to the fact table.

Using .NET we can efficiently hash all values in a object collection and we can also handle both character and numerical data.

Because we are using a fully cached lookup, this data flow item is case-sensitive, therefore we should make sure we equalise the case in both fact and reference table data before we compare the two. We should also make sure that in this case the Script Component is case-insensitive, because if it is we will end up with multiple different rows in our dimension table for each case variation of our inferred members.

Additionally, the Script Component task should be built either accepting a parameter for the dimension name, or it can read its name (in our implementation) and find the first word in it to determine the dimension it is used for. In the above diagram, the Script Component task is handling the State dimension, therefore its name starts with State – . This makes the implementation of multiple Script Components for multiple dimensions very easy – all we need to do is change its name and it just works. There can also be some .NET code for auditing purposes. This is also fairly easy to implement and an entry level of .NET should be sufficient for development and maintenance. A sample of this approach can be found in Project REAL and it is thoroughly discussed in the SSIS paper produced with it.

2.2. Double Lookup (SQL CAT)

image

Here, we are doing the same as before – matching fact rows against a State dimension. If a mismatch is found in the first (fully cached) lookup, we pass the row to the second one. The New StateSK is a partially cached lookup. Its purpose is similar to a hash table – it caches new rows, and when there is a brand new mismatch it adds it to the database by executing a stored procedure. Then we Union All our inferred members with the rest of the fact data.

Because the second lookup task is utilising partial lookup, it is case-insensitive and case variations of the same character term will not lead to multiple dimension rows for the same member. However the first fully-cached lookup is case-sensitive, so we should make sure that both source and reference data is in the same case because that would be more efficient. Furthermore, In SQL Server 2005 partial caching must have a specified memory limit and if we have a large number of late arriving dimension members, we may run out of memory for them. In that case SQL Server 2005 will start discarding the least used values from its cache, which may have performance implications. The latter problem is overcome in SQL Server 2008.

Another problem with this approach is the use of Union All tasks. These are semi-blocking and may impact performance when used multiple times in our ETL package.

For much more detailed description, including samples you can go to:

SQL CAT – Assigning surrogate keys to early arriving facts using Integration Services

From our testing and research we reached the conclusion that using Script Component is the best approach, closely followed by Double Lookups. The stored procedure approach is slow, hard to maintain and may be impossible to implement in a production environment. Using SSIS with .NET proves to be efficient, convenient and fast.If avoiding .NET is preferable, handling inferred members in SSIS is a very good alternative.

SSIS ,

 

Switch to our mobile site