SSIS Balanced Data Distributor – Comparison

May 21st, 2011

Microsoft just released a new Data Flow transformation for SSIS – Balanced Data Distributor. Reading the installation pages, we can see that there are a few interesting things about it, among which – it is a multithreaded transform, which uniformly splits a data stream in multiple outputs. I decided to give it a quick test and see how it performs in comparison to a straight insert and a Script Component which splits the input stream in two, as well.

First the tests show performance with input of a SQL Server table (OLE DB Input) and output to a raw file. The input is a TOP 20000000 * select from a large table. The results are as follow:

1. Straight insert: 32 seconds

2. Balanced Data Distributor: 36 seconds

3. Script Component: 57 seconds

4. Conditional Split: 42 seconds



Note that the Conditional Split divides the stream based on the remainder of a division of an integer field, while the Script Component does it based on the row number. The Conditional Split may or may not be useful in a number of cases – when we have non-numeric data only, or when the range of the numeric data is not wide enough to split in the number of streams we would like to (e.g. Gender Key can be 1 or 2, while we may want to split in 10 parallel data streams). Therefore, its functionality is not equivalent to the BDD transform (thanks to Vidas Matelis for asking me to include it in the case study).

The second tests show how fast it is with reversed input and output (raw file to SQL Server table) and everything else identical.

  1. Straight insert: 56 seconds
  2. Balanced Data Distributor: 1 minute and 47 seconds
  3. Script Component: 1 minute and 57 seconds
  4. Conditional Split: 1 minute and 54 seconds

Over 40M rows the difference between the BDD transform and the Script Component – 1:16 vs 2:13 (vs 1:24 for the equivalent Conditional Split), or 57 seconds difference when inserting in a raw file. In general, the overall performance improvement seems to be around 35-45% in that direction. Since I am inserting in my local SQL Server in the same table the parallel split does not seem to be beneficial even though the destination is slower than the source. In fact the straight insert outperforms the parallel data flows in both cases. If we were to insert into a partitioned table over different filegroups hosted on separate drives the results could be quite different. Either way, in my opinion it is a nice and welcome addition to our arsenal of SSIS data flow components.

Edit: Len Wyatt from the SQL Performance Team at Microsoft has provided a link to his post with great bit of detail about the BDD transform in the comments below. Please take a minute and have a look if interested.

SSIS , ,

 

Tip: Deployment Options in BIDS

May 19th, 2011
Comments Off

The default options in BIDS do not always make perfect sense and one of the annoying defaults for me is the configuration of deployment options for a SSAS project. If we create a new project and then check its properties we see the following under the deployment tab:

By default the Processing Option is Default (surprise!). We also have two other choices:

In most cases we do not want to process the deployed database after a change has been made but by default BIDS will send a process command to SSAS whenever a process is required to bring the objects which are modified up. If we change the Default to Do Not Process we avoid this and the project gets deployed only.

The relevant MSDN page states:

By default, Analysis Services will determine the type of processing required when changes to objects are deployed. This generally results in the fastest deployment time. However, you can also choose to have either full processing or no processing performed with each deployment.”

I beg to disagree – I do not believe that it results in fastest deployment time. In fact it is just between Do Not Process and Full Process when it comes to deployment time only.

The default can be particularly annoying when we really want to only deploy and we right-click on the project and click Deploy while developing but we almost always get a partial process after a structural change. I have seen developers clicking Process instead, waiting for the deployment to finish (no automatic process if we do this) and then quickly pressing <Esc> to cancel the processing afterwards. Not a major breakthrough, but may optimise a bit the way we work, so I thought it may be a good idea to put this little issue in the spotlight for a moment.

SSAS ,

 

SSAS to BISM – Recent Developments

May 18th, 2011
Comments Off

There was a fair bit of FUD around the future of SSAS and just now it got officially dispelled by both TK Anand’s and Chris Webb’s posts on the roadmap ahead of SSAS. If you haven’t read these I would definitely recommend a thorough read through both (yes, TK’s is a bit long, but we did need all of it).

After the confusion has been more or less sorted out, a brief summary could go along the lines of: “We just got an enhanced SSAS“. I have been asked a number of times about the future of SSAS and UDM. So far I seem to have gotten it right – we don’t get a tremendously successful product replaced – instead we get a new and exciting addition to it. Rumours have it that the SSAS team will soon get down and dirty with more work on all of the components of SSAS – both multidimensional and tabular. What can come out of this? – a unique mix of on-disk and in-memory analytics. Edges may have to be smoothened, and in the end of the day we get more, not less.

What caused the confusion – well, in my opinion the SSAS team may be the greatest in analytical tools but in this particular case I think that the communication from Microsoft to us was not up to par. In all of their excitement about the new toys they are building for us they did not accurately draw a roadmap, which lead to the rumours. I hope that in the future this won’t happen and the recent posts by the SSAS team show a lot of improvement in this regard.

All in all we get a BI Semantic Model – encompassing both multidimensional (previously known as UDM) and the new tabular (in-memory) modelling. These two are integrated in one BISM, which allows us to pick and choose the tools we need to deliver the best possible results. All other tools in the stack will eventually work equally well with both models and the two models will integrate well together. Of course, this is a big job for the team and I hope that they succeed in their vision since the end result will be the best platform out there by leaps and bounds.

As of today – the future looks bright and I am very pleased with the news.

PowerPivot, SSAS , , ,

 

Dynamic Groups in SSRS Reports with MDX

May 7th, 2011

Providing a dynamic group in SSRS with SQL is not all that hard. It is also a common business requirement when combining reports. In example, I see very often an old system providing multiple similar reports to its users – that is reports with exactly the same layout but with different items on rows or columns. I also see quite often SSRS requirements asking me to mock the old functionality and build multiple almost identical reports. My personal preference is to combine them in one if possible, thus reducing the maintenance and even development effort and cost. In SQL we have the convenience to choose how to name the output columns. As long as we have a stable dataset – that is a constant number of columns with the same data types called the same way, SSRS does not differentiate between them and treats them the same way. This allows us to flexibly change their actual contents. The standard approach in MDX would be to create all possible “groups” in a large dataset and then use one or another in the report as required through expressions. This, however, leads to slower execution times, larger chunks of data going around the network and the need to aggregate within SSRS at almost every data cell.

Instead, we can build the same functionality by constructing MDX dynamically through the use of parameters. Let’s examine a practical case to illustrate the concept a bit more clearly. If we assume that the Adventure Works executives asked me to create two simple reports showing years across the columns, their Internet Sales Amount in the data area and on one of them product categories on rows, while on the other one – the countries from which the sales originate from. The layout would be the same – a tablix, which has one column group – Year (calendar), Internet Sales Amount in the data cell and one of the two dimension hierarchies (Product Category or Country) on rows. The MDX for the datasets would also be fairly similar:

Year

Country

As we can see, the only difference here is the set on rows. Instead of providing two reports we can combine them in one. The standard approach I see all developers implementing is crossjoining all sets together and then aggregating in SSRS. To switch between the two grouping fields there would be a parameter “Group By”. The sample MDX for this report would be:

And the layout in SSRS:

The row group (and the expression within the row cells) here is based on a GroupBy parameter with two values – “product” or “country”. In SSRS the expression for them looks like this:


When the user selects Product as a value for the Group By parameter, our report renders grouped with the Product categories on rows. In this particular case the dataset is relatively small even when we do a full crossjoin between years, product categories and countries. However, in many cases the number of items on rows can be prohibitive. Therefore, it is better if we can avoid doing this by dynamically constructing the MDX expression. We can add another (Internal) parameter – GroupMDX to our report with the following expression as its default value:

Then, we can replace the report dataset with the following MDX:

To make it execute, I added as a sample query parameter GroupMDX with a value of [Product].[Product Categories].[Category]*[Customer].[Customer Geography].[Australia]. Note that you should use the same hierarchies as sample parameters as what you are actually using in your MDX parameter – otherwise you will get empty results (e.g. if I were to use [Customer].[Country].[Australia] as a sample in this case the actual dataset would not be able to figure out that the hierarchy has been changed and will return an empty set on rows for country).

Unlike with SQL, though, we have a problem – every time the query executes with a different set as a parameter value it returns different result set and its fields collection does not link well to them:

Luckily, there is a workaround. We can add a bit of custom code to our report (Report Properties -> Code). For more details you can read the following MSDN article. In a nutshell, we can check in the custom code for missing fields, thus guarding against the error we are receiving. The slightly modified code snippet from the MSDN page is:

Now, we can wrap every possibly disappearing fields reference within our report with a call to the function. In our case, we can replace the row group expression, as well as the row cell expression with the following expression:

Now we can run our report and get the results we expect. Every time we change the Group By parameter and run the report, SSRS constructs a different MDX query and sends it to SSAS getting only the results it needs. It is a fiddly technique, but in many cases can be a lifesaver – in particular when the large crossjoin method is just too slow.

And the report rdl can be downloaded here: 

SSRS , ,

 

SSRS Cascading Parameters Refresh: Solved

April 23rd, 2011

There is a long-standing issue with cascading parameters in SSRS – when changing the selection of the “parent” parameter the default selection of the dependent parameter does not always get refreshed. This is closed as “By Design” by Microsoft in the corresponding Connect item:

https://connect.microsoft.com/SQLServer/feedback/details/268032/default-does-not-get-refreshed-for-cascading-parameters

The reason given is that since the users may have changed the selection in the dependent parameter drop-down we do not want to overwrite their changed with the default values every time they select something else in the parent parameter. This is a valid reason, but in some cases it is actually desirable and expected for the defaults to change. Since the refresh behaviour cannot be controlled in SSRS the only way to deal with a requirement specifying a complete refresh was to tell the users it was not possible.

However, after playing a little bit around the various options I found a hack/workaround/solution which actually works and allows us to do a complete refresh – hence this post. The actual trick comes from the fact that the dependent parameter gets refreshed only when its values are invalidated by the selection in the first parameter. In example, if a user de-selects a value in the parent parameter and selects another value which prompts a complete change in the second (dependent) parameter, SSRS will actually apply the default selection by default. This got me thinking that if we can invalidate the dependent parameter every time a parameter changes we will enforce a complete refresh. An easy way to do this is to attach a value such as a GUID obtained with the NEWID() T-SQL function, or the output from GETDATE(). An old workaround I have provided in the Connect item was an idea based on GETDATE(). However, in this post I will show how we can use NEWID() with the same outcome.

Firstly, let’s assume we have the following tables in SQL Server:

Parameter (e.g. dimension) table pTable


Fact table fTable


p1k is for parameter 1 key, p1l for parameter 1 label

In preparation for creating our report we can create the following three stored procedures (script is attached in the end of the post):

There are a few things that may need clarification.

  • The first stored procedure is a standard procedure to retrieve the distinct values for the parent parameter from a typical dimension table;
  • The TvFMVParamSplit function splits strings in the format ‘a,b,c’ to a table containing a, b and c on each row. We need to use such a function to split multi-select parameter value strings which SSRS returns when dealing with multi-selectable parameters;
  • In the second stored procedure we attach a GUID to all parameter keys. Every time this stored procedure gets executed the GUIDs are different and the output from the stored procedure is different, as well. We use this to invalidate the available and default values for the dependent parameter;
  • When we use the key+guid strings in the third stored procedure we need to strip the GUID out of the string. We know that the output from NEWID() has LENgth of 36, so we REPLACE the last 36 characters in each selected value with an empty string, thus eliminating the redundant manually attached part and we are left with the key only;
  • The string which SSRS constructs for the multiple selections gets quite much longer than when passing ints – we get 36 extra characters per value. Therefore, this is less efficient and cumbersome than passing straight keys. It could be better to have a flag in SQL Server which we can toggle every execution to 0 or 1 instead of using a GUID as it would be shorter but would introduce a slight extra bit of complexity I avoided for the purposes of this post.

The overall goal is to cause a complete refresh of all values in the second parameter whenever its stored procedure is called by SSRS (after change in the parent parameter).

Now we are ready to use these datasets in SSRS:

  1. Create a new Data Source pointing to the database containing out tables and stored procedures;
  2. Create a Data Set p1 for the parent parameter;
  3. Create a @p1 report parameter (multi-select, text) with available values obtained from the usp_p1 stored procedure (p1k as Value, p1l as Label);
  4. Create a Data Set p2 for the dependent parameter, which accepts @p1 as a parameter;
  5. Create a @p2 report parameter (multi-select, text) with available and default values obtained from the usp_p2 procedure (p2k as Value, p2l as Label).

An interesting thing to note is that this setup will actually not quite work. If we just perform these 5 steps and test we will notice that the values for @p2 do not always get refreshed on change of @p1. Half by chance I made @p2 Internal. Then I added an extra @p3, which is then used on the report and did this the trick. Therefore, the actual steps required to complete the creation of our report are:

  1. Set the @p2 report parameter to be Internal;
  2. Create a @p3 report parameter (multi-select, text), with Available Values Value expression of =Parameters!p2.Value and Label of =Parameters!p2.Label. The Default Values should be =Parameters!p2.Value;

  1. Create a Data Set main from the usp_main stored procedure and make sure that its parameter is populated by @p3, not @p2;
  2. Add a table in the report with two columns, which show p2l and amt from the main data set.

Now if we deploy (also in the Preview window) we can test the outcome by changing selections of p1. Since there are quite a few fiddly steps involved in getting this to work I am attaching the working rdl (SQL Server 2008 R2) and a T-SQL script (SQL 2008) which creates the tables, populates them with data and creates the stored procedures (in Adventure Works) required for this workaround to work as described here. Please note that the same should work in SSRS 2005 but I have not tested it in that environment. Please add a comment to this post if you test it and it indeed works.

SSRS ,