Archive

Posts Tagged ‘dimensional modelling’

The Single Table Model in PowerPivot

February 11th, 2011

In my last post I examined a normalised vs a denormalised model in PowerPivot. In some cases, though, users will invariably avoid this de/normalisation “stuff” and import a single table in PowerPivot. After all, PowerPivot targets Excel users, and Excel users are used to using large workbooks – in most cases a large extract provided by their friendly DBA or database developers. This is why the scenario where PowerPivot becomes a tool to overcome the 1 million rows limitation in Excel will be quite common. But how does it perform, is it wise to do this and when? I will try to answer some of the questions in this post.

Performance

To test performance I mashed up the data in my PTest environment and put it in a single table. Of course, comparing the space on disk between the normalised, denormalised and the single table approaches there was a massive difference with the single table being by far the largest, followed by the denormalised model and the normalised being the smallest. This is what we would expect and is one of the reasons for normalising at the first place.

In a database the single table would be very inefficient since it will lead to lots of IO in many scenarios. However, when imported in PowerPivot the sizes compare like this (variation from denormalised given in brackets):

PTest_Denormalised

5.5 Gb RAM
3.5 Gb File

PTest_Normalised

5.2 Gb RAM (-0.3Gb)
3.3 Gb File (-0.2Gb)

PTest_SingleTable

4.3 Gb RAM (-1.2Gb)
2.8 Gb File (-0.7Gb)

Obviously the elimination of many distinct keys leads to significant space savings. In fact, the single table approach is by far the most efficient when comparing memory utilisation and disk space (when saving the Excel file).

After I did my standard slicer testing, I got extremely good performance out of my single table. No relationships whatsoever seem to be a fast approach to PowerPivot. So, if there were no considerations we could jump into a conclusion that a single table is the best possible option for PowerPivot. Well, the next few section of this post will show why this is actually not the case.

Usability

Let’s quickly add another hypothetical table to our model. What if we need to add some more data and we decide to ask our friendly DBA to give us another extract with more of the same? Now we have two massive tables and we want to analyse the data from both of them. We hit a serious problem – slicing by a slicer built from one of the tables does not work with the other one. This is an obvious scenario for SSAS developers. We need to have a relationship between the tables we use for slicing. In other words, if we want to slice both tables by Date, we need to have a common table which we base the slicer on. Our model should look like the following diagram:

We do not have this relationship and trying to add one directly between the two big tables fails because we do not have a column with distinct values. This is the most obvious showstopper when considering using a single large table I can see.

Size

When we have one table memory utilisation seems very good. However, when we have more than one of these behemoths in memory, we can reasonably expect that due to duplication of attribute data (e.g. we have to store our Products and Customer Names a number of times in memory), we will have a problem – unnecessary duplication of the same data. This is especially true for high-cardinality attributes – that is where we have many distinct values, like in my sample Order Number attribute (20+ million distinct values). In such cases separating these in their own “dimension table” would save memory and disk space.

DAX

How these models compare when building DAX calculations is a topic on its own and I will soon show some comparisons which should answer two questions – which is the most convenient and intuitive model to work with and which one is the fastest. For now it would suffice to say that always working over a large set of values in different tables could be expected to be the slowest (however, I have not done sufficient testing to confirm this yet).

In conclusion, when doing ad-hoc analytics over some extracts which would definitely not need to be mixed up with others, the single table approach works very well with PowerPivot. It certainly extends the functionality Excel offers natively. However, if we are building extensible models, which are to be shared, enhanced and would form the heart of our Team BI, we should avoid the single table because of the other considerations listed above.

PowerPivot , , ,

 

Avoiding NULLs in SSAS Measures

January 20th, 2011

In a recent discussion I made a statement that many data marts allow and contain NULLs in their fact table measure columns. From the poll responses I am getting here, I can see that more than half of everyone voting does have NULLs in their measures. I thought the ratio would be smaller because in many models we can avoid NULLs and it could be a best modelling practise to attempt to do so. There are a few techniques which lead to a reduction of the need for NULL-able measures and possibly even to their complete elimination. While converting NULLs to zeros has a performance benefit because of certain optimisations, it also loses it in many cases because calculations operate over a larger set of values; in addition the performance hit of retrieving and rendering zeros in reports and PivotTables may be unnecessary. Therefore, it is advisable to test the performance and usability benefits of Preserving or converting NULLs to BlankOrZero. Thomas Ivarsson has written an interesting article about the NullProcessing measure property on his blog.

I am offering two approaches, which should be always considered when dealing with measure columns containing NULLs.

Adding Extra Dimensions

Let’s assume that we have a fact table with a structure similar to this:

Date        Product    Actual Amount    Forecast Amount
201101      Bikes      100              NULL
201101      Bikes      NULL             105

By adding a Version dimension we can achieve a structure like this:

Date        Product    Version        Amount
201101      Bikes      Actual         100
201101      Bikes      Forecast       105

This way we eliminate the NULL measure values and we get maintainability and arguably a usability boost as well.

Splitting Fact Tables

Another way to improve the data mart model is to ensure that we have more fact tables containing a smaller number of measures. Let’s illustrate this concept with a simple example:

Date        Product    Amount    Exception Amount
201101      Bikes      100       NULL
201102      Bikes      120       10
201103      Bikes      110       NULL
201104      Bikes      130       NULL

In this case, Exception Amount could be an extra cost which is relevant in rare cases. Thus, it may be better moved to a separate table which contains less data and may lead to some space savings on large data volumes. Additionally, it would also allow for the same analytics in SSAS if we implement it as a separate Measure Group. In example, we could rebuild out model like this:

Date        Product    Amount
201101      Bikes      100
201102      Bikes      120
201103      Bikes      110
201104      Bikes      130

Date        Product    Exception Amount
201102      Bikes      10

Even though we have two tables and one extra row in this case, depending on how rare the Exception Amount is, the savings of an extra column may be worth it.

Sometimes it is not possible, or correct to apply these modelling transformations. In example, we could have a valid design:

Date        Product    Sales Amount    Returns Count
201101      Bikes      100             10
201102      Bikes      120             30
201103      Bikes      110             NULL
201104      Bikes      130             20

Here adding Measure Type dimension would help us with eliminating the NULL, but would also mean that we would have to store two different data types – dollar amounts and counts in the same measure, which we should definitely avoid. Also, since we may have a very few months with NULLs in the Returns Count column and the ratio of non-NULLs to NULLs may be low, we would actually lose the benefits of the second approach.

From Analysis Services point of view, it is better to use NULL, again depending on the non-NULL-to-NULL ratio we have. The more NULLs we have, the better it is to not convert the NULLs to zeros as calculations which utilise NONEMPTY will have to work over a larger set of tuples. Also, here we may notice the difference between using NONEMPTY and EXISTS – the second one will not filter out tuples which have associated rows in the fact table even when they are NULL. It also depends on the usage scenarios – if a user filters out non-empty cells in Excel, the Returns Count for 201103 will appear as 0 if we store zeros (because of the way NON EMPTY, which Excel uses, works), or if we have set the measure to convert NULLs to zeros. In the opposite case (when we have NULLs and the NullProcessing property is set to Preserve) Excel will filter out the empty tuples.

Regardless of the scenario, it is always a good idea to try to remove NULLs by creating a good dimensional model. If we are in a situation where NULLs are better left in the measure column of the relational fact table, we should consider whether we need the default zeros in our measures, or if we would be better off avoiding them altogether.

I would be interested to see if and why you do allow NULLs in your measure groups if the reason is not in this article. Also, it would be interesting to see if there are other good ways to redesign a model in order to avoid storing NULL measure values.

SSAS ,

 

Ad-Hoc Ranges in SSAS

September 2nd, 2010

We can easily build ranges in MDX with the range operator “:”. We can also easily create a Range dimension in SSAS and use it to slice our data. This post is not about either of those. I would like to discuss the scenario where we need to restrict an ad-hoc query (e.g. PivotTable in Excel) by a range of values. Usually, we would tell our users to just select the values they need. This works. However, if our users do not want to be selecting/deselecting many values, we can provide an easier way to do this.

Let’s assume we have an Age dimension for a Kindergarten cube. The Age dimension contains the ages of the children, which can be from 1 to 10. Our requirement is to be able to select low and high limits of ages for a Pivot Table in Excel, so that the data in the Pivot Table is sliced for the range of ages between those limits.

To implement this in practise, we can build two extra dimensions – Age – Low Limit and Age – High Limit, which contain the same members as the Age dimension and then use them to slice our cube. Because the data is the same for all three dimensions, we can use a couple of database views on top of the Ade dimension table, thus ensuring that all three are in sync:

After that, we build two bridging tables BridgeLowLimit and BridgeHighLimit between Age and Age – High Limit, as well as between Age – Low Limit:

The data in these Bridging tables maps each Low and High limit to all Age members which are either lower (for High limit) or higher (for Low Limit) than the limit members:

 

Now, we can define many-to-many relationships between the FactAmount (our fact table), through the Age dimension and the Bridging tables to our limit dimensions as follows:

After this, we can hide the two measure groups for the Bridge tables from the users:

Now, we are ready to process our SSAS database. After that, we get the following in Excel:

If we place the Low and High limits in the Report Filter, Age on rows and our Amount measure on columns we can limit the Age members displayed by changing the filter members. Note that only the lowest member in the Age – Low Limit dimension and the highest in the Age – High Limit dimension matter – everything in between those (in case of multi-selects) effectively get ignored.

There are certain problems with this solution. If we place the limit dimensions on rows and we select multiple members from each dimension, we get the following:

This can be confusing for the users if they want to get distinct ranges like 1-3, 3-6, 6-10. Unfortunately, it is not possible to build a cube calculation which hides the irrelevant members as we do not know what the users have selected in Excel. From there, we cannot determine what members are in the query scope, and from there, we can’t pick only the ones we need (e.g. the ones with the lowest distance between the two limit members).

If we place the two dimensions on Rows and Columns, we can get a nice matrix and this makes a bit more sense:

 

Also, for large dimensions (e.g. Date), this can be quite slow, as the number of rows in the Bridge tables will grow. In example, if we have 10 years in our Date dimension, and we map them the way I just showed we will end up with approximately 6-7 million rows in each Bridge table, which can be quite prohibitive from performance point of view. However, for smaller dimensions (in my opinion everything under 1000 members would be ok as it would generate approximately up to 500,000 rows in each Bridge table). Therefore, if our users insist on this functionality – especially when they have a flat list of 100-1000 members, and they frequently select ranges out of this list – we have a way of answering their need.

SSAS , , ,

 

Do Business Analysts make good dimensional modellers??

May 26th, 2010

Recently I had the (dis)pleasure of working with Business Analysts, who also thought that they are good in dimensional modelling. so, I had to implement BI solutions (including cubes) on top of their database design. I will show an example (about 95% the same as the actual design), where the idea of letting BAs go into dev territory does not yield the best results:

 

This “dimensional model” was created by an experienced BA. Some “features” are missing here:
1. The fact table had EffectiveFrom and EffectiveTo dates
2. The relationships between some Dim Tables were 1-1 ?!
3. The Time dim (the only one properly implemented on its own – on the bottom of my example) had columns like: DateTimeName nvarchar(100), DateTimeKey nvarchar(100), YearName nvarchar(100), etc..
4. The Some Tables on the top had nothing to do with the rest (in fact a colleague of mine reckons they are there to fill in the white space on the top of the A3 printout)

Another design, which is better, but still pretty bad showed up after my training on Dimensional Modelling (1hr to go through EVERYTHING, including M2M relationships, Parent-Child hierarchies, Type 2 dimensions, etc):

Obviously, the designer (a developer actually) did grasp some concepts. However, my explanation of a star schema must have been not too clear..

Hope that you had some fun with these two diagrams..and I am sure many developers get in a similar situation, especially when someone else designs their databases. But two points:

1. Ask the BAs to analyse the business and their requirements – not to design the database
2. 1 hour of training on dimensional modelling will not make you an expert

SSAS, T-SQL ,