The Single Table Model in PowerPivot

February 11th, 2011

In my last post I examined a normalised vs a denormalised model in PowerPivot. In some cases, though, users will invariably avoid this de/normalisation “stuff” and import a single table in PowerPivot. After all, PowerPivot targets Excel users, and Excel users are used to using large workbooks – in most cases a large extract provided by their friendly DBA or database developers. This is why the scenario where PowerPivot becomes a tool to overcome the 1 million rows limitation in Excel will be quite common. But how does it perform, is it wise to do this and when? I will try to answer some of the questions in this post.

Performance

To test performance I mashed up the data in my PTest environment and put it in a single table. Of course, comparing the space on disk between the normalised, denormalised and the single table approaches there was a massive difference with the single table being by far the largest, followed by the denormalised model and the normalised being the smallest. This is what we would expect and is one of the reasons for normalising at the first place.

In a database the single table would be very inefficient since it will lead to lots of IO in many scenarios. However, when imported in PowerPivot the sizes compare like this (variation from denormalised given in brackets):

PTest_Denormalised

5.5 Gb RAM
3.5 Gb File

PTest_Normalised

5.2 Gb RAM (-0.3Gb)
3.3 Gb File (-0.2Gb)

PTest_SingleTable

4.3 Gb RAM (-1.2Gb)
2.8 Gb File (-0.7Gb)

Obviously the elimination of many distinct keys leads to significant space savings. In fact, the single table approach is by far the most efficient when comparing memory utilisation and disk space (when saving the Excel file).

After I did my standard slicer testing, I got extremely good performance out of my single table. No relationships whatsoever seem to be a fast approach to PowerPivot. So, if there were no considerations we could jump into a conclusion that a single table is the best possible option for PowerPivot. Well, the next few section of this post will show why this is actually not the case.

Usability

Let’s quickly add another hypothetical table to our model. What if we need to add some more data and we decide to ask our friendly DBA to give us another extract with more of the same? Now we have two massive tables and we want to analyse the data from both of them. We hit a serious problem – slicing by a slicer built from one of the tables does not work with the other one. This is an obvious scenario for SSAS developers. We need to have a relationship between the tables we use for slicing. In other words, if we want to slice both tables by Date, we need to have a common table which we base the slicer on. Our model should look like the following diagram:

We do not have this relationship and trying to add one directly between the two big tables fails because we do not have a column with distinct values. This is the most obvious showstopper when considering using a single large table I can see.

Size

When we have one table memory utilisation seems very good. However, when we have more than one of these behemoths in memory, we can reasonably expect that due to duplication of attribute data (e.g. we have to store our Products and Customer Names a number of times in memory), we will have a problem – unnecessary duplication of the same data. This is especially true for high-cardinality attributes – that is where we have many distinct values, like in my sample Order Number attribute (20+ million distinct values). In such cases separating these in their own “dimension table” would save memory and disk space.

DAX

How these models compare when building DAX calculations is a topic on its own and I will soon show some comparisons which should answer two questions – which is the most convenient and intuitive model to work with and which one is the fastest. For now it would suffice to say that always working over a large set of values in different tables could be expected to be the slowest (however, I have not done sufficient testing to confirm this yet).

In conclusion, when doing ad-hoc analytics over some extracts which would definitely not need to be mixed up with others, the single table approach works very well with PowerPivot. It certainly extends the functionality Excel offers natively. However, if we are building extensible models, which are to be shared, enhanced and would form the heart of our Team BI, we should avoid the single table because of the other considerations listed above.

PowerPivot , , ,

 

PowerPivot Data Modelling for Performance

January 30th, 2011

PowerPivot allows us more flexibility than SSAS UDM when it comes to data modelling. We are no longer constrained to dimensions and facts. As Thomas Ivarsson showed us in his PowerPivot Schema Flexibility post, we can avoid certain data manipulations when building a model in PowerPivot. The same holds true for relational database scenarios, as well. We can write SQL queries and analyse data without complying with any particular model. However, we have the concepts of a data mart and a star schema, which are not based on SSAS UDM but rather the other way around. In this post I will compare a normalised and a denormalised model in PowerPivot and reason around the relevance of data modelling and denormalisation with Microsoft in-memory technologies.

I prepared two SQL Server databases: PTest_Normalised and PTest_Denormalised. First I populated the normalised model with random data and then I moved that into the denormalised one. The two data sets are identical with the small exception of the date key, which in the normalised one is datetime, while in the denormalised model I converted it to a meaningful integer.

The specifics of each database are as follows:

PTest_Normalised

PTest_Denormalised

The largest table in the models is the table containing Order Details. It contains exactly 50 million rows. The next largest is Order Header (or Order in the second model), which contains 20 million rows. The rest are significantly smaller with the next largest being the Customer table with 1 million rows.

After setting the databases, I imported them into PowerPivot. Since I have implemented foreign keys and consistent names of my columns, PowerPivot detected and created all relationships for me as they should be (see the diagrams above). Some applicable measurements of the two models in PowerPivot:

PTest_Normalised

5.2 Gb RAM
3.3 Gb File

PTest_Denormalised

5.5 Gb RAM
3.5 Gb File

What this tells us is that there is not much of a difference between the two when stored in memory and on disk. The denormalised model is slightly larger, but the difference is not even close to what we have in the relational database. This is no doubt due to the fact that with columnar compression in memory the dimension key columns in the fact table are so well compressed that the advantages of normalisation are mostly imperceptible. Interestingly, if we want to save disk space and memory, normalising the model will do little to help us. Existing data marts not fitting in memory will not get much better if normalised, and based on this case study I would advise against normalising them as a solution to the problem (while buying more RAM would definitely help).

Once the two models are in PowerPivot we can start the fun and build some pivot tables.

First I added my only measure SalesAmount to the Values area and placed YearName in Slicers Vertical, MonthName and DateName on rows:

When I clicked on a particular year in the slicer the execution times were:

PTest_Normalised

Cold (first time click): 14s

Warm: (subsequent clicks on the same item): 3s

PTest_Denormalised

Cold: 2s

Warm: <1s

There is a large intermediary table (OrderHeader) in the normalised model, which would have influenced these results. To avoid this problem, I tried slicing the Sales Amount measure by Products. I modified the pivot table for my next test by substituting date for product:

The results:

PTest_Normalised

Cold: 4s

Warm: 3s

PTest_Denormalised

Cold: 4s

Warm: 3s

Clearly, avoiding the large intermediary table helped with performance. Here, however we see that if we have small chained relationships performance is not all that bad. Therefore, snowflake schemas seem to be quite alright with PowerPivot. In my observations most of the 3/4s responses went on retrieving the long lists of Subcategories and Product Names. If I eliminate this:

PTest_Normalised

Cold: <1s

Warm: <1s

PTest_Denormalised

Cold: <1s

Warm: <1s

Chains seemingly do not matter much, as long as the tables in them do not contain too many rows. How many is too many – well, it depends on your case, but I would be cautious with more than 1 million rows, unless your hardware allows it. Of course, if you examine my normalised model because a lot of relationships go through the Order Header table, slicing by many of the “dimension” attributes results in very sluggish query responses. Mentioning hardware, all my testing was done on a laptop with 8Gb RAM, i5 460M 2.53Ghz CPU and a 7200 RPM hard disk.

In conclusion, I think that the results from my little test case show that data marts are not irrelevant even when considering the speeds of in-memory technologies. Reading Kimball and understanding the concepts behind data marts will be important as data is data, and an optimal data model does play a significant (if not the most significant) part in a PowerPivot model just like in a relational database. While the memory required for importing the data in PowerPivot is almost the same, performance can be greatly reduced when modelling is not taken seriously. The flexibility which PowerPivot offers us is great, but in some scenarios it could potentially be a problem – especially when the users decide to serve themselves with too much data.

PowerPivot

 

Microsoft BI TechCenters: Learning and Resources

January 29th, 2011
Comments Off

Apart from subscribing to the blogs and reading the books in the Sites and Blogs and Books sections of this blog you can also visit the following TechCenters on TechNet and MSDN:

SSAS – Multidimensional Data

http://msdn.microsoft.com/en-us/sqlserver/cc510300.aspx

SSAS – Data Mining

http://technet.microsoft.com/en-us/sqlserver/cc510301.aspx

PowerPivot for Excel

http://technet.microsoft.com/en-us/bi/ff604673.aspx

You can find a good collection of whitepapers, articles and links to various resources.

PowerPivot, SSAS ,

 

Avoiding NULLs in SSAS Measures

January 20th, 2011

In a recent discussion I made a statement that many data marts allow and contain NULLs in their fact table measure columns. From the poll responses I am getting here, I can see that more than half of everyone voting does have NULLs in their measures. I thought the ratio would be smaller because in many models we can avoid NULLs and it could be a best modelling practise to attempt to do so. There are a few techniques which lead to a reduction of the need for NULL-able measures and possibly even to their complete elimination. While converting NULLs to zeros has a performance benefit because of certain optimisations, it also loses it in many cases because calculations operate over a larger set of values; in addition the performance hit of retrieving and rendering zeros in reports and PivotTables may be unnecessary. Therefore, it is advisable to test the performance and usability benefits of Preserving or converting NULLs to BlankOrZero. Thomas Ivarsson has written an interesting article about the NullProcessing measure property on his blog.

I am offering two approaches, which should be always considered when dealing with measure columns containing NULLs.

Adding Extra Dimensions

Let’s assume that we have a fact table with a structure similar to this:

Date        Product    Actual Amount    Forecast Amount
201101      Bikes      100              NULL
201101      Bikes      NULL             105

By adding a Version dimension we can achieve a structure like this:

Date        Product    Version        Amount
201101      Bikes      Actual         100
201101      Bikes      Forecast       105

This way we eliminate the NULL measure values and we get maintainability and arguably a usability boost as well.

Splitting Fact Tables

Another way to improve the data mart model is to ensure that we have more fact tables containing a smaller number of measures. Let’s illustrate this concept with a simple example:

Date        Product    Amount    Exception Amount
201101      Bikes      100       NULL
201102      Bikes      120       10
201103      Bikes      110       NULL
201104      Bikes      130       NULL

In this case, Exception Amount could be an extra cost which is relevant in rare cases. Thus, it may be better moved to a separate table which contains less data and may lead to some space savings on large data volumes. Additionally, it would also allow for the same analytics in SSAS if we implement it as a separate Measure Group. In example, we could rebuild out model like this:

Date        Product    Amount
201101      Bikes      100
201102      Bikes      120
201103      Bikes      110
201104      Bikes      130

Date        Product    Exception Amount
201102      Bikes      10

Even though we have two tables and one extra row in this case, depending on how rare the Exception Amount is, the savings of an extra column may be worth it.

Sometimes it is not possible, or correct to apply these modelling transformations. In example, we could have a valid design:

Date        Product    Sales Amount    Returns Count
201101      Bikes      100             10
201102      Bikes      120             30
201103      Bikes      110             NULL
201104      Bikes      130             20

Here adding Measure Type dimension would help us with eliminating the NULL, but would also mean that we would have to store two different data types – dollar amounts and counts in the same measure, which we should definitely avoid. Also, since we may have a very few months with NULLs in the Returns Count column and the ratio of non-NULLs to NULLs may be low, we would actually lose the benefits of the second approach.

From Analysis Services point of view, it is better to use NULL, again depending on the non-NULL-to-NULL ratio we have. The more NULLs we have, the better it is to not convert the NULLs to zeros as calculations which utilise NONEMPTY will have to work over a larger set of tuples. Also, here we may notice the difference between using NONEMPTY and EXISTS – the second one will not filter out tuples which have associated rows in the fact table even when they are NULL. It also depends on the usage scenarios – if a user filters out non-empty cells in Excel, the Returns Count for 201103 will appear as 0 if we store zeros (because of the way NON EMPTY, which Excel uses, works), or if we have set the measure to convert NULLs to zeros. In the opposite case (when we have NULLs and the NullProcessing property is set to Preserve) Excel will filter out the empty tuples.

Regardless of the scenario, it is always a good idea to try to remove NULLs by creating a good dimensional model. If we are in a situation where NULLs are better left in the measure column of the relational fact table, we should consider whether we need the default zeros in our measures, or if we would be better off avoiding them altogether.

I would be interested to see if and why you do allow NULLs in your measure groups if the reason is not in this article. Also, it would be interesting to see if there are other good ways to redesign a model in order to avoid storing NULL measure values.

SSAS ,

 

New Blog

January 17th, 2011
Comments Off

I just saw that John Simon (my boss at Avanade Melbourne) has started his own blog at:

http://jsimonbi.wordpress.com

The first four posts offer a description of various approaches to hierarchies. I am looking forward to more good content, which based on John’s career and knowledge in BI is not far away.

Other