Boyan Penev on Microsoft BI

Melbourne SQL Server Social Event: Short Notice!

A short notice for SQL Server enthusiasts/professionals in Melbourne. Stephen Few is in town and will attend the SQL Server Social Event tomorrow (17 Nov) at the Sherlock Holmes Inn in the CBD (415 Collins Street). Feel free to come and meet the community, talk about SQL Server and information visualisation..and possibly get your Stephen Few books signed.

Link for the event: http://sqlserversocial.eventbrite.com/

See you there!

Alternate Ordering of Attributes in SSAS

Sometimes we need to display attribute members in SSAS in a different order than the order of its name or key. For this purpose we have the option to use one of its attribute’s name or key. However, in some cases changing the order may break some calculation logic which depends on the initial order. The new ordering may also be inconvenient for writing MDX as using some functions of the language is easier (at least conceptually) when thinking of sets in ascending order. The best example which we can use to illustrate this problem is the Date dimension. While in most, if not all, cases the Date dimension is ordered in ascending order, sometimes users prefer to see the most recent date first and request us to change the order to descending. Doing so invalidates many time intelligence calculations like rolling and parallel periods, etc. Furthermore, fixing those requires inverting numbers to negative, or avoiding the use of functions like ClosingPeriod. All in all, a “small” change can lead to a big problem. We can, however, accommodate our ignorant users (which unknowingly get the benefit of reading default time series charts backwards – from right to left – when dragging-dropping descending dates in Excel, for example) without changing too much in our scripts. A little trick in the modelling can help and it is the reason for writing this post.

Let’s have a look at a simple Date dimension with one attribute – Date. Nothing unusual, with the Date being ordered by its Key (integer in this case) and with a name coming from another column in the Date table – DateName. When we create a simple slice in Excel we get the following:

Now we create a measure Rolling 3 Days Amount, which sums the last 3 days’ amount:

The MDX for this calculation is:

If we simply invert the order of the Date attribute by ordering it by another column in our Date table, which contains DateKey*-1 and refresh the Excel pivot table we get the following:

This is simply incorrect. A relatively small change in the MDX script can help us with this issue (e.g. changing the Lag to Lead), however in many cases we do not want to rebuild all the measures. Luckily, we can employ a different tactic. Instead of changing the script, we can change the structure of our dimension by adding an additional attribute which is not exposed to the users. (i.e. is hidden). This attribute will be based on the same column we use for our Date, but will not be ordered by the descending column. We can rename the original attribute (the one exposed to the users) to something like Date Desc, or a more user-friendly option, and hide the new one:

Everything else stays the same – our cube script does not need to be adjusted and its logic is correct:

A different approach could be to leave the old attribute named Date, so there is no change necessary in case of reports depending on the naming. This, however, requires a change of the cube script, which can be easily performed with using the BIDS Replace functionality (e.g. Ctrl+H).

Note that for this approach to work we need to make sure that the attribute exposed to the users is the dimension key attribute as changing its current member results in an (infamous) attribute overwrite where its related attributes, which are above it in the relationship chain) also change. If we expose the non-key date attribute our MDX logic will break as the changes to its current member will not affect the attributes below it (actually, it will set them to their All member).

Range Queries with Azure DataMarket Feeds

By default the Azure DataMarket does not allow range queries. In fact, the only way we can filter a data feed is through specifying one or more values for the “queryable” fields specified for it. There is not technical reason behind not allowing range queries as both the back-end (presumably SQL Azure, or SQL Server) and the OData protocol support them. Fortunately, there is a way to consume a range of the values in a column of a data feed in PowerPivot. It is not straight-forward and I do not think that the target audience of both self-service BI/PowerPivot and the DataMarket itself would appreciate the complexity, but it could be useful anyway.

If we want to pull all three tables from the DataMarket we can simply use https://api.datamarket.azure.com/BoyanPenev/DateStream/ as the URL in PowerPivot:

Otherwise, we can pick each one with a URL like (for the BasicCalendarEngish table):

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish

If we filter the data set on the DataMarket website to request only the data for 2010 we get the following URL:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey%20eq%202010

Note the last bit:

?$filter=YearKey%20eq%202010

This is simply the URL/OData encoded ?$filter=YearKey = 2010

In OData we can also use other operators, not just = (or eq). For ranges these are gt (greater than), ge (greater than or equal to), lt (less than) and le (less than or equal to). We can also use and and or operators to combine different predicates. For a more thorough list, please refer to http://www.odata.org/developers/protocols/uri-conventions. If we replace the ” = 2010″ with ” < 2010″ and then encode the URL, we do indeed get all years prior to 2010. Things get slightly more complicated when we have a more complex scenario. In example, when building a date table we may want to include all years between 2000 and 2030. To do that, we would have to write something like:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey >= 2000 and YearKey <= 2030

encoded, the same looks like this:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey%20ge%202000%20and%20YearKey%20le%202030

Here space is %20 and the math comparison operators have been replaced with the OData operators (in red).

If we paste this in PowerPivot and hit “Next”:

…we get exactly what we expect – a table with 30 years.

Things get more complicated if we include the datetime DateKey in the URL. For a single date (e.g. 1900-01-01), we have to use:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey = datetime’1900-01-01T00:00:00′

After Applying URL encoding we get:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey%20eq%20datetime%271900-01-01T00%3a00%3a00%27

Where %27 is apostrophe and %3a is a colon (for a list of ASCII characters and their URL encoded form we can refer to http://www.w3schools.com/tags/ref_urlencode.asp).

Now, to combine the two we would need to write:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey = datetime’1900-01-01T00:00:00′ or (YearKey >= 2000 and YearKey <= 2030)

Encoded this becomes:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey%20eq%20datetime%271900-01-01T00%3a00%3a00%27%20or%20%28YearKey%20ge%202000%20and%20YearKey%20le%202030%29

This monstrous-to-write URL string returns 30 years of data + 1 day.

I suppose this approach can be classified as a workaround, as I have not seen any documentation on PowerPivot referring to any options for filtering data from the Azure DataMarket. However, in my opinion, this should be a feature of the DataMarket itself as it would make it easier/possible for users with any tool to get just the data they need and even possibly reduce the load on the site service since it will no longer be necessary to export everything and then attempt to apply a filter.

Introducing Project DateStream (CodePlex)

I recently blogged about The Case for an Azure DataMarket Date Table. I finished the blog post with a bit of a critique of the DataMarket team at Microsoft, which I can now wholeheartedly apologise for. This is because since my last post I was contacted by Max Uritsky who is a Group Program Manager on the Windows Azure Marketplace DataMarket team (a long name for a team, yes). He and Belinda Tiberio managed to help me with creating and hosting a new Date feed. Not only they helped with making it available for free on the DataMarket website, but also gave me a 1Gb free SQL Azure database for the project. A big “thank you” goes to Julie Strauss from the SSAS team for making the contact, as well.

To summarise, the DateStream project is a free date table available as a feed and intended to be used by PowerPivot BI users. As most, if not all, BI projects include a date table, the goal is to provide an easy-to-use, correct and simple mechanism for creating such tables.

After some deliberations on the format of the feed we decided that it would be best to split the feed in a number of localised Basic tables and one Extended table. Currently we have only two Basic (US and EN) versions and a fairly straight-forward Extended one. However, the plan is to inlclude more and more column in the Extended one and provide extra localised (Italian, French, Spanish, German, Russian, etc.) tables. When I am saying “we” I mean fellow SSAS professionals, which I had the pleasure to discuss this idea (among which Marco Russo, Thomas Kejser, Willfried Faerber and Grant Paisley).

The CodePlex page of the project contains more details about the feed and the tables and also allows for commenting on existing features, as well as requesting new ones. It can be found at http://datestream.codeplex.com/.

The actual feed can be used directly from https://datamarket.azure.com/dataset/1542c52d-d466-4094-a801-9ef8fd7c6485.

Note the logo – it was created by Daniele Perilli from SQL BI with the assistance provided by Marco Russo. Thanks to them we did not have to resort to my graphic design skills, which definitely is a win for the DataMarket website.

One note – please let me know if the performance you get from the feed is not satisfactory (please remember that once pulled, the feed does not need to be refreshed as the data will remain valid forever). If many people agree that it is too slow I could potentially host it on an alternative location as well. It is possible to download a CSV version from the DataMarket as a workaround, which also allows removing unnecessary date ranges.

Two New Books Definitely Worth Buying

I will try to provide a concise and useful insight into two books from the world of SQL Server. The first one is available now, while the second one is coming out very soon (in the next couple of weeks).

MDX with Microsoft SQL Server 2008 R2 Analysis Services Cookbook (link)
by Tomislav Piasevoli

I received an invite to review Tomislav’s book a few weeks ago and despite the fact that I am quite busy these days, I decided to have a look mainly because I expected it to be very good as I had heard about it months ago by the author and the reviewers (Darren Gosbell, Chris Webb, Greg Galloway, Marco Russo and Deepak Puri); and because I was promised a free copy, so I wouldn’t have to buy the book myself (which I would have done anyway)J. Tomislav has been one of the most prominent SSAS MVPs, quite active on the MSDN Forums and writing interesting posts on his blog. I was not surprised that he has been able to deliver excellent practical advice for his readers in this cookbook and I expect to be using it often during SSAS implementations. Let’s see what you could expect from this book if you buy it.

Firstly, it is a cookbook – Tomislav has written recipes for a large amount of real-world scenarios. I must admit that I did not read absolutely all of them. I went through the ones I think I know well and compared Tomislav’s versions to the ones I would typically use. The topics are very well covered and follow a pattern of: Getting Ready -> How to do it -> How it works -> There’s more -> See also. First, we get an introduction, and then we prepare to get the work done. After that we get a step-by-step walkthrough for the actual solution. I liked the next section “How it works”. Here we get an explanation of why we get the result letting us get some more insight rather than blindly typing code. I find it a very nice touch and I applaud the author for spending the time to include this invaluable bit of information. The “There’s more” section after that expands a bit the topic, trying different options and showing what happens when we apply them. In the end of each chapter we have a section showing which other topics are related to the one we are currently reading. All in all, I think that Tomislav’s cookbook is very, very, well written.

In general, I would say that if you want to learn MDX and how it works, you can start with a textbook-style book, which shows the basics, explains them and then builds up on them. However, if you need to work with MDX now, or if you do have some knowledge but you lack practical skills, then this book is a real gem. I would definitely recommend buying Tomislav’s cookbook, and if you are interested in getting more theoretical detail on why things work the way they do, either do a lot of blog reading, or buying another MDX book to serve as a companion to this one.

In brief, this is the best MDX cookbook out there and offers excellent practical advice over a variety of topics.

MVP Deep Dives vol.2 (link)

The second book I would like to introduce is another “must buy”. Imagine the effort required to assemble a book from 53 different authors who live in different parts of the world and are not all native English speakers (like me). Well, Kalen Delaney has made the effort…twice…and this is one of the reasons for having this book published in the next couple of weeks. Another reason is the motivation which the authors found in donating their work to Operation Smile. We are not profiting from it and we hope that you will also buy the book not only because of the outstanding content, but also because the money you spend will be used to help children with facial deformities all around the world.

The list of authors speaks for itself and I am very flattered to be able to participate with a chapter entitled “Managing Context in MDX”. The book will be out by the beginning of PASS Summit 2011 and there will be a book signing session, where you can get your copies signed by some of the authors. Come, drop by if you are there to say hello and have a chat!

Building a Date Table

Date tables in themselves are nothing new. Every data warehouse and business intelligence project includes one of those and typically it is one of the first tables which we find on our list of implementation tasks. Surprisingly, however, I am finding that in many implementations the omnipresent date table cannot support some analytics. In this post I will outline a few considerations which we need to take into account when building date tables in general, and some which apply specifically to SSAS.

UPDATE (9th October 2011): If you are a PowerPivot user (or would not mind using an OData feed) there is a simple way to get a date/calendar table as there is a free one available on the Azure DataMarket. I recently blogged about this new project in my Introducing Project DateStream (CodePlex) post.

UPDATE 2 (16 October 2011): John Simon recently published a nice T-SQL script which can be used to create and populate a date table on his blog. Have a look if you need one!

Grain, Key and Format

In general, date tables are on a daily/date grain. This is because the lowest level of reporting we need to do is very often on full days. Even if we need to analyse on time intervals smaller than a day, we should still build a Date and a Time table to support this. Why? Because if we pack it all in the same table we end up with too many rows and this is something we typically want to avoid in any dimension. As we have a row per day, one of our columns is inevitably based on the calendar date. It is also typically the dimension key. As with all other dimension tables, we need to be able to join the table to our fact tables and this is done on our key. To keep our fact tables as small as possible, we need the smallest possible data type for our keys, and in the case of a Date this is usually an integer. However, unlike with other dimensions, we do not have to have a meaningless integer for our surrogate key because it does not give us any advantage. It is very easy to convert between a datetime and an int and in SQL Server datetime is 8 bytes, while int is 4. In SQL 2008+ we have another type suitable for our date table key – date, which is 3 bytes. However, for compatibility reasons, it is frequently better to stick to an int. Luckily, we also have the very convenient ANSI/ISO 8601 standard to follow in order to make our integer dates easier to work with. If we do encode them with the YYYYMMDD format, (thus the 15^th of January 2011 becomes 20110115), we can easily sort on the column and compare two dates with the standard > and < operators. Additionally, we also escape the ambiguities around date formats in the USA and the rest of the world.

Attributes

Building the date key column is a good start but to allow for better analytical capabilities we need to add other attributes. As in most cases we need to analyse our data not only on days, but also on larger intervals (e.g. Months, Quarters and Years), we need to add columns for each of them. A typical requirement is to support both Financial and Calendar date hierarchies. As with dimensions it is ok to have many columns, we can add one for each combination of the type of calendar and period (e.g. FinancialMonth and CalendarMonth, FinancialQuarter and CalendarQuarter, etc.). Other interesting attributes we may want to have materialised in the table are the likes of Weekends, Public Holidays (even though we need to maintain these), Solstice/Equinox and PhaseOfMoon (thanks for Thomas Kejser for mentioning these in an online conversation recently). Basically, for a fully functional date table we need to consider anything which could be of business value without going to extremes.

Of course, when working with SSAS we also want to have an integer key for each attribute and possibly a common name for it. This multiplies the number of columns we need by two – one Id and one Name column for each of the attributes we have decided to implement. Some attention needs to be spared when determining the formats of each attribute Id. If we are to be building a hierarchy out of a collection of attributes we want those to form a nice natural hierarchy. That is – each child member should not have two parents with different Ids. To illustrate the concept, let’s consider two different months – January 2011 and January 2012. While they are the first month of the calendar year, they represent different entities when reporting. We do not aggregate data to January only, but to January in each respective year. Therefore, if we were to write a SQL query to get data for January 2011 and our Ids for both of these members are 1, we would also need to use the Year attribute to get the values we need. Effectively, our hierarchy would have two children with identical Ids of 1 with different parents (with Ids of 2011 and 2012). This is a particular problem when working with SSAS as it definitely prefers unique Ids for each separate attribute member. There is a very easy solution – we can include both the Year and the Month number in the MonthId. In our case, January 2011 gets an Id of 201101 and January 2012 – 201201. Now we don’t have the same issue. Similarly, we must pay attention when we construct the Ids of other attributes which participate in our hierarchies like Quarter (YYYYQQ) and Half Year (YYYYHH).

Weeks

The week is a special case in the calendar. While we can say that Days roll up to Months, which roll up to Quarters, which in turn roll up to Half Years and Years, Weeks are a different story. If we are not using a special calendar like a 4-4-5 calendar, we are dealing with a real-world entity, which does not naturally belong to only one larger period. In most cases we do not have to worry about the week falling between Day and Month as business understands that this is not a very easy to work with hierarchy. However, business users very often are ignorant about the fact that weeks do not roll up nicely to years, too. We can again read ISO 8601, which also deals with the Week->Year problem. Basically, the ISO has come up with a simple solution – if we have a week which has dates in two years, we count it towards the year which contains more dates (or, the year which contains the Thursday of the week). Why is this important? Well, we can simply split a week in two, however, this means that certain weeks in our table contain less than 7 days. If we compare such weeks with a Weekly Growth calculation we will notice a problem – the amounts aggregated to them are smaller than usual. Similarly, on a graph showing weekly amounts, we have a dip as the two parts of the same week are plotted as separate points. If the users do not care, then it is not a problem, but it is something we should consider and ask about when building the week attribute in our date table.

Ranges

The date table has a limited amount of rows and we have to make a decision on how many are enough. Some developers build date tables with ranges all the way from 1900 to 2100, or even further. As we have roughly 365.25 days per year (note how 1900 is not a leap year, not is 2100 – something Excel doesn’t know), for 10 years of data we end up with ~3652 rows. With 100 years we have 36525 rows. It is quite doubtful that a data warehouse will contain data for all these years. In most cases it contains a bit of historical data and is designed to keep data for another 20-50 years. To optimise the performance of the queries using the date table it is a good idea to have a dynamic range of dates. In our ETL we can easily keep expanding the range as needed, or collapse it if we ever purge fact data. One thing which is often overlooked when picking the date range is the fact that many times queries depend on the completeness of the date range. A major mistake, which we must avoid are basing the date table on fact data and allowing gaps in the range, and having incomplete top-level periods.

The first problem is easy to explain. Zealously trying to keep the date dimension as small as possible by reducing the number of days (i.e. rows) to only the applicable ones for our fact data can introduce gaps in the overall table. In example, if we have no data for 26^th of January because our business was closed for Australia Day, and we “optimise” the date table to exclude this day from the date table, all analytics which depend on counting the number of dates in January will be skewed. An average calculation doing Sales/NumberOfDaysPerMonth will divide by 30, not 31. One of the reasons for having a date table at the first place is to avoid such problems. Hence, gaps in the date table must be avoided.

Secondly, we must also ensure that all top-level periods (i.e. Year usually is the top level in the Calendar Hierarchy) must be also complete. It is not acceptable to cut off the date range with a month, so we have a constant number of dates. Even if our fact data is implemented over a fixed window of 24 months (rolling), we should not do the same with the date table. In example, if we are in September 2011 and we have fact data for 24 months prior to this (e.g. September 2009 – September 2011), the date table should also include data for months prior to that (e.g. January 2009 – September 2011). The reason is the same as before – all calculations doing Amount/NumberOfPeriodsWithinAPeriod would be wrong for 2009. Furthermore, if we use something like window functions in SQL partitioning by year and counting the months within the period to compare equivalent ones will be incorrect. The first month of 2009 would be September and the first in 2010 – January. Because of such issues, it is best to keep all top level periods complete. A solution could be removing data from the date table once we are not interested in the complete top level period. In the case of the 24 rolling months, we can remove all of 2009 with all its periods/rows once we move to 2012 and we are interested in January 2010 – January 2012.

In SSAS all date functions like ParallelPeriod and ClosingPeriod work with relative periods. To get the ParallelPeriod for January 2010 in 2009, SSAS will determine that January 2010 is the first month in 2010, go back to 2009 and pick the first month there. If the first month is September 2009, we will get that returned.

Unknowns

As with other dimensions, we should also have an Unknown member in the date dimension. While with all other dimensions we usually pick a value like -1 for the Unknown member Id and Name of “Unknown”, with a date table things a slightly different as we often have columns which are some sort of a date/time data type and we cannot cast -1 to a date in a meaningful way. Therefore, in many cases we can pick a complete outlier, like 19000101, which we use for storing Unknowns. Depending on the scenario, we may not need to do that, but if we do, there is nothing wrong with doing that as long as we make sure that it is clear that it is a special case. If the data starts in 2001 and we pick 20000101 as an unknown value many users will be wondering why there is some data against that date.

Conclusion

The best date table is the one we don’t notice and take for granted. Given that it is not something that changes at all, the structure of the date table is the one common over many different implementations and whether in SQL Server, SSAS or PowerPivot, the fundamentals stay the same. It can be shared and is part of the base of virtually every BI solution. It, unfortunately, is built wrong very often, thus severely impairing the capability of the system it is a part of. With the arrival of self-service BI I could imagine an increasing need for practical advice on this specific data modelling technique and I hope this article helps with delivering a bit of it.

A Closer Look at PALO and GPGPU

Last week I had the pleasure to meet a friend of mine, who formed a company I wrote about a year or two ago. His business has grown nicely since then and they have become the number one PALO partner in Australia. For those who are not aware of Jedox and PALO, I would recommend visiting their website at www.jedox.com – it is an open source BI suite very similar to SQL Server, minus the relational part. Since I was given a private show (no, nothing immoral here) in their corporate setup, I thought it may be interesting to discuss what I saw here in this post.

There are a few interesting and vastly different aspects of PALO when compared to the SQL Server BI stack:

GPGPU

For me the best feature they have is the General-Purpose GPU support in the OLAP server. While the OLAP components can be queried through MDX much like SSAS, they solve query bottlenecks with raw power. As far as I am aware, PALO supports CUDA, or the NVIDIA implementation (ATI have their own) of the GPGPU vision. If this all sounds a bit foreign, have a look at Tom’s Hardware article “The Advent of GPGPU“, where the concept of using the GPU for computational purposes is explained in a fair bit of detail. In short, by harnessing the power of NVIDIA GPUs, the processing power of a PC jumps from a few GFLOPs (50-60 GFLOPs on my i7 2600K OC-ed to 4.5Ghz) to 1500-1800 GFLOPs on my NVIDIA GTX 570 GPU. This means that for GPU optimised calculations, a PC gets a boost of a factor of 30. Both NVIDIA and ATI can see the potential and have been working hard in the last few years to get better drivers and better support for such applications. PALO in particular prefers the NVIDIA Tesla GPU. Note that a Tesla does not even have video output – it is used only for calculations, supports ECC memory (thus making itself ready for enterprise environments), and has been designed from the ground up for CUDA.

In terms of PALO, I got told that when they have an optimised query performing badly, adding a new Tesla unit in the server solves the problem. Their experience shows that the servers scale up linearly with every new GPU, and since NVIDIA’s SLI allows multiple GPUs running in parallel, adding 2-4 such units is all it takes to create a very, very fast computational workhorse.

Tablet Apps

Another area where I was impressed was the way PALO does mobile. They have free apps for the iPad (which I saw in action), as well as the iPhone and Android. Their vision is that information dashboards are best seen, and mostly required on the go when BI users have limited ability to browse around and get a deeper insight. I tend to agree to some extent. In my experience, the information dashboard is a slightly overrated concept. Having it on your phone or tablet where you can easily connect to you corporate environment and check some numbers quickly is a nice idea and I hope we see it becoming a part of the Microsoft stack sooner rather than later. The application which PALO have is quite nice minus the pies, allows any form of touch experience (multi-touch included) and allows easy slicing and dicing of data – just how it should be.

Open Source Software Compatibility

The last bit I would offer as an impressive and different to other not-open source vendors is the openness and compatibility of PALO with other open-source tools. Their stack components are easily replaceable. The ETL component can be changed to Pentaho’s Kettle, or JasperSoft’s ETL software which can load data directly in PALO’s cubes. A bit like loading a SQL Server data mart with Informatica, but seemingly better and tighter as the interfaces between the components are, apparently, completely open.

Apart from these areas, I think that the Microsoft stack has a nicer UI, allows easier development, and is richer (with MDS, QDS coming up, Data Mining, etc.). PALO has its own ETL tool, which is not graphical and relies on drop-downs and various windows to get the work done, the OLAP server seems to support many features out of the box, allows querying through MDX and supports write-back, but in general seems quite barren from SSAS point of view. The front-end is either Excel through a plug-in allowing the creation of reports through formulas, Open Office, Libre Office, and PALO’s own web-based spreadsheet environment. Once a report is created in either of those it can be published to a web portal for sharing with other users.

All in all, PALO is a neat, free BI suite, which comes for very cheap initially. There is an enterprise version, which is not free and, of course, any new customers will have to pay for someone to install it, configure it, and implement their requirements which will add to the total cost but these expenses are there for any other set of tools (although, a decent argument can be lead on which suite allows faster and cheaper development). The features listed in this article definitely appeal to some and I am very impressed by the innovative GPGPU capability, which has a lot of potential and I can easily think of a few areas where a 30-fold improvement in computational power will benefit SQL Server BI.

SSAS Myths Dispelled

This post is an attempt to dispel a few myths which seems to get repeated over and over among SSAS developers. While the truths are nothing new and have been documented in multiple sources like BOL, SQL CAT whitepapers, books and blog posts, they seem to consistently escape the attention of the wider public.

1 SSAS pre-aggregates data by default

While it is true that SSAS can pre-aggregate data this does not happen by default. SSAS compresses data, indexes data and caches data but it does not pre-aggregate data unless we define aggregations. When I am saying pre-aggregate I mean that SSAS does not automatically know what the Internet Sales Amount for Australia in 2007 is. It needs to get the leaf-level data and sum it up; unless we have built an aggregation on Country and Year for the partition containing the Internet Sales Amount measure. In that case the data is pre-aggregated and ready for retrieval.

2 We can emulate in MDX at no cost Enterprise Edition functionality in Standard Edition

Well, we can’t. We can emulate certain features like LastNonEmpty aggregation functions for example, but it comes at a cost. The cost usually relates to Storage Engine (multi-threaded) vs Formula Engine (single-threaded) execution.

3 SSAS is always faster than SQL Server RDBMS

While it is true that SSAS is faster than SQL Server RDBMS in many cases, this does not always hold true. A particular area in which the relational engine beats SSAS is the retrieval and processing of low-level granular data. SSAS usually beats the RDBMS when it comes to ad-hoc access to aggregated data.

4 MOLAP is always faster than ROLAP

If you read SQL CAT’s “Analysis Services ROLAP for SQL Server Data Warehouses” whitepaper you can see that after careful tuning ROLAP can be faster than MOLAP. Not always, but sometimes – enough to claim that it is not true that MOLAP is always faster than ROLAP. This ties a bit to the previous myth and proves that a well tuned RDBMS can perform very well with aggregates.

From the paper:

“At last, SQLCAT’s redesign and optimization efforts paid off. The ROLAP cube was finally ready for performance testing, and thanks to the amazingly fast performance of the relational SQL Server engine on top of a super-fast storage subsystem, the results looked better than expected. To everybody’s surprise, the ROLAP cube outpaced the MOLAP cube in 45 percent of all queries right from the start (see Figure 14). Only 39 percent of the queries showed substantially slower response times in ROLAP mode (more than twice the amount of MOLAP time) and 16 percent showed moderate performance degradation (less than twice the amount of MOLAP time).”

5 Usage Based Optimisations do not work well

In SQL Server Analysis Services 2008 the Usage Based Optimisation (UBO) algorithm has been redesigned. Now it works, and it works well. It does not create redundant aggregations and in general performs much better. Building UBO aggregations has always been recommended by Microsoft and even more so now.

6 Rigid attribute relationships boost performance

Whether an attribute relationship is Rigid or Flexible does not actually improve performance at all. Not query performance. A wrong choice here only affects processing of partition indexes. If an attribute relationship is static, setting it to Rigid means that you do not have to process partition indexes when you update the dimension. This is all the benefit you get from Rigid relationships. Going too far and marking changing relationships to Rigid may have a very negative impact as a change will prompt a complete process of the partition data and indexes, which will take much longer than updating just the indexes. Either way, there is no difference during query execution.

7 MDX and DAX are hard

I believe that this particular myth stems from the fact that we get to compare MDX and DAX to sweet and fluffy languages like SQL and C#. It all depends on the vantage point. Take the following “Hello world!” program in Malbolge for comparison purposes:

(‘&%:9]!~}|z2Vxwv-,POqponl$Hjig%eB@@>}=<M:9wv6WsU2T|nm-,jcL(I&%$#”
`CB]V?Tx<uVtT`Rpo3NlF.Jh++FdbCBA@?]!~|4XzyTT43Qsqq(Lnmkj”Fhg${z@>

MDX is not all that bad from a Malbolge developer’s point of view, is it?

Custom Groups in Excel 2007 – Error

I just finished digging around a particular issue with Excel 2007 (some versions) and SSAS Pivot Tables. In brief, the issue was that a user could not use the custom groups functionality which Excel provides because she got an error saying:

“The query did not run, or the database table could not be opened. Check the database server or contact your administrator. Make sure the external database is available and hasn’t been moved or reorganized, then try the operation again.”

I added her to the server administrators, but the message persisted. After profiling I noticed that the MDX generated by Excel 2007 for this operation read:

CREATE SESSION CUBE [Cube_XL_GROUPING0] FROM [Cube] ( DIMENSION [Cube].[Agency].[Agency Hierarchy] HIDDEN AS _XL_GROUPING0,DIMENSION [Cube].[Agency].[Flag],DIMENSION [Cube].[Agency].[Region],DIMENSION [Cube].[Collection].[Application],DIMENSION [Cube].[Collection].[Application Code],DIMENSION [Cube].[Collection].[Data Collection Code],DIMENSION [Cube].[Data…

Error: “Parser: The syntax for ‘DIMENSION’ is incorrect.”

I have highlighted the problematic part – the MEASURE part of this expression was missing. A correct MDX statement issued by another instance of Excel 2007 running on a different machine showed:

CREATE SESSION CUBE [Cube_XL_GROUPING1] FROM [Cube] ( MEASURE [Cube].[Value – Data Integer Quarter] HIDDEN,MEASURE [Cube].[Value – Data Integer Semi] HIDDEN,MEASURE [Cube].[Value – Data Integer Annual] HIDDEN,MEASURE [Cube].[Value – Data Integer Month] HIDDEN,MEASURE [Cube].[Value – Data Real Quarter] HIDDEN,MEASURE [Cube].[Value – Data Real Month] HIDDEN,MEASURE [Cube].[Value – Data Real Annual] HIDDEN,MEASURE [Cube].[Value – Data Money Semi] HIDDEN,MEASURE [Cube].[Value – Data Money Month] HIDDEN,MEASURE [Cube].[Value – Data Real Semi] HIDDEN,MEASURE [Cube].[Value – Data Money Quarter] HIDDEN,MEASURE [Cube].[Value – Data Money Annual] HIDDEN,DIMENSION [Cube].[Agency].[Agency Hierarchy] HIDDEN AS _XL_GROUPING0,DIMENSION [Cube].[Agency].[Pub Pte Flag],DIMENSION [Cube].[Agency].[Region],DIMENSION [Cube].[Collection].[Application],DIMENSION [Cube].[Collection].[Application Code],DIMENSION [Cube].[Collection].[Collection Code],DIMENSION [Cube].[Element].[Common Name],DIMENSION [Cube].[Element].[Data Element Code],DIME…

Here we have the cube measures as a part of the CREATE SESSION CUBE statement and this makes it a valid one. The reason for this seems to be the fact that all the physical measures in the cube were hidden and only one calculated measure was shown to the users. Excel (2007 Enterprise) seemed unable to find them, so the fix was easy – creating a visible dummy physical measure and using a scope assignment to make it work like the calculated one. Now Excel merrily creates valid MDX and my user is happy.

I understand this will be a very rare problem, but it takes some time to investigate, so I hope the post may help someone out there.

MDX Subselects – Some Insight

Recently I answered a question on StackOverflow, which may be an interesting, common case and understanding it correctly helps understanding subselects better.

Let’s examine the following MDX query:

Here we effective place the All member of the Customer.Education hierarchy on columns, and an ordered set of the countries in the Customer dimension on rows. We have a subselect and a slicer. The slicer contains the Internet Order Count measure (which places is it in context) and the slicer restricts the query to look at customers with Partial High School education only.

When we execute the query we get exactly what we expect.

Now, let’s change this to:

The difference here is the All member in the tuple used as a second argument of the Order function. What we get is a different order (note that Germany and France are on different positions on rows). It seems like the set is ordered by the total amount for all customers without taking their education in consideration – the subselect does not make a difference. However, the measure value is the same as in the first query. So, what causes this?

To answer, we need first to understand what a subselect does. As Mosha posted a while ago, a subselect (which is really the same as a subcube) does implicit exists with the sets or set expressions on each axis and also applies visual totals “even within expressions if there are no coordinate overwrites”.

Instead of explaining the same as what Mosha has already spent the effort to explain, I will advise you to read this post thoroughly. Then, you would understand what the “implicit exists” does. In our case, it is not as important as the visual totals part, so I will concentrate on it.

The reason why the set gets ordered by orders made by customers with partial high school education is the visual totals. It takes place within the Order expression. The visual totals also restrict what we see in the query results, as it applies to the slicer axis measure, as well. However, in this case the catch is in the “if there are no coordinate overwrites” part of the story. The visual totals in Order does not get applied because we explicitly overwrite the Customer.Education hierarchy member within the tuple in the Order function:

Therefore, the set of countries gets ordered by the Internet Order Count for All Customers without taking into account the subselect. However, the measure on the slicer axis still gets the visual total logic applied to it, and this causes the result set to be for those customers who have Partial High School education.
If we re-write the second statement to:

We can see that the NON VISUAL keyword (SSMS can’t recognise this syntax and underlines with a red squiggly line) changes the way the measure is displayed. We see the total amount for the slicer, as well.

Similarly, if we re-write the first query with NON VISUAL we get:

Here both visual totals are excluded and both the Order and the result set are done for customers with any education.

Hopefully this will clarify the logic applied when we use subselects to some extent.