Avoiding Multiple Role-Playing Date Dimensions
If we investigate the Adventure Works SSAS solution, we’ll notice that whoever built it has chosen to use a role-playing Date dimension for various dates (e.g. Date, Ship Date, Delivery Date, etc.). I am not saying that instead, we should have multiple identical date dimensions – no, this would be a very bad idea. However, we could simplify things a bit by providing one Date dimension and then build a separate “Event” dimension, which represents the various events during the lifetime of an order. Possible dimension members would be Ordered, Shipped, Delivered, etc.
Let’s see how this can benefit us as developers, as well as our end-users. Firstly, often enough we want our cube to be able to answer a question like: “For this date, I want to see how many orders got ordered, shipped and delivered”. If we do use a Role-Playing date dimension like in Adventure Works, these are not very easy to answer. If we assume that our users use an ad-hoc query tool like an Excel pivot table, this sort of functionality is not possible out of the box and they have to separately pick each date dimension and sequentially replace it with the others to get different counts of orders – there is no way to get the data in one pivot table at the same time, as one order will usually get Ordered and then Shipped on different dates. Therefore, if we slice by both dimensions and choose the same date in both, we will get no results. To allow this, we can build a specific measure – in example “Orders Shipped” like this:
CREATE MEMBER CURRENTCUBE.[Measures].[Orders Shipped]
(LinkMember([Date].[Calendar].CurrentMember, [Ship Date].[Calendar]),
[Measures].[Internet Order Count]);
This way our users can slice only by Date and the Orders Shipped measure will show the count of internet orders for the selected Date. This is however a bit hard on the users, as they would need to get a similar measure for each role-playing date dimension which they need to query like this, plus they would need to know which instance of it they should use for slicing. Furthermore, LinkMember is slow and better avoided. I am considering this sort of measures a “hack”, which should not be in a final implementation unless absolutely necessary. I have, so far, found that in most cases multiple Date dimensions lead to some sort of a scenario where someone asks for something which leads to LinkMember.
Another point against using multiple Date dimensions is the fact that an average date dimension would have approximately 2000-5000 (I will use 3000 for convenience) members. If we have 3 such dimensions, the complexity of our cube grows and the theoretical cube space expands to 3000^3 = 27,000,000,000, or 27 billion cells. You can see how it would grow exponentially with each incarnation of our Date dimensions. Of course, this is mostly sparse, however it still impacts performance. In general, adding dimensions increases complexity, which reduces performance.
There is another shortcoming of this approach. This is the case when an order is Ordered but not yet Shipped or Delivered. More often than not, developers choose to leave the dimension key in the fact table for the Shipped and Delivered dates as NULL, which in SSAS will get either converted to Unknown, or will be excluded altogether. I hate NULL dimension keys. Therefore, I always prefer to replace the NULL with something meaningful – in most cases with -1, but in the case of a Date dimension this becomes a bit different. The common approaches would be to either replace the date with an outlier value (e.g. 1900010 or 99991231), or with the first/last values in the dimension (e.g. 20000101 or 20200101). Querying this type of a fact table can also cause headaches. Asking “How many orders are Ordered but not yet shipped?” in MDX has to be replaced with asking “How many orders are Ordered but shipped on 99991231?” For tricky developers like us this is ok, but I would hate to explain this to Excel users.
Lastly, I want to discuss the issue of adding an additional “event” for the orders. In example, if we want to now record when an order gets Delivered, we have to add one more dimension to our fact table, change our cube to include that and potentially we would need to change some of our MDX code. This is a major inconvenience which is better avoided.
I mentioned an alternative. This is one Date dimension and an Event dimension (it could be better to call it Status in many cases). The Event dimension can contain only a few members – in our case Ordered, Shipped and Delivered. In the fact table we record each event. Therefore, when an order gets placed, we insert one row with the applicable Date and Event (in my example – Ordered). Then, when an order gets Shipped, we insert another row for the same order but with a different date and with an Event of “Shipped”.
Let’s see if this eliminates the problems above (hint: yes, it does):
Now users can simply drop our count of orders measure in the data area in a pivot table, then filter by a date and place the Event dimension on rows – the pivot table displays the relevant counts for the selected date. Thus, we eliminated the ambiguous bit.
There is also no need to use unusual dimension members for events which have not happened – we simply have no fact rows for them. Therefore, when we want to query the cube for orders which are ordered but not yet shipped, we can just ask for ones which are empty when sliced by the Shipped member of the Event dimension:
CREATE MEMBER CURRENTCUBE.[Measures].[Ordered but not Shipped]
([Event].[Event].[Event].[Ordered], [Measures].[Order Count]) > 0 AND
([Event].[Event].[Event].[Shipped], [Measures].[Order Count]) = 0).COUNT;
We also completely avoid using LinkMember and the performance issue around the cube complexity, since we have only one Date dimension and a few Event dimension members. Instead of exponentially, the potential cube space grows linearly with each Event (instead of 3000^3, we get 3000*3 for three Events).
As for the last issue – when we want to add a new event, we can do this by simply adding one more Event dimension member and start inserting rows in the fact table with its key. As simple as that.
Therefore, yes, we have eliminated the problems I discussed.
Potential drawbacks (there are always some):
- The fact table will grow faster as we would be storing multiple rows per order. Well, this should not normally be a huge issue, but in some cases it could be. From a pure SSAS point of view, this is a non-issue but it should be considered in some scenarios. A mitigating factor is the decrease in row size in result of the elimination of the multiple Date keys but this is not likely to offset the extra rows in most cases.
- Storing amounts against orders becomes a bit more difficult in this case, because we may have to duplicate the amounts for each event and use something like LastNonEmpty to not sum the amounts over time, as well as make the Event dimension non-aggregatable, so amounts do not get aggregated for two different events. Also, we could possibly move the amounts to their own fact table.
- If we query the fact table in T-SQL as well, this becomes a very inconvenient model as simple tasks as finding the number of dates between an order is Ordered and Shipped cannot be done with a simple DATEDIFF() call. This is not really a problem in SSAS and we are talking about SSAS in this post but you should keep it in mind.
Stefan Riedel, a colleague of mine helped me with identifying some problems – thanks to him this post is not as incomplete as it could have been. There could be more, and I would welcome you to comment on this post with possible problems and solutions, which could help other readers implement this sort of a scenario better.