Archive

Archive for the ‘PowerPivot’ Category

Range Queries with Azure DataMarket Feeds

October 17th, 2011

By default the Azure DataMarket does not allow range queries. In fact, the only way we can filter a data feed is through specifying one or more values for the “queryable” fields specified for it. There is not technical reason behind not allowing range queries as both the back-end (presumably SQL Azure, or SQL Server) and the OData protocol support them. Fortunately, there is a way to consume a range of the values in a column of a data feed in PowerPivot. It is not straight-forward and I do not think that the target audience of both self-service BI/PowerPivot and the DataMarket itself would appreciate the complexity, but it could be useful anyway.

If we want to pull all three tables from the DataMarket we can simply use https://api.datamarket.azure.com/BoyanPenev/DateStream/ as the URL in PowerPivot:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Otherwise, we can pick each one with a URL like (for the BasicCalendarEngish table):

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish

If we filter the data set on the DataMarket website to request only the data for 2010 we get the following URL:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey%20eq%202010

Note the last bit:

?$filter=YearKey%20eq%202010

This is simply the URL/OData encoded ?$filter=YearKey = 2010

In OData we can also use other operators, not just = (or eq). For ranges these are gt (greater than), ge (greater than or equal to), lt (less than) and le (less than or equal to). We can also use and and or operators to combine different predicates. For a more thorough list, please refer to http://www.odata.org/developers/protocols/uri-conventions. If we replace the ” = 2010″ with ” < 2010″ and then encode the URL, we do indeed get all years prior to 2010. Things get slightly more complicated when we have a more complex scenario. In example, when building a date table we may want to include all years between 2000 and 2030. To do that, we would have to write something like:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey >= 2000 and YearKey <= 2030

encoded, the same looks like this:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=YearKey%20ge%202000%20and%20YearKey%20le%202030

Here space is %20 and the math comparison operators have been replaced with the OData operators (in red).

If we paste this in PowerPivot and hit “Next”:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

…we get exactly what we expect – a table with 30 years.

Things get more complicated if we include the datetime DateKey in the URL. For a single date (e.g. 1900-01-01), we have to use:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey = datetime’1900-01-01T00:00:00′

After Applying URL encoding we get:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey%20eq%20datetime%271900-01-01T00%3a00%3a00%27

Where %27 is apostrophe and %3a is a colon (for a list of ASCII characters and their URL encoded form we can refer to http://www.w3schools.com/tags/ref_urlencode.asp).

Now, to combine the two we would need to write:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey = datetime’1900-01-01T00:00:00′ or (YearKey >= 2000 and YearKey <= 2030)

Encoded this becomes:

https://api.datamarket.azure.com/Data.ashx/BoyanPenev/DateStream/BasicCalendarEnglish?$filter=DateKey%20eq%20datetime%271900-01-01T00%3a00%3a00%27%20or%20%28YearKey%20ge%202000%20and%20YearKey%20le%202030%29

This monstrous-to-write URL string returns 30 years of data + 1 day.

I suppose this approach can be classified as a workaround, as I have not seen any documentation on PowerPivot referring to any options for filtering data from the Azure DataMarket. However, in my opinion, this should be a feature of the DataMarket itself as it would make it easier/possible for users with any tool to get just the data they need and even possibly reduce the load on the site service since it will no longer be necessary to export everything and then attempt to apply a filter.

PowerPivot , , ,

 

Introducing Project DateStream (CodePlex)

October 6th, 2011

I recently blogged about The Case for an Azure DataMarket Date Table. I finished the blog post with a bit of a critique of the DataMarket team at Microsoft, which I can now wholeheartedly apologise for. This is because since my last post I was contacted by Max Uritsky who is a Group Program Manager on the Windows Azure Marketplace DataMarket team (a long name for a team, yes). He and Belinda Tiberio managed to help me with creating and hosting a new Date feed. Not only they helped with making it available for free on the DataMarket website, but also gave me a 1Gb free SQL Azure database for the project. A big “thank you” goes to Julie Strauss from the SSAS team for making the contact, as well.

To summarise, the DateStream project is a free date table available as a feed and intended to be used by PowerPivot BI users. As most, if not all, BI projects include a date table, the goal is to provide an easy-to-use, correct and simple mechanism for creating such tables.

After some deliberations on the format of the feed we decided that it would be best to split the feed in a number of localised Basic tables and one Extended table. Currently we have only two Basic (US and EN) versions and a fairly straight-forward Extended one. However, the plan is to inlclude more and more column in the Extended one and provide extra localised (Italian, French, Spanish, German, Russian, etc.) tables. When I am saying “we” I mean fellow SSAS professionals, which I had the pleasure to discuss this idea (among which Marco Russo, Thomas Kejser, Willfried Faerber and Grant Paisley).

The CodePlex page of the project contains more details about the feed and the tables and also allows for commenting on existing features, as well as requesting new ones. It can be found at http://datestream.codeplex.com/.

The actual feed can be used directly from https://datamarket.azure.com/dataset/1542c52d-d466-4094-a801-9ef8fd7c6485.

Note the logo – it was created by Daniele Perilli from SQL BI with the assistance provided by Marco Russo. Thanks to them we did not have to resort to my graphic design skills, which definitely is a win for the DataMarket website.

One note - please let me know if the performance you get from the feed is not satisfactory (please remember that once pulled, the feed does not need to be refreshed as the data will remain valid forever). If many people agree that it is too slow I could potentially host it on an alternative location as well. It is possible to download a CSV version from the DataMarket as a workaround, which also allows removing unnecessary date ranges.

PowerPivot , , , ,

 

Building a Date Table

September 19th, 2011

Date tables in themselves are nothing new. Every data warehouse and business intelligence project includes one of those and typically it is one of the first tables which we find on our list of implementation tasks. Surprisingly, however, I am finding that in many implementations the omnipresent date table cannot support some analytics. In this post I will outline a few considerations which we need to take into account when building date tables in general, and some which apply specifically to SSAS.

UPDATE (9th October 2011): If you are a PowerPivot user (or would not mind using an OData feed) there is a simple way to get a date/calendar table as there is a free one available on the Azure DataMarket. I recently blogged about this new project in my Introducing Project DateStream (CodePlex) post.

UPDATE 2 (16 October 2011): John Simon recently published a nice T-SQL script which can be used to create and populate  a date table on his blog. Have a look if you need one!

Grain, Key and Format

In general, date tables are on a daily/date grain. This is because the lowest level of reporting we need to do is very often on full days. Even if we need to analyse on time intervals smaller than a day, we should still build a Date and a Time table to support this. Why? Because if we pack it all in the same table we end up with too many rows and this is something we typically want to avoid in any dimension. As we have a row per day, one of our columns is inevitably based on the calendar date. It is also typically the dimension key. As with all other dimension tables, we need to be able to join the table to our fact tables and this is done on our key. To keep our fact tables as small as possible, we need the smallest possible data type for our keys, and in the case of a Date this is usually an integer. However, unlike with other dimensions, we do not have to have a meaningless integer for our surrogate key because it does not give us any advantage. It is very easy to convert between a datetime and an int and in SQL Server datetime is 8 bytes, while int is 4. In SQL 2008+ we have another type suitable for our date table key – date, which is 3 bytes. However, for compatibility reasons, it is frequently better to stick to an int. Luckily, we also have the very convenient ANSI/ISO 8601 standard to follow in order to make our integer dates easier to work with. If we do encode them with the YYYYMMDD format, (thus the 15th of January 2011 becomes 20110115), we can easily sort on the column and compare two dates with the standard > and < operators. Additionally, we also escape the ambiguities around date formats in the USA and the rest of the world.

Attributes

Building the date key column is a good start but to allow for better analytical capabilities we need to add other attributes. As in most cases we need to analyse our data not only on days, but also on larger intervals (e.g. Months, Quarters and Years), we need to add columns for each of them. A typical requirement is to support both Financial and Calendar date hierarchies. As with dimensions it is ok to have many columns, we can add one for each combination of the type of calendar and period (e.g. FinancialMonth and CalendarMonth, FinancialQuarter and CalendarQuarter, etc.). Other interesting attributes we may want to have materialised in the table are the likes of Weekends, Public Holidays (even though we need to maintain these), Solstice/Equinox and PhaseOfMoon (thanks for Thomas Kejser for mentioning these in an online conversation recently). Basically, for a fully functional date table we need to consider anything which could be of business value without going to extremes.

Of course, when working with SSAS we also want to have an integer key for each attribute and possibly a common name for it. This multiplies the number of columns we need by two – one Id and one Name column for each of the attributes we have decided to implement. Some attention needs to be spared when determining the formats of each attribute Id. If we are to be building a hierarchy out of a collection of attributes we want those to form a nice natural hierarchy. That is – each child member should not have two parents with different Ids. To illustrate the concept, let’s consider two different months – January 2011 and January 2012. While they are the first month of the calendar year, they represent different entities when reporting. We do not aggregate data to January only, but to January in each respective year. Therefore, if we were to write a SQL query to get data for January 2011 and our Ids for both of these members are 1, we would also need to use the Year attribute to get the values we need. Effectively, our hierarchy would have two children with identical Ids of 1 with different parents (with Ids of 2011 and 2012). This is a particular problem when working with SSAS as it definitely prefers unique Ids for each separate attribute member. There is a very easy solution – we can include both the Year and the Month number in the MonthId. In our case, January 2011 gets an Id of 201101 and January 2012 – 201201. Now we don’t have the same issue. Similarly, we must pay attention when we construct the Ids of other attributes which participate in our hierarchies like Quarter (YYYYQQ) and Half Year (YYYYHH).

Weeks

The week is a special case in the calendar. While we can say that Days roll up to Months, which roll up to Quarters, which in turn roll up to Half Years and Years, Weeks are a different story. If we are not using a special calendar like a 4-4-5 calendar, we are dealing with a real-world entity, which does not naturally belong to only one larger period. In most cases we do not have to worry about the week falling between Day and Month as business understands that this is not a very easy to work with hierarchy. However, business users very often are ignorant about the fact that weeks do not roll up nicely to years, too. We can again read ISO 8601, which also deals with the Week->Year problem. Basically, the ISO has come up with a simple solution – if we have a week which has dates in two years, we count it towards the year which contains more dates (or, the year which contains the Thursday of the week). Why is this important? Well, we can simply split a week in two, however, this means that certain weeks in our table contain less than 7 days. If we compare such weeks with a Weekly Growth calculation we will notice a problem – the amounts aggregated to them are smaller than usual. Similarly, on a graph showing weekly amounts, we have a dip as the two parts of the same week are plotted as separate points. If the users do not care, then it is not a problem, but it is something we should consider and ask about when building the week attribute in our date table.

Ranges

The date table has a limited amount of rows and we have to make a decision on how many are enough. Some developers build date tables with ranges all the way from 1900 to 2100, or even further. As we have roughly 365.25 days per year (note how 1900 is not a leap year, not is 2100 – something Excel doesn’t know), for 10 years of data we end up with ~3652 rows. With 100 years we have 36525 rows. It is quite doubtful that a data warehouse will contain data for all these years. In most cases it contains a bit of historical data and is designed to keep data for another 20-50 years. To optimise the performance of the queries using the date table it is a good idea to have a dynamic range of dates. In our ETL we can easily keep expanding the range as needed, or collapse it if we ever purge fact data. One thing which is often overlooked when picking the date range is the fact that many times queries depend on the completeness of the date range. A major mistake, which we must avoid are basing the date table on fact data and allowing gaps in the range, and having incomplete top-level periods.

The first problem is easy to explain. Zealously trying to keep the date dimension as small as possible by reducing the number of days (i.e. rows) to only the applicable ones for our fact data can introduce gaps in the overall table. In example, if we have no data for 26th of January because our business was closed for Australia Day, and we “optimise” the date table to exclude this day from the date table, all analytics which depend on counting the number of dates in January will be skewed. An average calculation doing Sales/NumberOfDaysPerMonth will divide by 30, not 31. One of the reasons for having a date table at the first place is to avoid such problems. Hence, gaps in the date table must be avoided.

Secondly, we must also ensure that all top-level periods (i.e. Year usually is the top level in the Calendar Hierarchy) must be also complete. It is not acceptable to cut off the date range with a month, so we have a constant number of dates. Even if our fact data is implemented over a fixed window of 24 months (rolling), we should not do the same with the date table. In example, if we are in September 2011 and we have fact data for 24 months prior to this (e.g. September 2009 – September 2011), the date table should also include data for months prior to that (e.g. January 2009 – September 2011). The reason is the same as before – all calculations doing Amount/NumberOfPeriodsWithinAPeriod would be wrong for 2009. Furthermore, if we use something like window functions in SQL partitioning by year and counting the months within the period to compare equivalent ones will be incorrect. The first month of 2009 would be September and the first in 2010 – January. Because of such issues, it is best to keep all top level periods complete. A solution could be removing data from the date table once we are not interested in the complete top level period. In the case of the 24 rolling months, we can remove all of 2009 with all its periods/rows once we move to 2012 and we are interested in January 2010 – January 2012.

In SSAS all date functions like ParallelPeriod and ClosingPeriod work with relative periods. To get the ParallelPeriod for January 2010 in 2009, SSAS will determine that January 2010 is the first month in 2010, go back to 2009 and pick the first month there. If the first month is September 2009, we will get that returned.

Unknowns

As with other dimensions, we should also have an Unknown member in the date dimension. While with all other dimensions we usually pick a value like -1 for the Unknown member Id and Name of “Unknown”, with a date table things a slightly different as we often have columns which are some sort of a date/time data type and we cannot cast -1 to a date in a meaningful way. Therefore, in many cases we can pick a complete outlier, like 19000101, which we use for storing Unknowns. Depending on the scenario, we may not need to do that, but if we do, there is nothing wrong with doing that as long as we make sure that it is clear that it is a special case. If the data starts in 2001 and we pick 20000101 as an unknown value many users will be wondering why there is some data against that date.

Conclusion

The best date table is the one we don’t notice and take for granted. Given that it is not something that changes at all, the structure of the date table is the one common over many different implementations and whether in SQL Server, SSAS or PowerPivot, the fundamentals stay the same. It can be shared and is part of the base of virtually every BI solution. It, unfortunately, is built wrong very often, thus severely impairing the capability of the system it is a part of. With the arrival of self-service BI I could imagine an increasing need for practical advice on this specific data modelling technique and I hope this article helps with delivering a bit of it.

PowerPivot, SSAS ,

 

The Case for an Azure DataMarket Date Table

June 6th, 2011

Since the release of PowerPivot Excel pros and power users have been encouraged to learn and “play” with the add-in. There is one little thing from the world of BI which we (as long-standing BI professionals) are used to but apparently troubles our new friends from the Excel world – the Date table. In the SSAS Multidimensional world we have BIDS which can generate a date table in a variety of formats. Still, much more common is the custom Date table, which we build through a SQL script as it enables us to dynamically generate it for a range of dates. I have also used Excel in the past for quick and dirty solutions. Although all these scenarios are very “workable” for database professionals, when it comes to Excel power-users fiddling with databases is far from ideal. Luckily, there is a better way.

Azure DataMarket

The answer I am proposing is the new DataMarket, which as an added bonus (in cases when we cannot use third-party products) is Microsoft-owned. It is integrated very well within PowerPivot since the last update and allows selecting subsets of the data (e.g. we do not need to import everything available in the data set). Because the data is in a feed format we can connect and pull data we want anytime.

The DataMarket is the vehicle but it needs a good data set to transport. With the Microsoft SSAS team being busy with new releases and unable to chase this up with the Azure DataMarket team, I tried contacting the latter directly to no avail. It would be very simple to create a sample, test and if all goes well – we could easily expand the feed to include lots of necessary columns which could simplify any PowerPivot implementation. As a start, a minimum of a calendar hierarchy with 4-5 levels should suffice, but the possibilities are very exciting. We could have multiple financial/fiscal calendars, public holidays, weekends, leap years taken care of and many other Date properties built right into the feed. If customisation is required (as it probably will be in many cases), PowerPivot developers have the Excel and DAX to play with the data through formulas and change various properties like member names and formats.

In my opinion with a miniscule development effort Microsoft can win on both new fronts – PowerPivot and the Azure DataMarket. After all, both products need more exposure and a popular Date feed will definitely help in this direction (not to mention how much easier it would be for developers to “get into” PowerPivot-based BI implementations).

PS: I have emailed the DataMarket team a few months ago and I received no reply. This remains so even after Julie Strauss from the SSAS team followed up on this – it seems like someone is sleeping on the job…

PPS: I just created a Microsoft Connect suggestion – please vote if you feel like this is a good idea. Also, if you feel even more inclined to act you can email the DataMarket guys directly asking them to pay attention through the links provided on their Contact Us page.

PowerPivot , , ,

 

SSAS to BISM – Recent Developments

May 18th, 2011

There was a fair bit of FUD around the future of SSAS and just now it got officially dispelled by both TK Anand’s and Chris Webb’s posts on the roadmap ahead of SSAS. If you haven’t read these I would definitely recommend a thorough read through both (yes, TK’s is a bit long, but we did need all of it).

After the confusion has been more or less sorted out, a brief summary could go along the lines of: “We just got an enhanced SSAS“. I have been asked a number of times about the future of SSAS and UDM. So far I seem to have gotten it right – we don’t get a tremendously successful product replaced – instead we get a new and exciting addition to it. Rumours have it that the SSAS team will soon get down and dirty with more work on all of the components of SSAS – both multidimensional and tabular. What can come out of this? – a unique mix of on-disk and in-memory analytics. Edges may have to be smoothened, and in the end of the day we get more, not less.

What caused the confusion – well, in my opinion the SSAS team may be the greatest in analytical tools but in this particular case I think that the communication from Microsoft to us was not up to par. In all of their excitement about the new toys they are building for us they did not accurately draw a roadmap, which lead to the rumours. I hope that in the future this won’t happen and the recent posts by the SSAS team show a lot of improvement in this regard.

All in all we get a BI Semantic Model – encompassing both multidimensional (previously known as UDM) and the new tabular (in-memory) modelling. These two are integrated in one BISM, which allows us to pick and choose the tools we need to deliver the best possible results. All other tools in the stack will eventually work equally well with both models and the two models will integrate well together. Of course, this is a big job for the team and I hope that they succeed in their vision since the end result will be the best platform out there by leaps and bounds.

As of today – the future looks bright and I am very pleased with the news.

PowerPivot, SSAS , , ,

 

Switch to our mobile site