Archive

Archive for 2009

Data Mystification Techniques

October 15th, 2009

I am a fan of studies in Data Visualisation. It is a creative and dynamic field with a lot of room for experiment. I am considering report and dashboard design, and within this frame Data Visualisation, as a form of practical art. Well designed and built reports are critical for solution adoption and usability. However, in this post I will concentrate on exactly the opposite topic – intentionally mystifying reports, obscuring the data and making it hard, for the report consumers to reach informed conclusions. I will also show how we can make the data visualisations as misleading as possible. This post is not as abstract as one may think, as it draws its examples from a very real project, for which I had to build a report under heavy pressure.

Initially, I was asked to build a report based on a small data set (~450 rows) stored in an Excel workbook. The data was perfectly suitable for building a Pivot Table on top of it, so I did so and then I decided to use the pivot table as a source for my report. The users have one measure – Spending Amount and a few dimensions – Category and Focus for that spending. The request came in the form: “We want to see the Amount by Category and Focus in both graphical and numeric form“. So, I sat down and produced this prototype (with their colour theme):

As you can see from the screenshot, the report is fairly simple – the bar graphs on the top clearly show how the Amount is distributed per Category and Focus. Also, because of an explicit request, I build the bar graph on the second row to show category and focus amounts combined, and in order to clarify the whole picture, I added the table at the bottom of the report. The report works for colour-blind people too, as the details for the Focus per Category Expenditure bar graph are shown directly below in the data table.

I sent the report prototype for review, and the response was:

1. Remove the table.
2. Change the bar graphs to make them more “flat”. Meaning: there is too much difference between the bars.

The reasoning – it is “too obvious” that the data is unevenly distributed. As the report is supposed to be presented to higher level management, discrepancies in the amount allocation would “look bad” on the people who requested me to create the report at the first place.

Adding more fake data got rejected, so I was advised to prepare a new prototype report with the new requirements. It follows:

Now, Stephen Few would spew if presented with such a report. I did everything I could to obscure the report – added 3D effects, presented the amount in 100% stacked graphs when absolutely not necessary and, of course, I had to add a Pie Chart. A 3D Pie Chart. The whole report is totally useless. It is impossible to deduct the actual numbers of the various category/focus amounts. Furthermore, the Pie Chart combines the focus and category and uses virtually indistinguishable colours for the different slices. The 3D effect distorts the proportions of these slices and if printed in Black and White, or if viewed by a colour-blind person, the report looks like this:
Since my goal of total mystification of the report was achieved, I sent the second prototype back.
The response was: “It looks much better, and we really like the top right bar graph and the Pie Chart. Would it be possible to leave just those two and change the Pie Chart to show data broken down just by Category?

So, the users did not like my combined series. A point for them. Then I decided to remove the 3D cone graph, to remove all 3D effects, to make it more readable and to create the following 3rd prototype:

Here, we can see that the pie has the actual percentages displayed, and each category amount can be calculated from there. The stacked bar graph is again almost useless.

The users liked it, but still thought that it was too revealing, and were particularly concerned with the fact that there are a couple of “invisible” categories (the ones with small percentages) on the Pie Chart. They had a new suggestion – change the pie chart back to a bar graph and play with the format, so that even the smallest amount is visible. I offered an option, made another prototype and it finally got approved. The exact words were: “The graphs are exactly what we want“. There it is:

The Y Axis in the first graph is interesting. Not only that there is a “scale break”, but in fact the scale is totally useless, as it is manually adjusted, because of the big gaps between the amounts. I needed two scale breaks, which for a series with 6 values is a bit too much. You can compare the normal linear scale of the first Prototype I created and this one. However, it hit the goal – my users were satisfied.

I consider this effort to be contrary to be an exercise in “Data Mystification”. It is very easy to draw the wrong conclusions from the report, and it achieves the opposite to the goals of Business Intelligence, as instead of empowering users, it could actually confuse them and lead them to making wrong decisions.

Visualisation , ,

 

A new BI company

September 22nd, 2009
Comments Off

Just a quick one – recently I learned that three former colleagues of mine had formed a new BI company in Australia – Naked Data. What makes it slightly different is that it is based on an open-source platform; and what makes this venture interesting for me is the people who are involved. All of them are exceptional professionals and I believe that if anyone can make it it is them.

I am curious!

Other , ,

 

On the search for the perfect OLAP browser

September 22nd, 2009

Browsing Analysis Services cubes is typically done by power users in SQL Server Management Studio or Excel. However, because of a requirement stating that intranet and non-intranet users need to be able to access and browse an OLAP cube I had to go out there and try to find the best OLAP browser on the market. The client had been using Dundas OLAP Services with mixed success, and had purchased PerformancePoint with ProClarity but never used it because of infrastructure issues. That is why they used Dundas as an interim solution. Dundas’s OLAP Services is a user-friendly tool, but tis performance is not ideal when it comes to large grids, so my client wanted something combining Dundas’s usability with Excel’s performance, which could be used both internally and externally.

I contacted Microsoft with a question: “What tools can I use?” They came back to me with the following suggestions:

  • IntelliMax Solutions OLAP Browser (an Australian partner)
  • BI Companion
  • Panorama
  • Dundas
  • ProClarity

I added PerformancePoint to the list, as well as Radar-Soft’s OLAP Browser, so my final list was:

  1. IntelliMax
  2. BI Companion
  3. Panorama
  4. Dundas
  5. ProClarity
  6. PerformancePoint
  7. Radar-Soft

IntelliMax Solutions

A sales guy told me that they will organise an online demo for me, and then if I am interested, they will organise an evaluation installation of their product on our servers. That just happened half an hour ago and it immediately became apparent that the product lack one critical piece of functionality – it does not support attribute hierarchies. So, I cannot use anything but user-defined custom hierarchies, because according to the support person, including attribute hierarchies make the grids and reports “too big”. I, however, definitely need these big grids/reports. Otherwise, the tool is nice and simple with very obvious support for Custom Data – you can define this property from the UI, which makes it easier to work on non-Kerberos enabled environments. It also should integrate with PerformancePoint and Reporting Services, but I did not test those parts, because of the aforementioned problem with attribute hierarchies.

BI Companion

They were very friendly and responsive. I got an evaluation install, tried it out and it worked fine, apart from a small bug with non-aggregatable dimension hierarchies, which they fixed immediately. I was quite impressed with the product. It seemed to perform slightly better than Dundas, but it also was slightly less user-friendly. The interface imitates Cube Browser and is very feature-rich. Unfortunately, the HTML version did not work with Firefox, so they suggested using a SilverLight version, which was not supported by my client. As one of my requirements was cross-browser operability (IE and Firefox at least), BI Companion became a no-go-to for me.

Panorama

Panorama’s NovaView product competed with PerformancePoint as a dashboard creation tool and I have no idea why Microsoft recommended a competitor. I contacted their reseller in Australia and the sales person organised an online demo. I could play with their grid analytics components and it seemed OK. NovaView comes in two versions – a legacy Java version, as well as a new Flash version still in development. The Flash version is quite nice, but still harder to use than BI Companion or Dundas. As a big miss in the current version, Panorama NovaView does not let the user to add/remove dimension hierarchies from its grid columns. It can be done on rows, but not on columns, which is frustrating and ultimately a critical lack of functionality in my case. The Panorama support person told me that “they did not want to release all features yet” and that they have it going in their lab. He also advised me to use the Java version until the Flash version gets updated. I did and found out that the Java version is indeed fast, but not as fast as Excel and ProClarity. Furthermore, Panorama’s product is just too big for our needs and all of its administrative functionality, which includes user management, security and what-not, is just not necessary. Otherwise it looks nice.

Dundas

The Dundas OLAP Services OLAP grid is very usable – all the users need to do is drag and drop dimension hierarchies to browse the cubes they have access to. Unfortunately, it is a bit buggy and very slow when the cells, which need to be displayed, hit the > 10 000 range. After performance testing, we found out that about 50-80% of the processing time takes place on the client machines, where the browser renders a huge HTML table. It turns out that it is not only Dundas which has the same issues. Any HTML OLAP tool I tested suffers from exactly the same problem. This includes Dundas, BI Companion and Radar-Soft. Nothing we can do. Dundas supports paging, so the results can be split in a number of pages. It performs better if that feature is turned on, but requires hacking when exporting to Excel, because in order to export the whole grid, by default the user needs to export all the pages individually and then manually concatenate them in Excel. Since the HTML table problem cannot be rectified by Dundas or us, Dundas’s OLAP Services remain what they were intended to be – an interim solution.

ProClarity

Dated, but FAST. ProClarity still beats all other solutions with its performance. It was as fast as Excel and much, much faster than anything else. Although it lacks the apparent user-friendliness of some of the other solutions, it does allow users to conveniently browse OLAP cubes. Furthermore, it integrates well with PerformancePoint. Unfortunately, 6.2 is its last ever version. Whether its functionality gets implemented as a part of SharePoint 2010 is still to be announced by Microsoft. By the current way things look, ProClarity is still the best solution for ad-hoc OLAP browsing. It is stable and even though its installation is not quite smooth (e.g. requires IIS to be running gin 32bit mode), it does offer what most users would feel comfortable with. Its drawbacks are: tabbed interface, no drag-drop, and outdated graphics. Not perfect, but good enough. Oh, and it works with Firefox.

PerformancePoint

The Microsoft dashboarding tool offers an Analytics Grid part, which can be used for ad-hoc analysis. Unfortunately, it does not allow the users to add more dimension hierarchies on rows or columns. What is does well is drilling up or down the hierarchies and slicing by other hierarchies. Unfortunately this is hardly enough when it comes to full-featured ad-hoc reporting.

Radar-Soft

The small OLAP browser Radar-Soft is offering seems fast at a first glance. It would have been a nice small tool if it did not have one major problem – paging. It pages everything. It even creates a number of pages within cells. This improves performance but makes it impossible to use if we want to export a whole grid to Excel. I guess that some of the paging can be disabled, but I am quite sure that, since it is HTML, it will have the aforementioned issues with performance when it comes to displaying a relatively large HTML table.

Conclusion

The three HTML solutions – Radar-Soft, Dundas and BI Companion all had the same issue with performance. My guess is that there is no HTML solution which is faster, because in the end it comes to browser performance rather than MDX or .NET implementation.

Panorama with its Java and Flash versions is maybe a good solution for a full-featured dashboarding and reporting, but is a massive overkill for our circumstances. Also, it is not very user-friendly and not very fast.

PerformancePoint and ProClarity are my favourite and I believe that Microsoft’s offering provides the best capabilities for our users at least until something better comes from the Microsoft workshop. These integrate nicely with SharePoint and perform very well. While PerformancePoint is very user friendly, I would not say that for ProClarity. Hopefully in the future we will see nicer OLAP browsing by Microsoft based on these two tools.

A word about SilverLight and Excel

While SilverLight OLAP browsers are available by various software houses (most notable BI Companion has one in Beta), it is a relatively new technology and was not an option for me as it was not in my client’s SOE. Also, from my testing it does not seem like it is much faster than HTML, but definitely looks nicer.

Excel is, in my opinion, the best tool for browsing OLAP cubes inside the organisation. It is fast, and provides a rich set of analytics capabilities. There are a lot of discussions about Excel as a report authoring tool on various blogs, but I believe that for ad-hoc reporting there is nothing out there which beats Excel. Despite this, I could not recommend it to my client, as the solution is exposed to external users and letting them connect to our cubes directly though Excel is just impossible considering our security infrastructure. Also, Excel Services does not provide enough ad-hoc functionality, as it does not allow users to modify the Pivot Tables.

Wishes

I wish and hope that Microsoft releases a full-featured, user-friendly and performant OLAP browser as part of Visual Studio and .NET or as a part of SharePoint 2010. Such a tool is just a must and a serious miss in the Microsoft BI stack. The combination of Excel+PerformancePoint+ProClarity does the trick for now, but as ProClarity is getting discontinued, there must be something in the future toolset, which takes its place.

A few other bloggers wrote posts closely related to this topic:

Richard Lees – Which cube browser for Microsoft OLAP
Richard Lees – What’s preventing Excel from being the ubiquitous cube browser
Chris Webb – Proclarity Migration Roadmap (or lack thereof)

SSAS , , , , , , ,

 

Inferred Members Implementation Best Practise

August 30th, 2009

Just last week a colleague of mine and I did a very thorough research on the best implementation practises in regards to Inferred Members. We went through a few web sites and we also had a look at previous implementations through SSIS. There are quite a few resources on the subject and the best two methods we found were based on Project REAL and a SQL CAT recommendation. So, in the end we agreed that we can recommend three approaches, which have their advantages and disadvantages and could be all successfully used when we need to have Inferred Member (late arriving dimensions = early arriving facts) support in a solution.

Our findings can be summarised in the following table:

Method Pros Cons
Stored procedures No need to use SSIS SlowComplicated developmentNeed to link source and target data server instances.
Script Component (SSIS) FastHigh reusability .NET skills required
Double Lookup (SSIS) FastHigh reusability Performance issues with multiple dimensions

I will go through each of these methods and provide some more details about them.

1. Stored Procedures

If SSIS data flow tasks are not an option for our ETL we can implement inferred member support through SQL code. The basic idea behind this method is:

  1. Find new dimension members in the fact data and add them to the dimension
  2. Move the fact data from source to target (i.e. staging to datamart) with surrogate keys from the dimension table

If we have our source and target data on separate instances of SQL Server it becomes quite inconvenient to utilise this approach. We need to link the two instances to use tables from both tables in the same stored procedure. This is a major issue and it is easily avoided by using SSIS.

2. SQL Server Integration Services

A general design of SSIS solutions can be represented with the following diagram:

Inferred Members Small

We can implement this in the following ways:

2.1. Script Component (Project REAL)

The Data Flow task for this approach is:

image

After looking up missing states, we pass them to a Script Component, which hashes the misses and hits the database only when a genuinely new miss occurs. We have a stored procedure in place which simply adds the new dim member in the dimension table in the database and returns the new surrogate key, which then gets sent to the fact table.

Using .NET we can efficiently hash all values in a object collection and we can also handle both character and numerical data.

Because we are using a fully cached lookup, this data flow item is case-sensitive, therefore we should make sure we equalise the case in both fact and reference table data before we compare the two. We should also make sure that in this case the Script Component is case-insensitive, because if it is we will end up with multiple different rows in our dimension table for each case variation of our inferred members.

Additionally, the Script Component task should be built either accepting a parameter for the dimension name, or it can read its name (in our implementation) and find the first word in it to determine the dimension it is used for. In the above diagram, the Script Component task is handling the State dimension, therefore its name starts with State – . This makes the implementation of multiple Script Components for multiple dimensions very easy – all we need to do is change its name and it just works. There can also be some .NET code for auditing purposes. This is also fairly easy to implement and an entry level of .NET should be sufficient for development and maintenance. A sample of this approach can be found in Project REAL and it is thoroughly discussed in the SSIS paper produced with it.

2.2. Double Lookup (SQL CAT)

image

Here, we are doing the same as before – matching fact rows against a State dimension. If a mismatch is found in the first (fully cached) lookup, we pass the row to the second one. The New StateSK is a partially cached lookup. Its purpose is similar to a hash table – it caches new rows, and when there is a brand new mismatch it adds it to the database by executing a stored procedure. Then we Union All our inferred members with the rest of the fact data.

Because the second lookup task is utilising partial lookup, it is case-insensitive and case variations of the same character term will not lead to multiple dimension rows for the same member. However the first fully-cached lookup is case-sensitive, so we should make sure that both source and reference data is in the same case because that would be more efficient. Furthermore, In SQL Server 2005 partial caching must have a specified memory limit and if we have a large number of late arriving dimension members, we may run out of memory for them. In that case SQL Server 2005 will start discarding the least used values from its cache, which may have performance implications. The latter problem is overcome in SQL Server 2008.

Another problem with this approach is the use of Union All tasks. These are semi-blocking and may impact performance when used multiple times in our ETL package.

For much more detailed description, including samples you can go to:

SQL CAT – Assigning surrogate keys to early arriving facts using Integration Services

From our testing and research we reached the conclusion that using Script Component is the best approach, closely followed by Double Lookups. The stored procedure approach is slow, hard to maintain and may be impossible to implement in a production environment. Using SSIS with .NET proves to be efficient, convenient and fast.If avoiding .NET is preferable, handling inferred members in SSIS is a very good alternative.

SSIS ,

 

Custom Rounding and Truncation of Numbers in MDX

August 2nd, 2009
Comments Off

Article published in SQL Server Central on 2009/03/26

In some scenarios we need to be able to round or truncate decimals to achieve correct calculation results. In SQL we have ROUND, which can do either of these. It rounds like we are used to – 0.5 rounds up to 1, can round up or down and we rarely get a project where as a part of the requirements we are implementing our own rounding or truncation algorithm.

However, in MDX we have Round() which performs a “strange” operation – bankers’ rounding, which our business users have usually not been exposed to, and if we decide to truncate to an integer number through casting with Int or cInt, we also get some strange results. To illustrate the problem with MDX please consider the value of these expressions:

Round(2.15, 1) = 2.2
Round(2.25, 1) = 2.2
Round(2.35, 1) = 2.4
Round(2.45, 1) = 2.4

Int(1.9) = 1
Int(-1.9) = -2
cInt(1.9) = 1
cInt(-1.9)= -2

These are usually considered wrong, because they are not the obvious results. Even though they are mathematically well founded, if we round 2.25 to 2.2, our users will come back at us with wrong numbers on their reports. Same goes for “trimming” -1.9 to -2.

To resolve the first problem with rounding, we can use our own math formula:

Fix([Measures].[???] * Factor + 0.5 * Sgn([Measures].[???])) / Factor

Where Factor is the rounding factor – 1 for 0 decimal places, 10 for 1 and so on (defined by 1/Factor). Of course, Factor of 0 will give us Div by 0 error. (Reference: http://support.microsoft.com/kb/196652)

If we have the Excel function libraries intalled on our server, we can also simply use Excel!Round() as Chris Webb advises.

As for the Int and cInt in MDX, we can use the Fix() VBA function to remove decimal places:

Fix(1.9) = 1
Fix(-1.9) = -1

Also, for truncation of values to a certain decimal point in MDX, we can use the following formula:

Fix(<value>*10^1)/10^1

All we need to adjust in order to change the decimal places is to replace 10^1 with another power of 10 – in example, to truncate 3.156 to 3.15 we can use: Fix(3.156*10^2)/10^2. To make things simpler, in all our formulas the power of 10 is what determines how many decimal paces we need to round to; negative powers will give us rounding to tens, thousands and so on. If we use ^0 we will round to whole numbers.

Using these we can avoid bankers’ rounding and some strange results with converting to integers in MDX.

 

SSAS , , , , , ,