What I have learned in the name of unit testing of ETL, DWH, OLAP Cube and MDX

In one of our projects, we ‘somehow’ decided that every part of the system should be ‘unit test’-ed . It was a classical ETL->DWH->Cube project where web dashboard was built to show BI reports from cube (using MDX as reporting language).  Off course, unit testing was nothing new to our web developers. But, our data warehouse developers had to figure it out.

Let’s take a simple example; we have a function in our ETL process that joins two table and returns the result set. The only way to test this function is to verify its entire output data. But, unlike a programming language where expected output of a unit test remains fixed, this simple function’s expected output cannot remain fixed. Because it depends on input data. The only way we can keep it fixed is to keep the input data fixed. Hence came the idea of ‘frozen environment’. So, all our data should be frozen as long as we are doing development and this way we can pass the unit test. Now this is where it gets difficult. Because, we can never make our system ‘go live’ keeping every data in a frozen state.  It has to change every now and then. And, every time input data changes, we need to change expected output data as well.

Another thing is, it hardly happens that we are changing something in our ETL flow and not expecting any data change in the system. In ETL we mostly do data manipulation, isn’t it? What kind of change we can do in a lame ‘join’ where we do not expect any data change in the end? Yes, performance tweaking, CLRs, string formatters may be some of those but not much other than that. So, we end up changing the unit test almost every time we change anything in the flow. Otherwise, unit tests will fail. So, more and more it is looking like data testing rather than unit testing.

All of above is also applicable for data warehouse, olap cubes and MDX. Actually, unit testing is something meant for programming languages. Cube or sophisticated ETL tools are more like design tools doing mostly data manipulation; therefore will not behave like a programming language. Personally, I will always call it ‘good old data testing’, not ‘unit testing’ anymore.

Posted in SQA, Testing, SQL, Database | Tagged , , , | 1 Comment

Common Latest Period in SSAS Cube

Recently I came across a situation where data was being delivered at different times for different countries and client wanted to see data for latest common month in Region/Continent level reports. Data arrive on monthly basis, but it arrives at different dates for different countries. So, at one point in our data warehouse, we have Dec-2014  data for UK and Qatar. Rest of the countries have fallen behind them (having data for up to Nov-2014, Oct-2014 etc.). Let’s analyze in more details.

First, let’s go into the cube. We have a simple Geo hierarchy. Countries are at lowest level and its parent is Region (Europe, Mid East).

GeoHierarchy

We have a simple month dimension as well. Its primary key is the digit produced by concatenating year with month. So, Dec-2014’s key will be 201412.

MonthDimensionDimMonthData

Let’s browse cube and see how data looks like…

CubeData

Look at rows inside red colored zone. Different countries have different latest months. User requirement is to see latest month’s sales for all geo levels. It can be achieved in many ways. One way to achieve this is to create a calculated measure like this…

CREATE MEMBER CURRENTCUBE.[Measures].[Last Month Sales]

AS sum(

tail(nonempty([Dim Month].[Id].[Id],([Measures].[Sales],[Dim Geography].[Hierarchy].currentmember)),1),

[Measures].[Sales])

,VISIBLE = 1 ;

If we drag this measure in cube browser, country level and region level sales will be like this:

CountrySales1 RegionSales1

It’s perfectly alright for country level sales . But, it will be interesting to analyze Region level sales. Actually, for regions, most of the business users would like to see common latest month’s total sales. In our data, Europe’s last month’s sales 4830 is probably giving the business user a wrong picture (of course it depends…. ), because only UK  data is available for Dec-2014. So showing a total for Dec-2014 may not make much sense. Latest common month for Europe is Nov-2014 and latest common month for Mid East region is Oct-2014. So, it would always make much sense if we show Nov-2014 total for Europe and Oct-2014 total for Mid East. To support this requirement of showing common latest month’s sales I have introduced another fact. This fact is nothing but a storage for country-wise latest month.

CubeStructure2 NewFact

This Fact table’s name is FactCountryMaxPeriod and declared a simple count measure called [Measures].[Fact Country Max Period Count] from it. We are almost done. Just declare a new calculated measure called [Measures].[Common Latest Month Sales] and see it in cube browser.

CREATE MEMBER CURRENTCUBE.[Measures].[Common Latest Month Sales]

as

sum(nonempty([Dim Month].[Id].[Id], [Measures].[Fact Country Max Period Count]).item(0), [Measures].[Sales])

,VISIBLE = 1 ;

LastMonthSalesCountry LastMonthSalesRegion

This time, Europe and Mid East is showing total sales of Nov-2014 and Oct-2014 respectively.

Posted in MDX, SSAS (Analysis Service), Cube | Tagged , , , , , , , | Leave a comment

Slowly Changing Dimension in SSAS Cube

 

This is a simple example of SCD (Type-2) in OLAP cube.  We have a Dimension table for Employee and their Departments. This dimension is an SCD. One employee worked in different department over the course of time. EmpID is business key and SID is surrogate key.

DimEmpolyee table

112814_1040_SlowlyChang1.png

We have another dimension called DimTime. It’s a very simple Month Dimension table.

Fact table looks like this:

FactTable

Fact

Both business key (EmpID) and Surrogate key (EmpSID) is present in Fact table, but we will use only EmpSID key here.

Let’s have a look at the cube now. First, look at the Dimension called [Dim Employee]

Dimension

And it’s attribute relationship…..

AttrRelationship

See carefully that I did not use SID in hierarchy. I used employee name (attribute called ‘Employee’) for reporting purpose. Its key column and name column both are set to Employee column of the physical dimension table.

We are almost done. Let’s process the cube and write an MDX to see result:

MDXResult

 

You can clearly see that an employee named David sold different amounts in different months. David sold under Product Development Department for up to Aug-2013 and then he started selling under Android Apps department from Sep-2013.

Hope you will like it.

Posted in MDX, SSAS (Analysis Service), Cube | Tagged , , , , , , , | Leave a comment

5 Tips to avoid running out of space in SQL Server Database

Try to follow these tips. Each of these topics has been explained in different websites. So, I am not going to explain them here.

1. Design

Design your database efficiently (normalization, use of foreign keys, getting rid of unnecessary tables, selecting appropriate data type etc..). Clean up your database before your go LIVE!

2. Estimate Data Volume

Estimate the total data volume you need in your database. Don’t get surprise in future. Know your data and know the space required.

3. Estimate Log file size and Temp DB Size

Set back up plan for your database and then chose a logging mode (Simple, full or bulk logged). Try to find out the log file space required in the first few runs and then set log file size accordingly. If your process uses ‘#’tables (Temp tables), try to find out the space required there as well. Closely monitor TempDB size in first few runs and then set its size accordingly.

4. Compression

After a few months (could be few years depending upon the rate of change in your data), you can run Table compression in your tables. It’s good for reporting systems like DWH but might reduce performance in transactional system. Read pros and cons first.

5. Shrink (use carefully)

If any big sized Ad-hoc operations happen in your database, you can shrink your log files. Otherwise, shrinking of log files will not benefit you much. It will anyway take up the allotted space. Same goes for Database shrink. Read articles found on web. Database shrink will introduce fragmented index. In worst case, you can drop/recreate index rather than rebuild them.

Posted in SQL, Database | Tagged , , , , , , | Leave a comment

Ceiling in MDX

It’s just a simple solution to find ceiling of a measure with help of some string operation. We have a measure called [Measures].[Marks]. Now, we define a calculated measure like below:

CREATE MEMBER CURRENTCUBE.[Measures].[CeilingMarks]
AS CInt(Mid(cstr([Measures].[Marks]),0,InStr(cstr([Measures].[Marks]),“.”)-1))+1,
VISIBLE = 1 ;

It works like this;

Let’s say our number is 315.5, first it finds the decimal point’s position. In our case, it is 4. Then, it uses the MID function to find the number prior to that decimal point. In our case, it is 315. And then, it just adds one, so it becomes 316.

Posted in MDX, SSAS (Analysis Service), Cube | Tagged , , | Leave a comment

Pre order, Post order, Level order (BFS) traversing with MDX

A tree or hierarchy of my sample cube is being shown below:


BFS or Level Order Traversing:

with
member [Measures].[x]
as [Dim Category].[Parent].currentmember.level.ordinal

select
order(([Dim Category].[Parent].members),[Measures].[x],BASC)
on 1 ,
[Measures].[x] on 0
from SalesCube
And, here goes the output:


Pre-Order traversing:
By default, traversing in MDX is always in pre-order.
select [Dim Category].[Parent].members
on 1 ,
[Measures].[Sales] on 0
from SalesCube


Post-Order Traversing:
Using “Hierarchize” function, with parameter “POST”
select
hierarchize([Dim Category].[Parent].members,post)
on 1 ,
[Measures].[Sales] on 0
from SalesCube

Posted in MDX, SSAS (Analysis Service), Cube | Leave a comment

Leaf members do not sum up to parent – Part 2 (parent child hierarchy)

In part 1, we saw user defined hierarchies. For parent child hierarchy, we can use Unary Operator Column. Let’s take the same example from Part1. We have a simple cube with two dimensions called DimCompany and DimCategory, and only one measure called Sales. DimCategory table looks like this: Id    CategoryItems        Parent

1        Phone                            4

2         Tab                               4

3      Web Applications         5

4         Hardware                    6

5          Software                     6

6        AllCategory                NULL

If we browse the cube taking DimCategory to axis 1 and DimCompany to axis 0, we get to see the below result:
Now, imagine that we received an ad-hoc request from our source data provider that total Hardware Sales for Apple should be 9000 and that of Google should be 12000. There have been some unreported sales for both these companies which do not fit into the available sub categories (phone and tab). So, the requirement becomes likes this: For Apple, Phone sales is 1000 and Tab Sales is 7000, but total Hardware Sales is 9000 (1000 greater than children sum) For Google, Phone sales is 8000 and Tab Sales is 0, but total Hardware Sales is 12000 (4000 greater than children sum) To solve this, first we have to insert Hardware sales for all companies into Fact Table. Id for Hardware is 4. Newly inserted records are shown below:

Now, we need to add unary operator column in Dim Category table. So, the table looks like this: Id    CategoryItems              Parent        UnaryOperator

1               Phone                         4                        ~

2                Tab                            4                        ~

3            Web Applications       5                     NULL

4                  Hardware               6                     NULL

5                   Software                6                     NULL

6                AllCategory             NULL             NULL

Tilde (~) has been used to ignore these two members while calculating parent sum. Now, we need to change two properties (MembersWithData and UnaryOperatorColumn) of Parent attribute of our parent child dimension:

Now, process the cube and final result looks like this:

Yes, we could insert incremental value for Hardware sales and avoid using unary operator column. Result would be same.

Posted in MDX, SSAS (Analysis Service), Cube | Tagged , , , | Leave a comment