In one of our projects, we ‘somehow’ decided that every part of the system should be ‘unit test’-ed . It was a classical ETL->DWH->Cube project where web dashboard was built to show BI reports from cube (using MDX as reporting language). Off course, unit testing was nothing new to our web developers. But, our data warehouse developers had to figure it out.
Let’s take a simple example; we have a function in our ETL process that joins two table and returns the result set. The only way to test this function is to verify its entire output data. But, unlike a programming language where expected output of a unit test remains fixed, this simple function’s expected output cannot remain fixed. Because it depends on input data. The only way we can keep it fixed is to keep the input data fixed. Hence came the idea of ‘frozen environment’. So, all our data should be frozen as long as we are doing development and this way we can pass the unit test. Now this is where it gets difficult. Because, we can never make our system ‘go live’ keeping every data in a frozen state. It has to change every now and then. And, every time input data changes, we need to change expected output data as well.
Another thing is, it hardly happens that we are changing something in our ETL flow and not expecting any data change in the system. In ETL we mostly do data manipulation, isn’t it? What kind of change we can do in a lame ‘join’ where we do not expect any data change in the end? Yes, performance tweaking, CLRs, string formatters may be some of those but not much other than that. So, we end up changing the unit test almost every time we change anything in the flow. Otherwise, unit tests will fail. So, more and more it is looking like data testing rather than unit testing.
All of above is also applicable for data warehouse, olap cubes and MDX. Actually, unit testing is something meant for programming languages. Cube or sophisticated ETL tools are more like design tools doing mostly data manipulation; therefore will not behave like a programming language. Personally, I will always call it ‘good old data testing’, not ‘unit testing’ anymore.