When I started learning spark, I decided to learn it from the other way round. That means, rather than going through a step by step journey, I decided to learn it from BI reporting perspective. I took a table with some data. Then, I tried to answer some common reporting queries by using python+Spark. It was simple stuff but I decided to post it in my blog as it may help other people from BI/SQL background who are wondering where to start from.
My table looks like below. Each territory has some bricks and each brick has some sales and target amount.
So, first question is which territory has the highest number of bricks?
It’s a simple group by/count problem in SQL. But pySpark is quite simple as well.
df.groupby(‘Territory’).count().orderBy(‘count’, asending=False).show(1)
Second question is, which Territory has the highest target attainment? Here, we have to calculate sum of sales and sum of target for each territory and then divide sales by target to find out attainment percentage. Again, pySpark has some simple solution:
from pyspark.sql.functions import sum
from pyspark.sql.functions import format_number
df.groupby(‘Territory’).agg(format_number((sum(“BrickSales”)*100/sum(“Brick Target”)),2).alias(‘Attainment %’) ).show(1)
Finally, we will try to make a flat structured table a hierarchical one. Let’s see the input table.
We have a hierarchy here; National->Territory->Brick. We will now make this table a classic hierarchical parent-child table. So, our expected table will be like this:
Child Parent Sales
AB16 Terr0032 0
BA06 Terr0003 374.0941667
Terr0032 40012 XXXX
Terr002 40012 XXXX
To solve this problem, we will create two dataframes and then make a union of them.