### From SQL to Spark

When I started learning spark, I decided to learn it from the other way round. That means, rather than going through a step by step journey, I decided to learn it from BI reporting perspective. I took a table with some data. Then, I tried to answer some common reporting queries by using python+Spark. It was simple stuff but I decided to post it in my blog as it may help other people from BI/SQL background who are wondering where to start from.

My table looks like below. Each territory has some bricks and each brick has some sales and target amount.

So, first question is which territory has the highest number of bricks?

It’s a simple group by/count problem in SQL. But pySpark is quite simple as well.

##### df.groupby(‘Territory’).count().orderBy(‘count’, asending=False).show(1)

Second question is, which Territory has the highest target attainment? Here, we have to calculate sum of sales and sum of target for each territory and then divide sales by target to find out attainment percentage. Again, pySpark has some simple solution:

##### df.groupby(‘Territory’).agg(format_number((sum(“BrickSales”)*100/sum(“Brick Target”)),2).alias(‘Attainment %’) ).show(1)

Finally, we will try to make a flat structured table a hierarchical one. Let’s see the input table.

We have a hierarchy here; National->Territory->Brick. We will now make this table a classic hierarchical parent-child table. So, our expected table will be like this:

Child                            Parent                        Sales

AB16                             Terr0032                   0

BA06                             Terr0003                   374.0941667

Terr0032                      40012                          XXXX

Terr002                        40012                          XXXX

To solve this problem, we will create two dataframes and then make a union of them.