From SQL to Spark

When I started learning spark, I decided to learn it from the other way round. That means, rather than going through a step by step journey, I decided to learn it from BI reporting perspective. I took a table with some data. Then, I tried to answer some common reporting queries by using python+Spark. It was simple stuff but I decided to post it in my blog as it may help other people from BI/SQL background who are wondering where to start from.

My table looks like below. Each territory has some bricks and each brick has some sales and target amount.


So, first question is which territory has the highest number of bricks?

It’s a simple group by/count problem in SQL. But pySpark is quite simple as well.

df.groupby(‘Territory’).count().orderBy(‘count’, asending=False).show(1)


Second question is, which Territory has the highest target attainment? Here, we have to calculate sum of sales and sum of target for each territory and then divide sales by target to find out attainment percentage. Again, pySpark has some simple solution:


from pyspark.sql.functions import sum
from pyspark.sql.functions import format_number
df.groupby(‘Territory’).agg(format_number((sum(“BrickSales”)*100/sum(“Brick Target”)),2).alias(‘Attainment %’) ).show(1)


Finally, we will try to make a flat structured table a hierarchical one. Let’s see the input table.


We have a hierarchy here; National->Territory->Brick. We will now make this table a classic hierarchical parent-child table. So, our expected table will be like this:

Child                            Parent                        Sales

AB16                             Terr0032                   0

BA06                             Terr0003                   374.0941667

Terr0032                      40012                          XXXX

Terr002                        40012                          XXXX

To solve this problem, we will create two dataframes and then make a union of them.

from pyspark.sql.functions import sum
df_1 = df.withColumnRenamed(“Brick”,”Child”).withColumnRenamed(“Territory”,”Parent”).withColumnRenamed(“BrickSales”,”Sales”)
df_2 = df.groupby(‘Territory’,’NATIONAL’).agg(sum(“BrickSales”).alias(‘Sales’) ).withColumnRenamed(“Territory”,”Child”).withColumnRenamed(“NATIONAL”,”Parent”)


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.