Skip to main content

GROUP BY ALL - Databricks

Using SQL to analyze data in Azure Data Lake (ADLS Gen 2)

 

Using SQL to analyze data in Azure Data Lake (ADLS Gen 2)


As more and more data is ingested and made available in data lakes, there is a growing demand from data analysts to be able to quickly access the data and drive insights. The biggest reason fueling this demand is the ability to use existing skills like "SQL" to be able to analyze available data.

In Azure, following are few of the available in a Big Data world, to make the data available to data analysts:-

    1. Mounting Azure Data Lake (ADLS Gen 2) to Databricks.
    2. Implementing Databricks Lakehouse pattern, built on top of Delta tables which are available as tables in Databricks.
    3. Loading data in Azure Synapse Workspace.

One good thing regarding the above mentioned options is that; all the options enable the data analyst to use there existing "SQL" skills to analyze the data and give them the flexibility to build solutions and products as quickly as possible and in turn enables "citizen development".

In this blog post we will quickly go through the first option "Mounting Azure Data Lake (ADLS Gen 2) to Databricks" and also look at some of the basic SQL constructs to refresh our basics.

Prerequisites

  • ADLS Gen 2
  • Service Principal
  • Azure Databricks
  • ADLS Gen 2 already mounted to Azure Databricks.
So what does Mounting means? Mounting is the process where we make Data Lake act as a local file system. Mounting abstracts the complexity of reading and writing to a data lake from the end user.

SQL Basics

Firstly, we are going to review some of the basic SQL commands. This is not a complete list but a list of commonly used SQL commands.

Commands

  1. SELECT:- Select data from an object (table/views)
  2. FROM:- The object (table/views) we need to select the data from.
  3. WHERE:- Filter query to apply an condition.
  4. JOIN:- Combine 2 or more objects (table/views). 
  5. GROUP BY:- Group rows that have same value into summary rows.
  6. ORDER BY:- Sort the result.
  7. HAVING:- Filters the data but used with aggregate commands/functions.
  8. SUM:- Sum of columns
  9. AVG:- Average of columns
  10. MIN:- Minimum value of a column
  11. MAX:- Maximum value of a columns
  12. COUNT:- Number of rows.
  13. AND:- Condition used in a query. All conditions have to be TRUE.
  14. OR:- Condition used in a query. One of the condition has to be TRUE.
  15. LIKE:- Used for searching a pattern
  16. IN:- multiple values.
  17. LIMIT:- Limits the number of rows returned.

Examples

  • Select all rows from a table
    •        
      SELECT * FROM table1;
      
  • Select first 100 rows from a table 
    •        
      SELECT * FROM table1 LIMIT 100;
      
  • Select 3 columns but all the data from a table 
    •        
      SELECT column1, column2, column3 FROM table1;
      
  • Select all rows from a table with 1 filter applied.
    •        
      SELECT * FROM table1 WHERE column1 = 10;
      
  • Select all rows from a table with multiple filter applied
    •        
      SELECT * FROM table1 WHERE column1 = 10 AND column2 = 'aaaaa';
      
  • Get count of the rows from a table
    •        
      SELECT COUNT(*) as TableCount FROM table1;
      
  • Get maximum value of a column from a table.
    •        
      SELECT MAX(column1) AS MaxValue FROM table1;
      
  • Get minimum value of a column from a table
    •        
      SELECT MIN(column1) AS MinValue FROM table1;
      
  • Get average value of a column from a table
    •        
      SELECT AVG(column1) AS AvgValue FROM table1;
      
  • Join two tables and select the data from matching values in both the tables
    •        
      SELECT tbl1.*, tbl2.* 
      FROM table1 as tbl1 
      INNER JOIN table2 tbl2 
      ON tbl1.column1 = tbl2.column1;
      
  • Join two tables and select all the data from the first table and the matching values from the second table
    •        
      SELECT tbl1.*, tbl2.* 
      FROM table1 as tbl1 
      LEFT JOIN table2 tbl2 
      ON tbl1.column1 = tbl2.column1;
      
  • Join two tables and select all the data from the second table and the matching values from the first table
    •        
      SELECT tbl1.*, tbl2.* 
      FROM table1 as tbl1 
      RIGHT JOIN table2 tbl2 
      ON tbl1.column1 = tbl2.column1;
      
  • Aggregate the data and filter the aggregate
    •        
      SELECT column1, SUM(column2) AS SumCol2 
      FROM table1 
      GROUP BY column1 
      HAVING SUM(column2) > 1000;
      

Steps to Follow along

1. Create a new Notebook in Databricks. 
  • Name:- Anything you want.
  • Default Language:- SQL
  • Cluster:- Available cluster
  •  

2. Lets check what mounts are available.
  • Enter the command in the new cell  
    %py dbutils.fs.mounts()
  • %py is a magic command which lets Databricks know to use python language to execute this command.
3. Now lets read the parquet file into a SQL view.
  • %py
    #----Start Comments------------
    # This is the code to create a SQL view from a folder in ADLS Gen 2.
    # We are reading "folder_name" from the ADLS Gen2 "refined" container and creating a SQL view called "view1"
    # Replace "folder_name" with the name of the folder in your "refined" container
    #----End Comments------------
    #----Start Comments------------
    #Name of the sql view
    #----End Comments------------
    view_name = 'view1'
    #----Start Comments------------
    #building data frame to read from ADLS Gen2.
    #/mnt/refined --> mount which has already been created
    #folder_name --> Name of the folder in the refined container
    #----End Comments------------
    df = spark.read.parquet('/mnt/refined/folder_name/')
    #----Start Comments------------
    #Command to create a SQL View.
    #This view will persist in memory as long as the cluster is running.
    #----End Comments------------
    df.createOrReplaceTempView(view_name)
4. Lets select data from the view created in Step 3 ("view1").
  • -- As this is a SQL notebook so the default context is %sql
    SELECT * FROM view1
5. Lets get the number of rows from the view ("view1").
  • -- As this is a SQL notebook so the default context is %sql
    SELECT COUNT(*) AS NbrOfRows FROM view1
 Following the above steps should get you started with using SQL in Databricks to explore and drive insights from the data available in the datalake.

References

Comments