Using SQL to analyze data in Azure Data Lake (ADLS Gen 2)

As more and more data is ingested and made available in data lakes, there is a growing demand from data analysts to be able to quickly access the data and drive insights. The biggest reason fueling this demand is the ability to use existing skills like "SQL" to be able to analyze available data.

In Azure, following are few of the available in a Big Data world, to make the data available to data analysts:-

1. Mounting Azure Data Lake (ADLS Gen 2) to Databricks.

2. Implementing Databricks Lakehouse pattern, built on top of Delta tables which are available as tables in Databricks.

3. Loading data in Azure Synapse Workspace.

One good thing regarding the above mentioned options is that; all the options enable the data analyst to use there existing "SQL" skills to analyze the data and give them the flexibility to build solutions and products as quickly as possible and in turn enables "citizen development".

In this blog post we will quickly go through the first option "Mounting Azure Data Lake (ADLS Gen 2) to Databricks" and also look at some of the basic SQL constructs to refresh our basics.

Prerequisites

ADLS Gen 2
Service Principal
Azure Databricks
ADLS Gen 2 already mounted to Azure Databricks.

So what does Mounting means? Mounting is the process where we make Data Lake act as a local file system. Mounting abstracts the complexity of reading and writing to a data lake from the end user.

SQL Basics

Firstly, we are going to review some of the basic SQL commands. This is not a complete list but a list of commonly used SQL commands.

Commands

SELECT:- Select data from an object (table/views)
FROM:- The object (table/views) we need to select the data from.
WHERE:- Filter query to apply an condition.
JOIN:- Combine 2 or more objects (table/views).
GROUP BY:- Group rows that have same value into summary rows.
ORDER BY:- Sort the result.
HAVING:- Filters the data but used with aggregate commands/functions.
SUM:- Sum of columns
AVG:- Average of columns
MIN:- Minimum value of a column
MAX:- Maximum value of a columns
COUNT:- Number of rows.
AND:- Condition used in a query. All conditions have to be TRUE.
OR:- Condition used in a query. One of the condition has to be TRUE.
LIKE:- Used for searching a pattern
IN:- multiple values.
LIMIT:- Limits the number of rows returned.

Examples

Select all rows from a table

```
       
SELECT * FROM table1;
```

Select first 100 rows from a table

       
SELECT * FROM table1 LIMIT 100;

Select 3 columns but all the data from a table

       
SELECT column1, column2, column3 FROM table1;

Select all rows from a table with 1 filter applied.

       
SELECT * FROM table1 WHERE column1 = 10;

Select all rows from a table with multiple filter applied

       
SELECT * FROM table1 WHERE column1 = 10 AND column2 = 'aaaaa';

Get count of the rows from a table

       
SELECT COUNT(*) as TableCount FROM table1;

Get maximum value of a column from a table.

       
SELECT MAX(column1) AS MaxValue FROM table1;

Get minimum value of a column from a table

       
SELECT MIN(column1) AS MinValue FROM table1;

Get average value of a column from a table

       
SELECT AVG(column1) AS AvgValue FROM table1;

Join two tables and select the data from matching values in both the tables

       
SELECT tbl1.*, tbl2.* 
FROM table1 as tbl1 
INNER JOIN table2 tbl2 
ON tbl1.column1 = tbl2.column1;

Join two tables and select all the data from the first table and the matching values from the second table

       
SELECT tbl1.*, tbl2.* 
FROM table1 as tbl1 
LEFT JOIN table2 tbl2 
ON tbl1.column1 = tbl2.column1;

Join two tables and select all the data from the second table and the matching values from the first table

       
SELECT tbl1.*, tbl2.* 
FROM table1 as tbl1 
RIGHT JOIN table2 tbl2 
ON tbl1.column1 = tbl2.column1;

Aggregate the data and filter the aggregate

       
SELECT column1, SUM(column2) AS SumCol2 
FROM table1 
GROUP BY column1 
HAVING SUM(column2) > 1000;

Steps to Follow along

1. Create a new Notebook in Databricks.

Name:- Anything you want.
Default Language:- SQL
Cluster:- Available cluster

2. Lets check what mounts are available.

Enter the command in the new cell
```
%py dbutils.fs.mounts()
```
%py is a magic command which lets Databricks know to use python language to execute this command.

3. Now lets read the parquet file into a SQL view.

%py
#----Start Comments------------
# This is the code to create a SQL view from a folder in ADLS Gen 2.
# We are reading "folder_name" from the ADLS Gen2 "refined" container and creating a SQL view called "view1"
# Replace "folder_name" with the name of the folder in your "refined" container
#----End Comments------------
#----Start Comments------------
#Name of the sql view
#----End Comments------------
view_name = 'view1'
#----Start Comments------------
#building data frame to read from ADLS Gen2.
#/mnt/refined --> mount which has already been created
#folder_name --> Name of the folder in the refined container
#----End Comments------------
df = spark.read.parquet('/mnt/refined/folder_name/')
#----Start Comments------------
#Command to create a SQL View.
#This view will persist in memory as long as the cluster is running.
#----End Comments------------
df.createOrReplaceTempView(view_name)

4. Lets select data from the view created in Step 3 ("view1").

-- As this is a SQL notebook so the default context is %sql
SELECT * FROM view1

5. Lets get the number of rows from the view ("view1").

-- As this is a SQL notebook so the default context is %sql
SELECT COUNT(*) AS NbrOfRows FROM view1

Following the above steps should get you started with using SQL in Databricks to explore and drive insights from the data available in the datalake.

References

W3 Schools

Akhil Mahajan

Search This Blog

Agentic AI/ AI agents