Beginners Guide to Read and Write data in Azure Databricks
This article is a continuation of my earlier article (Dumb Down Azure Databricks Delta Lake Architecture) and the primary goal of this article is to demonstrate how we can read and write the data in Azure databricks and delta lake.
Below are the steps we are going to cover in this article
1. Provision of the required resources
2. Create the Cluster and Upload the sample file
3. Create a notebook and read the file
4. Create a data frame and a delta-lake table
5. Query the table using SQL script
1. Provision the Resources Required
Azure Databricks Workspace:-
→From the Azure portal, click Create a resource → Analytics → Databricks.
Enter the required details and Click Review+Create
Create a Spark Cluster:-
Open the Azure Databricks Workspace and click on the New Cluster
Give a meaningful name to Cluster and select the Runtime version and Worker Type based on your preference and click on Create Cluster
2. Upload the Sample file to Databricks (DBFS)
Open the Databricks workspace and click on the ‘Import & Explore Data’
Click on the ‘Drop files to upload…’ and select the file you want to process, For this article, I am using the Country Sales data from here
The Country sales data file is uploaded to the DBFS and ready to use
Click on the DBFS tab to see the uploaded file and the Filestroe path
3. Read and Write the Data
Open the Azure databricks workspace and create a notebook
Now its time to write some python code to read the ‘CountrySales.csv’ file and create a data frame
# File location and type
file_location = “/FileStore/tables/Country_Sales_Records.csv”
file_type = “csv”
# CSV options
infer_schema = “false”
first_row_is_header = “false”
delimiter = “,”
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option(“inferSchema”, infer_schema) \
.option(“header”, first_row_is_header) \
.option(“sep”, delimiter) \
Copy and Paste the above code in the cell, change the file name to your file name and make sure the cluster is running and attached to the notebook
Now its time to run the code and see the data, let's run it by clicking on the Run Cell or CTRL + ENTER
The code was executed successfully and I see the cluster created 2 spark jobs to read and display the data from the ‘Country Sale’ data file
Also if you notice the schema is not exactly right, it shows String for all the columns, and the Header doesn’t seem right ( _c0,_c1..etc)
so let's see how we can fix the Schema and the header
# CSV options
infer_schema = “true”
first_row_is_header = “true”
Change the variable s to True and run the cell again
Now the schema and header looks much better
Create a table and query the data using SQL
Create a temporary view using the data frame and query the data using SQL language.
# Create a view or table
tblCountrtySales = “Country_Sales”
Add a new cell to the notebook, paste the above code and then run the cell
select * from `Country_Sales`
Add a new cell, paste the above code and run the cell ( the % SQL is magic command and will talk about the magic command and more in the upcoming articles)
Now you can use the regular SQL scripting language on top of the temporary view and query the data in whatever way you want. But the view is temporary in nature, which means it will only be available to this particular notebook and will not be available once the cluster restarts.
Created a new notebook and tried to access the view that we just created, but it's not accessible from this notebook
So to make it available across the notebooks and to all the users we have to create a permanent table. So let's create a permanent, table by executing the below code
tbl_name = “tbl_Country_Sales”
Now the permanent table is created and it will persist across cluster restarts as well as allow various users across different notebooks to query this data.
Let's try accessing it from another notebook and see whether it's accessible or not.
Cool, now we can access the table from other notebooks as well.
Hope you enjoyed reading this article, in my future artcile we will see how we can build an end-end data engineering project using Azure databricks until then have a great time and stay warm.