Azure Databricks is a cloud-based big data and machine learning platform based on Apache Spark. With fully managed Spark clusters, it can process large data workloads and supports APIs for R, SQL, Python, Scala and Java.
Apache Spark is an open source, distributed computing environment, that can analyze big data using SQL (Spark SQL), machine learning (MLlib), graph processing (GraphX) and real-time data streaming.
Databricks has native connectivity with Azure Blob Storage, Azure Data Lake, Azure Synapse, Apache Kafka and Hadoop.
We will use Databricks to cleanse and prep the csv files that we have ingested in our data lake. We will grab the files from the data lake, cleanse and transform them, and put them back on the data lake, in the cleansed layer.
If you are new to this blog, please check the Kickoff and ADF series first, to learn about the WTA Insights project and the required Azure setup.
Posts in the Databricks series:
1. Creating a Databricks service, workspace and cluster
2. Notebooks with Python – part 1
3. Notebooks with Python – part 2