What is Azure Databricks?
•Databricks is a unified data and analytics platform built to enable all data personas: data engineers, data scientists and data analysts.
•It is a managed platform that gives data developers all the tools and infrastructure they need to be able to focus on the data analytics, without worry about managing Databricks clusters, libraries, dependencies, upgrades, and other tasks that are not related to driving insights from data.
Use of Azure Databricks :
•Higher productivity and collaboration
•Integrates easily with the whole Microsoft stack
•Extensive list of data sources
•Suitable for small jobs too
What is Azure Machine Learning Workspace ?
•Azure Machine Learning is a separate and modernized service that delivers a complete data science platform.
•It supports both code-first and low-code experiences.
•Azure Machine Learning studio is a web portal in Azure Machine Learning that contains low-code and no-code options for project authoring and asset management.
Use of Azure Machine Learning :
•Use Machine Learning as a Service
•Easy & Flexible building interface
•Wide range of supported algorithms
•Easy implementation of web services
Integrating Azure Databricks with Azure Machine Learning :
•Azure Databricks is ideal for running large-scale intensive machine learning workflows on the scalable Apache Spark platform in the Azure cloud.
•It provides a collaborative Notebook-based environment with a CPU or GPU-based compute cluster.
•Azure Databricks integrates with Azure Machine Learning and its AutoML capabilities.
Use of Integrating Azure Databricks with Azure ML:
1.To train a model using Spark MLlib and deploy the model to ACI/AKS.
2.With automated machine learning capabilities using an Azure ML SDK.
3.As a compute target from an Azure Machine Learning pipeline.
Summary Steps :
- Please refer the GitHub link for the actual code .
- Create Storage account, Blob Storage container (is used for this demo). But other data sources like ADLS, cosmos DB, SQL database, MySQL database can also be used.
- The main script for running this pipeline will be present in azure ml.
- This pipeline is built in two steps.
- Data preparation and model building in azure databricks. For this I used mlflow. Similarly we can use automl also.
- Metrics for evaluating the model performance as a python script is present in azure ml.
7. Adding both the steps into azure ml pipeline step.
8. Execute the pipeline.
9. Track model performance with different metrics logged, model registry and can be monitored in azure ml.
Steps with screenshots :
1. Create Azure Databricks Workspace in azure portal as shown below.
2. In Azure Databricks workspace , Click on Link Azure ML workspace and UI see below will be popping up.
3. In this, you can create a new ML workspace or you can link the existing workspace. When linking the azure ml workspace make sure(subscription id, resource group and location) are same as azure databricks.
4. Once after completing the above step, we can see azureml is linked with azure databricks UI as shown below.
5. Launch databricks workspace, create cluster with configuration as per requirement.
.
6. Create notebook and write code for data preprocessing and model building. It can be done independently or else in same notebook based on the data and on requirement.(refer this jupyter notebook as named “data_prepartion”)
7. Launch azureml workspace, create compute instance in compute section of azure ml.
8. Launch Jupyter notebook and create two script file as per the need or design layout.
(refer this python script file as named “DatabricksStep_Job.py” main pipeline script, and Evaluate.py for logging metrics of the registered model.)
9. You can refer these DatabricksStep_Job.py script in azure ml. It covers all compute creation for both azure ml and azure databricks in it.
Compute created for executing commands in evaluate.py in azureml
Compute attached for executing commands in data_prepartion ipynb file in azure databricks
10. After running pipeline and executed successfully. Can trace all model metrics, logging information, model registered in azure ml.
Similarly we can check each steps involved in this pipeline in azure ml covering azure databricks as well in pipeline over view.
github_link : https://github.com/vigneshSs-07/Azure_DBK_with_AzureML
THANK YOU..!