Databricks AutoML + SQL = Streaming ML on the Lakehouse!

Avinash Sooriyarachchi
11 min readJun 23, 2022

--

Stack Overflow recently released their 2022 Developer Survey and among Professional Developers, SQL came ahead of Python as the 3rd most widely used programming language. While some would argue SQL’s standing as a ‘proper’ programming language (it is Turing complete, so…), it’s hard to deny the ubiquity and the utility of SQL in the data world.

However, when it comes to AI and machine learning, SQL is still a second class citizen, even compared to R or Javascript. This is unfortunate, as SQL savvy engineers and analysts typically know the data and the domain extremely well, which is crucial to successful machine learning in production.

On the other hand, in-stream model deployment is a largely unexplored deployment mechanism that has a number of business critical applications such as anomaly detection and predictive maintenance.

A means to enable SQL analysts to build streaming machine learning systems would unlock significant value for many organizations across numerous industries. Unfortunately streaming, machine learning and SQL are hardly ever used in the same sentence!

Today, we are going to change that and see how you can use the low-code Machine learning capabilities of Databricks AutoML and Delta Live Tables to perform machine learning inference on streaming data on the Databricks Lakehouse. In other words, we will train and put a machine learning model into production to perform predictions on streaming data without writing a single line of Python!

In the diagram below, I’ve broken down the end-to-end process into 5 stages for the sake of clarity. Here, all the transformations you will compose will be in SQL and these will be executed in the Delta Live Tables (DLT) pipeline. If you are unfamiliar with features such as DLT and Databricks AutoML, I will add more color as we describe the steps.

Architecture diagram of the streaming ML inference solution
End to End process for streaming ML inference on Databricks using AutoML and DLT

Let’s take a closer look at the 5 steps.

  1. Load data and train the machine learning model using Databricks AutoML low-code interface.
  2. Copy the auto-generated Spark UDF code and requirements from the MLflow UI and paste in the cells of a Databricks notebook.
  3. Write the SQL transformations you wish to perform on streaming data in a Databricks notebook with SQL as the default language.
  4. Create a DLT pipeline using the DLT interface and hit run. Your streaming ML inference pipeline will start transforming and inferring records against the ML model on the fly!
  5. Visualize the results using, you guessed it, SQL with Databricks SQL.

As you see here, although it is 5 steps, the process of creating this streaming pipeline is quite simple and well within the skillset of the eager SQL analyst.

This general pattern is applicable across multiple use cases. For the purpose of this blog, we will choose a predictive maintenance scenario where we train a machine learning model to predict machine failure and deploy the model to a stream of incoming machine sensor readings.

Now let’s dive in and start building!

I have placed all the code and data necessary for this entire exercise in this Github repo which is publicly accessible. The data we use in this example is a sanitized version of this machine predictive maintenance classification data set from Kaggle.

Clone example code repository with Repos

Databricks provides seamless integration with a number of Git providers such as Github and Gitlab with Databricks Repos. We can use Repos to clone the above linked repository to get started.

First copy the URL of the Git Repo from the Github UI.

Click on the Repo icon on the left tab of the Databricks workspace interface. Right click on the opened tab and click on Add Repo in the drop-down menu. Paste the URL in the Git Repository URL text box and click create. This can be seen below.

Upload the Data for training model with Databricks AutoML

You can import data to Databricks from any source system under the sun. However, for this example, the goal is to keep things simple. We will first upload data and create a table.

Go to the Databricks SQL UI using the persona switcher

Then Open the Data Explorer to get to the data upload UI (this is only meant for small datasets for quick analysis).

Click on ‘Create table via upload’ and drag and drop the .csv file with the sample machine readings in the Github repo (download this locally and upload using this UI).

You will see a sample of the data from the file.

Click the Create button. You now have a table that could be used by Databricks AutoML for training a ML model.

Train the machine learning Model using Databricks AutoML

Databricks AutoML provides a low-code approach to automated machine learning. With the click of a button, you can trigger the system to search for the best machine learning model for a given machine learning problem. It will perform an intelligent Bayesian hyperparameter and model search and provide the trained machine learning models and the corresponding notebooks that were used to generate the models. These notebooks are well commented and are quite easy to modify if needed. This glass-box approach is in stark contrast to most black box AutoML systems in the market today. The system also generates a Data Exploration notebook which fully automates the exploratory data analysis process.

Now let’s get going! Go to the persona switcher again and choose the Machine Learning persona.

Then click on the Start AutoML link in the user interface.

When performing anything related to machine learning on Databricks, always use a cluster with a machine learning runtime. This example was created using a 10.5 ML runtime cluster and I encourage you to use the same for seamless reproduction of the results.

Leave the default ML problem type of Classification, deselect the failure type variable (to avoid data leakage) and select ‘Target’ as the Prediction Target. Then hit the bright blue button and AutoML will do all the work for you!

There is a lot more to AutoML than I’ll discuss here and it’s definitely worth checking out.

Once the training process is done. You will see the notebooks corresponding to the top model, the data exploration notebook and a list of models (and corresponding notebooks) below it. Click on the model artifact link for the top model. In this case, a LightGBM model was the most performant model generated by Databricks AutoML. MLflow, which is an open source platform developed by Databricks that is tightly integrated with the platform, works under the hood to track and log all the models thus generated by AutoML.

Thanks to MLflow, you will see all that is required to deploy the model in any deployment system in this interface. Register the model by clicking the button and copy the Spark UDF code and the dependencies in the requirements.txt file to a text editor.

Why would you write Python when you can copy and paste (and slightly modify) it!

Create a Notebook for Registering the UDF

To use the machine learning model that we’ve obtained above in a streaming application, we need to apply it as an Apache Spark User Defined Function (UDF). The autogenerated code that we copied above wraps the model in a highly efficient vectorized UDF that could then be used to perform distributed inference across a cluster either in batch or streaming applications. How does this apply to Delta Live Tables? We will revisit that after putting together the notebook for registering the UDF

Open a Python notebook and paste the dependencies you copied from the requirements.txt file. Prepend each line with ‘%pip install ’ as shown below.

Then in the cell below, paste the code copied from the box with autogenerated PySpark UDF code. At the bottom of it copy and paste the following line:

spark.udf.register("predict_failure", loaded_model)

Just so I can stick to my promise of you not manually writing any Python :)

The second cell should look as follows:

To summarize, this snippet loads the model as a spark UDF and registers it so that it could be used in a streaming pipeline.

What is Delta Live Tables?

Delta Live Tables (DLT) is a declarative ETL framework that allows you to focus on writing the data transformation logic in SQL or Python and automates and abstracts away the underlying complexity of building and maintaining production grade data pipelines. With DLT you can create streaming or batch data pipelines without having any knowledge of Apache Spark or any other frameworks. Apart from ETL, DLT can be efficiently used for creating machine learning feature engineering pipelines and inference pipelines.

In this example we are combining ETL and model inference in a streaming pipeline. (There is a lot to DLT such as data quality monitoring, which I won’t discuss in this blog so be sure to check out the documentation)

Loading Data for Ingestion with Autoloader in DLT

For a streaming pipeline, data needs to be continuously ingested as it is generated/ lands in object storage. Databrick Autoloader exists for this exact purpose. Autoloader is used to ingest data to DLT pipeline and to simulate this process in this example, we will re-upload a sample of the same dataset to Databricks File System (DBFS)

Specify a sub-directory and upload the data to DBFS by navigating to the DFBS upload interface by clicking on the Data icon in the left tab in either the Machine learning or Data Science and Engineering persona based view of the Databricks Workspace.

Note this location in DBFS to be used later in the SQL notebook (seen next to the check mark in the above image)

Composing DLT transformation logic in SQL

Create a notebook with SQL as the default language. This will contain the ETL and model inference logic. You should be able to use the notebook in the cloned repo to reproduce this example. The only thing to be changed would be the directory in DBFS with the data to be ingested to the pipeline.

Let’s examine the data ingestion code closely:

The above SQL statement creates a streaming live table in DLT. This syntax is specific to DLT and creates a table that will be streamed into if run in continuous mode (we’ll get to this). Here we see a comment added to give additional information on the logic in this step. We are using autoloader (cloud_files function) to incrementally ingest .csv files in the specified directory to a Bronze table, while dynamically inferring schema. In the Databricks medallion architecture, a Bronze table contains raw data as it is ingested from object storage. This is declared as a table property. Note that the comment and table property are optional here.

Then we perform some data cleaning and quality checks before passing records to be inferred by the model.

Here, we create yet another streaming live table, but with a twist. We have data quality constraints that monitor two fields in the data flowing through this step for null values and drops records where this constraint is not met. The beauty of DLT is that data related to quality constraints are collected as the pipeline is executed (which we will get to, shortly). The table thus created is cleaner, hence a Silver table.

Finally we get to the gold table, where records are scored/ predicted with the UDF registered in the previous notebook. We are predicting records with a LightGBM model on streaming data in SQL!

But wait, the code for registering a UDF is in another notebook altogether. And if you did try to run the SQL notebook cell by cell, you would have seen something like this:

It just provides syntax validation (which is very useful, but far short of a streaming ML inference pipeline!).

We need to create an actual DLT pipeline from these two notebooks.

Putting it all together with Delta Live Tables

Go the the Machine Learning or Data Science and Engineering persona view in the workspace interface and click on the Delta Live Tables tab.

Then click on the Create Pipeline button.

Enter a name of your choice for the pipeline and add both Notebooks as Notebook libraries in the appropriate section (order does not matter. DLT will figure out the dependencies). I recommend running the following line in a SQL notebook elsewhere/Databricks SQL query editor to have a location to store the tables created in the DLT pipeline:

CREATE DATABASE IF NOT EXISTS Machine_predictions

Enter this database name in the ‘Target’ input field and select continuous for Pipeline mode. This is to create a streaming pipeline. For batch processing, select triggered. There are numerous other options and I invite you to explore these as they show the extent to which DLT automates the pain away from ETL.

Your configurations should look similar to this.

Click Create.

In a few minutes, you will have a streaming pipeline that performs predictions using machine learning on incoming records!

As you see, this is done in development mode. In a production scenario, you can switch to production after tests and a number of features such as cluster restarts and automated pipeline retries enabled by DLT makes the process seamless.

Note that the sample data used in this example is very small. DLT as a framework is engineered to handle data volumes from kilobyte to petabyte scale in real world scenarios.

There’s much to discuss about DLT which I won’t in this blog. I encourage you to watch this short video for a quick overview.

Querying Results with Databricks SQL

Now let’s use SQL to take a look at the predictions generated by this pipeline.

Switch over to Databricks SQL and run the following query in the query editor:

You should be able to see the record ID and the model predictions that we expect to see in the Gold table! You will also see the bronze and silver tables in the database and they can be queried equally easily.

There you have it. We used the power of low-code AutoML capabilities in Databricks to train a machine learning model that was deployed in-stream in a pipeline created using Delta Live Tables. Just by pointing and clicking, using autogenerated code and straight SQL! This opens up a world of possibilities to data teams across the world where SQL is the most readily available skill. I invite you to try out this example, tweak things, create your own and hopefully use this pattern for actual use cases in your day-to-day work!

--

--

Avinash Sooriyarachchi

I build intelligent systems for a living + fun. I currently work at Databricks