Don’t know SQL? Can you talk?

Natural language queries to SQL with Transformer based language models

Avinash Sooriyarachchi
7 min readMar 24, 2022

The technical skill gap between what’s required to effectively analyze data and what the average professional possesses has stood in the way of many organizations for years. This problem is only getting more pronounced as organizations seek to be more data driven. To address this issue, a large number of new products and even product categories have been created. One such category is that of drag and drop querying tools which allow non-technical users to string together a process to extract some insight from data by dragging and dropping icons on a canvas. At the end of the process, this query is compiled to some query plan in code, often to SQL and executed on a data processing engine.

The above drag and drop process is great. Only if you have simple queries. Once you string together a flow for a complex query/ process, you are left with a mess which is quite hard to look at and to follow (at which point many begin to think they’d be better off learning to write SQL!). Unfortunately, these complex queries are all too common in most organizations.

This begs the question, if drag and drop query generators (marketed as ‘low-code analytics tools’) are ineffective at most real world tasks, what could be the holy grail? In my personal opinion, as matters stand, natural language query (NLP speak for a ‘question’) to SQL would be very close to it. Since SQL queries can run on many distributed query platforms like Apache Spark (the best tailor made product to check out for this, would be Databricks SQL, in my opinion), any data warehouse (again check out Databricks SQL) or Presto, as this provides a uniform interface to ‘interrogate data’. All you have to do, as an end user, is to know what to ask for and the system should be able to handle ambiguity inherent in the said question, generate a SQL query and run it in the underlying data processing system.

With transformer based language model architectures, there have been a few good demonstrations of this text to SQL approach. However, none have been foolproof. This is particularly because of data quality issues that plague any enterprise grade datasets and limitations in transformer-based models themselves. Regardless, it’s still worth taking a closer look at what’s out there. Here I focus on transformer model based approaches in particular, although there is a rich body of research done on the subject that extends beyond this particular approach.

GPT-3

Right off the bat, I’d like to start with the biggest language model most people know about. Generative Pretrained Transformer — 3. With 175 Billion parameters, this model itself is not open source, but is available as an API as a Service.

One of the demonstrative examples OpenAI, the organization responsible for a multitude of research in the domain of NLP and commercialization of GPT-3, has showcased on their website is, in fact, text to SQL conversion. To be precise, there are two such examples. Note these examples leverage the largest model available i.e. text-davinci-002 at the time of publication of this blog.

The first approach is to generate simple SQL queries from a natural language query (a question). One such example that I tried is as follows:

Question/ Prompt:

Create a SQL request to find all people named Jeoff who live in the state of LA:

It generated the following query:

SELECT * FROM people WHERE name = ‘Jeoff’ AND state = ‘LA’;

Which is very good. Once I cranked up the complexity of the query, the results were suboptimal but still pretty good.

Question/ Prompt:

Create a SQL request to find all people named Jeoff who live in the state of LA who make more than 50000:

Generated SQL query:

SELECT * FROM people WHERE name = ‘Jeoff’ AND state = ‘LA’ AND salary > 50000;

The problem here is somewhat subtle. In all honesty, the problem is present in both examples. How do we know if the name is in the ‘name’ column and the annual salary is in the ‘salary’ column and not the ‘income’ column? We provide no information about the schema and let the model, based on the data it was trained on, come up with column names. This is certain to fail in most real world situations.

GPT does have another much more robust approach that includes table schema in the prompt for the model.

An example is as follows (documented examples can be found in the SQL translate section of the OpenAI API examples: https://beta.openai.com/examples):

Question/Prompt:

### Database tables, with their properties:

#

# Student(id, name, department_id)

# Department(id, name, cost_center)

# Exam_Score(id, student_id, score, date) #

### A query to list students who scored more than 80 in the finals exam in the mathematics department SELECT

Generated SQL query:

SELECT student.name FROM student INNER JOIN department ON student.department_id=department.id INNER JOIN exam_score ON student.id=exam_score.student_id WHERE department.name=”mathematics” AND exam_score.score>80 GROUP BY student.name HAVING COUNT(*)>1

This is not very concise but pretty impressive.

Even with this, as queries become complex, results become more and more unimpressive. Also, the model input can only contain a certain maximum number of tokens. Given that schema information in metastores and elsewhere span hundreds of tables, there needs to be some preprocessing done to the input prompt, for this to even work in a toy example in the real world. Another thing to keep in mind is the cost. As I mentioned before, the model used here is the largest (hence the most expensive) GPT model available via the OpenAI API. In a production application, with hundreds and thousands of concurrent ad-hoc queries, the cost is sure to balloon. In addition to that, a number of queries generated as such may not be valid, risking compute costs incurred on the underlying distributed query engine. This might necessitate a rather annoying partially manual query validation process, at which point the intended user-base goes from data novice to a person who’d rather validate SQL than write it (although he/she can), which misses the point.

Perhaps with the fine-tuning feature that OpenAI has released: https://beta.openai.com/docs/guides/fine-tuning, the query quality can be improved, but some of the other problems such as input token limits remain unaffected.

T5: Encoder-Decoder Transformer based model

Paying for the OpenAI API is not the only way to convert text to SQL queries using transformers. A class of transformers, most commonly used for sequence to sequence translations (which is pretty close to what we are trying to do here) called encoder-decoder models can be used for this purpose.

A T5 model fine tuned on a publicly available corpus of text-SQL samples i.e. the wikiSQL dataset, is available on the HuggingFace model hub: https://huggingface.co/mrm8488/t5-base-finetuned-wikiSQL

The performance of this model for generating SQL queries from text input is rather limited and is similar to the first example of simple SQL queries we demonstrated with the OpenAI API. The lack of ability to feed in table schema and poor performance with complex queries make this an interesting experiment but difficult to apply in most real world scenarios without customization.

A simple query I prompted this fine tuned T5 model with is as follows:

Query/Prompt:

How many kangaroos were born in Australia in 2010?

Generated SQL query:

SELECT COUNT Kangaroo FROM table WHERE Year born = 2010 AND

This query is less than stellar. However, with better preprocessing and more finetuning, the model performance could be improved.

Converting Speech to SQL queries

An example where I combined this model with a speech to text model ‘silero_stt’ can be found at the following link:https://github.com/AviSoori1x/speech_to_text

The file is a Databricks .dbc file that could be tried with the community edition which is free.

If you wish to build a real-time application with this, you could use MLflow serving within Databricks following instructions here (wrapping the speech to text (silero_stt inference) and text to SQL(T5 inference) as a custom class inheriting from the mlflow.pyfunc.PythonModel class):https://mlflow.org/docs/latest/models.html#custom-python-models

And deploy it using instructions here: https://docs.databricks.com/applications/mlflow/model-serving.html

Without having to write custom web servers, having to worry about hosting them and all the baggage that comes with it, just by using Databricks!

A dedicated exploration of this done by Romain Rigaux can be found here: https://medium.com/data-querying/serving-a-transformer-model-converting-text-to-sql-with-huggingface-and-mlflow-be831ae6213c

In the most trivial cases, this would allow the user to ‘ask’ the question and the system would generate a SQL query that could be executed against a query engine to glean insights from structured data. For example, you could use the Databricks SQL REST API to execute thus generated SQL queries against highly optimized autoscaling Databricks SQL endpoints, which gives unparalleled price performance (this can be seen here: https://databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html).

Using Databricks SQL with Hugging Face Transformers and Silero STT for a speech to distributed SQL query execution system

Based on this very high level exploration I’ve done here, the future looks promising. However, currently, no out of the box pre-trained or fine-tuned model generates 100% accurate SQL queries even with the schema provided. With some clever engineering, this could change in the future, leading to better accessibility to the data ecosystem for people with varied skill levels.

You can get started with Databricks community edition at: https://databricks.com/try-databricks

--

--

Avinash Sooriyarachchi

I build intelligent systems for a living + fun. I currently work at Databricks