Salesforce is using AI to democratize SQL so anyone can query databases in natural language

SQL is about as easy as it gets in the world of programming, and yet its learn curve is still steep enough to prevent many people from interacting with relational databases. Salesforce’s AI research team took it upon itself to explored how machine learning might be able to open doors for those without knowledge of SQL.

Their recent paper, Seq2SQL: Producing Structured Queries from Natural Language use Reinforcement Learning, builds on sequence to sequence models typically employed in machine translation. A reinforcement learning twist allowed the team to obtain promising results translating natural language database queries into SQL.

In practice this means that you could simply ask who the winningest squad in college football is and an appropriate database could be automatically queried to tell you that it is in fact the University of Michigan.

“We don’t actually have just one way of writing a query the correct way, ” Victor Zhong, one of the Salesforce researchers who worked on the project, explained to me in an interview. “If I devote a natural language question, there might be two or three ways to write the query. We use reinforcement learning to encourage employ of queries that procure same result.”

You can imagine how machine translation problems can quickly become massively complex with large vocabularies. The more you can restriction the number of possible translations for each missing term, the simpler your problem becomes. To this avail, Salesforce opted to limit its vocabulary to terms used in database labels, the words in the question being asked and the words typically used in SQL queries.

The idea of democratizing SQL isn’t new. Startups like ClearGraph, which was recently acquired by Tableau, have made it their business to open up data with English rather than SQL.

“Some models perform execution on a database itself, ” added Zhong. “But there’s potential privacy fears if you’re asking a question about Social Security numbers.”

Outside of the paper itself, Salesforce’s biggest contribution here comes in the form of the WikiSQL data set it constructed to aid in constructing its model. First HTML tables were collected from Wikipedia. These tables became the basis for arbitrarily made SQL queries. These queries were used to sort questions that were then passed off to humans for rephrasing over Amazon Mechanical Turk. Each paraphrasing was confirmed twice with additional human guidance. The resulting data set is the largest such data set in existence.

Make sure to visit: CapGeneration.com