Technology
AI

14.12.2022

Using AI to write SQL and Terraform Code

AI and Machine Learning tools are finally coming into the hands of Data Engineers and business users. This opens a multitude of possibilities! Could AI, for example, help us to write better, more sustainable SQL or Python code in the future? Let's investigated the newest fuzz on the town, ChatGPT-3, and how well it can write SQL and Terraform code.

Written by — Mika Heino, Data Architect

Everything starts with small talk

At Recordly, we have strong expertise on how Machine Learning and AI can help create better data solutions and I am blessed to have worked with one of our top ML engineers, Rasmus Toivanen. He is renowned for his research on Finnish NLP/ASR models and has been featured in the Finnish IT field paper (behind paywall), Tivi for the same topic.

Rasmus has been a great help to me in my past Snowflake Machine Learning projects. After office hours, we've been testing the increasingly popular Stable Diffusion text-to-image solution at the office to generate images of us, Recordly employees. Recently, I've been using the Colab Dreambooth text-to-image generator to create images for my blogs. It's a great way to get interesting visuals without having to purchase a license from a stock image service. For example, the illustration on this blog is created with the help of the latest Stable Diffusion 2.0 from Stability AI.

The last time we caught up with Rasmus, we started discussing Werner Vogels' introduction of Neural Radiance Fields (NeRF) at the AWS re: Invent keynote. I've been fascinated with NeRFs ever since I saw them featured on the popular YouTube series Corridor Crew. I think NeRFs is an incredible advancement in the field of image processing technology, but I haven't yet invented a way NeRFs can help us with real-life issues.

Rasmus told me about an even more interesting new feature than NeRF called Open AI ChatGPT-3 which can actually help us right away. I was intrigued by ChatGPT-3 because Rasmus said it can write Terraform code so well that you can even set up a data platform with it. When I asked how it performs with SQL, the lingua franca of our data engineers, I was in for a surprise.

What is ChatGPT?

Before doing the actual testing, let's first understand what ChatGPT-3 is. Open AI's ChatGPT belongs to Conversational AI (or CAI) solutions which are part of

artificial intelligence technology that enables machines and humans to have natural conversations.

Conversational AI combines natural language processing (NLP), machine learning (ML), and deep learning (DL) to build systems that can understand and respond to user input in a natural and human-like way. Conversational AI is used in applications such as chatbots, virtual assistants, and automated customer service systems.

With ChatGPT, you can have a realistic, two-way conversation that feels more like talking to a real person. What makes ChatGPT better than current CAI solutions is the usage of reinforcement

training. To train ChatGPT, the Open AI team used a combination of reinforcement learning from human feedback (RLHF) and supervised fine-tuning. In the RLHF process, human trainers played both sides of a conversation and were given access to model-generated suggestions to help them compose their responses. The team then created a reward model for reinforcement learning by collecting comparison data consisting of ranked responses from the chatbot. This data was used to fine-tune the model using Proximal Policy Optimization.

Writing SQL with ChatGPT-3

Let's try then checking out what ChatGPT-3 can do. The values in the prompt are a real language I gave to ChatGPT-3 and the values in Reply are the stuff that ChatCPT-3 replied to (in addition to my own comments about the quality of the reply.

The prompt

Let's start with the easy stuff. I use the SQL syntax that I used in Sustainable SQL with Jaromir Jagr -blog and see how the ChatGPT-3 performs.

"I have Snowflake table, Show me how to query the Average and Min values from Players table where I have attribute Player Height"

The reply

To query the average and minimum values of the Player Height attribute in the Players table in Snowflake, you can use the following SQL statement:

SELECT AVG(PlayerHeight) AS AvgPlayerHeight, MIN(PlayerHeight) AS MinPlayerHeight FROM Players;

This statement will return the average and minimum values for Player Height in the Players table. You can adjust the query as needed to include any additional criteria or to include other attributes from the table. For example, if you only want to consider players who are taller than a certain height, you could add a WHERE clause to the query:

SELECT AVG(PlayerHeight) AS AvgPlayerHeight, MIN(PlayerHeight) AS MinPlayerHeight FROM Players WHERE PlayerHeight > X;

where X is the minimum height you want to consider.

Not bad. Not bad at all. I especially like the CAPITAL notation and that ChatGPT even gives me further instructions on how to have a WHERE clause for further filtering.

The prompt

Let's try SQL which needs grouping.

"I have Snowflake table Players which has attributes for Player Name, Player time on ice, Player team name. Can you create me a SQL query that shows the player from each team who has the max time on ice on each team?"

The reply

To create a SQL query that shows the player from each team who has the maximum time on ice for their team in the Players table in Snowflake, you can use the following statement:

SELECT PlayerName, PlayerTimeOnIce, PlayerTeamName FROM Players WHERE PlayerTimeOnIce = (SELECT MAX(PlayerTimeOnIce) FROM Players WHERE PlayerTeamName = Players.PlayerTeamName);

This query will return the player name, time on ice, and team name for each player who has the maximum time on ice for their team. You can adjust the query as needed to include other attributes or to add additional criteria. For example, if you only want to consider players who have played a certain number of games, you could add a HAVING clause to the query:

Again. Not bad, even though my question was somewhat vague and can be understood in multiple ways. The SQL works and ChatGPT gives again suggestions on how to improve the query.

The prompt

How about administrative SQL?

"I have a Snowflake table called users, I need to clone it a make a copy of it. Whats the SQL command for doing this?"

The reply

To create a copy of the users table in Snowflake, you can use the CREATE TABLE statement with the LIKE clause. The basic syntax for this statement is:

CREATE TABLE users_copy AS SELECT * FROM users;

This will create a new table called new_table_name with the same schema as the users table. You can then use the INSERT statement with a SELECT query to copy the data from the users table into the new table:

INSERT INTO new_table_name SELECT * FROM users;

This will insert all rows from the users table into the new new_table_name table. You can adjust the query as needed to include only certain columns or to add additional criteria for the data to be copied. For example, if you only want to copy users who have a certain attribute value, you could add a WHERE clause to the SELECT query:

INSERT INTO new_table_name SELECT * FROM users WHERE attribute = value;
where attribute is the name of the attribute you want to filter on, and value is the value you want to consider.

This works as well though I was expecting the usage of the CLONE command, specialty in Snowflake. I was also surprised by the usage of the LIKE command instead of AS SELECT *. ChatGPT goes again beyond the normal question and instructs on the proper INSERT INTO clause with filtering options.

The prompt

ChatGPT-3 can perform simple SQL -queries so let's try something more specific.

"I have Snowflake table which contains deliveries per month for months between 012022 to 122022. I need to PIVOT this table, how to do this?"

The reply

To pivot a table in Snowflake, you can use the PIVOT function. The basic syntax for this function is:

SELECT * FROM table_name PIVOT ( aggregation_function(column_to_aggregate) FOR pivot_column IN (list_of_values) );

where table_name is the name of the table you want to pivot, aggregation_function is the function you want to apply to the data (such as SUM or COUNT), column_to_aggregate is the column you want to aggregate, and pivot_column is the column you want to pivot on. The list_of_values should be a comma-separated list of the values you want to include in the pivot.

For example, if you have a table called deliveries with columns month, product, and quantity, and you want to pivot the data on the month column and sum the quantities for each product, you could use the following query:

SELECT * FROM deliveries PIVOT ( SUM(quantity) FOR month IN ('012022', '022022', '032022', '042022', '052022', '062022', '072022', '082022', '092022', '102022', '112022', '122022') );

This will pivot the data on the month column and return the total quantity for each product for each month. You can adjust the query as needed to include different columns or to use different aggregation functions.

Yep. Even though I didn't give ChatGPT enough information, it managed to create valid PIVOT SQL. This is looking pretty good. You could easily build a prompter for helping with SQL writing with these kinds of capabilities that ChatGPT has.

Consider for example that you have to PIVOT the data, but do not remember the syntax. Instead of going into the documentation, the prompter would build you a proper example of how PIVOT data.

The prompt

Let's try something that is always needed in data projects. Could I create the necessary calendar using ChatGPT-3?

"Create me Snowflake table D_CALENDAR -dimension table for joining with YYYY, MM-DD-YYYY and Day of the week attributes and insert few values into it"

The reply

To create a Snowflake table called D_CALENDAR with attributes for year, date, and day of the week, you can use the following CREATE TABLE statement:

CREATE TABLE D_CALENDAR ( YYYY INT, MM_DD_YYYY DATE, DAY_OF_WEEK VARCHAR(20) );

This will create a table with three columns for year, date, and day of the week. You can then use the INSERT statement to add some data to the table:

INSERT INTO D_CALENDAR (YYYY, MM_DD_YYYY, DAY_OF_WEEK) VALUES (2022, '01-01-2022', 'Saturday'), (2022, '01-02-2022', 'Sunday'), (2022, '01-03-2022', 'Monday'), (2022, '01-04-2022', 'Tuesday'), (2022, '01-05-2022', 'Wednesday'), (2022, '01-06-2022', 'Thursday'), (2022, '01-07-2022', 'Friday');

This will insert seven rows of data into the D_CALENDAR table for the first week of January 2022. You can adjust the INSERT statement as needed to include different data or to add additional rows.

That's D_CALENDAR or at least the start of it. ChatGPT doesn't know why I need it (it doesn't have an understanding of the context, but it created a proper DDL -clause.

The prompt

Let's try the ask the question in a different format.

"If I'm building a data warehouse solution, what's the best modeling option on getting the weekday (Mon, Tue) of a certain date in history?"

The reply

To get the weekday (e.g. Monday, Tuesday) for a certain date in a data warehouse solution, you can use a dimension table that contains the mapping between the numeric day of the week and the corresponding weekday name.

In a data warehouse, a dimension table is a table that contains descriptive attributes of the entities in the data warehouse, such as dates, product categories, or customer demographics. Dimension tables are typically used in combination with fact tables, which contain the quantitative data that is used to measure business performance.

To create a dimension table for the weekdays, you can use the following CREATE TABLE statement:

CREATE TABLE dim_weekdays ( dow INT, weekday VARCHAR(20) );

This will create a dimension table called dim_weekdays with two columns: dow for the numeric day of the week (1-7, where 1 is Monday and 7 is Sunday) and weekday for the corresponding weekday name. You can then use the INSERT statement to add the weekday names to the table:

INSERT INTO dim_weekdays (dow, weekday) VALUES (1, 'Monday'), (2, 'Tuesday'), (3, 'Wednesday'), (4, 'Thursday'), (5, 'Friday'), (6, 'Saturday'), (7, 'Sunday');

This will insert seven rows into the dim_weekdays table, one for each day of the week. You can then use a JOIN operation to link this table to a fact table that contains the dates for which you want to get the weekdays.

For example, if you have a fact table called sales with a column date that contains the dates of the sales transactions, you could use the following query to get the weekday for each transaction:

SELECT s.date, w.weekday FROM sales s JOIN dim_weekdays w ON s.date_part('isodow', s.date) = w.dow;

This will return the date and weekday for each transaction in the sales table, using the `dim

Holy moly. ChatGPT does not what a D_CALENDAR is, but doesn't have the knowledge to link D_CALENDAR into the proper context. Because this is a roughly correct answer and should be general knowledge for Data Engineers working on Data Warehouse -projects.

The prompt

Now, let's ask something that most of the Data Engineers how I know, might not know. I'm expecting a certain answer to this question.

"In the Snowflake database, what is the best table type for staging tables?"

The reply

In Snowflake, the best table type for staging tables is the TRANSIENT table. Transient tables are temporary tables that are automatically dropped when the session that created them ends. They are the ideal choice for staging tables because they do not consume storage space in the Snowflake account and are automatically cleaned up when they are no longer needed.

To create a transient table in Snowflake, you can use the CREATE TRANSIENT TABLE statement. The basic syntax for this statement is:

CREATE TRANSIENT TABLE table_name ( column1 datatype, column2 datatype, ... );

where table_name is the name of the table you want to create and column1, column2, etc. are the columns in the table along with their data types. For example, to create a transient table called staging_data with two columns, col1 and col2, you could use the following statement:

CREATE TRANSIENT TABLE staging_data ( col1 INT, col2 VARCHAR(100) );

This will create a transient table called staging_data with two columns, col1 and col2, of the specified data types. You can then use the INSERT statement to load data into the table. When the session ends, the table and its data will be automatically dropped.

Now I'm mind blown. This is the answer I was looking for and it's even better. ChatGPT even tells why the TRANSIENT table is the best option.

How about Terraform? How well ChatCPT-3 performs on that?

The prompt

Let's start with the basics. Here I am giving ChatGPT really much information, such as the general knowledge of TCP -port 1433.

"Create me a Terraform script to set up AWS network security groups and deploy the MSSQL database inside the network security group. Open 1433 firewall for MSSQL database"

The reply

Here is a simple example of a Terraform script that can create a AWS network security group and deploy an MSSQL database inside it.

# Configure the AWS provider provider "aws" { region = "us-east-1" }

# Create a new VPC resource "aws_vpc" "my_vpc" { cidr_block = "10.0.0.0/16" }

# Create a new network security group resource "aws_security_group" "mssql_sg" { name = "mssql-sg" description = "Security group for MSSQL database" vpc_id = aws_vpc.my_vpc.id # Open the MSSQL database port ingress { from_port = 1433 to_port = 1433 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } }

# Create a new MSSQL database instance resource "aws_db_instance" "mssql_db" { engine = "sqlserver-ex" engine_version = "14.00.3223.3.v1" instance_class = "db.t2.micro" name = "my_mssql_db" username = "mssqluser" password = "mssqlpass" storage_type = "gp2" storage_encrypted = true publicly_accessible = true vpc_security_group_ids = [aws_security_group.mssql_sg.id] backup_retention_period = 7 }

It's been few years since I have wroten Terraform (I try to favor SaaS as much as possible), but to my eye this looks promising. There's the necessary VPC, security group and database server.

The prompt

Let's try something else. Could I set up a simple Linux server for testing out PyTorch?

"Create me an AWS Terraform script for deploying Linux virtual machine. Inside the Linux, I want to install the latest Python version and Pytorch and create the user pythontestuser"

The reply

Here is a simple example of a Terraform script that can create an AWS Linux virtual machine, install the latest version of Python and PyTorch, and create a new user named pythontestuser.

# Configure the AWS provider provider "aws" { region = "us-east-1" }

# Create a new Linux AMI data "aws_ami" "ubuntu" { most_recent = true filter { name = "name" values = ["ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-*"] } filter { name = "virtualization-type" values = ["hvm"] } owners = ["099720109477"] # Canonical }

# Create a new EC2 instance resource "aws_instance" "web" { ami = data.aws_ami.ubuntu.id instance_type = "t2.micro"

# Install the latest version of Python and PyTorch user_data = <<-EOF #!/bin/bash

# Install Python and PyTorch apt-get update apt-get install -y python3-pip pip3 install torch torchvision # Create a new user useradd pythontestuser EOF }

Ok, it's been again few years since I logged onto any Debian -variant, but looks like proper apt-get commands.

The prompt

As a final question. Can I finally focus on creating data products instead of creating data pipelines?

"Create me an example data pipeline using using Azure Data Factory"

The reply

Here is a simple example of an Azure Data Factory data pipeline that copies data from a CSV file stored in Azure Blob Storage to a table in Azure SQL Database.

First, create a new Azure Data Factory instance and give it a name. Then, create a new pipeline and add the following steps:

Get CSV file from Azure Blob Storage: Use the Get metadata and Copy data activities to retrieve the CSV file from Azure Blob Storage and store it in a Azure Data Lake Storage Gen2 account.

Extract data from CSV file: Use the Extract data activity to extract the data from the CSV file and store it in a structured format in Azure Data Lake Storage Gen2.

Load data into Azure SQL Database: Use the Copy data activity to copy the structured data from Azure Data Lake Storage Gen2 and load it into a table in Azure SQL Database.

Here is the code for the pipeline:

I didn't even bother to paste code for the pipeline because you can see that ChatGPT gives the correct answers for creating a really simple data pipeline.

What I can say right away, is that if solutions like ChatGPT-3 would be added as helping tools for business users or Data Engineers, we could get data pipelines created much faster and Data Engineers could spend their time solving actual problems with the data. The results that ChatGPT gives are not always 100% right, but they don't need to be as they point in the right direction. The bigger question does come through in the form, does ChatGPT give correct answers? I knew the answers I was looking for, but Data Engineers at the start of their careers might not know.

This has also raised concerns elsewhere as ChatGPT has been apparently been so already so popular that StackOverflow has temporarily banned its usage for producing coding-related answers and Open AI's ChatGPT has already seen as a rival for Google's search.

With this phase the AI solutions are created, I expected that by 2025 we will have AI solutions more and more embedded into Data Tools as well.

What solutions we can use in the future?

Conversational AI is set to revolutionize the way data engineers work in the future. By giving data engineers the ability to ask questions and receive answers in natural language, they can quickly access and analyze data.

Moreover, conversational AI can be used to automate mundane tasks, freeing up data engineers to focus on more strategic projects and enabling them to work more efficiently and effectively.

With conversational AI, data engineers will be able to create more powerful data models and insights, transforming the way they work and paving the way for a brighter future. This is just in the field of Data Engineering.

Let's pose this question to ChatGPT and see what other answers we can get.

Chatbots: ChatGPT can be used to build chatbots that can provide customer service, recommend products and services, answer FAQs, and more.
Automated customer support: ChatGPT can be used to create an automated customer support system that can provide answers to customer queries quickly and accurately.
Natural language processing: ChatGPT can be used to create natural language processing systems that can understand and interpret customer queries.
Text summarization: ChatGPT can be used to create text summarization systems that can generate summaries of customer queries.
Recommendation systems: ChatGPT can be used to create recommendation systems that can suggest products and services to customers based on their queries.
Automated marketing: ChatGPT can be used to create automated marketing systems that can generate targeted marketing messages based on customer queries.

Can I try this myself?

Yes, OpenAI provides a free site where you can try out its ChatGPT technology at chat.openai.com. All you need to do is create an account and you're ready to start conversing with it. Since ChatGPT is a conversational AI, you'll need to ask it the right questions to get the best results.

So - here comes the more intriguing question, did I write this blog myself or not? The right answer is that I created the first version and structure, but I asked ChatGPT to rewrite it and make it better. After that, I modified the text again. Even this sentence isn't completely my work - you can tell by the lack of Finnish language structure. This is just one example of how solutions like ChatGPT can help in the future. Whether you need help writing good SQL, working with Python script syntax, or simply creating more readable blog text, ChatGPT can help you out.

The use of AI tools such as ChatGPT and currently very popular art generators from DALL·E 2 to Lensa do not come without issues. Far from that. Our team has been doing some testing with them and we will be sharing our experiences and thoughts on the matter later. If you're interested in learning more about what you should take into consideration when or if using these tools, sign up:

Written by — Mika Heino, Data Architect

Follow us

Using AI to write SQL and Terraform Code

Everything starts with small talk

What is ChatGPT?

Writing SQL with ChatGPT-3

The prompt

The reply

The prompt

The reply

The prompt

The reply

The prompt

The reply

The prompt

The reply

The prompt

The reply

The prompt

The reply

How about Terraform? How well ChatCPT-3 performs on that?

The prompt

The reply

The prompt

The reply

The prompt

The reply

What solutions we can use in the future?

Can I try this myself?

Latest from the Blog

Recordly 101: Digging dollars from data

Transparency, trust & Juhla Mokka: The truth about working at Recordly

Rethinking business intelligence: Why the semantic layer matters now

Social media

Contact

Futurice Family