SQL Code Analysis
In the post, I used a simple SQL query to explain how certain things work in SQL. I also outlined problems with the query and potential ways to improve the code.
Abraham Enyo-one Musa
Structured Query Language (SQL) is one of the most common programming languages that enables one to store, retrieve, manipulate and organize data in relational databases such as MySQL, Oracle Database, and Microsoft SQL Server. This documentation contains a short SQL query example that combines two tables using two unique keys present in both tables.
On a high level, the example SQL code above "selects all columns from tables a and b, and joins both tables anywhere there's a match on the sk_applicant and hellofresh_week field. A NULL value is returned for non-match keys." Let's take a moment to understand what the query is doing.
In SQL, "SELECT" is a reserved keyword used to find and extract data from one or more database tables. Whenever you use a SELECT keyword in a query, what typically comes after is the column you want to select (In the example SQL code, "SELECT a.*, b.*" retrieves all columns in tables a and b). A database table stores data in a logically organized, row-and-column format and uses identifiers (i.e., primary and foreign keys) to uniquely identify table records and enhance referencing with other related tables.
In SQL, you can't select data in a vacuum. You always need to specify the table you want to retrieve data from. In the example query, we're fetching data from tables a and b. Another thing to note is that you can select columns from one table and JOIN the data retrieved with another table using some keywords like INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN e.t.c.
The example query uses a "FULL OUTER JOIN." A FULL OUTER JOIN returns both matching and unmatched rows and columns from "table a and b." If a record in "table a" doesn't have a match in "table b" on the join clauses, SQL fills the corresponding cell with "NULL." The same thing happens when a record in "table b" doesn't have a match in "table a."
Conventionally, programmers prefer to join tables using a single unique identifier or key. But, the example query uses two different identifiers for merging (i.e., sk_applicant and hellofresh_week). The second index further narrows down the data and ensures there's a unique match between sk_applicant and hellofresh_week in the "table a" with the sk_applicant and hellofresh_week in "table b." It's super important as a single applicant can have multiple hellofresh_week values.
The diagram below presents a high-level overview of how the code works:
But, there are several concerns with the SQL query:
- The table names aren't descriptive of the data they hold.
- The example query uses the exhaustive column retrieval method, which can be very expensive and cumbersome over tables with a larger number of columns, especially when you need to select only a few columns.
- If the two tables don't have too many matches on the keys selected, the merged table retrieved will have a high proportion of zero values.
The following can be done to improve the query:
- The names "a and b" should be replaced with a name that describes the kind of data in each table.
- If we only need to select a few columns, it's better to list the columns rather than using "a .* and b.*".
- After studying the code carefully, I think an INNER JOIN would do the job more efficiently and effectively than a FULL OUTER JOIN.
- If all the table columns must be selected, a better alternative to the original code example would be to use a single "*" to retrieve all the data from both tables as shown in the code block below:
Abraham Enyo-one Musa
Data Scientist || Machine Learning Engineer || Technical Content Writer
I'm a data scientist, machine learning engineer and technical content writer. You can either reach out to me via LinkedIn or shoot an email to firstname.lastname@example.org.