Engineering
-
January 13, 2023

Easy SQL tricks to clean messy data

Cleaning messy data can be time-consuming and tedious. In this tutorial, we will go over some easy SQL tricks to help you with the most common data-cleaning tasks.

Egor Gryaznov
Get Data Insights Delivered
Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.
Stay Informed
Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.
Get the Best of Data Leadership
Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Cleaning messy data can be time-consuming and tedious. In this tutorial, we will go over some easy SQL tricks to help you with the most common data-cleaning tasks.

Finding missing values in data

Missing values, also known as null values, can occur for a variety of reasons, such as data entry errors or missing information.

In SQL, you can use the IS NULL operator to find missing values in a table. For example, let's say we have a table called employees with the following structure:

IDNameSalaryDepartment1NULL45000Marketing2Jane5000Sales3JerryNULLNULL4Sarah4000HR

Missing values in specific columns

To find rows with missing values in the salary and department columns, we can use the following SQL query:

SELECT * FROM employees WHERE salary IS NULL OR name IS NULL;

This will return the following result:

IDNameSalaryDepartment1NULL45000Marketing3JerryNULLNULL

You can also use the COUNT function to count the number of missing values in a column. For example, to count the number of missing values in the salary column, we can use the following query:

SELECT COUNT(*) FROM employees WHERE salary IS NULL;

This will return the result 1, indicating that there is one missing value in the salary column.

Missing values in any column

To check if any column in a table contains a null value, you can use the following SQL query:

SELECT COUNT(*) AS count FROM employees WHERE COALESCE( (SELECT column_name FROM INFORMATION_SCHEMA.COLUMNS WHERE table_name = 'table' AND is_nullable = 'YES' FOR XML PATH('')), '' ) IS NOT NULL;

This query uses the INFORMATION_SCHEMA.COLUMNS table to get a list of all the nullable columns in the **employees**table, and then passes them as arguments to the COALESCE function. If any of the columns contain a null value, the COALESCE function will return a non-null value. The COUNT function then counts the number of rows that have a non-null value.

Patterns in missing values

To find patterns in the missing values, you can use the GROUP BY clause to group the rows by a certain column, and then use the COUNT function and the NULL keyword to count the number of missing values in each group. This can help you identify patterns in the missing values, such as whether certain groups have a higher or lower percentage of missing values. For example, to group the rows by department column and count the number of missing salary values in each department, you can use the following SQL query:

SELECT department, COUNT(*) AS count FROM employees WHERE salary IS NULL GROUP BY department;

Removing duplicates

Duplicate values can occur for a variety of reasons, such as data entry errors or data import issues. It is important to identify and remove duplicates before proceeding with any analysis, as they can significantly affect the results.

Remove duplicates from a specific column

In SQL, you can use the DISTINCT keyword to remove duplicates from the result set. For example, let's say we have a table called customers with the following structure:

IDNameEmail1Johnjohn@example.com2Janejane@example.com3Jerryjerry@example.com4Sarahsarah@example.com5Johnjohn@example.com6Jonjohn@example.com

To remove duplicates from the name column, we can use the following SQL query:

SELECT DISTINCT name FROM customers;

This will return the following result:

NameJohnJaneJerrySarah

Remove duplicates from the entire table

To remove duplicates from the entire table, not just from a single column, we can use the following SQL query:

SELECT DISTINCT * FROM customers;

This will return the following result:

IDNameEmail1Johnjohn@example.com2Janejane@example.com3Jerryjerry@example.com4Sarahsarah@example.com

Cleaning string variables

String variables, also known as character variables, are used to store text data. However, sometimes the data in these variables can be dirty or inconsistent, which can cause issues when trying to analyze or manipulate the data. Here are four ways to clean string variables in SQL:

1. Trimming whitespace

Whitespace, also known as leading or trailing spaces, can sometimes be added to the beginning or end of a string. These spaces can cause problems when comparing or joining strings, so it is important to remove them.

In SQL, you can use the TRIM function to remove leading and trailing spaces from a string. For example, let's say we have a table called products with the following structure:

IDNamePrice1iPhone9992Samsung8993Google Pixel7994"Pad"499

To remove the leading and trailing spaces from the name column, we can use the following SQL query:

SELECT id, TRIM(name) AS name, price FROM products;

This will return the following result:

IDNamePrice1iPhone9992Samsung8993Google Pixel7994iPad499

2. Replacing characters

Sometimes, there may be certain characters in a string that need to be replaced with another character. For example, you may want to replace all hyphens with underscores, or all commas with pipes.

In SQL, you can use the REPLACE function to replace a specific character or sequence of characters with another character or sequence of characters. For example, let's say we have a table called sales with the following structure:

IDDateSales12021-01-0110022021-01-0220032021-01-0330042021-01-04400

To replace the hyphens in the date column with underscores, we can use the following SQL query:

SELECT id, REPLACE(date, '-', '_') AS date, sales FROM sales;

This will return the following result:

IDDateSales12021_01_0110022021_01_0220032021_01_0330042021_01_04400

Sometimes, you don’t know the exact characters you want to replace in a string — only their rough form. For example, you may want to remove all special characters, or all numbers.

In SQL, you can use the REGEXP_REPLACE function to use regular expressions to define the pattern of characters to be replaced.

For example, let's say we have a table called contacts with the following structure:

IDNamePhone1John Smith(123) 456-78902Jane Doe(123) 456-78913Jerry Garcia(123) 456-78924Sarah Johnson(123) 456-7893

To remove all non-numeric characters from the phone column, we can use the regular expression '[^0-9]' that matches any character that is not a digit (0-9).

SELECT id, name, REGEXP_REPLACE(phone, '[^0-9]', '') AS phone FROM contacts;

This will return the following result:

IDNamePhone1John Smith12345678902Jane Doe12345678913Jerry Garcia12345678924Sarah Johnson1234567893

3. Extracting characters

Sometimes, you may want to extract a specific portion of a string. For example, you may want to extract the first name from a full name, or the domain from an email address. To do this, you can use the **SUBSTRING** or the **REGEX_SUBSTR** function.

The SUBSTRING function takes three arguments: the string to extract from, the starting position of the extraction, and the length of the extraction.

For example, let's say we have a table called emails with the following structure:

IDEmail1john@example.com2jane@example.com3jerry@example.com4sarah@example.com

To extract the domain from the email column, we can use the following SQL query:

SELECT id, SUBSTRING(email, LENGTH(email) - 9) AS domain FROM emails;

This will return the following result:

IDDomain1example.com2example.com3example.com4example.com

You can also use the REGEXP_SUBSTR function to extract substrings using regular expressions. For example, the query below extracts the domain from the emails in the emails table using a regular expression.

SELECT REGEXP_SUBSTR(email, '(@\w+\.)', 1, 1, 'i') AS domain FROM emails;

The REGEXP_SUBSTR function returns the first occurrence of the pattern, which is the domain in this case.

Converting values from one data type to another

Sometimes, you may need to convert values from one data type to another. For example, you may need to convert a string to a number, or a date to a string.

In SQL, you can use type casting functions to convert values from one data type to another. Here are some commonly used type-casting functions:

  • CAST(value AS type)
  • Converts the value to the specified data type.
  • TO_NUMBER(value)
  • Converts the value to a number.
  • TO_DATE(value)
  • Converts the value to a date.
  • TO_CHAR(value)
  • Converts the value to a string.

For example, let's say we have a table called orders with the following structure:

IDOrder dateOrder total12021-01-0110022021-01-0220032021-01-0330042021-01-04400

order_date and order_total are both strings. To convert them to a date and int, respectively, we can use the following SQL query:

SELECT id, CAST(order_date as DATE) AS order_date, CAST(order_total AS char(15)) AS order_total FROM orders;

You might be wondering, since it seems that CAST does the same thing as the TO_{} functions, with more generality, when you might want to use the latter over the former.

In general, you should use the TO_{} functions when you are sure that the value is of a certain type and can be safely converted to that data type. If you are not sure about the data type of the value or want to convert it to a different data type, you should use the CAST function.

TO_{} conversions and format masking

You might also want to use the TO_CHAR function with a format mask to convert a number to a string and specify the format of the resulting string. CAST does not support format masks.

For example, the following SQL query converts the order_total column to a string and displays it with the currency symbol and with two digits of specificity.

SELECT TO_CHAR(order_total, '$9,999.99') AS column FROM table;

In this example, the $ character represents the currency symbol, the 9 character represents a digit, and the , and .characters represent the thousand separator and the decimal point, respectively. The number of 9 characters determines the number of digits to display before and after the decimal point.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.