A 3-step approach to tackle the problem of duplicates in databases
Duplicates are a recurring problem for any database user. There are several reasons why some duplicates may appear in a data set, and a sanity check is often necessary before any analysis can be conducted properly. This is why I want to share with you some lessons learned from my experience in dealing with duplicates using SQL.
Let’s set the scene. You just conducted a brilliant analysis to evaluate how many clients you or your company acquired over the past year. Your conclusion seems to confirm your intuition: the number of clients has tripled over the past 12 months! You are so proud of this result — although somewhat surprised — that you want to double-check it before presenting it to your boss. Digging into the data, you suddenly realize that some customers were counted twice… which clearly questions your previous finding.
How to deal with possible duplicate entries in your data set? How to work around the problem of duplicates using SQL? Let me guide you through some of the techniques I have been using as a Data Analyst to identify, avoid and remove duplicates from any type of data set.
Below is the sample table (named customers) that I will use throughout this article to illustrate my points:
1. How to Identify Duplicates
Regardless of how duplicates were created, the first step is to identify them within your data table. The most difficult part here is not technical: it is about acknowledging their existence. Too often one may be tempted to consider that a data table (especially within a company) does not contain any duplicate entry by default. While it is, fortunately, the case for the majority of them, sometimes duplicates were created unknowingly over time.
A simple SQL query allows you to retrieve all duplicates present in a data table. Looking at some particular examples of duplicate rows is a good way to get started. Five to ten concrete examples will give you a first idea of what is going on in your data, in order to better understand what could have led to the creation of duplicates.
Based on the example introduced previously, the following query allows to identify possible customers that are present more than once in the data table customers. In this example, I want to ensure that all the customers registered are unique. To do so, I use customers’ email address and date of birth, as they are supposed to be unique to each customer.
To apply this query to any data table, you simply have to replace the following fields:
- customers should be replaced by the data table in which you are looking for possible duplicates
- email_address and date_of_birth should be replaced by all fields (there can be one or several) that are supposed to be unique to each unit in your data table
If you get an empty table as an output, congratulations! It means that no duplicates were found in your table, based on the criteria you specified. If you get an output, you must investigate some of the duplicate rows.
Here, the columns that are not supposed to be unique (in this example, several customers may have the same customer_first_name and/or customer_last_name) will help you identify duplicates. They will pave the way for a better understanding of why there are duplicates in your table. In my example, I see that the customer Sarah Smith, whose email address is firstname.lastname@example.org and who is born May 23rd, 1983, has been doubled-counted in my data table.
2. How to Avoid Duplicates
As is often the case, the best cure is to avoid getting sick. Therefore, before jumping into the “removing duplicates” part, you should understand what could have led to the creation of duplicates in a data table. Possible causes can be operational (e.g. a salesperson registered the same customer multiple times), technical (e.g. an IT bug led to customer accounts being created twice) or related to data manipulations (e.g. an intermediary data table is built in such a way that there are duplicate rows).
Identifying the root cause of the creation of duplicates will help you tackle the problem at the right level. Hence, if the problem comes from operational or technical issues, you may turn to the department in charge (Operations or IT for the two first cases in my example). If the problem comes from data manipulation, there are some actions that you can take to fix this problem on your own. By doing so, not only will your analysis be more reliable, but it will raise your awareness — and possibly the awareness of your colleagues — to avoid this problem in the future.
As a general rule, better structuring your SQL code should prevent the majority of problems in unintentionally creating duplicate rows. In particular, if you clearly visualize your query output before even starting to write your code (as explained in this article), you should be able to avoid the creation of duplicates upfront.
3. How to Remove Duplicates
Now that you have identified the duplicates in your data set and that you have understood how they were created, you want to remove them from your output data table. By doing so, you will be able to conduct a correct analysis and make sure that your findings are reliable.
Based on my own experience, here are three techniques to remove duplicates from a given data table. Using the previous example, these pieces of code allow me to remove duplicates from the table customers, assuming that each customer is defined by a unique combination of their email address and their date of birth.
The “well-structured” technique: LIST
A preliminary query is written to retrieve the unique fields related to each customer (in this example, email_address and date_of_birth). Then this sub-query is placed into a WITH clause and is called list. This becomes the first data source of the main query, which is LEFT JOIN-ed on the initial table customers.
The “unique identifier” technique: PARTITION
This technique uses the function FIRST_VALUE combined with the PARTITION BY clause. For the initially defined fields (here, email_address and date_of_birth) these functions take the first value of the field customer_first_name (respectively customer_last_name) encountered within the data table customers.
The “super quick” technique: GROUP BY
Similarly to the previous technique, using the GROUP BY clause enables the user to define in which way the non-unique fields (here, customer_first_name and customer_last_name) must be aggregated for each customer. This technique is particularly useful for numeric fields, but it also works for string fields.
All three techniques lead to the same result: duplicates are removed from the data set. As they use different logical paths and SQL functions, you may find one method clearer or more practical than the other depending on your background and your proficiency level in SQL. Feel free to use the one you are most comfortable with!