Wednesday, January 11, 2012

Handy query to find duplicate rows in a MySQL DB

My most recent project has me dealing with huge (we're talking hundreds of megabytes) lists of email addresses. These email addresses have been collected over years and I found out the hard way that there were not only duplicates and triplicates, but some email addresses were listed in there more than 6 TIMES!!!

My first order of business was to clean up the email lists which reside in MySQL data tables. Here is the query to grab all the duplicates. This query will also tell you how many times the email address is duplicated:

SELECT email_address,
COUNT (email_address) AS NumOccurrences
FROM email_marketing_tbl
GROUP BY email_address
HAVING (count (email_address) > 1);

The above query will produce output like this:

| email_address              | NumOccurrences |
|         |              2 |
|            |              2 |
|         |              2 |
|         |              2 |

Once dupes are identified and reviewed the next step is to delete the duplicate records. We do so using the following SQL:

CREATE TEMPORARY TABLE email_temp AS SELECT * FROM email_marketing_tbl
GROUP BY email_address;
DELETE FROM email_marketing_tbl;
INSERT INTO email_marketing_tbl SELECT * FROM email_temp;

Below are the results of the above three queries:

Query OK, 539 rows affected
Records: 539  Duplicates: 0  Warnings: 0
Query OK, 548 rows affected

Query OK, 539 rows affected
Records: 539  Duplicates: 0  Warnings: 0
To recap:
1. Identify which records are duplicates
2. Create a temporary table from the output of the select of the email_marketing_tbl (the one with the dupes in it)
3. Delete all records from email_marketing_tbl
4. Insert all (cleaned up) records from email_temp into email_marketing_tbl

Thats IT!

Creative Commons License
VGP-Miami Web and Mobile Automation Blog by Alfred Vega is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.