Many a time due to bug or otherwise duplicate rows gets introduced into tables and it is not hard to perform dedupe and let the ETL/ELT process continue without interruption. In environment where rows can be identified by row_ids (Oracle) or table is not large (say, less than 100 million rows) simple techniques like copying the whole table to temporary table and then removing the dups are suffice to handle the situation. Thus, data users (Scientists, Analysis or Execs) are not negatively affected either due to data issue or downtime.
But some cases can be interesting, say, where table is large or very large, from few hundred millions of rows to few billions of rows and tables are wide with tens to hundreds of columns. One technique that I use in Vertica environment is copying the duplicates to temp table and generating a pseudo row_id and then deleting the duplicates. Recently a table with with more than 150 million rows and 60 columns had just 2 duplicates and needed dedupe without any downtime. The steps I used were:
Step 1: Check for dups
select count(1) from large-table-1; -- Just some sanity check before beginning the dedupe
-- Find the duplicates
select keys, count(1) from large-table-1
where where-conditions
group by 1
having count(1) > 1
order by count(1) desc ;
Step 2: Dump the duplicates into temp table
create table test.large-table-1-dups
like large-table-1;
alter table test.large-table-1-dups -- add row_num column (pseudo row_id)
add column row_num int;
insert into test.large-table-1-dups
select *, ROW_NUMBER() OVER(PARTITION BY key)
from large-table-1
where key in ('k1', 'k2'); -- where, say, k1 has n and k2 has m exact duplicates
Step 3: Remove duplicates from the temp table
delete from test.large-table-1-dups
where row_num > 1;
select * from test.dim_line_items_dups; -- Sanity test. Should have 1 row each of k1 & k2 rows
Step 4: Delete all duplicates from main table...
delete from large-table-1
-- select * from large-table-1
where key in ('k1', 'k2'); -- key is like line_item_nk, etc.
Step 5: Insert data back into main table from temp dedupe data
alter table test.large-table-1-dups
drop column row_num;
insert into large-table-1
select * from test.large-table-1-dups;
But some cases can be interesting, say, where table is large or very large, from few hundred millions of rows to few billions of rows and tables are wide with tens to hundreds of columns. One technique that I use in Vertica environment is copying the duplicates to temp table and generating a pseudo row_id and then deleting the duplicates. Recently a table with with more than 150 million rows and 60 columns had just 2 duplicates and needed dedupe without any downtime. The steps I used were:
Step 1: Check for dups
select count(1) from large-table-1; -- Just some sanity check before beginning the dedupe
-- Find the duplicates
select keys, count(1) from large-table-1
where where-conditions
group by 1
having count(1) > 1
order by count(1) desc ;
Step 2: Dump the duplicates into temp table
create table test.large-table-1-dups
like large-table-1;
alter table test.large-table-1-dups -- add row_num column (pseudo row_id)
add column row_num int;
insert into test.large-table-1-dups
select *, ROW_NUMBER() OVER(PARTITION BY key)
from large-table-1
where key in ('k1', 'k2'); -- where, say, k1 has n and k2 has m exact duplicates
Step 3: Remove duplicates from the temp table
delete from test.large-table-1-dups
where row_num > 1;
select * from test.dim_line_items_dups; -- Sanity test. Should have 1 row each of k1 & k2 rows
Step 4: Delete all duplicates from main table...
delete from large-table-1
-- select * from large-table-1
where key in ('k1', 'k2'); -- key is like line_item_nk, etc.
Step 5: Insert data back into main table from temp dedupe data
alter table test.large-table-1-dups
drop column row_num;
insert into large-table-1
select * from test.large-table-1-dups;
Thanks for your informative guide on SQL technology. As we all know, SQL is most used database tool and your article offer clear insight about the technology. Your blog is recommended for freshers and experienced ones looking for ideal blog to learn SQL technology.
ReplyDeleteJava Training in Chennai
Dot Net Training in Chennai
PHP Training in Chennai
Your blog is awesome..You have clearly explained about it ...Its very useful for me to know about new things..Keep on blogging..
ReplyDeletePHP training in chennai
Updating with the latest technology and implementing it is the only way to survive in our niche. Thanks for making me this article. You have done a great job by sharing this content in here. Keep writing article like this.
ReplyDeleteSAS Training in Chennai | SAS Training Institute in Chennai
great article
ReplyDeletejob oriented java j2ee training | java spring training | java hibernate training
Really an interesting post.
ReplyDeleteRegards,
SAS Training in Chennai | SAS course in Chennai | SAS Institutes in Chennai
Thanks for sharing the information about the hadoop training and keep updating us.This information is useful to me.
ReplyDeleteHadoop Training Chennai
Hadoop training in Chennai
SAS (Statistical Analysis System) is the most popular tool for statistical modelling and data analysis and having a great scope in IT industry. It’s an application suite that can change, manage & retrieve data from the variety of origins & perform statistical analysis on it.
ReplyDeleteRegards,
SAS Training Institutes in Chennai | SAS Training in Chennai | SAS Courses in Chennai
Thanks for posting this useful content, Good to know about new things here, Let me share this,
ReplyDeleteAngularJS Training in Chennai | AngularJS Training | Best AngularJS Training Institute in Chennai
I wish to show thanks to you just for bailing me out of this particular trouble.As a result of checking through the net and meeting techniques that were not productive.
ReplyDeleteDocker-training-in-chennai
Thank you a lot for providing individuals with a very spectacular possibility to read critical reviews from this site.
ReplyDeleteselenium training in chennai
I believe there are many more pleasurable opportunities ahead for individuals that looked at your site.
ReplyDeleteAmazon Web Services Training in Chennai
Best Java Training Institute Chennai
Those guidelines additionally worked to become a good way to recognize that other people online have the identical fervor like mine to grasp great deal more around this condition.
ReplyDeleteDevOps Training in Bangalore
I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
ReplyDeletebest safety training in chennai
The style of writing is excellent and also the content is top-notch. Thanks for that shrewdness you provide the readers! to read. I never stop myself to say something about
ReplyDeleteDot Net Training in Chennai | Dot Net Training in anna nagar | Dot Net Training in omr | Dot Net Training in porur | Dot Net Training in tambaram | Dot Net Training in velachery
thanks sir for sharing such information. this is very nice article.thanks a lot to provide SEO information...such a great.keep it up!!
ReplyDeleteAndroid Training in Chennai
Android Online Training in Chennai
Android Training in Bangalore
Android Training in Hyderabad
Android Training in Coimbatore
Android Training
Android Online Training