Removing duplicates in Vertica data sets

Many a time due to bug or otherwise duplicate rows gets introduced into tables and it is not hard to perform dedupe and let the ETL/ELT process continue without interruption.  In environment where rows can be identified by row_ids (Oracle) or table is not large (say, less than 100 million rows) simple techniques like copying the whole table to temporary table and then removing the dups are suffice to handle the situation. Thus, data users (Scientists, Analysis or Execs) are not negatively affected either due to data issue or downtime.

But some cases can be interesting, say, where table is large or very large, from few hundred millions of rows to few billions of rows and tables are wide with tens to hundreds of columns.  One technique that I use in Vertica environment is copying the duplicates to temp table and generating a pseudo row_id and then deleting the duplicates.  Recently a table with with more than 150 million rows and 60 columns had just 2 duplicates and needed dedupe without any downtime.  The steps I used were:

Step 1:  Check for dups
select count(1) from large-table-1;    -- Just some sanity check before beginning the dedupe

-- Find the duplicates
select keys, count(1) from large-table-1
where where-conditions
group by 1
having count(1) > 1
order by count(1) desc  ;

Step 2:  Dump the duplicates into temp table
create table test.large-table-1-dups
like large-table-1;

alter table test.large-table-1-dups     -- add row_num column (pseudo row_id)
add column row_num int;

insert into test.large-table-1-dups
select *, ROW_NUMBER() OVER(PARTITION BY key)
from large-table-1
where key in ('k1', 'k2');                    -- where, say, k1 has n and k2 has m exact duplicates

Step 3: Remove duplicates from the temp table
delete from test.large-table-1-dups
where row_num > 1;

select * from test.dim_line_items_dups;    --  Sanity test.  Should have 1 row each of k1 & k2 rows

Step 4: Delete all duplicates from main table...
delete from large-table-1
-- select * from large-table-1
where key in ('k1', 'k2');                      -- key is like line_item_nk, etc.

Step 5: Insert data back into main table from temp dedupe data
alter table test.large-table-1-dups
drop column row_num;

insert into large-table-1
select * from test.large-table-1-dups;

15 comments:

  1. Thanks for your informative guide on SQL technology. As we all know, SQL is most used database tool and your article offer clear insight about the technology. Your blog is recommended for freshers and experienced ones looking for ideal blog to learn SQL technology.
    Java Training in Chennai
    Dot Net Training in Chennai
    PHP Training in Chennai

    ReplyDelete
  2. Your blog is awesome..You have clearly explained about it ...Its very useful for me to know about new things..Keep on blogging..
    PHP training in chennai

    ReplyDelete
  3. Updating with the latest technology and implementing it is the only way to survive in our niche. Thanks for making me this article. You have done a great job by sharing this content in here. Keep writing article like this.
    SAS Training in Chennai | SAS Training Institute in Chennai

    ReplyDelete
  4. Thanks for sharing the information about the hadoop training and keep updating us.This information is useful to me.
    Hadoop Training Chennai
    Hadoop training in Chennai

    ReplyDelete
  5. SAS (Statistical Analysis System) is the most popular tool for statistical modelling and data analysis and having a great scope in IT industry. It’s an application suite that can change, manage & retrieve data from the variety of origins & perform statistical analysis on it.
    Regards,
    SAS Training Institutes in Chennai | SAS Training in Chennai | SAS Courses in Chennai

    ReplyDelete
  6. Thanks for posting this useful content, Good to know about new things here, Let me share this,
    AngularJS Training in Chennai | AngularJS Training | Best AngularJS Training Institute in Chennai

    ReplyDelete
  7. I wish to show thanks to you just for bailing me out of this particular trouble.As a result of checking through the net and meeting techniques that were not productive.
    Docker-training-in-chennai

    ReplyDelete
  8. Thank you a lot for providing individuals with a very spectacular possibility to read critical reviews from this site.
    selenium training in chennai

    ReplyDelete
  9. I believe there are many more pleasurable opportunities ahead for individuals that looked at your site.

    Amazon Web Services Training in Chennai

    Best Java Training Institute Chennai


    ReplyDelete
  10. Those guidelines additionally worked to become a good way to recognize that other people online have the identical fervor like mine to grasp great deal more around this condition.


    DevOps Training in Bangalore

    ReplyDelete
  11. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
    best safety training in chennai

    ReplyDelete
  12. The style of writing is excellent and also the content is top-notch. Thanks for that shrewdness you provide the readers! to read. I never stop myself to say something about


    Dot Net Training in Chennai | Dot Net Training in anna nagar | Dot Net Training in omr | Dot Net Training in porur | Dot Net Training in tambaram | Dot Net Training in velachery




    ReplyDelete