Check and set analyze statistics on tables-columns in Vertica

It is important to update statistics in any database for better query runs or query optimizations.  Vertica provides ANALYZE_STATISTICS and ANALYZE_HISTOGRAM to collect stats and in big data warehouse environment it is sometime possible that some tables may not have stats collected.  In that case you can run following query to check and add the tables/ columns to collect stats.


SELECT T1.table_schema, T1.table_name, T2.row_ct, T2.wos_ct, T2.ros_ct FROM ( -- Get list of tables that don't have 'FULL' stats SELECT distinct table_schema, table_name FROM projection_columns WHERE (statistics_type = 'ROWCOUNT' or statistics_type = 'NONE') -- and table_schema not like 'v_%' AND (table_schema = 'schema_1' OR table_schema = 'schema_2' ) ORDER BY 1,2 ) T1 JOIN ( -- Get list of tables with storage details SELECT anchor_table_schema, anchor_table_name, sum(row_count) row_ct, sum(wos_row_count) wos_ct, sum(ros_row_count) ros_ct FROM projection_storage WHERE (anchor_table_schema = 'schema_1' OR anchor_table_schema = 'schema_2' ) GROUP BY 1,2 ) T2 ON T1.table_schema = T2.anchor_table_schema AND T1.table_name = T2.anchor_table_name ORDER BY 3 desc;

From the list generated above add those tables/ columns to your regular stats collecting script which you can run during off peak hours.

Notes:
Vertica's ANALYZE_HISTOGRAM returns similar data as ANALYZE_STATISTICS except that one can set the percent of data that is sampled to generate statistics metrics.  In the case of latter it uses fixed 10% sampling.

WOS: Write Optimized Store is a memory resident data structure used to support fast data loads without data compression and indexing.
ROS: Read Optimized Store is disk-based storage with data compressed and sorted.  Tuple Mover moves data out of WOS and into ROS in batches.

References:

Removing duplicates in Vertica data sets

Many a time due to bug or otherwise duplicate rows gets introduced into tables and it is not hard to perform dedupe and let the ETL/ELT process continue without interruption.  In environment where rows can be identified by row_ids (Oracle) or table is not large (say, less than 100 million rows) simple techniques like copying the whole table to temporary table and then removing the dups are suffice to handle the situation. Thus, data users (Scientists, Analysis or Execs) are not negatively affected either due to data issue or downtime.

But some cases can be interesting, say, where table is large or very large, from few hundred millions of rows to few billions of rows and tables are wide with tens to hundreds of columns.  One technique that I use in Vertica environment is copying the duplicates to temp table and generating a pseudo row_id and then deleting the duplicates.  Recently a table with with more than 150 million rows and 60 columns had just 2 duplicates and needed dedupe without any downtime.  The steps I used were:

Step 1:  Check for dups
select count(1) from large-table-1;    -- Just some sanity check before beginning the dedupe

-- Find the duplicates
select keys, count(1) from large-table-1
where where-conditions
group by 1
having count(1) > 1
order by count(1) desc  ;

Step 2:  Dump the duplicates into temp table
create table test.large-table-1-dups
like large-table-1;

alter table test.large-table-1-dups     -- add row_num column (pseudo row_id)
add column row_num int;

insert into test.large-table-1-dups
select *, ROW_NUMBER() OVER(PARTITION BY key)
from large-table-1
where key in ('k1', 'k2');                    -- where, say, k1 has n and k2 has m exact duplicates

Step 3: Remove duplicates from the temp table
delete from test.large-table-1-dups
where row_num > 1;

select * from test.dim_line_items_dups;    --  Sanity test.  Should have 1 row each of k1 & k2 rows

Step 4: Delete all duplicates from main table...
delete from large-table-1
-- select * from large-table-1
where key in ('k1', 'k2');                      -- key is like line_item_nk, etc.

Step 5: Insert data back into main table from temp dedupe data
alter table test.large-table-1-dups
drop column row_num;

insert into large-table-1
select * from test.large-table-1-dups;

Yum metadata file does not match checksum error and fix

While using yum/ CentOS5.4 at a client site, came across error on trying to search for a package.  The error indicated a checksum mismatch like:

Loaded plugins: fastestmirror
openx-artifactory-released/primary                                                  | 1.1 MB     00:00     
http://abcd.company.org:8081/artifactory/libs-release-local/repodata/primary.xml.gz: [Errno -1] Metadata file does not match checksum
Trying other mirror.
Error: failure: repodata/primary.xml.gz from openx-artifactory-released: [Errno 256] No more mirrors to try.

This happens when the repo metadata files are out of sync for some reason (including outdated cache) and one of the quick way to resolve is to force not checking the checksum with fastestmirror of yum which can take command line option of user confirmation input (y(es) or N(o)).  So ran

> sudo yum -y remove yum-fastestmirror

If one doesn't check the dependency before removing a package one can get bitten. And that is what I ended up with :(.

Before running the above command I should have checked with 

> yum deplist yum-fastestmirror
Excluding Packages from CentOS-5 - Plus
Finished
Finding dependencies: 
package: yum-fastestmirror.noarch 1.1.16-14.el5.centos.1
  dependency: config(yum-fastestmirror) = 1.1.16-14.el5.centos.1
   provider: yum-fastestmirror.noarch 1.1.16-14.el5.centos.1
  dependency: yum >= 3.0
   provider: yum.noarch 3.2.22-20.el5.centos
   provider: yum.noarch 3.2.22-20.el5.centos
package: yum-fastestmirror.noarch 1.1.16-13.el5.centos
  dependency: config(yum-fastestmirror) = 1.1.16-13.el5.centos
   provider: yum-fastestmirror.noarch 1.1.16-13.el5.centos
   provider: yum-fastestmirror.noarch 1.1.16-13.el5.centos
  dependency: yum >= 3.0
   provider: yum.noarch 3.2.22-20.el5.centos
   provider: yum.noarch 3.2.22-20.el5.centos

and also yum itself depends on yum-fastestmirror!
> yum deplist yum
...
  dependency: yum-fastestmirror
   provider: yum-fastestmirror.noarch 1.1.16-13.el5.centos
   provider: yum-fastestmirror.noarch 1.1.16-13.el5.centos
   provider: yum-fastestmirror.noarch 1.1.16-14.el5.centos.1
....

On removing yum-fastestmirror (to quickly avoid the metadata checksum error), I ended up with bigger problem of having uninstalled the yum itself!

> yum search python
-bash: yum: command not found

> rpm -qi yum
package yum is not installed

Now the only solution is to get the new package of yum for this CentOS and install new.  I couldn't perform wget or http from behind the firewall and proxy due to security network on this system.  

> cat /etc/redhat-release
OpenX release 5.4 (based on CentOS 5.4)

httped to http://vault.centos.org/5.4/os/i386/CentOS/ and download yum 3.2.22 (yum-3.2.22-20.el5.centos.noarch.rpm) and yum-fastestmirror & yum-utils to my local laptop.  Then scped the packages to dev system

On the dev box, then, I ran

> sudo rpm -ivh yum-3.2.22-20.el5.centos.noarch.rpm yum-fastestmirror-1.1.16-13.el5.centos.noarch.rpm 

Note: You need to provide the both rpms at the same time (in single command) due to fact there is cyclic dependency.  If you try installing one at a time each will complain that other package is missing!

Once that install was good, I could perform

> yum search ctags

Purging Vertica tables/ partitions at regular intervals

First thing first, especially, when there is potential to misunderstand due to definitions of specific words - purging vs truncating vs deleting data. In this article, my interest is to purge the tables of deleted rows.

Truncate => Removes all storage associated with a table, while preserving the table definitions.
Purge => Permanently removes deleted data from physical storage so that the disk space can be reused.  You can purge historical data up to and including the epoch in which the Ancient History Mark is contained.
Delete => It marks tuples as no longer valid in the current epoch. It does not delete data from disk storage for base tables.

Vertica purge_table or purge_partitions statement purges all projections of the specified table and can temporarily take significant disk space while performing the purge.

Following query returns list of queries that when executed purges data for each partitioned tables.

   -- Look into only large tables (TABLE_SIZE_OF_INTEREST), say 1Billion row,
   --  that have good chunk deletes (>20%)
   SELECT 'SELECT purge_partition('''|| P.projection_schema ||'.'|| P.anchor_table_name || ''','''||PRT.partition_key||''');'
    FROM projections P inner join
    (   SELECT projection_id, partition_key
        ,SUM(deleted_row_count)
        ,SUM(ros_row_count) as total_rows
        FROM partitions
        GROUP BY 1,2
        HAVING SUM(deleted_row_count)/SUM(ros_row_count) > 0.2
    ) PRT
    ON PRT.projection_id = P.projection_id
    GROUP BY PRT.partition_key, P.projection_schema, P.anchor_table_name
    HAVING max(total_rows) <= {TABLE_SIZE_OF_INTEREST}
    ORDER BY max(total_rows) DESC
    LIMIT 25 ;


Similarly for those tables that are non-partitioned use following query.

    SELECT 'select purge_table('''|| projection_schema ||'.'|| anchor_table_name || ''');'
    FROM projections P inner join
    (   SELECT schema_name, projection_name ,
            sum(deleted_row_count) delete_rows ,
            sum(delete_vector_count) delete_vector_count ,
            sum(total_row_count) total_rows,
            sum(deleted_row_count)/sum(total_row_count) as delete_percent
        FROM storage_containers
        GROUP BY 1,2
        HAVING sum(deleted_row_count)/sum(total_row_count) > 0.2
    ) A
    ON A.schema_name = P.projection_schema
      AND A.projection_name = P.projection_name
    INNER JOIN tables T on P.anchor_table_id = T.table_id
    WHERE length(T.partition_expression) = 0
    GROUP BY 1
    HAVING max(total_rows) <= {TABLE_SIZE_OF_INTEREST}
    ORDER BY max(total_rows) DESC
    LIMIT 25 ;

With a wrapper around these queries in Python/ Perl/ Bash, etc one can easily go through the list executing each statement and cleaning up old deleted data for improved performance.