Yum metadata file does not match checksum error and fix

While using yum/ CentOS5.4 at a client site, came across error on trying to search for a package.  The error indicated a checksum mismatch like:

Loaded plugins: fastestmirror
openx-artifactory-released/primary                                                  | 1.1 MB     00:00     
http://abcd.company.org:8081/artifactory/libs-release-local/repodata/primary.xml.gz: [Errno -1] Metadata file does not match checksum
Trying other mirror.
Error: failure: repodata/primary.xml.gz from openx-artifactory-released: [Errno 256] No more mirrors to try.

This happens when the repo metadata files are out of sync for some reason (including outdated cache) and one of the quick way to resolve is to force not checking the checksum with fastestmirror of yum which can take command line option of user confirmation input (y(es) or N(o)).  So ran

> sudo yum -y remove yum-fastestmirror

If one doesn't check the dependency before removing a package one can get bitten. And that is what I ended up with :(.

Before running the above command I should have checked with 

> yum deplist yum-fastestmirror
Excluding Packages from CentOS-5 - Plus
Finished
Finding dependencies: 
package: yum-fastestmirror.noarch 1.1.16-14.el5.centos.1
  dependency: config(yum-fastestmirror) = 1.1.16-14.el5.centos.1
   provider: yum-fastestmirror.noarch 1.1.16-14.el5.centos.1
  dependency: yum >= 3.0
   provider: yum.noarch 3.2.22-20.el5.centos
   provider: yum.noarch 3.2.22-20.el5.centos
package: yum-fastestmirror.noarch 1.1.16-13.el5.centos
  dependency: config(yum-fastestmirror) = 1.1.16-13.el5.centos
   provider: yum-fastestmirror.noarch 1.1.16-13.el5.centos
   provider: yum-fastestmirror.noarch 1.1.16-13.el5.centos
  dependency: yum >= 3.0
   provider: yum.noarch 3.2.22-20.el5.centos
   provider: yum.noarch 3.2.22-20.el5.centos

and also yum itself depends on yum-fastestmirror!
> yum deplist yum
...
  dependency: yum-fastestmirror
   provider: yum-fastestmirror.noarch 1.1.16-13.el5.centos
   provider: yum-fastestmirror.noarch 1.1.16-13.el5.centos
   provider: yum-fastestmirror.noarch 1.1.16-14.el5.centos.1
....

On removing yum-fastestmirror (to quickly avoid the metadata checksum error), I ended up with bigger problem of having uninstalled the yum itself!

> yum search python
-bash: yum: command not found

> rpm -qi yum
package yum is not installed

Now the only solution is to get the new package of yum for this CentOS and install new.  I couldn't perform wget or http from behind the firewall and proxy due to security network on this system.  

> cat /etc/redhat-release
OpenX release 5.4 (based on CentOS 5.4)

httped to http://vault.centos.org/5.4/os/i386/CentOS/ and download yum 3.2.22 (yum-3.2.22-20.el5.centos.noarch.rpm) and yum-fastestmirror & yum-utils to my local laptop.  Then scped the packages to dev system

On the dev box, then, I ran

> sudo rpm -ivh yum-3.2.22-20.el5.centos.noarch.rpm yum-fastestmirror-1.1.16-13.el5.centos.noarch.rpm 

Note: You need to provide the both rpms at the same time (in single command) due to fact there is cyclic dependency.  If you try installing one at a time each will complain that other package is missing!

Once that install was good, I could perform

> yum search ctags

Purging Vertica tables/ partitions at regular intervals

First thing first, especially, when there is potential to misunderstand due to definitions of specific words - purging vs truncating vs deleting data. In this article, my interest is to purge the tables of deleted rows.

Truncate => Removes all storage associated with a table, while preserving the table definitions.
Purge => Permanently removes deleted data from physical storage so that the disk space can be reused.  You can purge historical data up to and including the epoch in which the Ancient History Mark is contained.
Delete => It marks tuples as no longer valid in the current epoch. It does not delete data from disk storage for base tables.

Vertica purge_table or purge_partitions statement purges all projections of the specified table and can temporarily take significant disk space while performing the purge.

Following query returns list of queries that when executed purges data for each partitioned tables.

   -- Look into only large tables (TABLE_SIZE_OF_INTEREST), say 1Billion row,
   --  that have good chunk deletes (>20%)
   SELECT 'SELECT purge_partition('''|| P.projection_schema ||'.'|| P.anchor_table_name || ''','''||PRT.partition_key||''');'
    FROM projections P inner join
    (   SELECT projection_id, partition_key
        ,SUM(deleted_row_count)
        ,SUM(ros_row_count) as total_rows
        FROM partitions
        GROUP BY 1,2
        HAVING SUM(deleted_row_count)/SUM(ros_row_count) > 0.2
    ) PRT
    ON PRT.projection_id = P.projection_id
    GROUP BY PRT.partition_key, P.projection_schema, P.anchor_table_name
    HAVING max(total_rows) <= {TABLE_SIZE_OF_INTEREST}
    ORDER BY max(total_rows) DESC
    LIMIT 25 ;


Similarly for those tables that are non-partitioned use following query.

    SELECT 'select purge_table('''|| projection_schema ||'.'|| anchor_table_name || ''');'
    FROM projections P inner join
    (   SELECT schema_name, projection_name ,
            sum(deleted_row_count) delete_rows ,
            sum(delete_vector_count) delete_vector_count ,
            sum(total_row_count) total_rows,
            sum(deleted_row_count)/sum(total_row_count) as delete_percent
        FROM storage_containers
        GROUP BY 1,2
        HAVING sum(deleted_row_count)/sum(total_row_count) > 0.2
    ) A
    ON A.schema_name = P.projection_schema
      AND A.projection_name = P.projection_name
    INNER JOIN tables T on P.anchor_table_id = T.table_id
    WHERE length(T.partition_expression) = 0
    GROUP BY 1
    HAVING max(total_rows) <= {TABLE_SIZE_OF_INTEREST}
    ORDER BY max(total_rows) DESC
    LIMIT 25 ;

With a wrapper around these queries in Python/ Perl/ Bash, etc one can easily go through the list executing each statement and cleaning up old deleted data for improved performance.