Data and Technology: January 2015

While using yum/ CentOS5.4 at a client site, came across error on trying to search for a package. The error indicated a checksum mismatch like:

Loaded plugins: fastestmirror

openx-artifactory-released/primary | 1.1 MB 00:00

http://abcd.company.org:8081/artifactory/libs-release-local/repodata/primary.xml.gz: [Errno -1] Metadata file does not match checksum

Trying other mirror.

Error: failure: repodata/primary.xml.gz from openx-artifactory-released: [Errno 256] No more mirrors to try.

This happens when the repo metadata files are out of sync for some reason (including outdated cache) and one of the quick way to resolve is to force not checking the checksum with fastestmirror of yum which can take command line option of user confirmation input (y(es) or N(o)). So ran

> sudo yum -y remove yum-fastestmirror

If one doesn't check the dependency before removing a package one can get bitten. And that is what I ended up with :(.

Before running the above command I should have checked with

> yum deplist yum-fastestmirror

Excluding Packages from CentOS-5 - Plus

Finished

Finding dependencies:

package: yum-fastestmirror.noarch 1.1.16-14.el5.centos.1

dependency: config(yum-fastestmirror) = 1.1.16-14.el5.centos.1

provider: yum-fastestmirror.noarch 1.1.16-14.el5.centos.1

dependency: yum >= 3.0

provider: yum.noarch 3.2.22-20.el5.centos

package: yum-fastestmirror.noarch 1.1.16-13.el5.centos

dependency: config(yum-fastestmirror) = 1.1.16-13.el5.centos

provider: yum-fastestmirror.noarch 1.1.16-13.el5.centos

dependency: yum >= 3.0

provider: yum.noarch 3.2.22-20.el5.centos

and also yum itself depends on yum-fastestmirror!

> yum deplist yum

...

dependency: yum-fastestmirror

provider: yum-fastestmirror.noarch 1.1.16-13.el5.centos

provider: yum-fastestmirror.noarch 1.1.16-14.el5.centos.1

....

On removing yum-fastestmirror (to quickly avoid the metadata checksum error), I ended up with bigger problem of having uninstalled the yum itself!

> yum search python

-bash: yum: command not found

> rpm -qi yum

package yum is not installed

Now the only solution is to get the new package of yum for this CentOS and install new. I couldn't perform wget or http from behind the firewall and proxy due to security network on this system.

> cat /etc/redhat-release

OpenX release 5.4 (based on CentOS 5.4)

httped to http://vault.centos.org/5.4/os/i386/CentOS/ and download yum 3.2.22 (yum-3.2.22-20.el5.centos.noarch.rpm) and yum-fastestmirror & yum-utils to my local laptop. Then scped the packages to dev system

On the dev box, then, I ran

> sudo rpm -ivh yum-3.2.22-20.el5.centos.noarch.rpm yum-fastestmirror-1.1.16-13.el5.centos.noarch.rpm

Note: You need to provide the both rpms at the same time (in single command) due to fact there is cyclic dependency. If you try installing one at a time each will complain that other package is missing!

Once that install was good, I could perform

> yum search ctags

First thing first, especially, when there is potential to misunderstand due to definitions of specific words - purging vs truncating vs deleting data. In this article, my interest is to purge the tables of deleted rows.

Truncate => Removes all storage associated with a table, while preserving the table definitions.
Purge => Permanently removes deleted data from physical storage so that the disk space can be reused. You can purge historical data up to and including the epoch in which the Ancient History Mark is contained.
Delete => It marks tuples as no longer valid in the current epoch. It does not delete data from disk storage for base tables.

Vertica purge_table or purge_partitions statement purges all projections of the specified table and can temporarily take significant disk space while performing the purge.

Following query returns list of queries that when executed purges data for each partitioned tables.

-- Look into only large tables (TABLE_SIZE_OF_INTEREST), say 1Billion row,
-- that have good chunk deletes (>20%)
SELECT 'SELECT purge_partition('''|| P.projection_schema ||'.'|| P.anchor_table_name || ''','''||PRT.partition_key||''');'
FROM projections P inner join
( SELECT projection_id, partition_key
,SUM(deleted_row_count)
,SUM(ros_row_count) as total_rows
FROM partitions
GROUP BY 1,2
HAVING SUM(deleted_row_count)/SUM(ros_row_count) > 0.2
) PRT
ON PRT.projection_id = P.projection_id
GROUP BY PRT.partition_key, P.projection_schema, P.anchor_table_name
HAVING max(total_rows) <= {TABLE_SIZE_OF_INTEREST}
ORDER BY max(total_rows) DESC
LIMIT 25 ;

Similarly for those tables that are non-partitioned use following query.

SELECT 'select purge_table('''|| projection_schema ||'.'|| anchor_table_name || ''');'
FROM projections P inner join
( SELECT schema_name, projection_name ,
sum(deleted_row_count) delete_rows ,
sum(delete_vector_count) delete_vector_count ,
sum(total_row_count) total_rows,
sum(deleted_row_count)/sum(total_row_count) as delete_percent
FROM storage_containers
GROUP BY 1,2
HAVING sum(deleted_row_count)/sum(total_row_count) > 0.2
) A
ON A.schema_name = P.projection_schema
AND A.projection_name = P.projection_name
INNER JOIN tables T on P.anchor_table_id = T.table_id
WHERE length(T.partition_expression) = 0
GROUP BY 1
HAVING max(total_rows) <= {TABLE_SIZE_OF_INTEREST}
ORDER BY max(total_rows) DESC
LIMIT 25 ;

With a wrapper around these queries in Python/ Perl/ Bash, etc one can easily go through the list executing each statement and cleaning up old deleted data for improved performance.

Yum metadata file does not match checksum error and fix

Purging Vertica tables/ partitions at regular intervals