Recently I needed to load a single table from a transaction system with few hundred million rows into Greenplum/Postgresql from MySQL. MySQL schema didn't have many tables but one single table was large with around 50G size including data and index. Ended up testing with 2 different techniques below.
Technique 1: Using mysqldump and Postgresql inserts
In the beginning I thought it would be pretty straight forward with mysqldump I could be able to use postgres load utility
Another minor quirk was transaction systems was using Clustrix a specific vendor version of MySQL. It's dump creates a file that is not fully compatible with direct load into postgresql. Dump even with --compitable=postgresql option didn't help much.
One of the major issue while loading huge file with psql utility the "Out of memory" error even with reasonably small file, say 1G.
Example lines removed are below.
INSERT INTO line and ENABLE KEYS line.
Used a script to perform the filtering.
This gave me all the data I needed with only few hundred lines with each line as long as 10 or more MB! These are long lines with thousands and thousands of records. At certain intervals, 100,000 or so, Clustrix inserted new row with "INSERT INTO ...". I removed these extra inserts comands and split the records with perl simpel one liner
With continued error of "Out of memory" you will be kind of misled to believe that the Greenplum is slurping in all data into memory and trying to load which in first place shouldn't be the case. With INSERT INTO .... VALUES ( .... ) statement there is no need to do so. Next option was to find the possible error by splitting the file into smaller files and adding INSERT INTO statement at the beginning of each line and then removing the trailing "," at the end of last line.
After trying 10 million, 1 million and 0.5 million, Greenplum started throwing appropriate error like non-existing table (this is because the path was not set for postgresql), missing "," etc.
Split command used
Then loaded the table
Postgresql - PostgreSQL 8.2.14
MySQL - 5.0.45-clustrix-v4.1
Perl - 5.8.8 multithreaded
Bash
All running on linux x86_64 with 24G memory
There were more than 400 files with 0.5G data loaded in less than three hours. Still substantial but it is one time load and was acceptable.
Technique 2: Using mysqldump and Greenplum gpload
Greenplum's bulk loading utility (gpload) is an excellent one to load large data set. After dumping the data and cleaning, formatting it into a few files of 10G each, you can use gpload as below.
gpload -f $gpload_ctl_file
with control file created dynamically from a template. For example in the below table replace all place holders with respective values. With dynamically created control file (and no hard-coded values) the technique can be used for daily bulk loads as well.
VERSION: 1.0.0.1
DATABASE:
USER:
HOST:
PORT:
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
-
PORT:
FILE:
-
- FORMAT: text
- DELIMITER: '|'
- NULL_AS: 'NULL'
- ERROR_LIMIT: 25
- ERROR_TABLE: sandbox_log.gpload_errors
- COLUMNS:
- timestamp: text
- priority: text
...
...
PRELOAD:
- TRUNCATE: false
OUTPUT:
- TABLE:
- MODE: INSERT
This is a much faster and efficient loading than technique 1.
HTH,
Shiva
Technique 1: Using mysqldump and Postgresql inserts
In the beginning I thought it would be pretty straight forward with mysqldump I could be able to use postgres load utility
> psql -h HOST -U USER -f FILENAMEbut it turned out be intersting challenge with many changes needed to load successfully.
Another minor quirk was transaction systems was using Clustrix a specific vendor version of MySQL. It's dump creates a file that is not fully compatible with direct load into postgresql. Dump even with --compitable=postgresql option didn't help much.
One of the major issue while loading huge file with psql utility the "Out of memory" error even with reasonably small file, say 1G.
ERROR: out of memory DETAIL: Cannot enlarge string buffer containing 0 bytes by 1414939983 more bytes.As a first step I removed all MySQL comments and anything other than data with INSERT INTO statement.
Example lines removed are below.
-- MySQL dump 10.13 Distrib 5.1.42, for krobix-linux-gnu (x86_64)
--
-- Host: localhost Database: supply_production
-- ------------------------------------------------------
-- Server version 5.0.45
/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
And retained any lines betweenINSERT INTO
Used a script to perform the filtering.
This gave me all the data I needed with only few hundred lines with each line as long as 10 or more MB! These are long lines with thousands and thousands of records. At certain intervals, 100,000 or so, Clustrix inserted new row with "INSERT INTO ...". I removed these extra inserts comands and split the records with perl simpel one liner
> perl -pi -e 's#\)\,\(#\)\,\n\(#g'thus inserting new line at the end of each record and the new file had around 200 million lines now.
With continued error of "Out of memory" you will be kind of misled to believe that the Greenplum is slurping in all data into memory and trying to load which in first place shouldn't be the case. With INSERT INTO .... VALUES ( .... ) statement there is no need to do so. Next option was to find the possible error by splitting the file into smaller files and adding INSERT INTO statement at the beginning of each line and then removing the trailing "," at the end of last line.
After trying 10 million, 1 million and 0.5 million, Greenplum started throwing appropriate error like non-existing table (this is because the path was not set for postgresql), missing "," etc.
Split command used
> split --lines=500000 FILENAMEAdding "INSERT INTO ...." to each of these files and instead of seeking to end of file and removing extra ",", I added a new dummy line which I can delete later from uploaded table.
> for fn in `ls x*`; do echo "Working on $fn"; echo "INSERT INTO schema.table VALUES " > "${fn}_r_l"; cat $fn >> "${fn}_r_l"; echo "(column_1, column_2, column_3,....column_N)" >> "${fn}_r_l" ; doneThis created for each split file corresponding file with "_r_l" suffix (ready_to_load).
Then loaded the table
> for fn in `ls xd*_r_l`; do echo "Loading $fn"; psql -h HOST -U USER -d DATABASE -f "FILENAME"; done
Systems and utilities used:
Greenplum DB - Greenplum Database 4.0.6.0 build 4Postgresql - PostgreSQL 8.2.14
MySQL - 5.0.45-clustrix-v4.1
Perl - 5.8.8 multithreaded
Bash
All running on linux x86_64 with 24G memory
There were more than 400 files with 0.5G data loaded in less than three hours. Still substantial but it is one time load and was acceptable.
Technique 2: Using mysqldump and Greenplum gpload
Greenplum's bulk loading utility (gpload) is an excellent one to load large data set. After dumping the data and cleaning, formatting it into a few files of 10G each, you can use gpload as below.
gpload -f $gpload_ctl_file
with control file created dynamically from a template. For example in the below table replace all place holders with respective values. With dynamically created control file (and no hard-coded values) the technique can be used for daily bulk loads as well.
VERSION: 1.0.0.1
DATABASE:
USER:
HOST:
PORT:
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
-
PORT:
FILE:
-
- FORMAT: text
- DELIMITER: '|'
- NULL_AS: 'NULL'
- ERROR_LIMIT: 25
- ERROR_TABLE: sandbox_log.gpload_errors
- COLUMNS:
- timestamp: text
- priority: text
...
...
PRELOAD:
- TRUNCATE: false
OUTPUT:
- TABLE:
- MODE: INSERT
This is a much faster and efficient loading than technique 1.
HTH,
Shiva