Large data mysqldump to Greenplum

Recently I needed to load a single table from a transaction system with few hundred million rows into Greenplum/Postgresql from MySQL.  MySQL schema didn't have many tables but one single table was large with around 50G size including data and index.  Ended up testing with 2 different techniques below.

Technique 1: Using mysqldump and Postgresql inserts

In the beginning I thought it would be pretty straight forward with mysqldump I could be able to use postgres load utility
> psql -h HOST -U USER -f FILENAME   
but it turned out be intersting challenge with many changes needed to load successfully.
Another minor quirk was transaction systems was using Clustrix a specific vendor version of MySQL. It's dump creates a file that is not fully compatible with direct load into postgresql.  Dump even with --compitable=postgresql option didn't help much.

One of the major issue while loading huge file with psql utility the "Out of memory" error even with reasonably small file, say 1G.
ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 0 bytes by 1414939983 more bytes.
As a first step I removed all MySQL comments and anything other than data with INSERT INTO statement.

Example lines removed are below.
-- MySQL dump 10.13  Distrib 5.1.42, for krobix-linux-gnu (x86_64)
--
-- Host: localhost    Database: supply_production
-- ------------------------------------------------------
-- Server version 5.0.45

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
And retained any lines between
   INSERT INTO line and ENABLE KEYS line.  
Used a script to perform the filtering.

This gave me all the data I needed with only few hundred lines with each line as long as 10 or more MB!   These are long lines with thousands and thousands of records.  At certain intervals, 100,000 or so, Clustrix inserted new row with "INSERT INTO ...".  I removed these extra inserts comands and split the records with perl simpel one liner

>  perl -pi -e 's#\)\,\(#\)\,\n\(#g'  
thus inserting new line at the end of each record and the new file had around 200 million lines now.

With continued error of "Out of memory" you will be kind of misled to believe that the Greenplum is slurping in all data into memory and trying to load which in first place shouldn't be the case.  With INSERT INTO .... VALUES ( .... ) statement there is no need to do so.  Next option was to find the possible error by splitting the file into smaller files and adding INSERT INTO statement at the beginning of each line and then removing the trailing "," at the end of last line.

After trying 10 million, 1 million and 0.5 million, Greenplum started throwing appropriate error like non-existing table (this is because the path was not set for postgresql), missing "," etc.

Split command used
> split --lines=500000 FILENAME
Adding "INSERT INTO ...." to each of these files and instead of seeking to end of file and removing extra ",", I added a new dummy line which I can delete later from uploaded table.

> for fn in `ls x*`;
    do echo "Working on $fn";
      echo "INSERT INTO schema.table VALUES " > "${fn}_r_l";
      cat $fn >> "${fn}_r_l";
      echo "(column_1, column_2, column_3,....column_N)" >> "${fn}_r_l" ;
   done
This created for each split file corresponding file with "_r_l" suffix (ready_to_load).

Then loaded the table
> for fn in `ls xd*_r_l`;
    do
      echo "Loading $fn";
      psql -h HOST -U USER -d DATABASE -f "FILENAME";
    done


Systems and utilities used:

  Greenplum DB - Greenplum Database 4.0.6.0 build 4
  Postgresql - PostgreSQL 8.2.14
  MySQL - 5.0.45-clustrix-v4.1
  Perl - 5.8.8 multithreaded
  Bash
  All running on linux x86_64 with 24G memory

There were more than 400 files with 0.5G data loaded in less than three hours.  Still substantial but it is one time load and was acceptable.


Technique 2:   Using mysqldump and Greenplum gpload

Greenplum's bulk loading utility (gpload) is an excellent one to load large data set.  After dumping the data and cleaning, formatting it into a few files of 10G each, you can use gpload as below.

gpload  -f  $gpload_ctl_file

with control file created dynamically from a template.  For example in the below table replace all place holders with respective values.  With dynamically created control file (and no hard-coded values) the technique can be used for daily bulk loads as well.

VERSION: 1.0.0.1
DATABASE:
USER:  
HOST:  
PORT:  
GPLOAD:
   INPUT:
     - SOURCE:
         LOCAL_HOSTNAME:
           -
         PORT:
         FILE:
           -
     - FORMAT: text
     - DELIMITER: '|'
     - NULL_AS: 'NULL'
     - ERROR_LIMIT: 25
     - ERROR_TABLE: sandbox_log.gpload_errors
     - COLUMNS:
         - timestamp: text
         - priority: text
     ...
     ...
 PRELOAD:
    - TRUNCATE: false
   OUTPUT:
     - TABLE:
     - MODE: INSERT

This is a much faster and efficient loading than technique 1.

HTH,
Shiva

Recursion defined

Recursion, see Recursion.  :)

Something defined in terms itself.  Or sometimes CS scientists or programmers making point through
GNU - "GNU's Not Unix"
YAML - "YAML Ain't Markup Language"
Or beautiful Sierpinski Traingles

When a function call's itself some interesting things happen behind the scene like holding onto the variables which later used when computer execution unwinds the stack.  In a typical example of recursion in solving a factorial, one may write
#!/usr/bin/env perl

use strict;
sub factorial {
    my $v = shift;
    return 1 if $v == 1;
    return $v * factorial($v - 1);
}
factorial(5);
When a first call is made to factorial(5), the execution jumps to factorial function (subroutine) and gets to last line, where while evaluating encounters another function call to factorial ($v -1) which then again makes a call to function or subroutine. This pushes a new call frame on to stack (with args).  If a function returns it is pop-ed out of the stack and done (lost).

Few things are working together with call stack, heap, garbage collector (which removes any memory of any variable or func obj that doesn't have reference count 1 or more), and execution system.

Now to see more on recursion you can try the following


  1 #!/usr/bin/env  perl
  2 $! = 1;
  3 use strict;
  4 use IO::Handle;
  5 use Carp qw(cluck);
  6
  7 STDOUT->autoflush(1);      # Flush output immediately
  8 STDERR->autoflush(1);
  9
 10 sub factorial {
 11     my $v = shift;
 12  
 13     dummy_func();             # Sub that returns immediately printing call stack
 14     return 1 if $v == 1;
 15     print "Variable v value: $v and it's address:", \$v,
                     "\nCurrent sub factorial addr:", \&factorial, "\n","-"x40;
 16     return $v * factorial($v - 1);    # Builds on call for each func call
 17 }
 18  
 19 sub dummy_func {
 20     cluck;
 21 }
 22
 23 factorial(5);
Resulting output:
  1     main::dummy_func() called at ./t_recursion.pl line 13
  2     main::factorial(5) called at ./t_recursion.pl line 23
  3 Variable v value: 5 and its address:SCALAR(0x7ff6240546a0)
  4 Current sub factorial addr:CODE(0x7ff62402f2c0)
  5 ----------------------------------------
  6     main::dummy_func() called at ./t_recursion.pl line 13
  7     main::factorial(4) called at ./t_recursion.pl line 16
  8     main::factorial(5) called at ./t_recursion.pl line 23
  9 Variable v value: 4 and its address:SCALAR(0x7ff6240610e8)
 10 Current sub factorial addr:CODE(0x7ff62402f2c0)
 11 ----------------------------------------
 12     main::dummy_func() called at ./t_recursion.pl line 13
 13     main::factorial(3) called at ./t_recursion.pl line 16
 14     main::factorial(4) called at ./t_recursion.pl line 16
 15     main::factorial(5) called at ./t_recursion.pl line 23
 16 Variable v value: 3 and its address:SCALAR(0x7ff6240612f8)
 17 Current sub factorial addr:CODE(0x7ff62402f2c0)
 18 ----------------------------------------
 19     main::dummy_func() called at ./t_recursion.pl line 13
 20     main::factorial(2) called at ./t_recursion.pl line 16
 21     main::factorial(3) called at ./t_recursion.pl line 16
 22     main::factorial(4) called at ./t_recursion.pl line 16
 23     main::factorial(5) called at ./t_recursion.pl line 23
 24 Variable v value: 2 and its address:SCALAR(0x7ff624061538)
 25 Current sub factorial addr:CODE(0x7ff62402f2c0)
 26 ----------------------------------------
 27     main::dummy_func() called at ./t_recursion.pl line 13
 28     main::factorial(1) called at ./t_recursion.pl line 16
 29     main::factorial(2) called at ./t_recursion.pl line 16
 30     main::factorial(3) called at ./t_recursion.pl line 16
 31     main::factorial(4) called at ./t_recursion.pl line 16
 32     main::factorial(5) called at ./t_recursion.pl line 23

When recursion script is kicked-off, it pushes factorial(5) first frame on to the call stack (line 2 above) which calls another dummy_func which then goes on to the stack (line 1).   Hence when cluck is called in dummy_func there are two calls on the stack along with any arguments passed.

Then dummy_call returns and is pop-ed from the stack.  Program moves to line 15 (script above) and evaluates to false.  Then prints lines 3&4 output ($v and its location, factorial sub location).

Script line 16 calls factorial which pushes the new function call on to stack and at the point the value of $v is 5.  The function and this variable are in same scope and on stack.  So later when this function returns is multiplied with $v (value 5).

When factorial is called 2nd time (but first time at line 16 and pushed onto call stack) $v is reduced by 1 ($v -1) which is then copied and execution starts at top of this subroutine again.  Remember copy of definition of function always the same at some location (CODE(0x7ff62402f2c0)) in program memory.

This execution then calls dummy_func which spits out the call stack and as you expected now you have dummy_func at top, 2nd factorial in middle and 1st factorial call at bottom.  Stack is FILO (First In Last Out or LIFO - Last In First Out).  Then execution moves to lines 14 & 15.   Output looks like:

  6     main::dummy_func() called at ./t_recursion.pl line 13
  7     main::factorial(4) called at ./t_recursion.pl line 16
  8     main::factorial(5) called at ./t_recursion.pl line 23
  9  Variable v value: 4 and its address:SCALAR(0x7ff6240610e8)
 10 Current sub factorial addr:CODE(0x7ff62402f2c0)

At script line 16 the recursion continues and you get output lines 12 to 32.  At the last function the base or terminal condition of recursion is met ( return 1 if $v == 1; ) and it returns 1.

factorial of 1 => 1! = 1;

Now the stack rewinding begins, the return value of 1 (when factorial (1) returned) is multiplied with the variable $v (value 2) and results in 2 which is returned by  return $v * factorial($v - 1);  statement.

Finally,  5! = 120.

All this happen behind the scene and it might be just better to know and recognize the common pattern when this happen :).  I wouldn't worry about how the implementation is done when I run query like

SELECT column_N FROM table_X;
It is so darn simple but so much goes behind that SQL statement from mapping table to file and exact location in file to extract correct values.  It is all hidden from the application program.

For more details take a look at "Call Stack" or "Activation Record".

But if you like to dig deeper through debugging, try


> perl -d t_recursion.pl
Loading DB routines from perl5db.pl version 1.33
Editor support available.

Enter h or `h h' for help, or `man perldebug' for more help.

main::(t_recursion.pl:2):	$! = 1;
  DB<1> n
main::(t_recursion.pl:7):	STDOUT->autoflush(1);
  DB<1> n
main::(t_recursion.pl:8):	STDERR->autoflush(1);
  DB<1> n
main::(t_recursion.pl:23):	factorial(5);
  DB<1> s
main::factorial(t_recursion.pl:11):
11:	    my $v = shift;
  DB<1> s
main::factorial(t_recursion.pl:13):
13:	    dummy_func();
  DB<1> s
main::dummy_func(t_recursion.pl:20):
20:	    cluck;
  DB<1> T
. = main::dummy_func() called from file `t_recursion.pl' line 13
. = main::factorial(5) called from file `t_recursion.pl' line 23
  DB<1>