Data and Technology: August 2010

In my previous few posts on the blog, I covered how to convert IP addresses to corresponding integers and back. When performing a Data Warehouse (DW) tables join between IP addresses for location analysis joining on integer IP column is much faster than joining on string column. As noted in other posts, this article covers IPv4 but technique could be used for IPv6 addresses on using 2 BIGINT.

For geo location info, you can use MaxMind's Geo-City data that is provided in two entities - one with mapping IP block to a location ID and another one with location ID to city/state/country. It is one to many (1:M) relationship and to find an IP address location one need to find the matching IP block and then map it to city or country. There are posts that describe in detail how to perform it effeciently for a given IP. These techniques will work for an OLTP environment where you pass known IP address from UI or application layer to database and gets its city/country.

Fast MySQL Range Queries on MaxMind GeoIP Tables byAndy Skelton
MaxMind GeoIP Install Setup Tutorial by Bartomedia
Finding a user's country/region based on IP by Shlomi Noach
On efficiently geo-referencing IPs with MaxMind GeoIP and MySQL GIS by Jeremy Cole

Geo-City data has following fields seperated by comma and row terminated by new line (\n):

Location ID
Country Code
State/Region
City Name
Postal Code (US only)
Latitude
Longitude
Metro Code
Area Code

IP Block has:

Start IP Address
End IP Address
Location ID

While loading this dimensional data during ETL process, you should also do the IP address transformation to integer as explained in "Converting an IP address to an integer". And Geo-City dimensional table will also have partition range (the reason for this post) and surrogate key (id). You should also perform other data validation like non-overlapping ranges, valid country codes, etc. which are all typical standard activities of DW ETL process. The time, resource and effort put into cleaning and validating the data before pushing the data into final fact tables will pay well within short period of time. To fix a data issue and reconcile the reports in DW/BI environment always leads to lower ROI.

You can also denormalize the above tables and create a single table. In either case by joining on range of IP address for a given set of IPs will result in the same slow performance.

dim_geo_ip_city_location holds around 5.9million rows with following attributes:

CREATE TABLE dim_geo_ip_city_location (
    location_id int NOT NULL,
    start_ip varchar(16) NOT NULL,
    end_ip  varchar(16) NOT NULL,
    country char(2),
    region varchar(64), 
    city varchar(128),
    postal_code varchar(32),
    latitude decimal(8,4),
    longitude decimal(8,4),
    metro_code varchar(8),
    area_code varchar(8),
    ...
)ENIGNE=...

CREATE TABLE dim_ip_block (
    start_ip_num bigint NOT NULL,
    end_ip_num bigint NOT NULL,
    location_id int NOT NULL,
        ...
)ENGINE=...

In MySQL (5.0) optimizer performs range operation effeciently with a constant meaning that you have given the query an IP address like "WHERE ip_addr between 1230 and 1234", for example, but not with range operations without constants. For more info on using range optimization see mysql documentation and for indexing see "Mastering the Art of (MySQL) Indexing".

I did not find any article providing some insights into doing DW/BI analysis on IP addresses. In a DW environment the business user is interested in slicing and dicing the hundreds of thousands to million of IPs. Just trying to find few thousand IP address locations (See query 1 below) was taking more than an hour when joined between lookup table (a table similar to dim_geo_ip_city_location but without partition column (ip_range_partition) and id) and ip address join table even with appropriate indexes like index on (start_ip_num, end_ip_num) or start_ip_num and end_ip_num.

Query 1:

-- Simple query to select IP block
SELECT IA.ip_num, B.location_id, B.start_ip_num, B.end_ip_num
FROM ip_addresses IA use index(...)
JOIN dim_ip_block B use index(...)
WHERE IA.ip_num between B.start_ip_num and B.end_ip_num

This leads to MySQL performing full scans. And in the case of ip_block table (B) it doesn't use any index. For this to be avoided, I partitioned the dimension (lookup) table into 1000 parts and since this table has ~5.9 Million rows each partition will result in ~5,900 rows in each partition. For more on data set partitions see its wiki. The partitions should be numbered in the ascending order with respect to ascending range of ip blocks (start_ip_num). Then also create a another table (dim_ip_range_partition) with start_ip_num and end_ip_num for this each range basically creating super IP blocks.

CREATE TABLE dim_ip_range_partition (
     partition_id int NOT NULL AUTO_INCREMENT,
     start_ip_num bigint NOT NULL,
     end_ip_num bigint NOT NULL,
     ....
) ENGINE=InnoDB ;

Also, create a primary key on patition_id and index on (start_ip_num, end_ip_num) on this table.
The solution is based on two things that there is no overlapping IP block and IP blocks are ordered (in ascending order).

To populate the above table,

Add id (row_id) column to dim_ip_block
Add partition_id integer column to dim_ip_block
Update id from 1 to n (highest start_ip_num) in ascending order of start_ip_num
Partition "n" ids into P parts (n/P) where n > 1 and 1 < P < n
Assign each part/partition ascending integer value
Now update above table dim_ip_range_partition P_TABLE from dim_ip_block B_TABLE where P_TABLE partition id between 1 and P of B_TABLE and P_TABLE.start_ip_num is lowest B_TABLE.start_ip_num of the corresponding partition and P_TABLE.end_ip_num is highest of corresponding partition. In other words...

         INSERT INTO dim_ip_range_partition
            (partition_id, start_ip_num, end_ip_num, ...)
         SELECT partition_id, min(start_ip_num), max(end_ip_num), ...
         FROM dim_ip_block
         GROUP BY partition_id
         ORDER BY partition_id asc

By doing so, we have created a super IP blocks each having n/P blocks from the original ip_blocks.   And the following query is one way to extract city/country info for given set of IP addresses.

SELECT t.ip_num, t.ip, B.start_ip_num, B.start_ip, B.end_ip_num, B.end_ip
       ,L.city, L.country
FROM 
(
   --- Get the super ip block (partition) 
   SELECT C.ip_num, max(partition_id) partition_id, max(P.start_ip_num) start_ip_num
   FROM ip_addresses C use index(...)
   JOIN dim_ip_range_partition P 
   WHERE C.ip_num >= P.start_ip_num
   GROUP BY C.ip_num
) t
LEFT OUTER JOIN mktg_ip_block B use index(...)  
  ON  t.partition_id = B.partition_id     --- And join on this partition
  AND t.ip_num >= B.start_ip_num 
  AND t.ip_num <= B.end_ip_num
JOIN dim_geo_ip_city_location L          --- Get city/country
  ON B.location_id = L.location_id

The query that would have taken more than hour returns within 30 seconds now! You can further optimize the above design with say more denormalized relations, having a large number of partitions (this is a trade-off, with highest partition n = P, each partition is same as original table which will result in worse off performance!) on both lookup table and lookedup table. Or extending the partitioning concept to more levels (super super block, ....) you can reduce the number of joining rows with little bit more complex query.

The solution is a typical Computer Science way of handling the complex problem by dividing into smaller ones and solving it. In this it is more specific to MySQL solution for large data sets with range operations.

Cheers,
Shiva

DW solution - IP mapping to Geo Locations

Tools to convert IP address