For geo location info, you can use MaxMind's Geo-City data that is provided in two entities - one with mapping IP block to a location ID and another one with location ID to city/state/country. It is one to many (1:M) relationship and to find an IP address location one need to find the matching IP block and then map it to city or country. There are posts that describe in detail how to perform it effeciently for a given IP. These techniques will work for an OLTP environment where you pass known IP address from UI or application layer to database and gets its city/country.
- Fast MySQL Range Queries on MaxMind GeoIP Tables byAndy Skelton
- MaxMind GeoIP Install Setup Tutorial by Bartomedia
- Finding a user's country/region based on IP by Shlomi Noach
- On efficiently geo-referencing IPs with MaxMind GeoIP and MySQL GIS by Jeremy Cole
- Location ID
- Country Code
- State/Region
- City Name
- Postal Code (US only)
- Latitude
- Longitude
- Metro Code
- Area Code
- Start IP Address
- End IP Address
- Location ID
You can also denormalize the above tables and create a single table. In either case by joining on range of IP address for a given set of IPs will result in the same slow performance.
dim_geo_ip_city_location holds around 5.9million rows with following attributes:
CREATE TABLE dim_geo_ip_city_location ( location_id int NOT NULL, start_ip varchar(16) NOT NULL, end_ip varchar(16) NOT NULL, country char(2), region varchar(64), city varchar(128), postal_code varchar(32), latitude decimal(8,4), longitude decimal(8,4), metro_code varchar(8), area_code varchar(8), ... )ENIGNE=... CREATE TABLE dim_ip_block ( start_ip_num bigint NOT NULL, end_ip_num bigint NOT NULL, location_id int NOT NULL, ... )ENGINE=...
In MySQL (5.0) optimizer performs range operation effeciently with a constant meaning that you have given the query an IP address like "WHERE ip_addr between 1230 and 1234", for example, but not with range operations without constants. For more info on using range optimization see mysql documentation and for indexing see "Mastering the Art of (MySQL) Indexing".
I did not find any article providing some insights into doing DW/BI analysis on IP addresses. In a DW environment the business user is interested in slicing and dicing the hundreds of thousands to million of IPs. Just trying to find few thousand IP address locations (See query 1 below) was taking more than an hour when joined between lookup table (a table similar to dim_geo_ip_city_location but without partition column (ip_range_partition) and id) and ip address join table even with appropriate indexes like index on (start_ip_num, end_ip_num) or start_ip_num and end_ip_num.
Query 1:
-- Simple query to select IP block SELECT IA.ip_num, B.location_id, B.start_ip_num, B.end_ip_num FROM ip_addresses IA use index(...) JOIN dim_ip_block B use index(...) WHERE IA.ip_num between B.start_ip_num and B.end_ip_num
This leads to MySQL performing full scans. And in the case of ip_block table (B) it doesn't use any index. For this to be avoided, I partitioned the dimension (lookup) table into 1000 parts and since this table has ~5.9 Million rows each partition will result in ~5,900 rows in each partition. For more on data set partitions see its wiki. The partitions should be numbered in the ascending order with respect to ascending range of ip blocks (start_ip_num). Then also create a another table (dim_ip_range_partition) with start_ip_num and end_ip_num for this each range basically creating super IP blocks.
CREATE TABLE dim_ip_range_partition ( partition_id int NOT NULL AUTO_INCREMENT, start_ip_num bigint NOT NULL, end_ip_num bigint NOT NULL, .... ) ENGINE=InnoDB ;
Also, create a primary key on patition_id and index on (start_ip_num, end_ip_num) on this table.
The solution is based on two things that there is no overlapping IP block and IP blocks are ordered (in ascending order).
To populate the above table,
- Add id (row_id) column to dim_ip_block
- Add partition_id integer column to dim_ip_block
- Update id from 1 to n (highest start_ip_num) in ascending order of start_ip_num
- Partition "n" ids into P parts (n/P) where n > 1 and 1 < P < n
- Assign each part/partition ascending integer value
- Now update above table dim_ip_range_partition P_TABLE from dim_ip_block B_TABLE where P_TABLE partition id between 1 and P of B_TABLE and P_TABLE.start_ip_num is lowest B_TABLE.start_ip_num of the corresponding partition and P_TABLE.end_ip_num is highest of corresponding partition. In other words...
(partition_id, start_ip_num, end_ip_num, ...)
SELECT partition_id, min(start_ip_num), max(end_ip_num), ...
FROM dim_ip_block
GROUP BY partition_id
ORDER BY partition_id asc
By doing so, we have created a super IP blocks each having n/P blocks from the original ip_blocks. And the following query is one way to extract city/country info for given set of IP addresses.
SELECT t.ip_num, t.ip, B.start_ip_num, B.start_ip, B.end_ip_num, B.end_ip ,L.city, L.country FROM ( --- Get the super ip block (partition) SELECT C.ip_num, max(partition_id) partition_id, max(P.start_ip_num) start_ip_num FROM ip_addresses C use index(...) JOIN dim_ip_range_partition P WHERE C.ip_num >= P.start_ip_num GROUP BY C.ip_num ) t LEFT OUTER JOIN mktg_ip_block B use index(...) ON t.partition_id = B.partition_id --- And join on this partition AND t.ip_num >= B.start_ip_num AND t.ip_num <= B.end_ip_num JOIN dim_geo_ip_city_location L --- Get city/country ON B.location_id = L.location_id
The query that would have taken more than hour returns within 30 seconds now! You can further optimize the above design with say more denormalized relations, having a large number of partitions (this is a trade-off, with highest partition n = P, each partition is same as original table which will result in worse off performance!) on both lookup table and lookedup table. Or extending the partitioning concept to more levels (super super block, ....) you can reduce the number of joining rows with little bit more complex query.
The solution is a typical Computer Science way of handling the complex problem by dividing into smaller ones and solving it. In this it is more specific to MySQL solution for large data sets with range operations.
Cheers,
Shiva