Character set, Collation

Here is an interesting challenge that I ran into in the Data Warehouse environment. While debugging a particular business user query, I joined between a string built in a sub-query to a column in a particular fact table with some 20million rows. BTW, I was using MySQL 5.0 and Toad 4.5 to run this query and instead of getting any result I got the following error!

MySQL Database Error: Illegal mix of collations (utf8_general_ci,COERCIBLE) and (latin1_swedish_ci,IMPLICIT) for operation '='    22    0

The query was simple and similar to
SELECT s.str, s.d1, s.d2
FROM (
    SELECT distinct 'string' str, dimension1 d1, dimension 2 d2
    FROM table_1
    WHERE ...
) s
LEFT OUTER JOIN table_2 t2
ON ...
WHERE s.str = t2.str
AND ...




The db error was indicating me that I was using two different collations for string comparison and shouldn't be doing so. That left me with either forcing one collation to be converted to other by force or look closely at server, database, table, column character set or collations and change it appropriately, if needed.

By default Toad sets the character_set_client, character_set_collation to 'utf8' and the column that joined in table 2 was in 'latin1'. Here is the first solution:


SELECT s.str, s.d1, s.d2
FROM (
    SELECT distinct 'string' str, dimension1 d1, dimension 2 d2
    FROM table_1
    WHERE ...
) s
LEFT OUTER JOIN table_2 t2
ON ...
WHERE s.str COLLATE utf8_general_ci = t2.str
AND ...


Here the COLLATE is used in WHERE clause but it can also be used in SELECT or GROUP BY or other aggregate functions.



Then looking at server, database's related variables indicated that they were in default setting.

character_set_server = latin1
collation_server = latin_swedish_ci
character_set_database = latin1
collation_database = latin_swedish_ci

and so was the table 2.

By setting the collation_connection and character_set_client to latin, I could effectively use the latin collation. Here is solution #2.

SET character_set_client = 'latin1';
COMMIT;
SET collation_connection = 'latin1_swedish_ci';
COMMIT;


Note, by setting the collation_connection I also set the character_set_connection to respective character set. After this, the query ran successfully.

Enjoy,
Shiva





No comments:

Post a Comment