Encodings in MySQL

DONG Yuxuan @ Dec 24, 2019


Concepts

In MySQL, the concept of encoding contains two subconcepts, character set and collation. A character set defines how string are encoded and a collation defines how strings are compared. Every character set has a default collation.

MySQL supports users to set character sets and collations on different levels (Server, Database, Table, Column). For example, you can specify a database using UTF-8 as character set and using the default collation of UTF-8 as the collation of the database, but specify utf8mb4 as the character set of a specified table and using utf8mb4_bin as its collation.

Available Options

To list all character sets, run the following command in MySQL.

mysql> SHOW CHARACTER SET;
+----------+---------------------------------+---------------------+--------+
| Charset  | Description                     | Default collation   | Maxlen |
+----------+---------------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese        | big5_chinese_ci     |      2 |
| dec8     | DEC West European               | dec8_swedish_ci     |      1 |
| cp850    | DOS West European               | cp850_general_ci    |      1 |
| hp8      | HP West European                | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian           | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European            | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European     | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                    | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                        | ascii_general_ci    |      1 |

--- More ---

To list all collations, run the following command in MySQL.

mysql> SHOW COLLATION;
+--------------------------+----------+-----+---------+----------+---------+
| Collation                | Charset  | Id  | Default | Compiled | Sortlen |
+--------------------------+----------+-----+---------+----------+---------+
| big5_chinese_ci          | big5     |   1 | Yes     | Yes      |       1 |
| big5_bin                 | big5     |  84 |         | Yes      |       1 |
| dec8_swedish_ci          | dec8     |   3 | Yes     | Yes      |       1 |
| dec8_bin                 | dec8     |  69 |         | Yes      |       1 |
| cp850_general_ci         | cp850    |   4 | Yes     | Yes      |       1 |
| cp850_bin                | cp850    |  80 |         | Yes      |       1 |
| hp8_english_ci           | hp8      |   6 | Yes     | Yes      |       1 |
| hp8_bin                  | hp8      |  72 |         | Yes      |       1 |
| koi8r_general_ci         | koi8r    |   7 | Yes     | Yes      |       1 |
| koi8r_bin                | koi8r    |  74 |         | Yes      |       1 |
| latin1_german1_ci        | latin1   |   5 |         | Yes      |       1 |
| latin1_swedish_ci        | latin1   |   8 | Yes     | Yes      |       1 |
| latin1_danish_ci         | latin1   |  15 |         | Yes      |       1 |
| latin1_german2_ci        | latin1   |  31 |         | Yes      |       2 |
| latin1_bin               | latin1   |  47 |         | Yes      |       1 |

--- More ---

Encodings on Server Level

If we create a column without explicitly specify the encoding, it inherits from the table level settings. Also, table level settings inherit from database level settings.

However, if we create a database, where dose it inherit encoding settings from? The answer is the server level settings. They can be checked by the following commands.

mysql> SHOW VARIABLES LIKE 'character_set_server';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| character_set_server | utf8  |
+----------------------+-------+
1 row in set (0.00 sec)

mysql> SHOW VARIABLES LIKE 'collation_server';
+------------------+-----------------+
| Variable_name    | Value           |
+------------------+-----------------+
| collation_server | utf8_general_ci |
+------------------+-----------------+
1 row in set (0.00 sec)

Server character set and collation can be specified in the config file. In most installations it is /etc/my.cnf. In Ubuntu with apt-installed MySQL, it’s /etc/mysql/mysql.conf.d/mysqld.cnf.

[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_bin

Except the config file, they can also be set while compiling MySQL, while starting mysqld, and by running commands in a MySQL client.

Encodings on Databse Level

We can specify the character set and the collation while creating the database.

CREATE DATABASE db_name
	[[DEFAULT] CHARACTER SET charset_name]
	[[DEFAULT] COLLATE collation_name]

An example is below.

mysql> CREATE DATABASE test CHARACTER SET utf8mb4;
Query OK, 1 row affected (0.01 sec)

After creating, we can also modify the original settings.

ALTER DATABASE db_name
	[[DEFAULT] CHARACTER SET charset_name]
	[[DEFAULT] COLLATE collation_name]

Checking the encoding the current databse uses is also simple.

mysql> SELECT @@character_set_database;
+--------------------------+
| @@character_set_database |
+--------------------------+
| utf8mb4                  |
+--------------------------+
1 row in set (0.00 sec)

mysql> SELECT @@collation_database;
+----------------------+
| @@collation_database |
+----------------------+
| utf8mb4_general_ci   |
+----------------------+
1 row in set (0.00 sec)

Encodings on Table Level

Setting is similar to database level.

CREATE TABLE tbl_name (columns)
	[[DEFAULT] CHARACTER SET charset_name]
	[[DEFAULT] COLLATE collation_name]
ALTER TABLE tbl_name
	[[DEFAULT] CHARACTER SET charset_name]
	[[DEFAULT] COLLATE collation_name]

To check the encoding a table uses, run SHOW CREATE TABLE tbl_name to check the table structure which contains the encoding.

Encodings on Column Level

Only columns of type CHAR, VARCHAR, and TEXT can be specified a character set and a collation.

The syntax is below.

column_name { CHAR | VARCHAR | TEXT } (length)
	[CHARACTER SET charset_name]
	[COLLATE collation_name]

Here’s an example.

mysql> CREATE TABLE person(name CHAR(255) CHARACTER SET utf8 COLLATE utf8_bin);
Query OK, 0 rows affected (0.02 sec)

Modification is also allowed.

mysql> ALTER TABLE person MODIFY name CHAR(255) CHARACTER SET utf8mb4;
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> SHOW CREATE TABLE person;
+--------+--------------------------------------------------------------------------------------------------------------------+
| Table  | Create Table                                                                                                       |
+--------+--------------------------------------------------------------------------------------------------------------------+
| person | CREATE TABLE `person` (
  `name` char(255) CHARACTER SET utf8mb4 DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+--------+--------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Connection Encodings

When a client connects to the server, it indicates which character set it wants to use for communication with the server. The server uses this information to set the character_set_client, character_set_results, character_set_connection system variables to the character set, and collation_connection to the character set default collation. These variables will be discussed below.

The server takes the character_set_client system variable to be the character set in which statements are sent by the client.

mysql> SHOW VARIABLES LIKE 'character_set_client';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| character_set_client | utf8  |
+----------------------+-------+
1 row in set (0.00 sec)

After receiving statements from a client, MySQL converts them to the charset set by the system variable character_set_connection with the collation set by the system variable collation_connection.

mysql> SHOW VARIABLES LIKE 'character_set_connection';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| character_set_connection | utf8  |
+--------------------------+-------+
1 row in set (0.00 sec)

mysql> SHOW VARIABLES LIKE 'collation_connection';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
+----------------------+-----------------+
1 row in set (0.01 sec)

Retrieved data is converted to the charset set by the system variable character_set_results before shipping them back to the client.

mysql> SHOW VARIABLES LIKE 'character_set_results';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| character_set_results | utf8  |
+-----------------------+-------+
1 row in set (0.13 sec)

Because character_set_client, character_set_connection, and character_set_results are set from the client when it connects, they are the same unless we explicitly modify.

To tell the server to perform no conversion of result sets or error messages, set character_set_results to NULL or binary.

SET character_set_results = NULL;
SET character_set_results = binary;

Handling Conflicts

Considering the following statement.

SELECT x FROM T WHERE x = 'Y';

Which collation should we take while comparing x with 'Y'? The collation of column x or the collation of literal string 'Y'?

Moreover, which charset and collation should we take for CONCAT(utf8_column, latin1_column)?

MySQL has a subtle rule for conflicts. You may check it carefully.

To consisely summarize the rule, not strictly, we may say Unicode dominates non-Unicode and _bin collations dominates other collations.

Common Problems

UTF8 vs. UTF8MB4

Nowadays, most programs use UTF-8 as the default encoding. However, in MySQL, the utf8 character set can only hold 3 bytes per character. Thus, 4-byte characters like Emoji emotions can’t be stored in a utf8 column. To solve this, we can specify the character set to utf8mb4.

Case Sensitivity

As we mentioned above, every character set has a default collation. If you check them, you will find they are with the suffix _ci. This means ‘case insensitive’. Default collations are case insensitive while comparing and this is not what we usually want. To make comparasion case sensitive, specify collation to one with then suffix _bin.

Integration with Python

If you use plain SQL with pyMySQL, the connection charset can be set while creating a connection.

conn = pymysql.connect(host='localhost',
	user='username',
	passwd='password',
	db='database',
	charset='utf8')

If you use SQLAlchemy with pyMySQL, the connection charset can be set in the connection string.

# In plain SQLAlchemy

engine = sqlalchemy.create_engine('mysql+pymysql://username:password@hostname/database?charset=utf8mb4')

# In Flask-SQLAlchemy

app.config["SQLALCHEMY_DATABASE_URI"] = 'mysql+pymysql://username:password@hostname/database?charset=utf8mb4'

In SQLAlchemy, the column charset/collation can be set like the following.

class MyTableName(Base):
	__tablename__ = 'mytablename2'

	test_column = Column(Integer, primary_key=True)
	test_column2 = Column(String(100, collation='utf8_bin'))