Go to the first, previous, next, last section, table of contents.

8 National Character Sets and Unicode in MySQL 4.1

Improved handling of character sets is one of the features added to MySQL in Version 4.1. This chapter explains:

What are character sets and collations
The multi-level default system
New syntax in MySQL 4.1
Affected functions and operations
The meaning of each individual character set and collation

The features described here are as implemented in MySQL 4.1.1. (MySQL 4.1.0 has some but not all of these features, and some of them are implemented differently.)

8.1 Character Sets and Collations in General

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.

Suppose we had an alphabet with four letters: `A', `B', `a', `b'. We give each letter a number: `A' = 0, `B' = 1, `a' = 2, `c' = 3. The letter `A' is a symbol, the number 0 is the encoding for `A', and the combination of all four letters and their encodings is a character set.

Now, suppose we want to compare two string values, `A' and `B'. The simplest way to do this is to look at the encodings -- 0 for `A' and 1 for `B' -- and because 0 is less than 1, we say `A' is less than `B'. Now, what we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): ``compare the encodings''. We call this simplest of all possible collations a binary collation.

But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters `a' and `b' as equivalent to `A' and `B'; (2) then compare the encodings. We call this a case insensitive collation. It's a little more complex than a binary collation.

In real life, most character sets have many characters: not just `A' and `B' but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules: not just case insensitivity but also accent insensitivity (an ``accent'' is a mark attached to a character as in German `Ö') and multiple-character mappings (such as the rule that `Ö' = `OE' in one of the two German collations).

MySQL 4.1 can do these things for you:

Store strings using a variety of character sets
Compare strings using a variety of collations
Mix strings with different character sets or collations in the same server, the same database, or even the same table
Allow specification of character set and collation at any level

In these respects, not only is MySQL 4.1 far more flexible than MySQL 4.0, it also is far ahead of other DBMSs. However, to use the new features effectively, you will need to learn what character sets and collations are available, how to change their defaults, and what the various string operators do with them.

8.2 Character Sets and Collations in MySQL

A character set always has at least one collation. It may have several collations.

For example, character set latin1 (``ISO-8859-1 West European'') has the following collations:

Collation	Meaning
`latin1_bin`	Binary according to `latin1` encoding
`latin1_danish_ci`	Danish/Norwegian
`latin1_german1_ci`	German DIN-1
`latin1_german2_ci`	German DIN-2
`latin1_swedish_ci`	Swedish/Finnish
`latin1_general_ci`	Multilingual

Notes:

Two different character sets can't have the same collation.
Each character set has one collation that is the default collation. For example, the default collation for latin1 is latin1_swedish_ci.

Notice that there is a convention for collation names: They start with the name of the character set they are associated with, they usually include a language name, and they end with _ci (case insensitive), _cs (case sensitive), or _bin (binary).

8.3 Determining The Default Character Set And Collation

There are default settings for character sets and collations at four levels: server, database, table, connection. The following description may appear complex, but it's been found in practice that multi-level defaulting leads to natural and obvious results.

8.3.1 Server Character Set and Collation

The MySQL Server has a server character set and a server collation, which may not be null.

MySQL determines the server character set and server collation thus:

According to the option settings in effect when the server starts up.

At this level, the decision is simple. The server character set and collation depend on the options that you use when you start mysqld. You can use --default-character-set=character_set_name for the character set, and along with it you can add --default-collation=collation_name for the collation. If you don't specify a character set, that is the same as saying --default-character-set=latin1. If you specify only a character set (for instance, latin1) but not a collation, that is the same as saying --default-charset=latin1 --collation=latin1_swedish_ci because latin1_swedish_ci is the default collation for latin1. Therefore the following three commands all have the same effect:

shell> mysqld
shell> mysqld --default-character-set=latin1
shell> mysqld --default-character-set=latin1
           --default-collation=latin1_swedish_ci

One way to change the settings is by recompiling. If you want to change the default server character set and collation when building from sources, use: --with-character-set and --with-collation as arguments for @command{configure}. For example:

shell> ./configure --with-character-set=latin1

shell> ./configure --with-character-set=latin1
           --with-collation=latin1_german1_ci

Both mysqld and configure check that the character set/collation combination is valid. Each program displays an error message and terminates if the combination is not valid.

8.3.2 Database Character Set and Collation

Every database has a database character set and a database collation, which may not be null. The CREATE DATABASE and ALTER DATABASE commands now have optional clauses for specifying the database character set and collation:

CREATE DATABASE db_name
   [CHARACTER SET character_set_name [COLLATE collation_name]]

ALTER DATABASE db_name
    [CHARACTER SET character_set_name [COLLATE collation_name]]

Example:

CREATE DATABASE db_name
   CHARACTER SET latin1 COLLATE latin1_swedish_ci;

MySQL chooses the database character set and database collation thus:

If both CHARACTER SET X and COLLATE Y were specified, then character set X and collation Y.
If CHARACTER SET X was specified without COLLATE, then character set X and its default collation.
Otherwise, the server character set and server collation.

MySQL's CREATE DATABASE ... CHARACTER SET ... syntax is analogous to the standard-SQL CREATE SCHEMA ... CHARACTER SET ... syntax. Because of this, it is possible to create databases with different character sets and collations, on the same MySQL server.

The database character set and collation are used as default values if the table character set and collation are not specified in CREATE TABLE statements. They have no other purpose.

8.3.3 Table Character Set and Collation

Every table has a table character set and a table collation, which may not be null. The CREATE TABLE and ALTER TABLE statements now have optional clauses for specifying the table character set and collation:

CREATE TABLE table_name ( column_list )
   [CHARACTER SET character_set_name [COLLATE collation_name]]

ALTER TABLE table_name
   [CHARACTER SET character_set_name] [COLLATE collation_name]

Example:

CREATE TABLE t1 ( ... ) CHARACTER SET latin1 COLLATE latin1_danish_ci;

MySQL chooses the table character set and collation thus:

If both CHARACTER SET X and COLLATE Y were specified, then character set X and collation Y.
If CHARACTER SET X was specified without COLLATE, then character set X and its default collation.
Otherwise, the database character set and collation.

The table character set and collation are used as default values, if the column character set and collation are not specified in individual column definitions. The table character set and collation are MySQL extensions; there are no such things in standard SQL.

8.3.4 Column Character Set and Collation

Every ``character'' column (that is, a column of type CHAR, VARCHAR, or TEXT) has a column character set and a column collation, which may not be null. Column definition syntax now has optional clauses for specifying the column character set and collation:

column_name {CHAR | VARCHAR | TEXT} (column_length)
    [CHARACTER SET character_set_name [COLLATE collation_name]]

Example:

CREATE TABLE Table1
(
   column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci
);

MySQL chooses the column character set and collation thus:

If both CHARACTER SET X and COLLATE Y were specified, then character set X and collation Y.
If CHARACTER SET X was specified without COLLATE, then character set X and its default collation.
Otherwise, the table character set and collation.

The CHARACTER SET and COLLATE clauses are standard SQL.

8.3.5 Examples of Character Set and Collation Assignment

The following examples show how MySQL determines default character set and collation values.

Example 1: Table + Column Definition

CREATE TABLE t1
(
  c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci
) CHARACTER SET latin2 COLLATE latin2_bin;

Here you have a column with a latin1 character set and a latin1_german1_ci collation. The definition is explicit, so that's straightforward. Notice that there's no problem storing a latin1 column in a latin2 table.

Example 2: Table + Column Definition

CREATE TABLE t1
(
   c1 CHAR(10) CHARACTER SET latin1
) CHARACTER SET latin1 COLLATE latin1_danish_ci;

This time we have a column with a latin1 character set and a default collation. Now, although it might seem natural, the default collation is not taken from the table level. Instead, because the default collation for latin1 is always latin1_swedish_ci, column c1 will have a collation of latin1_swedish_ci (not latin1_danish_ci).

Example 3: Table + Column Definition

CREATE TABLE t1
(
   c1 CHAR(10)
) CHARACTER SET latin1 COLLATE latin1_danish_ci;

We have a column with a default character set and a default collation. In this circumstance, MySQL looks up to the table level for inspiration in determining the column character set and collation. So the character set for column c1 is latin1 and its collation is latin1_danish_ci.

Example 4: Database + Table + Column Definition

CREATE DATABASE d1 CHARACTER SET latin2 COLLATE latin2_czech_ci;
USE d1;
CREATE TABLE t1
(
   c1 CHAR(10)
);

We create a column without specifying its character set and collation. We're also not specifying a character set and a collation at the table level. In this circumstance, MySQL looks up to the database level for inspiration. (The database's settings become the table's settings, and thereafter become the column's setting.) So the character set for column c1 is latin2 and its collation is latin2_czech_ci.

8.3.6 Connection Character Sets and Collations

Every connection has connection character sets and connection collations, which may not be null. There are actually two connection character sets, which we will call ``connection/literals'' and ``connection/results'' when it is necessary to distinguish them.

Consider what a ``connection'' is: It's what you make when you connect to the server. The client sends SQL statements, such as queries, over the connection to the server. The server sends responses, such as result sets, over the connection back to the client. This leads to several questions, such as: (a) what character set is the query in when it leaves the client? (b) what character set should the server translate a query to after receiving it? (c) what character set should the server translate to before shipping result sets or error messages back to the client? You can fine-tune the setting for these things, or you can depend on the defaults (in which case, you can skip this section).

There are two statements that affect the connection character sets:

SET NAMES character_set_name
SET CHARACTER SET character_set_name

SET NAMES indicates what is in the SQL statement that the client sends. Thus, SET NAMES cp1251 tells the server ``future incoming messages from this client will be in character set cp1251'' and the server is free to translate to its own character set, if appropriate.

SET CHARACTER SET indicates what is in the SQL statement that the client sends, and also what is in the result set that the server sends back to the client. Thus, SET CHARACTER SET includes SET NAMES, and also specifies what character set the column values will have if, for example, you use a SELECT statement.

EXAMPLE: Suppose that column1 is defined as CHAR(5) CHARACTER SET latin2. If you do not say SET CHARACTER SET, then for SELECT column1 FROM t the server will send back all the values for column1 using character set latin2. If on the other hand you say SET CHARACTER SET latin1 then the server will, just before sending back, convert the latin2 values to latin1. Such conversion is slow and may be lossy.

When you execute SET NAMES or SET CHARACTER SET, you are also changing the ``connection collation''. However, the connection collation exists for consistency only. Usually its value doesn't matter.

With the mysql client, it is not necessary to execute SET NAMES every time you start up. You can add the --default-character-set-name option setting to your mysql statement line, or in your option file. For example, the following option file setting will change the connection character set each time you run mysql:

[mysql]
default-character-set-name=character_set_name

8.3.7 Character String Literal Character Set and Collation

Every character string literal has a character set and a collation, which may not be null.

A character string literal may have an optional character set introducer and COLLATE clause:

[_character_set_name]'string' [COLLATE collation_name]

Examples:

SELECT 'string';
SELECT _latin1'string';
SELECT _latin1'string' COLLATE latin1_danish_ci;

The simple statement SELECT 'string' uses the connection/literal character set.

The _character_set_name expression is formally called an introducer. It tells the parser, ``the string that is about to follow is in character set X.'' Because this has confused people in the past, we emphasize that an introducer does not cause any conversion, it is strictly a signal that does not change the string's value. An introducer is also legal before standard hex literal and numeric hex literal notation (x'literal' and 0xnnnn), and before ? (parameter substitution when using prepared statements within a programming language interface).

Examples:

SELECT _latin1 x'AABBCC';
SELECT _latin1 0xAABBCC;
SELECT _latin1 ?;

MySQL determines a literal's character set and collation thus:

If both _X and COLLATE Y were specified then the literal character set is X and the literal collation is Y
If _X is specified but COLLATE is not specified, then the literal character set is X and the literal collation is X's default collation
Otherwise, the connection/literals character set and collation.

Examples:

A string with latin1 character set and latin1_german1_ci collation:
```
SELECT _latin1'Müller' COLLATE latin1_german1_ci;
```
A string with latin1 character set and its default collation, that is, latin1_swedish_ci:
```
SELECT _latin1'Müller';
```
A string with the connection/literals character set and collation:
```
SELECT 'Müller';
```

Character set introducers and the COLLATE clause are implemented according to standard-SQL specifications.

8.3.8 `COLLATE` Clause in Various Parts of an SQL Query

With the COLLATE clause you can override whatever the default collation is for a comparison. COLLATE may be used in various parts of SQL queries. Here are some examples:

With ORDER BY:

SELECT k
FROM t1
ORDER BY k COLLATE latin1_german2_ci;

With AS:

SELECT k COLLATE latin1_german2_ci AS k1
FROM t1
ORDER BY k1;

With GROUP BY:

SELECT k
FROM t1
GROUP BY k COLLATE latin1_german2_ci;

With aggregate functions:

SELECT MAX(k COLLATE latin1_german2_ci)
FROM t1;

With DISTINCT:

SELECT DISTINCT k COLLATE latin1_german2_ci
FROM t1;

With WHERE:

SELECT *
FROM t1
WHERE _latin1 'Müller' COLLATE latin1_german2_ci = k;

With HAVING:

SELECT k
FROM t1
GROUP BY k
HAVING k = _latin1 'Müller' COLLATE latin1_german2_ci;

8.3.9 `COLLATE` Clause Precedence

The COLLATE clause has high precedence (higher than ||), so the expression

x || y COLLATE z

is equivalent to:

x || (y COLLATE z)

8.3.10 `BINARY` Operator

The BINARY operator is a shorthand for a COLLATE clause. For example, BINARY 'x' is equivalent to 'x' COLLATE y, where y is the name of an appropriate binary collation. For example, assuming that column a is of character set latin1, these two queries have the same effect:

SELECT * FROM t1 ORDER BY BINARY a;
SELECT * FROM t1 ORDER BY a COLLATE latin1_bin;

Note: Every character set has a binary collation.

8.3.11 Some Special Cases Where the Collation Determination is Tricky

In the great majority of queries, it is obvious what collation MySQL uses to resolve a comparison operation. For example, in the following cases it should be clear that the collation will be ``the column collation of column x'':

SELECT x FROM T ORDER BY x;
SELECT x FROM T WHERE x = x;
SELECT DISTINCT x FROM T;

However, when multiple operands are involved, there can be ambiguity. For example:

SELECT x FROM T WHERE x = 'Y';

Should this query use the collation of the column x, or of the string literal 'Y'?

Standard SQL resolves such questions using what used to be called ``coercibility'' rules. The essence is: Because x and 'Y' both have collations, whose collation takes precedence? It's complex, but these rules would take care of most situations:

An explicit COLLATE clause has precedence 4.
A concatenation of two strings with different collations has precedence 3.
A column's collation has precedence 2.
A literal's collation has precedence 1.

Those rules resolve ambiguities thus:

Use the collation with the highest precedence.
If both sides have the same precedence, then it's an error if the collations aren't the same.

Examples:

`column1 = 'A'`	Use collation of `column1`
`column1 = 'A' COLLATE x`	Use collation of `'A'`
`column1 COLLATE x = 'A' COLLATE y`	Error

8.3.12 Collations Must Be for the Right Character Set

Recall that each character set has one or more collations, and each collation is associated with one and only one character set. Therefore, the following statement causes an error message because the latin2_bin collation is not legal with the latin1 character set:

mysql> SELECT _latin1 'x' COLLATE latin2_bin;
ERROR 1251: COLLATION 'latin2_bin' is not valid
for CHARACTER SET 'latin1'

8.3.13 An example of the Effect of Collation

Suppose column X in table T has these latin1 column values:

Muffler
Müller
MX Systems
MySQL

And suppose that the column values are retrieved using the following statement:

SELECT X FROM T ORDER BY X COLLATE collation_name;

The resulting order of the values for different collations is shown in this table:

`latin1_swedish_ci`	`latin1_german1_ci`	`latin1_german2_ci`
Muffler	Muffler	Müller
MX Systems	Müller	Muffler
Müller	MX Systems	MX Systems
MySQL	MySQL	MySQL

The table is an example that shows what the effect would be if we used different collations in an ORDER BY clause. The character that's causing the trouble in this example is the U with two dots over it, which the Germans call U-umlaut, but we'll call it U-diaeresis.

The first column shows the result of the SELECT using the Swedish/Finnish collating rule, which says that U-diaeresis sorts with Y.

The second column shows the result of the SELECT using the German DIN-1 rule, which says that U-diaeresis sorts with U.

The third column shows the result of the SELECT using the German DIN-2 rule, which says that U-diaeresis sorts with UE.

Three different collations, three different results. That's what MySQL is here to handle. By using the appropriate collation, you can choose the sort order you want.

8.4 Operations Affected by Character Set Support

This section describes operations that take character set information into account now.

8.4.1 Result Strings

MySQL has many operators and functions that return a string. This section answers the question: What is the character set and collation of such a string?

For simple functions that take a string input and return a string result as output, the output's character set and collation are the same as the principal input's. For example, UPPER(X) returns a string whose character string and collation are the same as that of X. The same applies for: INSTR(), LCASE(), LOWER(), LTRIM(), MID(), REPEAT(), REPLACE(), REVERSE(), RIGHT(), RPAD(), RTRIM(), SOUNDEX(), SUBSTRING(), TRIM(), UCASE(), UPPER(). (Also note: the REPLACE() function, unlike all other functions, ignores the collation of the string input and performs a case-insensitive comparison every time.)

For operations that combine multiple string inputs and return a single string output, SQL-99's ``aggregation rules'' apply. They are:

If explicit COLLATE X occurs, then use X
If explicit COLLATE X and COLLATE Y occur, then error
Otherwise, if all collations are X, then use X
Otherwise, the result has no collation

For example, with CASE ... WHEN a THEN b WHEN b THEN c COLLATE X END, the resultant collation is X. The same applies for: CONCAT(), GREATEST(), IF(), LEAST(), CASE, UNION, ||, ELT().

For operations that convert to character data, the result string's character set and collation are in the connection/literals character set and have the connection/literals collation. This applies for: CHAR(), CAST(), CONV(), FORMAT(). HEX(), SPACE().

8.4.2 `CONVERT()`

CONVERT() provides a way to convert data between different character sets. The syntax is:

CONVERT(expr USING transcoding_name)

In MySQL, transcoding names are the same as the corresponding character set names.

Examples:

SELECT CONVERT(_latin1'Müller' USING utf8);
INSERT INTO utf8table (utf8column)
   SELECT CONVERT(latin1field USING utf8) FROM latin1table;

CONVERT(... USING ...) is implemented according to the SQL-99 specification.

8.4.3 `CAST()`

You may also use CAST() to convert a string to a different character set. The new format is:

CAST ( character_string AS character_data_type
    CHARACTER SET character_set_name )

Example:

SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8);

You may not use a COLLATE clause inside a CAST(), but you may use it outside, that is, CAST(... COLLATE ...) is illegal but CAST(...) COLLATE ... is legal.

Example:

SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin;

If you use CAST() without specifying CHARACTER SET, then the resulting character set and collation are the connection/literal character set and its default collation. If you use CAST() with CHARACTER SET X, then the resulting character set is X and the resulting collation is X's default collation.

8.4.4 `SHOW CHARACTER SET`

The SHOW CHARACTER SET command shows all available character sets. It takes an optional LIKE clause that indicates which character set names to match.

For example:

mysql> SHOW CHARACTER SET LIKE 'latin%';
+---------+-----------------------------+-------------------+--------+
| Charset | Description                 | Default collation | Maxlen |
+---------+-----------------------------+-------------------+--------+
| latin1  | ISO 8859-1 West European    | latin1_swedish_ci |      1 |
| latin2  | ISO 8859-2 Central European | latin2_general_ci |      1 |
| latin5  | ISO 8859-9 Turkish          | latin5_turkish_ci |      1 |
| latin7  | ISO 8859-13 Baltic          | latin7_general_ci |      1 |
+---------+-----------------------------+-------------------+--------+
4 rows in set (0.00 sec)

Notes about the preceding listing:

The Maxlen column shows the maximum number of bytes used to store one character.

8.4.5 `SHOW COLLATION`

The output from SHOW COLLATION includes all available character sets. It takes an optional LIKE clause that indicates which collation names to match.

mysql> SHOW COLLATION LIKE 'latin1%';
+-------------------+---------+----+---------+----------+---------+
| Collation         | Charset | Id | Default | Compiled | Sortlen |
+-------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1  |  5 |         |          |       0 |
| latin1_swedish_ci | latin1  |  8 | Yes     | Yes      |       0 |
| latin1_danish_ci  | latin1  | 15 |         |          |       0 |
| latin1_german2_ci | latin1  | 31 |         | Yes      |       2 |
| latin1_bin        | latin1  | 47 |         | Yes      |       0 |
| latin1_general_ci | latin1  | 48 |         |          |       0 |
| latin1_general_cs | latin1  | 49 |         |          |       0 |
+-------------------+---------+----+---------+----------+---------+
7 rows in set (0.00 sec)

The Default column indicates whether a collation is the default for its character set. Compiled indicates whether or not the character set is compiled into the server. Sortlen is related to the amount of memory required to sort strings expressed in the character set.

8.4.6 `SHOW CREATE DATABASE`

The following query shows a CREATE DATABASE statement that will create the given database. The result includes all database options. DEFAULT CHARACTER SET and COLLATE are supported. All database options are stored in a text file that can be found in the database directory.

mysql> SHOW CREATE DATABASE a;
+----------+---------------------------------------------------------------------------------+
| Database | Create Database
|
+----------+---------------------------------------------------------------------------------+
| a        | CREATE DATABASE `a` /*!40100 DEFAULT CHARACTER SET macce
COLLATE macce_ci_ai */ |
+----------+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)

8.4.7 `SHOW FULL COLUMNS`

The SHOW COLUMNS statement now displays the collations of a table's columns, when invoked as SHOW FULL COLUMNS. Columns with CHAR, VARCHAR, or TEXT datatypes have non-NULL collations. Numeric and other non-character types have NULL collations. For example:

mysql> SHOW FULL COLUMNS FROM a;
+-------+---------+-------------------+------+-----+---------+-------+
| Field | Type    | Collation         | Null | Key | Default | Extra |
+-------+---------+-------------------+------+-----+---------+-------+
| a     | char(1) | latin1_swedish_ci | YES  |     | NULL    |       |
| b     | int(11) | NULL              | YES  |     | NULL    |       |
+-------+---------+-------------------+------+-----+---------+-------+
2 rows in set (0.02 sec)

The character set is not part of the display.

8.5 Unicode Support

There are two new character sets for storing Unicode data: ucs2 (the UCS-2 Unicode character set) and utf8 (the UTF-8 encoding of the Unicode character set).

In UCS-2 (binary Unicode representation) every character is represented by a two-byte Unicode code with the most significant byte first. For example: "LATIN CAPITAL LETTER A" has the code 0x0041 and it's stored as a two byte sequence: 0x00 0x41. "CYRILLIC SMALL LETTER YERU" (Unicode 0x044B) is stored as a two byte sequence: 0x04 0x4B. For Unicode characters and their codes please refer to the Unicode Home Page. Temporary restriction: UCS-2 can't (yet) be used as a client character set. That means that SET NAMES ucs2 will not work.
The UTF8 character set (transform Unicode representation) is an alternative way to store Unicode data. It is implemented according to RFC2279. The idea of the UTF8 character set is that various Unicode characters fit into byte sequences of different lengths.
- Basic Latin letters, digits, and punctuation signs use one byte.
- Most European and Middle East script letters fit into a two-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.
- Korean, Chinese and Japanese ideographs use three-byte sequences.
- Currently, MySQL UTF8 support does not include four-byte sequences.
Tip: To save space with UTF8, use VARCHAR instead of CHAR. Otherwise, MySQL has to reserve 30 bytes for a CHAR(10) CHARACTER SET utf8 column, because that's the maximum possible length.

8.6 UTF8 for Metadata

The metadata is the data about the data. Anything that describes the database, as opposed to being the contents of the database, is metadata. Thus column names, database names, user names, version names, and most of the string results from SHOW, are metadata.

All metadata must be in the same character set. (Otherwise, SHOW wouldn't work properly because different rows in the same column would be in different character sets.) On the other hand, metadata must include all characters in all languages. (Otherwise, users wouldn't be able to name columns and tables in their own languages.) In order to allow for both of these objectives, MySQL stores metadata in a Unicode character set, namely UTF8. This will not cause any disruption if you never use accented characters. But if you do, you should be aware that metadata is in UTF8.

This means that USER() (and its synonyms, SESSION_USER() and SYSTEM_USER()), CURRENT_USER(), and VERSION() functions will have the UTF8 character set by default.

This does NOT mean that the headers of columns and the results of DESCRIBE functions will be in the UTF8 character set by default. (When you say SELECT column1 FROM t the name column1 itself will be returned from the server to the client in the client's character set as determined by the SET NAMES statement.)

If you want the server to pass metadata results back in a non-UTF8 character set, then use SET CHARACTER SET to force the server to convert (see section 8.3.6 Connection Character Sets and Collations), or set the client to do the conversion. It is always more efficient to set the client to do the conversion, but this option will not be available for many clients until late in the MySQL 4.x product cycle.

If you are just using, for example, the USER() function for comparison or assignment within a single statement ... don't worry. MySQL will do some automatic conversion for you.

SELECT * FROM Table1 WHERE USER() = latin1_column;

This will work, because the contents of latin1_column are automatically converted to UTF8 before the comparison.

INSERT INTO Table1 (latin1_column) SELECT USER();

This will work, becaues the contents of USER() are automatically converted to latin1 before the assignment. Automatic conversion is not fully implemented yet, but should work correctly in a later version.

Although automatic conversion is not in the SQL standard, the SQL standard document does say that every character set is (in terms of supported characters) a ``subset'' of Unicode. Since it is a well-known principle that ``what applies to a superset can apply to a subset,'' we believe that a collation for Unicode can apply for comparisons with non-Unicode strings.

VERSION 4.1.1 NOTE: The `errmsg.txt' files will all be in UTF8 after this point. Conversion to the client character set will be automatic, as for metadata. Also: We may change the default behaviour for passing back result set metadata in the near future.

8.7 Compatibility with Other DBMSs

For SAP DB compatibility these two statements are the same:

CREATE TABLE t1 (f1 CHAR(n) UNICODE);
CREATE TABLE t1 (f1 CHAR(n) CHARACTER SET ucs2);

8.8 New Character Set Configuration File format

In MySQL 4.1, character set configuration is stored in XML files, one file per character set. (In previous versions, this information was stored in `.conf' files.)

8.9 National Character Set

In MySQL-4.x and earlier, NCHAR and CHAR were synonymous. ANSI defines NCHAR or NATIONAL CHAR as a way to define that a CHAR column should use some predefined character set. MySQL uses utf8 as that predefined character set. For example, these column type declarations are equivalent:

CHAR(10) CHARACTER SET utf8
NATIONAL CHARACTER(10)
NCHAR(10)

As are these:

VARCHAR(10) CHARACTER SET utf8
NATIONAL VARCHAR(10)
NCHAR VARCHAR(10)
NATIONAL CHARACTER VARYING(10)
NATIONAL CHAR VARYING(10)

You can use N'literal' to create a string in national character set.

These two statements are equivalent:

SELECT N'some text';
SELECT _utf8'some text';

8.10 Upgrading from MySQL 4.0

Now, what about upgrading from older versions of MySQL? MySQL 4.1 is almost upward compatible with MySQL 4.0 and earlier for the simple reason that almost all of the features are new, so there's nothing in earlier versions to conflict with. However, there are some differences and a few things to be aware of.

Most important: The ``MySQL 4.0 character set'' has the properties of both ``MySQL 4.1 character sets'' and ``MySQL 4.1 collations.'' You will have to unlearn this. Henceforth we will not bundle character set / collation properties in the same conglomerate object.

There is a special treatment of national character sets in MySQL 4.1. NCHAR is not the same as CHAR, and N'...' literals are not the same as '...' literals.

Finally, there is a different file format for storing information about character sets and collations. Make sure you have reinstalled the `/share/mysql/charsets/' directory containing the new configuration files.

If you want to start mysqld from a 4.1.x distribution with data created by MySQL 4.0, you should start the server with the same character set and collation. In this case you won't need to reindex your data.

There are two ways to do so:

shell> ./configure --with-character-set=... --with-collation=...
shell> ./mysqld --default-character-set=... --default-collation=...

If you used mysql with, for example, the MySQL 4.0 danish character set, you should now use the latin1 character set and the latin1_danish_ci collation:

shell> ./configure --with-character-set=latin1
           --with-collation=latin1_danish_ci
shell> ./mysqld --default-character-set=latin1
           --default-collation=latin1_danish_ci

Use the table shown in the next section to find old 4.0 character set names and their 4.1 character set/collation pair equivalents.

8.10.1 4.0 Character Sets and Corresponding 4.1 Character Set/Collation Pairs

ID	4.0 Character Set	4.1 Character Set	4.1 Collation
1	`big5`	`big5`	`big5_chinese_ci`
2	`czech`	`latin2`	`latin2_czech_ci`
3	`dec8`	`dec8`	`dec8_swedish_ci`
4	`dos`	`cp850`	`cp850_general_ci`
5	`german1`	`latin1`	`latin1_german1_ci`
6	`hp8`	`hp8`	`hp8_english_ci`
7	`koi8_ru`	`koi8r`	`koi8r_general_ci`
8	`latin1`	`latin1`	`latin1_swedish_ci`
9	`latin2`	`latin2`	`latin2_general_ci`
10	`swe7`	`swe7`	`swe7_swedish_ci`
11	`usa7`	`ascii`	`ascii_general_ci`
12	`ujis`	`ujis`	`ujis_japanese_ci`
13	`sjis`	`sjis`	`sjis_japanese_ci`
14	`cp1251`	`cp1251`	`cp1251_bulgarian_ci`
15	`danish`	`latin1`	`latin1_danish_ci`
16	`hebrew`	`hebrew`	`hebrew_general_ci`
17	`win1251`	(removed)	`(removed)`
18	`tis620`	`tis620`	`tis620_thai_ci`
19	`euc_kr`	`euckr`	`euckr_korean_ci`
20	`estonia`	`latin7`	`latin7_estonian_ci`
21	`hungarian`	`latin2`	`latin2_hungarian_ci`
22	`koi8_ukr`	`koi8u`	`koi8u_ukrainian_ci`
23	`win1251ukr`	`cp1251`	`cp1251_ukrainian_ci`
24	`gb2312`	`gb2312`	`gb2312_chinese_ci`
25	`greek`	`greek`	`greek_general_ci`
26	`win1250`	`cp1250`	`cp1250_general_ci`
27	`croat`	`latin2`	`latin2_croatian_ci`
28	`gbk`	`gbk`	`gbk_chinese_ci`
29	`cp1257`	`cp1257`	`cp1257_lithuanian_ci`
30	`latin5`	`latin5`	`latin5_turkish_ci`
31	`latin1_de`	`latin1`	`latin1_german2_ci`

8.11 The Character Sets and Collations that MySQL Supports

Here is an annotated list of character sets and collations that MySQL supports. Because options and installation settings differ, some sites will not have all items in the list, and some sites will have items that are not on the list because defining new character sets or collations is straightforward.

MySQL supports 70+ collations for 30+ character sets.

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | ISO 8859-1 West European    | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| cp1251   | Windows Cyrillic            | cp1251_bulgarian_ci |      1 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |      1 |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |      2 |
| cp866    | DOS Russian                 | cp866_general_ci    |      1 |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |      1 |
| macce    | Mac Central European        | macce_general_ci    |      1 |
| macroman | Mac West European           | macroman_general_ci |      1 |
| cp852    | DOS Central European        | cp852_general_ci    |      1 |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |      1 |
| cp1256   | Windows Arabic              | cp1256_general_ci   |      1 |
| cp1257   | Windows Baltic              | cp1257_general_ci   |      1 |
| binary   | Binary pseudo charset       | binary              |      1 |
+----------+-----------------------------+---------------------+--------+
33 rows in set (0.01 sec)

NB: ALL CHARACTER SETS HAVE A BINARY COLLATION. WE HAVE NOT INCLUDED THE BINARY COLLATION IN ALL THE DESCRIPTIONS THAT FOLLOW.

8.11.1 The Unicode Character Sets

Of course there are our two Unicode character sets. You can store texts in about 650 languages using these character sets. We have not added a large number of collations for these two new sets yet, but that will be happening soon. Now they have default case-insensitive accent-insensitive collations, plus the binary collation.

+---------+-----------------+-------------------+--------+
| Charset | Description     | Default collation | Maxlen |
+---------+-----------------+-------------------+--------+
| utf8    | UTF-8 Unicode   | utf8_general_ci   |      3 |
| ucs2    | UCS-2 Unicode   | ucs2_general_ci   |      2 |
+---------+-----------------+-------------------+--------+

8.11.2 Platform Specific Character Sets

+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
+----------+-----------------------------+---------------------+--------+

8.11.3 Character Sets for South Europe and Middle East

+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |      1 |
| cp1256   | Windows Arabic              | cp1256_general_ci   |      1 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
+----------+-----------------------------+---------------------+--------+

8.11.4 The Asian Character Sets

The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated. For example, the Chinese sets have to allow for thousands of different characters.

+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
+----------+-----------------------------+---------------------+--------+

8.11.5 The Baltic Character Sets

The Baltic character sets cover Estonian, Latvian, and Lithuanian languages. There are two Baltic character sets currently supported:

latin7 (ISO 8859-13 Baltic):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| latin7_estonian_cs   | latin7   | 20 |         |          |       0 |
| latin7_general_ci    | latin7   | 41 | Yes     |          |       0 |
| latin7_general_cs    | latin7   | 42 |         |          |       0 |
| latin7_bin           | latin7   | 79 |         |          |       0 |
+----------------------+----------+----+---------+----------+---------+

cp1257 (Windows Baltic):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| cp1257_lithuanian_ci | cp1257   | 29 |         |          |       0 |
| cp1257_bin           | cp1257   | 58 |         |          |       0 |
| cp1257_general_ci    | cp1257   | 59 | Yes     |          |       0 |
+----------------------+----------+----+---------+----------+---------+

8.11.6 The Cyrillic Character Sets

Here are the Cyrillic character sets and collations for use with Belarusian, Bulgarian, Russian, Ukrainian languages.

cp1251 (Windows Cyrillic):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| cp1251_bulgarian_ci  | cp1251   | 14 |         |          |       0 |
| cp1251_ukrainian_ci  | cp1251   | 23 |         |          |       0 |
| cp1251_bin           | cp1251   | 50 |         |          |       0 |
| cp1251_general_ci    | cp1251   | 51 | Yes     |          |       0 |
| cp1251_general_cs    | cp1251   | 52 |         |          |       0 |
+----------------------+----------+----+---------+----------+---------+

cp866 (DOS Russian):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| cp866_general_ci     | cp866    | 36 | Yes     |          |       0 |
| cp866_bin            | cp866    | 68 |         |          |       0 |
+----------------------+----------+----+---------+----------+---------+

koi8r (KOI8-R Relcom Russian, primarily used in Russia on Unix):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| koi8r_general_ci     | koi8r    |  7 | Yes     |          |       0 |
| koi8r_bin            | koi8r    | 74 |         |          |       0 |
+----------------------+----------+----+---------+----------+---------+

koi8u (KOI8-U Ukrainian, primarily used in Ukraine on Unix):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| koi8u_general_ci     | koi8u    | 22 | Yes     |          |       0 |
| koi8u_bin            | koi8u    | 75 |         |          |       0 |
+----------------------+----------+----+---------+----------+---------+

8.11.7 The Central European Character Sets

We have some support for character sets used in The Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, and Poland.

cp1250 (Windows Central European):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| cp1250_general_ci    | cp1250   | 26 | Yes     |          |       0 |
| cp1250_czech_ci      | cp1250   | 34 |         | Yes      |       2 |
| cp1250_bin           | cp1250   | 66 |         |          |       0 |
+----------------------+----------+----+---------+----------+---------+

cp852 (DOS Central European):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| cp852_general_ci     | cp852    | 40 | Yes     |          |       0 |
| cp852_bin            | cp852    | 81 |         |          |       0 |
+----------------------+----------+----+---------+----------+---------+

macce (Mac Central European):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| macce_general_ci     | macce    | 38 | Yes     |          |       0 |
| macce_bin            | macce    | 43 |         |          |       0 |
+----------------------+----------+----+---------+----------+---------+

latin2 (ISO 8859-2 Central European):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| latin2_czech_ci      | latin2   |  2 |         | Yes      |       4 |
| latin2_general_ci    | latin2   |  9 | Yes     |          |       0 |
| latin2_hungarian_ci  | latin2   | 21 |         |          |       0 |
| latin2_croatian_ci   | latin2   | 27 |         |          |       0 |
| latin2_bin           | latin2   | 77 |         |          |       0 |
+----------------------+----------+----+---------+----------+---------+

keybcs2 (DOS Kamenicky Czech-Slovak):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| keybcs2_general_ci   | keybcs2  | 37 | Yes     |          |       0 |
| keybcs2_bin          | keybcs2  | 73 |         |          |       0 |
+----------------------+----------+----+---------+----------+---------+

8.11.8 The West European Character Sets

West European Character Sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.

latin1 (ISO 8859-1 West European):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| latin1_german1_ci    | latin1   |  5 |         |          |       0 |
| latin1_swedish_ci    | latin1   |  8 | Yes     | Yes      |       0 |
| latin1_danish_ci     | latin1   | 15 |         |          |       0 |
| latin1_german2_ci    | latin1   | 31 |         | Yes      |       2 |
| latin1_bin           | latin1   | 47 |         | Yes      |       0 |
| latin1_general_ci    | latin1   | 48 |         |          |       0 |
| latin1_general_cs    | latin1   | 49 |         |          |       0 |
+----------------------+----------+----+---------+----------+---------+

The latin1_swedish_ci collation is the default that probably is used by the majority of MySQL customers. It is constantly stated that this is based on the Swedish/Finnish collation rules, but you will find Swedes and Finns who disagree with that statement. The latin1_german1_ci and latin1_german2_ci collations are based on the DIN-1 and DIN-2 standards, where DIN stands for Deutsches Institut Für Normung (that is, the German answer to ANSI). DIN-1 is called the dictionary collation and DIN-2 is called the phone-book collation.

latin1_german1_ci (dictionary) rules:

`Ä' = `A', `Ö' = `O', `Ü' = `U', `ß' = `s'

latin1_german2_ci (phone-book) rules:

`Ä' = `AE', `Ö' = `OE', `Ü' = `UE', `ß' = `ss'

macroman (Mac West European):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| macroman_general_ci  | macroman | 39 | Yes     |          |       0 |
| macroman_bin         | macroman | 53 |         |          |       0 |
+----------------------+----------+----+---------+----------+---------+

cp850 (DOS West European):

+----------------------+----------+----+---------+----------+---------+
| Collation            | Charset  | Id | Default | Compiled | Sortlen |
+----------------------+----------+----+---------+----------+---------+
| cp850_general_ci     | cp850    |  4 | Yes     |          |       0 |
| cp850_bin            | cp850    | 80 |         |          |       0 |
+----------------------+----------+----+---------+----------+---------+

Go to the first, previous, next, last section, table of contents.