[Return to Library]  [TOC]  [PREV]  SECT--  [NEXT]  [INDEX] [Help]

1    Introduction

Internationalization refers to the process of developing programs without prior knowledge of the language, cultural data, or character-encoding schemes that the programs are expected to handle. In system terms, internationalization refers to the provision of interfaces that let programs modify their behavior at run time for operation in a specific language environment. The mnemonic I18N is frequently used as an abbreviation for internationalization.

This manual describes Digital UNIX interfaces and utilities that help you develop internationalized programs. These interfaces and utilities conform to specifications in the X/Open UNIX standard. This standard allows for implementation-defined behavior in certain areas. This manual identifies those software characteristics that are vendor specific.


[Return to Library]  [TOC]  [PREV]  SECT--  [NEXT]  [INDEX] [Help]

1.1    Language

An internationalized program makes no assumptions about the language of character data (text) that the program is designed to handle. The term data refers to data generated internally, data extracted from or written to files, and message text used for communication with the program's user.

Language has implications for processing text for such things as character handling and word ordering. Digital UNIX provides interfaces that allow internationalized programs to manipulate text according to the language requirements of individual users.

Language differences require the separation of message text from program code. Digital UNIX provides facilities that allow message text to be separated from the code, translated into different languages, and accessed by the program at run time. Chapter 3 explains how an internationalized program that uses the Worldwide Portability Interfaces (WPI) generates and accesses messages.

An internationalized program that uses X and Motif interfaces can separate message text from program code in the following ways:

For information about separating message text from program code for X and Motif interfaces, refer to the following books:


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

1.2    Cultural Data

Cultural data refers to the conventions of a geographic area or territory for such things as date, time, and currency formats.

An internationalized program cannot assume how these formats are set in advance and uses system facilities to determine formats at run time. This capability is provided through a language information database that programs can query for the required formats of cultural data items.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

1.3    Character Sets

A character set is a set of alphabetic or other characters used to construct the words and other elementary units of a native language or computer language. A coded character set (or codeset) is a set of unambiguous rules that establishes a character set and the one-to-one relationship between each character of the set and its bit representation.

For a program to be able to handle text recorded in different codesets, the program cannot make assumptions about the size or bit assignment of character encodings. In particular, the program cannot assume that any part of an area used to store a character is available for other uses.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

1.4    Localization

Localization refers to the process of implementing local requirements within a computer system. Some of these requirements are addressed by locales. Each locale is a set of data that supports a particular combination of native language, cultural data, and codeset. The type of information a locale can contain and the interfaces that use a locale are subject to standardization. However, where locales reside on the system and how they are named can vary from one vendor to another.

Locales do not solve all of the problems that localization must address. For example, the localization process means making sure that translations are available for software messages; appropriate fonts, and measurement systems are supported and available for display and printing devices; and, in some cases, additional software is written to handle local requirements.

The mnemonic L10N is frequently used as an abbreviation for localization.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

1.4.1    Collating Sequence

The ordering of characters may be implicit in underlying hardware but can be defined for software to conform to the way language is used in a particular territory. Many languages have more complex rules for sorting than English. The following list shows why some English assumptions about character sorting do not apply to other languages:

Each locale contains collating sequence-information that informs string comparison functions about the relative ordering of characters defined in the associated codeset. Internationalized regular expressions also use the collating sequence for implementing character ranges, collating symbols, and equivalence classes.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

1.4.2    Character Classification

Character classification information provides details about the type of character associated with each valid character code; that is, whether the code defines an alphabetic, uppercase, lowercase, punctuation, control, space, or other kind of character. Both character classification functions and internationalized regular expressions use this information to determine character classes.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

1.4.3    Case Conversion

Case conversion refers to information that identifies the possible alternative case of each valid character code. Case conversion functions use this information to change characters from uppercase to lowercase or from lowercase to uppercase. Note that case is not a characteristic of all of the letters, or even of any characters, in some languages.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

1.4.4    Language Information

Language information (or langinfo database) refers to localization data that describes the format and setting of cultural data that can vary from one locale to another. This information includes the appropriate formats and characters for date and time, currency, and numeric values.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

1.4.5    Message Catalogs

A message catalog is a file or storage area that contains program messages, command prompts, and responses to prompts for a particular language. Motif applications also use resource files and User Interface Language (UIL) files in addition to or in place of message catalogs for text and other values that can vary from one locale to another. Chapter 3 describes the messaging system.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

1.5    Language Announcement

Language announcement is the mechanism by which language, cultural data, and codeset requirements are set either for the system as a whole, by an application, or by individual users. Language announcement is performed by setting a locale name in a set of reserved environment variables. On Digital UNIX systems, system managers can set the default values for these variables for different shell environments; refer to the System Administration book for information about setting locale defaults for shells. Users can also set locale variables on a per-process basis.

Typically, internationalized programs read locale variables at run time and use them to attach a particular instance of localization data to the programs' operational environment. However, programs can also set these variables internally when appropriate. Therefore, the binding to a particular locale need not be general for all parts of a program. Within one execution cycle, different parts of the program can request different localizations.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

1.6    Terms and Definitions

This section defines terms used extensively in this guide. Less common terms are defined when they first appear.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

1.6.1    Characters and Strings

A character is a sequence of one or more bytes that represent a single graphic symbol or control code. Do not confuse the term character with the C programming language data type char, which represents an object large enough to store any member of the basic execution character set and which is usually mapped as an 8-bit value. Unlike the char data type in C, a character as used in this guide can be represented by a multibyte or single-byte value. The expression multibyte character is synonymous with the term character; that is, both refer to character values of any length, including single-byte values.

A character string or string is a contiguous sequence of bytes terminated by and including the null byte. A string is an array of type char in the C programming language. The null byte is a value with all bits set to zero (0).

A wide character is an integral type that is large enough to hold any member of the extended execution character set. In program terms, a wide character is an object of type wchar_t, which is defined in the header files /usr/include/stddef.h (for conformance to the X/Open UNIX standard) and /usr/include/stdlib.h (for conformance to the ANSI C standard). The file locations where this data type is defined are determined by standards organizations; however, the definition itself is implementation specific. For example, implementations that support only single-byte codesets (not the case for Digital UNIX) might define wchar_t as a byte value.

A wide-character string is a contiguous sequence of wide characters terminated by and including the null wide character. A wide-character string is an array of type wchar_t. The null wide character is a wchar_t value with all bits set to zero (0).

An empty string is a character string whose first element is the null byte. Similarly, an empty wide-character string is a wide-character string whose first element is the null wide character.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

1.6.2    Portable Character Set

The Portable Character Set (PCS) is supported in both compile-time (source) and run-time (executable) environments. The PCS contains:

The Portable Character Set as defined by X/Open is similar to the basic source and basic execution character sets defined in ISO/IEC 9899:1990, except that the X/Open version also includes the dollar sign ($), commercial at sign (@), and grave accent ( [grave ]) characters.

Some locales (for example, ISO 646 variants) may make substitutions for one or more of the preceding characters. In such cases, the substituted character has the same syntactic meaning as the character it replaces in the Portable Character Set. An example of a character substitution might be the British pound sign ([pound ]) for the number sign (#) that is the default.