[Return to Library]  [TOC]  [PREV]  SECT--  [NEXT]  [INDEX] [Help]

7    Creating Locales

This chapter explains how to develop a locale, which provides information appropriate for a particular combination of language, territory, and codeset. You use the localedef command to create locales from the following files:


[Return to Library]  [TOC]  [PREV]  SECT--  [NEXT]  [INDEX] [Help]

7.1    Creating a Character Map Source File for a Locale

A charmap file defines symbols for character binary encodings. The localedef command uses this file to map character symbols in a locale source file to the character encodings. Example 7-1 shows a fragment of the source file, ISO8859-1.cmap, used for thede_DE.ISO8859-1@example locale being developed in this chapter. Appendix B contains this file in its entirety.


Example 7-1: The charmap File for a Sample Locale
# Map file providing symbols for characters whose binary   (1)
# encodings are specified in the ISO Latin-1 codeset.   (1)
<code_set_name> "ISO8859-1"    (2)
<mb_cur_max>1     (2)
<mb_cur_min> 1    (2)
<escape_char> \   (2)
<comment_char> #  (2)

CHARMAP    (3)
<NU>     \d000    (4)
<SH>     \d001
<SX>     \d002
<EX>     \d003
<ET>     \d004
<EQ>     \d005
<AK>     \d006
<BL>     \d007
<BS>     \d008

.
.
.
<0> \d048 (4) <1> \d049 <2> \d050 <3> \d051
.
.
.
<A> \d065 (4) <B> \d066 <C> \d067 <D> \d068 <E> \d069
.
.
.
<X> \d088 (4) <Y> \d089 <Z> \d090 <<(> \d091 <//> \d092 <)\>> \d093 <'\>> \d094 <_> \d095 <'!> \d096 <a> \d097 <b> \d098 <c> \d099 <d> \d100 <e> \d101
.
.
.
<x>\d120 (4) <y> \d121 <z> \d122 <(!> \d123 <!!> \d124 <!)> \d125 <'?> \d126 <DT> \d127
.
.
.
<O:> \d214 (4) <U:> \d220
.
.
.
<ss> \d223 (4)
.
.
.
<o:> \d246 (4)
.
.
.
<u:> \d252 (4)
.
.
.
<backspace> \d008 (5) <tab> \d009 <newline> \d010 <vertical-tab> \d011 <form-feed> \d012 <carriage-return> \d013
.
.
.
<space> \d032 (5) <exclamation-mark> \d033 <quotation-mark> \d063 <number-sign> \d035 <dollar-sign> \d036 END CHARMAP (6)

  1. Comment line

    By default, the comment character is the number sign (#). You can override this default with a <comment_char> definition (see 2).

  2. Keyword declarations

    This example provides entries for all valid declarations and specifies default values for all but <code_set_name>. Usually, you specify a declaration only when you want to override its default value. In this example, the declarations for <comment_char> and <escape_char> specify the default values for the comment character and escape character, respectively. The value for <mb_cur_max>, the maximum length (in bytes) of a character, is 1 for this particular locale. The value for <mb_cur_min>, the minimum length (in bytes) of a character, must be 1 in all locales. (All locales include characters in the Portable Character Set, which defines single-byte characters.)

    The <code_set_name> value will be the value returned on the nl_langinfo(CODESET) call made by applications that bind to the locale at run time.

  3. Header marking start of character maps

  4. Symbol-to-coding maps for characters

    Each character map consists of a symbolic name and encoding. The name and encoding are separated by one or more spaces

    A symbolic name begins with the left angle bracket (<) and ends with the right angle bracket (>). The characters between the angle brackets can be any characters from the Portable Character Set, except for control and space characters. If the name includes more than one right angle bracket (>), all but the last one must be preceded by the value of <escape_character>. A symbolic name cannot exceed 128 bytes in length.

    An encoding can be one or more decimal, octal, or hexadecimal constants. (Multiple constants apply to multibyte encodings.) The constants have the following formats:

  5. Additional maps for characters

    You can create multiple symbolic names for the same character (encoding). In this source file, for example, the backspace character (value \d008) has two symbolic names, <BS> and <backspace>. When more than one symbolic name exists for a character, you can specify any of them in locale definition source files to refer to the character.

  6. Trailer marking end of character maps

The source files for codesets with multibyte characters have more complex character maps. Example 7-2 shows a subset of character map entries from a source file for the Japanese SJIS codeset. This source file specifies entries from several character sets that must be supported within the same codeset.


Example 7-2: Fragment from a charmap File for a Multibyte Codeset
# SJIS charmap
#
<code_set_name> "SJIS"   (1)
<mb_cur_min>    1    (2)
<mb_cur_max>2    (3)
CHARMAP
#
# CS0: ASCII
#

.
.
.
<commercial-at> \x40 (4) <A> \x41 (4) <B> \x42 (4)
.
.
.
# # CS1: JIS X0208-1983 for ShiftJIS. # <zenkaku-space> \x81\x40 (5) <j0101>...<j0163> \x81\x40 (5) <j0164>...<j0194> \x81\x80 (5)
.
.
.
# # UDC Area in JIS X0208 plane # <u8501>...<u8563> \xeb\x40 (6) <u8564>...<u8594> \xeb\x80 (6) <u8601>...<u8663> \xeb\x9f (6)
.
.
.
# # CS2: JIS X0201 (so-called Hankaku-Kana) # <kana-fullstop> \xa1 (7)
.
.
.
<kana-conjunctive> \xa5 (7) <kana-WO> \xa6 (7) <kana-a> \xa7 (7)
.
.
.
END CHARMAP

  1. Codeset name

  2. Minimum number of bytes per character

    This value must be 1.

  3. Maximum number of bytes per character

    In SJIS, the largest multibyte character is 2 bytes in length.

  4. Symbols and encodings for ASCII characters

  5. Symbols and encodings for SJIS characters

    Note how character symbols are specified as a range and how two hexadecimal values determine the encoding for a 2-byte character.

    When symbols are specified as a range of symbol values, the specified character encoding applies to the first symbol in the range. The localedef command automatically increments both the symbol value and the encoding value to create symbols and encodings for all characters in the range.

  6. Maps for user-defined characters within the SJIS codeset

    These maps establish ranges of encodings for which users can later define characters.

  7. Maps for the single-byte characters of the Hankaku-Kana character set

Refer to the charmap(4) reference page for a complete list of rules that apply to character map source files.


Note

The symbolic names for characters in character map source files are in the process of becoming standardized. A future revision of the X/Open UNIX standard will likely specify both long and short symbolic names for characters.

The symbolic names for characters shown in this example are not necessarily the names being proposed for adoption by any standards group.



[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.2    Creating Locale Definition Source Files

A locale definition source file defines data that is specific to a particular language and territory. The source file is organized into sections, one for each category of locale data being defined. Example 7-3 shows the structure of a locale definition source file in pseudocode. The sections for locale categories are discussed in more detail following the example.


Example 7-3: Structure of Locale Source Definition File
# comment-line    (1)

comment_char      <char_symbol1>   (2)
escape_char       <char_symbol2>   (3)

CATEGORY_NAME    (4)

category_definition-statement   (5)
category_definition-statement   (5)

.
.
.
END CATEGORY_NAME (6)
.
.
.
(7)

  1. Comment line

    The number sign (#) is the default comment character. You can specify comments as entire lines by entering the comment character in the first column of the line. You cannot specify comments on the same lines as definition statements in locale source files. In this respect, locale source files differ from character map source files.

  2. Redefinition of comment character

    You can override the default comment character with an entry line that begins with the comment_char keyword, followed by the symbol for the desired character. The character symbol is defined in the character map (charmap) source file for the locale.

  3. Redefinition of escape character

    The escape character, by default the backslash (\), is used in decimal, hexadecimal, and octal constants and to indicate when definition statements are continued to the next line of the source file. You can override the default escape character with an entry line that begins with the escape_char keyword, followed by one or more blank characters, then the symbol for the desired character. The character symbol is defined in the character map source file for the locale.

  4. Header for locale category section

    Section headers correspond to category names, which are LC_CTYPE, LC_COLLATE, LC_NUMERIC, LC_MONETARY, LC_MESSAGES, and LC_TIME.

  5. Definition statement for the category

    The format of these statements varies from one category to the next. In general, a statement begins with a keyword, followed by one or more spaces or tabs, then the definition itself.

  6. Trailer for locale category section

    Section trailers start with the keyword END, followed by the category name.

  7. You can include sections for all locale categories or only a subset of categories. If you omit a section for a locale category from the source file, the definition for the omitted category is the same as defined for the POSIX, or C, locale.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.2.1    Defining the LC_CTYPE Locale Category

The LC_CTYPE section defines character classes and character attributes used in operations such as case conversion. Example 7-4 shows the definition for this section.


Example 7-4: LC_CTYPE Category Definition
LC_CTYPE   (1)

upper <A>;<A:>;<B>;<C>;<D>;<E>;<F>;<G>;<H>;<I>;<J>;<K>;<L>;<M>;<N>;<O>;\
      <O:>;<P>;<Q>;<R>;<S>;<T>;<U>;<U:>;<V>;<W>;<X>;<Y>;<Z>  (2)
lower <a>;<a:>;<b>;<c>;<d>;<e>;<f>;<g>;<h>;<i>;<j>;<k>;<l>;<m>;<n>;<o>;\
        <o:>;<p>;<q>;<r>;<s>;<ss>;<t>;<u>;<u:>;<v>;<w>;<x>;<y>;<z>  (2)

alpha <A>;<A:>;<B>;<C>;<D>;<E>;<F>;<G>;<H>;<I>;<J>;<K>;<L>;<M>;<N>;<O>;\
      <O:>;<P>;<Q>;<R>;<S>;<T>;<U>;<U:>;<V>;<W>;<X>;<Y>;<Z>;<a>;<a:>;<b>;\
      <c>;<d>;<e>;<f>;<g>;<h>;<i>;<j>;<k>;<l>;<m>;<n>;<o>;<o:>;<p>;<q>;<r>;\
      <s>;<ss>;<t>;<u>;<u:>;<v>;<w>;<x>;<y>;<z>   (2)
space <tab>;<newline>;<vertical-tab>;<form-feed>;<carriage-return>;<space>;\
      <NS>  (2)
cntrl <NUL>;...;<IS1>;<DEL>;...;<AC>   (2)

.
.
.
toupper (<a>,<A>);(<a:>,<A:>);(<b>,<B>);(<c>,<C>);(<d>,<D>);(<e>,<E>);\ (<f>,<F>);(<g>,<G>);(<h>,<H>);(<i>,<I>);(<j>,<J>);(<k>,<K>);\ (<l>,<L>);(<m>,<M>);(<n>,<N>);(<o>,<O>);(<o:>,<O:>);(<p>,<P>);\ (<q>,<Q>);(<r>,<R>);(<s>,<S>);(<t>,<T>);(<u>,<U>);(<u:>,<U:>);\ (<v>,<V>);(<w>,<W>);(<x>,<X>);(<y>,<Y>);(<z>,<Z>) (3)
.
.
.
END LC_CTYPE (4)

  1. Section header

  2. Definition of character class

    These definitions start with a keyword that stands for the character class, followed by one or more blank characters, then a list of symbols for all characters in that class. You can substitute the character's encoding for its symbol; however, specifying characters by their encodings diminishes the readability of the locale source file and makes it impossible to use the file with more than one codeset.

    As shown in the definition of the cntrl class, you can specify a horizontal elipsis (...) to represent a range of characters. In the string <NUL>;...;<IS1>, for example, the ellipsis represents all characters whose encodings are between the character whose symbol is <NUL> and the character whose symbol is <IS1>. The symbols and their encodings are specified in the charmap file for the locale.

    The standard character classes are represented by the following keywords:

    • upper (uppercase letter characters)

    • lower (lowercase letter characters)

    • alpha (all letter characters)

    • space (white-space characters)

    • cntrl (control characters)

    • punct (punctuation characters)

    • digit (numeric digits)

    • xdigit (hexadecimal digits)

    • blank (blank characters)

    • graph (printable characters, excluding the space character)

    • print (printable characters, including the space character)

    From the application standpoint, there is also the class alnum. This class is not defined in a locale; it is by definition a combination of characters in the alpha and digit classes.

  3. Definitions of case conversion for letter characters

    These definitions, which begin with the keywords toupper and tolower, list symbols in pairs rather than individually. In the toupper definition shown here, the first symbol in the pair is the symbol for a lowercase letter and the second symbol is the symbol for that letter's uppercase equivalent. This definition determines what a letter is converted to when functions perform case conversion on text data.

  4. Section trailer

The preceding example does not completely illustrate all the options you can use when defining the LC_CTYPE category. You can:

Applications can use the wctype() and iswctype() functions to determine and test all character classes (including user-defined ones). Applications can use the class-specific functions iswalpha, iswpunct, and so forth to test the standard character classes.

Refer to the locale(4) reference page for additional rules and restrictions that apply to the LC_CTYPE category definition.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.2.2    Defining the LC_COLLATE Locale Category

The LC_COLLATE section specifies how characters and strings are collated. Example 7-5 shows part of an LC_COLLATE section.


Example 7-5: LC_COLLATE Category Definition
LC_COLLATE    (1)
order_start forward;forward;backward    (2)

.
.
.
<o> <o>;<o>;<o> (3)
.
.
.
<o:> <o>;<o>;<o:> (3)
.
.
.
<O> <o>;<O>;<O> (3)
.
.
.
<O:> <o>;<O>;<O:> (3)
.
.
.
<Z> <z>;<Z>;<Z> (3)
.
.
.
UNDEFINED IGNORE;IGNORE;IGNORE (4) order_end (5) END LC_COLLATE (6)

  1. Section header

  2. An order_start keyword that marks the beginning of a section with statements that assign collating weights to elements

    Following the order_start keyword on the same line are sort directives, separated by semicolons (;) that apply to each order. Sort directives can include the following keywords.

    • forward, which specifies that the comparison operation proceeds from the start of the string towards the end of the string

    • backward, which specifies that the comparison operation proceeds from the end of the string towards the start of the string

    • position, which specifies that the comparison operation considers the relative position of characters in the string that are not subject to the collating weight IGNORE (in other words, ensures that nonignored characters that are the shortest distance from the start (forward,position) or end (backward,position) of the string collate first)

      When a sort directive includes two keywords, the position keyword combined with either forward or backward, the two keywords are separated by a comma (,). The position keyword by itself is equivalent to the directive forward,position.

    The number of sort directives corresponds to the number of weights each collating element is assigned in subsequent statements.

    Each sort directive and its associated set of weights specify information for one pass, or level, of string comparison. The first directive applies when the string comparison operation applies the primary weight, the second when the string comparison operation applies the secondary weight, and so on. The number of levels required to collate strings correctly depends on language and cultural requirements and therefore varies from one locale to another. There is also a level number maximum, associated with the COLL_WEIGHTS_MAX setting in the limits.h and sys/localedef.h files. On Digital UNIX systems, you are limited to six collation levels (sort directives).

    The backward directive is used for many languages to ensure that accented characters sort after unaccented characters only if the compared strings are otherwise equivalent.

    The position directive is frequently used to handle characters, such as the hyphen (-) in Western European languages, whose significance can be relative to word position. For example, assume you wanted the word "o-ring" to collate in a word list before the word "or-ing", but do not want the hyphen to be considered until after strings are sorted by letters alone. You would need two sort directives and associated sets of weight specifiers to implement this order. For the first comparison operation, you specify forward as the sort directive, letters as the first weights for all letter characters, and IGNORE as the weight for the hyphen character. For the second, or a later, comparison operation, you specify forward position as the sort directive, IGNORE as the weight for all letter characters, and the hyphen as the weight for the hyphen character.

    If you do not specify a sort directive, the default is forward.

  3. Collation order statements for elements

    These statements specify a character symbol, followed by one or more blank characters (spaces or tabs), then the symbols for characters that have the same weight at each stage of the sort. For example, the lowercase character o, lowercase character o umlaut, uppercase character O, and uppercase character O umlaut, whose symbols are <o>, <o:>, <O>, and <O:>, respectively, are grouped together (have the same weight) at the first sort level. At the secondary sort level, lowercase o is grouped with lowercase o umlaut and uppercase O is grouped with uppercase O umlaut. The four characters have distinct weights at the tertiary sort level.

  4. Collation order statement for undefined characters

    The UNDEFINED keyword begins a collation order statement to be applied to all characters that are defined in the locale's charmap file but not specified in other collation order statements. This statement indicates that such characters are to be ignored during collation for all weight comparisons.

    You should include a collation order statement that begins with the UNDEFINED keyword. If this statement is absent, the localedef command includes undefined characters at the end of the collating order and issues a warning.

    Furthermore, if you place an UNDEFINED statement as the last collation order statement, the localedef command can sometimes compress all undefined characters into one entry. This action can reduce the size of the locale.

  5. Trailer to indicate the end of collation order statements

  6. Trailer to indicate the end of the LC_COLLATE section

The preceding example shows only a few of the options that you can specify when defining the LC_COLLATE category. You can also use:

Refer to the locale(4) reference page for more detailed information on the LC_COLLATE category definition.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.2.3    Defining the LC_MESSAGES Locale Category

The LC_MESSAGES section defines strings that are valid for affirmative and negative responses from users. Example 7-6 shows an LC_MESSAGES section.


Example 7-6: LC_MESSAGES Category Definition
LC_MESSAGES    (1)
yesexpr         "^[<j><J>][[:alpha:]]*"    (2)
noexpr          "^[<n><N>][[:alpha:]]*"    (3)
yesstr          "<j>"   (4)
nostr           "<n>"   (5)
END LC_MESSAGES    (6)

  1. Section header

  2. Definition of an expression for a valid "yes" response

    This entry consists of the yesexpr keyword, followed by one or more spaces or tabs, and an extended regular expression that is delimited by double quotation marks.

    In German, an affirmative responses is "ja." The expression specified for yesexpr defines a valid response as being j or J or a string that begins with j or J and is followed by any number of letter characters. Note that the regular expression for yesexpr specifies individual characters by their symbols as defined in the locale's charmap file.

  3. Definition of an expression for a valid "no" response

    This entry consists of the noexpr keyword, followed by one or more spaces or tabs, and an extended regular expression that is delimited by double quotation marks.

    In German, "nein" is the negative response. The definition of noexpr is similar to the one for yesexpr, except that the only or initial character of the user's response must be the letter n or N.

  4. Definition of a string for a valid "yes" response

    This entry consists of the yesstr keyword, followed one or more spaces or tabs, and a string that is delimited by double quotation marks.

    The yesstr entry is marked for removal from the X/Open UNIX standard; however, some applications and systems software might still use yesstr rather than yesexpr. To ensure that your locale works correctly with such software, it is a good idea to define yesstr in your locale.

  5. Definition of a string for a valid "no" response

    This entry consists of the nostr keyword, followed one or more spaces or tabs, and a string that is delimited by double quotation marks.

    The nostr entry is marked for removal from the X/Open UNIX standard; however, some applications and systems software might still use nostr rather than noexpr. To ensure that your locale works correctly with such software, it is a good idea to define nostr in your locale.

  6. Section trailer

As an alternative to specifying symbol definitions, you can use the copy statement between the section header and trailer to duplicate an existing locale's definition of LC_MESSAGES. The copy statement represents a complete definition of the category and cannot be used along with explicit symbol definitions.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.2.4    Defining the LC_MONETARY Locale Category

The LC_MONETARY section of the locale source file defines the rules and symbols used to format monetary values. Application developers use the localeconv() and nl_langinfo() functions to determine the information defined in this section and apply formatting rules through the strfmon() function. Example 7-7 shows an LC_MONETARY section.


Example 7-7: LC_MONETARY Category Definition
LC_MONETARY    (1)
int_curr_symbol                 "<D><M>"  (2)
currency_symbol                 "<D><M>"  (2)
mon_decimal_point               "<,>"     (2)
mon_thousands_sep               "<.>"     (2)
mon_grouping                    3         (2)
positive_sign                   ""        (2)
negative_sign                   "<->"     (2)

.
.
.
END LC_MONETARY (3)

  1. Section header

  2. Symbol definitions

    The entries in the example specify the following:

    • The international and local currency symbols are the string DM (for Deutsch Mark).

    • The decimal point is the comma (,).

    • The separator grouping digits to the left of the decimal point is the period (.).

    • The number of digits in groups separated by periods is 3.

    • The positive sign is null.

    • The negative sign is the minus (-) character.

  3. Section trailer

The following list describes all the symbol names you can define in the LC_MONETARY section:

As an alternative to specifying symbol definitions, you can use the copy statement between the section header and trailer to duplicate an existing locale's definition of LC_MONETARY. The copy statement represents a complete definition of the category and cannot be used along with explicit symbol definitions.

Refer to the locale(4) reference page for complete information about specifying LC_MONETARY symbol definitions.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.2.5    Defining the LC_NUMERIC Locale Category

The LC_NUMERIC section of the locale source file defines the rules and symbols used to format numeric data. You can use the localeconv() and nl_langinfo() functions to access this formatting information. Example 7-8 shows this section.


Example 7-8: LC_NUMERIC Category Definition
LC_NUMERIC   (1)
decimal_point                   "<,>"   (2)
thousands_sep                   "<.>"   (3)
grouping                        3    (4)
END LC_NUMERIC   (5)

  1. Category header

  2. Definition of radix character (decimal point)

  3. Definition of character used to separate groups of digits to the left of the radix character

  4. The size of each group of digits to the left of the radix character

  5. Category trailer

The preceding example shows all of the symbols you can define in the LC_NUMERIC section. In place of any symbol definitions, you can specify a copy statement between the section header and trailer to include this section from another locale.

Refer to the locale(4) reference page for detailed rules about symbol definitions.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.2.6    Defining the LC_TIME Locale Category

The LC_TIME section defines the interpretation of field descriptors supported by the date command. This category section also affects the behavior of the strftime(), wcsftime(), strptime(), and nl_langinfo() functions. Example 7-9 shows some of the symbols defined for the sample German locale.


Example 7-9: LC_TIME Category Definition
LC_TIME    (1)

abday    "<S><o>";"<M><o>";"<D><i>";"<M><i>";"<D><o>";\
         "<F><r>";"<S><a>"    (2)

day      "<S><o><n><n><t><a><g>";"<M><o><n><t><a><g>";\
         "<D><i><e><n><s><t><a><g>";\
         "<M><i><t><t><w><o><c><h>";\
         "<D><o><n><n><e><r><s><t><a><g>";\
         "<F><r><e><i><t><a><g>";"<S><a><m><s><t><a><g>"    (3)

abmon    "<J><a><n>";"<F><e><b>";"<M><a:><r>";\
         "<A><p><r>";"<M><a><i>";"<J><u><n>";\
         "<J><u><l>";"<A><u><g>";"<S><e><p>";\
         "<O><k><t>";"<N><o><v>";"<D><e><z>"    (4)

mon      "<J><a><n><u><a><r>";"<F><e><b><r><u><a><r>";\
         "<M><a:><r><z>";"<A><p><r><i><l>";"<M><a><i>";\
         "<J><u><n><i>";"<J><u><l><i>";\
         "<A><u><g><u><s><t>";\
         "<S><e><p><t><e><m><b><e><r>";\
         "<O><k><t><o><b><e><r>";\
         "<N><o><v><e><m><b><e><r>";\
         "<D><e><z><e><m><b><e><r>"    (5)

d_t_fmt  "%d.%B %Y %H:%M:%S"    (6)

.
.
.
END LC_TIME (7)

  1. Section header

  2. Abbreviated names for days of the week

    Use the %a conversion specifier to include this string in formats.

  3. Full names for days of the week

    Use the %A conversion specifier to include this string in formats.

  4. Abbreviated names for months of the year

    Use the %b conversion specifier to include this string in formats.

  5. Full names for months of the year

    Use the %B conversion specifier to include this string in formats.

  6. Format for combined date and time information

    Use this format to combine field descriptors (whose first character is the percent sign (%)) and symbols for characters. You can specify characters from the Portable Character Set (PCS), such as the period (.) and ASCII space, explicitly as characters rather than implicitly through symbols; however, use symbols to specify all other characters.

    The specified format includes the field descriptors for the day of the month (%d), the full name of the month (%B), the full representation of the year (%Y), the number of hours in a 24-hour period (%H), the number of minutes (%M), and the number of seconds (%S). If the date were December 12, 1993, and the time 29 seconds after 12 o'clock in the afternoon, the format specified in this example would cause the date command to display 12.Dezember 1993 12:00:29.

  7. Section trailer

The preceding example includes only some of the symbol definitions that are standard for the LC_TIME category. The following definitions are also standard:

As is true for other category sections, you can specify a copy statement to include all LC_TIME definitions from another locale. Note that Digital UNIX supports symbols and field descriptors in addition to those described here. Refer to the locale(4) reference page for more complete information.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3    Building Libraries to Convert Multibyte/Wide-Character Encodings

C library routines rely on a set of special interfaces to convert characters to and from data file encoding and wide-character encoding (internal process code). By default, the C library routines use interfaces that handle only single-byte characters. However, many are defined with entry points that permit use of alternative interfaces for handling multibyte-characters. The interfaces that can be tailored to a locale's codeset are called methods.

Only locales with multibyte codesets must use methods. When a locale uses methods, there are some methods that the locale must supply and other methods that it can optionally supply. A method is required when the corresponding interface is converting characters between data formats and needs codeset-specific logic to do that operation correctly. A method is optional when the corresponding interface is working with data after it has been converted to wide-character format and can apply logic that is valid for both single-byte and multibyte characters.

Methods must be available on the system in a shareable library. This library and the functions that implement each method in the library are made known to the localedef command through a methods file. When the localedef command processes the methods file along with the charmap and locale source files, the resulting locale includes pointers to all methods that are supplied with the locale, along with pointers to default implementations for optional methods that are not supplied with the locale. When you set the LANG variable to the newly built locale and run a command or application, methods are used wherever they have been enabled in the system software.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.1    Required Methods

If your locale uses methods, it must supply the following methods; without these methods, it is impossible for C Library functions to convert data between multibyte and wide-character formats:


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.1.1    Writing the __mbstopcs Method for the fgetws Function
The fgetws() function uses the __mbstopcs method to convert the bytes in the standard I/0 (stdio) buffer to a wide-character string. The function that implements this method must return the number of wide characters converted by the call.

This method is similar to the one for mbstowcs (see Section 7.3.1.6) but contains additional parameters to meet the needs of fgetws(). By convention, a C source file for this method has the file name __mbstopcs_codeset .c, where codeset identifies the codeset for which the method is tailored. Example 7-10 shows the file __mbstopcs_sdeckanji.c that defines the __mbstopcs method used with the ja_JP.sdeckanji locale.


Example 7-10: The __mbstopcs_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h>  (1)
#include <wchar.h>   (1)
#include <sys/localedef.h>   (1)

int __mbstopcs_sdeckanji(
        wchar_t *pwcs,   (2)
        size_t pwcs_len,   (3)
        const char *s,   (4)
        size_t s_len,   (5)
        int stopchr,   (6)
        char **endptr,   (7)
        int *err,   (8)
        _LC_charmap_t *handle )   (9)
{
    int cnt = 0;   (10)
    int pwcs_cnt = 0;   (10)
    int s_cnt = 0;   (10)

    *err = 0;   (11)

    while (1) {   (12)
        if (pwcs_cnt >= pwcs_len || s_cnt >= s_len) {
            *endptr = (char *)&(s[s_cnt]);
            break;
        }   (13)
        if ((cnt = __mbtopc_sdeckanji(&(pwcs[pwcs_cnt]),
            &(s[s_cnt]), (s_len - s_cnt), err)) == 0) {
            *endptr = (char *)&(s[s_cnt]);
            break;
        }   (14)
        pwcs_cnt++;   (15)
        if ( s[s_cnt] == (char) stopchr) {
            *endptr = (char *)&(s[s_cnt+1]);
            break;
        }   (16)
        s_cnt += cnt;   (17)
    }   (18)
    return (pwcs_cnt);   (19)
}

  1. Include header files that contain constants and structures required for this method.

  2. Points, through pwcs, to a buffer that stores the wide-character string.

  3. Defines a variable, pwcs_len, to store the size of the pwcs buffer.

  4. Points, through s, to a buffer that stores the multibyte-character string being converted.

  5. Defines a variable, s_len, to store the number of bytes of data in the s buffer.

    This parameter is needed because the fgetws() function reads from the standard I/O buffer, which does not contain null-terminated strings.

  6. Defines a variable, stopchr, to contain a byte value that would force conversion to stop.

    This value, typically \n, is passed to the method on the call from the fgetws() function, which handles only one line of input per call.

  7. Defines a variable, endptr, that points to the byte following the last byte converted.

    This pointer is needed to specify the starting character in the standard I/O buffer for the next call to fgetws().

  8. Points, through err, to a variable that stores execution status for the call made by this method to the mbtopc method.

  9. Points, through hdl, to a structure that points to the methods that parse character maps for this locale.

    The localedef command creates and stores values in the _LC_charmap_t structure.

  10. Initialize variables that indicate the number of bytes that a character uses in multibyte format (supplied by the mbtopc method) and the byte or character position in buffers that the fgetws() function uses.

  11. Sets err to zero (0) to indicate success.

  12. Starts the while loop that converts the multibyte string.

  13. Sets endptr and breaks out of the loop when there is either no more space in the buffer that stores wide-character data or no more data in the buffer that stores multibyte data.

  14. Calls the mbtopc method to convert a character from multibyte format to wide-character format; breaks out of the loop and sets endptr to the first byte of the character that could not be converted if the mbtopc method fails to convert a character and returns an error.

    The err variable contains the return status of the call to the mbtopc method:

    • 0 indicates success.

    • -1 indicates an invalid character.

    • A value greater than 0 indicates that too few bytes remain in the multibyte-character buffer to form a valid character.

      In this case, the return is the number of bytes required to form a valid character. The fgetws() function can then refill the buffer and try again.

  15. Increments the character position in the buffer that stores the wide-character data.

  16. Sets endptr to the character following the character stored in stopchr if the stopchr character is encountered in the multibyte data.

  17. Increments the byte position in the buffer that contains multibyte data.

  18. Ends the while loop.

  19. Returns the number of characters in the buffer that contains wide-character data.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.1.2    Writing the __mbtopc Method for the getwc() Function
The getwc() or fgetwc() function calls the __mbtopc method to convert a multibyte character to a wide character. The method returns the number of bytes in the multibyte character that is converted. This method is similar to the one for mbtowc (see Section 7.3.1.7) but contains an additional parameter that getwc() needs. By convention, a C source file for this method has the file name __mbtopc_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-11 shows the file __mbtopc_sdeckanji.c that defines the __mbtopc method used with the ja_JP.sdeckanji locale.


Example 7-11: The __mbtopc_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h>  (1)
#include <wchar.h>   
#include <sys/localedef.h>   

/*
The algorithm for this conversion is:
s[0] < 0x9f:  PC = s[0]
s[0] = 0x8e:  PC = s[1] + 0x5f;
s[0] = 0x8f   PC = (((s[1] - 0xa1) << 7) | (s[2] - 0xa1)) + 0x303c
s[0] > 0xa1:0xa1 < s[1] < 0xfe
              PC = (((s[0] - 0xa1) << 7) | (s[1] - 0xa1)) + 0x15e
            0x21 < s[1] < 0x7e
              PC = (((s[0] - 0xa1) << 7) | (s[1] - 0x21)) + 0x5f1a
+-----------------+-----------+-----------+-----------+
|  process code   |   s[0]    |   s[1]    |   s[2]    |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f |    --     |    --     |
| 0x00a0 - 0x00ff |   --      |    --     |    --     |
| 0x0100 - 0x015d | 0x8e      | 0xa1-0xfe |    --     | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe |    --     | JIS X0208
| 0x303c - 0x5f19 | 0x8f      | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe |    --     | UDC
+-----------------+-----------+-----------+-----------+
*/   (2)
int  __mbtopc_sdeckanji(
        wchar_t *pwc,   (3)
        char *ts,   (4)
        size_t maxlen,   (5)
        int *err,   (6)
        _LC_charmap_t *handle )   (7)
{
    wchar_t dummy;   (8)
    unsigned char *s = (unsigned char *)ts;   (9)
    if (s == NULL)
        return(0);   (10)
    if (pwc == (wchar_t *)NULL)
        pwc = &dummy;   (11)
    *err = 0;   (12)
    if (s[0] <= 0x8d) {
        if (maxlen < 1) {
            *err = 1;
            return(0);
        }
        else {
            *pwc = (wchar_t) s[0];
            return(1);
        }
    }   (13)
    else if (s[0] == 0x8e) {
        if (maxlen >= 2) {
            if (s[1] >=0xa1 && s[1] <=0xfe) {
                *pwc = (wchar_t) (s[1] + 0x5f);
                return(2);
            }
        }
        else {
            *err = 2;
            return(0);
        }
    }   (14)
    else if (s[0] == 0x8f) {
        if (maxlen >= 3) {
            if ((s[1] >=0xa1 && s[1] <=0xfe) &&
                (s[2] >=0xa1 && s[2] <= 0xfe)) {
                *pwc = (wchar_t) (((s[1] - 0xa1) << 7) |
                       (wchar_t) (s[2] - 0xa1)) + 0x303c;
                return(3);
            }
        }
        else {
            *err = 3;
            return(0);
        }
    }   (15)
    else if (s[0] <= 0x9f) {
        if (maxlen < 1) {
            *err = 1;
            return(0);
        }
        else {
            *pwc = (wchar_t) s[0];
            return(1);
        }
    }   (16)
    else if (s[0] >= 0xa1 && s[0] <= 0xfe) {
        if (maxlen >= 2) {
            if  (s[1] >=0xa1 && s[1] <= 0xfe) {
                *pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
                       (wchar_t) (s[1] - 0xa1)) + 0x15e;
                return(2);
            } else if  (s[1] >=0x21 && s[1] <= 0x7e) {
                *pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
                       (wchar_t) (s[1] - 0x21)) + 0x5f1a;
                return(2);
            }
        }
        else {
            *err = 2;
            return(0);
        }
    }   (17)
    *err = -1;
    return(0);   (18)
}

  1. Include header files that contain constants and structures required for this method.

  2. Describes the algorithm used to determine the number of bytes and valid byte combinations for the different character sets that the codeset supports.

    The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid.

  3. Points, through pwc, to a buffer that stores the wide character.

  4. Points, through ts, to a buffer that stores the bytes that are passed to the method from the calling function.

  5. Declares a variable, maxlen, that stores the maximum number of bytes in the multibyte data.

    This value is passed by the calling function.

  6. Points, through err, to a buffer that stores execution status.

  7. Points, through handle, to a structure that contains pointers to the methods that parse the character maps for this locale.

  8. Declares a variable, dummy, to which pwc can be set to ensure a valid address.

  9. Casts ts (an array of signed characters) to s (an array of unsigned characters).

    This operation prevents problems when integer values are stored in the array and then referenced by index. Compilers apply sign extension to values when comparing a small signed data type, such as int, to a large signed data type, such as char. Sign extension means that the high bit of the value in the small data type is used to fill in bits that remain when the value is converted to the larger data type for comparison. For example, if s[0] is the value 0x8e, sign extension would cause it to be treated as 0xffffff8e. In this case, a condition like the following one would be evaluated as true when you would expect it to be false:

    if (s[0] <= 0x8d

  10. Returns zero (0) if the s buffer contains or points to NULL.

  11. Stores the contents of dummy in the wide-character buffer if the ts buffer contains or points to NULL.

    This operation ensures that *pwc always points to a valid address; otherwise, an application could produce a segmentation fault by referring to this pointer when a wide character has not been stored in pwc.

  12. Initializes err to zero (0) to indicate success.

  13. Determines if the character is one of the single-byte characters that the codeset defines for values equal to or less than 0x8d.

    If s contains no characters, returns zero (0) to indicate that no bytes were converted and sets err to 1 to indicate that 1 byte is needed to form a valid character.

    If the byte value is in the range being tested, moves the associated process code value to pwc and returns 1 to indicate the number of bytes converted.

  14. Determines if the character is one of the double-byte characters that the codeset defines for the value 0x8e (first byte) and the value range 0xa1 to 0xfe (second byte).

    If yes, moves the associated process code value to the pwc buffer and returns 2 to indicate the number of bytes converted; otherwise, returns 0 to indicate that no conversion took place and sets err to 2 to specify that at least 2 bytes are needed to form a valid character.

  15. Determines if the character is one of the triple-byte characters that the codeset defines for the value 0x8f (first byte), the range 0xa1 to 0xfe (second byte), and the range 0xa1 to 0xfe (third byte).

    If yes, moves the associated process code value to pwc and returns 3 to indicate the number of bytes converted; otherwise, sets err to 3 to indicate that at least 3 bytes are needed and returns zero (0) to indicate that no character was converted.

  16. Determines if the character is one of the single-byte characters that the codeset defines for the range 0x90 to 0x9f.

    If there are no bytes in the standard I/O buffer, returns zero (0) to indicate that no bytes were converted and sets err to 1 to indicate that at least 1 byte is needed to form a valid character.

    If the byte value is in the defined range, moves the associated process code value to pwc and returns 1 to indicate the number of bytes converted.

  17. Determines if the character is one of the double-byte characters that the codeset defines for the range 0xa1 to 0xfe (first byte) and 0x21 to 0x7e (second byte).

    If yes, moves the associated process code value to pwc buffer and returns 2 to indicate the number of bytes converted; otherwise, sets err to 2 to indicate that at least 2 bytes are needed to form a valid character and returns zero (0) to indicate that no bytes were converted.

  18. Sets err to -1 to indicate that an invalid multibyte sequence was encountered and returns zero (0) to indicate that no bytes were converted.

    These statements execute if the multibyte data in s satisfies none of the preceding if conditions.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.1.3    Writing the __pcstombs Method for the fputws() Function
The fputws() function first calls the __pcstombs method to convert a string of characters from process (wide-character) code to multibyte code. If this method returns -1 to indicate no support by the locale, fputws() then calls putwc() for each wide character in the string being converted. By convention, a C source file for this method has the file name __pcstombs_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-12 shows the file __pcstombs_sdeckanji.c that defines the __pcstombs method used with the ja_JP.sdeckanji locale.


Example 7-12: The __pcstombs_sdeckanji Method for the ja_JP.sdeckanji Locale
int __pcstombs_sdeckanji()
{
        return -1;   (1)
}

  1. Returns -1 to indicate that the locale does not support the method.

    This return causes the fputws() function to use multiple calls to putwc() to convert wide characters in the string.

If you choose to implement this method fully rather than writing it to return -1, your function implementation returns the number of wide characters converted and must include header files and parameters as shown in the following example:

#include <stdlib.h>
#include <wchar.h>
#include <sys/localedef.h>

int __pcstombs_newcodeset(
        wchar_t *pcsbuf,   (1)
        size_t pcsbuf_len,   (2)
        char *mbsbuf,  (3)
        size_t mbsbuf_len,  (4)
        char **endptr,  (5)
        int *err,   (6)
        _LC_charmap_t *handle )  (7)

  1. Specifies a pointer to a buffer that contains the wide-character string.

  2. Specifies a variable with the length of the wide-character buffer.

    This value is passed to the method on the call from fputws().

  3. Specifies a pointer to a buffer that contains the multibyte-character string.

  4. Specifies a variable with the length of the multibyte-character buffer.

    This value is passed to the method on the call from fputws().

  5. Points, through endptr, to a pointer to the byte position in the multibyte-character buffer where the next character would begin if multiple calls to fputws() are required to convert all the wide-character data.

  6. Specifies a pointer to the execution status return.

    If this method calls the wctomb method to perform the character conversion, the wctomb method sets this status. Otherwise, this method must incorporate the logic to perform wide-character to multibyte-character conversion and set the status directly.

    In any event, the fputws() function expects the following values:

    • 0 for success

    • -1 to indicate that the wide-character value is invalid and therefore cannot be converted

    • A positive value to indicate that the multibyte-character buffer contains too few bytes after the last character to store the next character

      In this case, the value is the number of bytes required to store the next character. The fputws() function can then empty the multibyte-character buffer and try again.

  7. Specifies a pointer to the _LC_charmap_t structure that stores pointers to the methods used with this locale.

The __pcstombs method performs the reverse of the operation that the __mbstopcs method described in Section 7.3.1.3 performs. Because of the direction of the data conversion, the __pcstombs method:


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.1.4    Writing a __pctomb Method
C Library functions currently do not use the __pctomb interface. The putwc() function, for example, calls the wctomb method to convert a character from wide-character to multibyte-character format. Nonetheless, the localedef command requires a method for this function when your locale supplies methods. By convention, a C source file for this method has the file name__pctomb_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-13 shows the file __pctomb_sdeckanji.c that defines the __pctomb method used with the ja_JP.sdeckanji locale.


Example 7-13: The __pctomb_sdeckanji Method for the ja_JP.sdeckanji Locale
int __pctomb_sdeckanji()
{
        return -1;   (1)
}

  1. Returns -1 to indicate that the locale does not support this method.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.1.5    Writing a Method for the mblen Function
The mblen() function uses the mblen method to return the number of bytes in a multibyte character. By convention, a C source file for this method has the file name __mblen_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-14 shows the file __mblen_sdeckanji.c that defines the mblen method used with the ja_JP.sdeckanji locale.


Example 7-14: The __mblen_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h>   (1)
#include <wchar.h>   
#include <sys/errno.h>   
#include <sys/localedef.h>   

/*
The algorithm for this conversion is:

s[0] < 0x9f:  1 byte
s[0] = 0x8e:  2 bytes
s[0] = 0x8f   3 bytes
s[0] > 0xa1   2 bytes

|  process code   |   s[0]    |   s[1]    |   s[2]    |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f |    --     |    --     |
| 0x00a0 - 0x00ff |   --      |    --     |    --     |
| 0x0100 - 0x015d | 0x8e      | 0xa1-0xfe |    --     | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe |    --     | JIS X0208
| 0x303c - 0x5f19 | 0x8f      | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe |    --     | UDC
+-----------------+-----------+-----------+-----------+
*/   (2)

int __mblen_sdeckanji(
        char *fs,   (3)
        size_t maxlen,   (4)
        _LC_charmap_t *handle )   (5)
{
    const unsigned char *s = (void *) fs;   (6)

    if (s == NULL || *s == '\0')
        return(0);   (7)

    if (maxlen < 1) {
        _Seterrno(EILSEQ);
        return((size_t)-1);
    }   (8)

    if (s[0] <= 0x8d)
        return(1);   (9)

    else if (s[0] == 0x8e) {
        if (maxlen >= 2 && s[1] >=0xa1 && s[1] <=0xfe)
            return(2);
    }   (10)

    else if (s[0] == 0x8f) {
        if(maxlen >=3 && (s[1] >=0xa1 && s[1] <=0xfe) &&
            (s[2] >=0xa1 && s[2] <= 0xfe))
            return(3);
    }   (11)

    else if (s[0] <= 0x9f)
        return(1);   (12)

    else if (s[0] >= 0xa1) {
            if (maxlen >=2 && (s[0] <= 0xfe) )
                    if ( (s[1] >=0xa1 && s[1] <= 0xfe) ||
                       (s[1] >=0x21 && s[1] <= 0x7e) )
                        return(2);
    }   (13)

    _Seterrno(EILSEQ);
    return((size_t)-1);   (14)
}

  1. Includes header files that contain constants and structures required by this method.

  2. Describes the algorithm used to determine the number of bytes in the character and whether it is a valid byte sequence.

    The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid.

  3. Points, through fs, to a buffer that stores the byte string to be examined.

  4. Defines a variable, maxlen, that stores the maximum length of a multibyte character.

    This value is passed to the method by the mblen() function.

  5. Points, through handle, to a structure that stores pointers to the methods that parse character maps for this locale.

  6. Casts fs (an array of signed characters) to s (an array of unsigned characters).

    This operation prevents problems when integer values are stored in the array and then referenced by index. Compilers apply sign extension to values when comparing a small signed data type, such as int, to a large signed data type, such as char. Sign extension means that the high bit of the value in the small data type is used to fill in bits that remain when the value is converted to the larger data type for comparison. For example, if s[0] is the value 0x8e, sign extension would cause it to be treated as 0xffffff8e. In this case, a condition like the following one would be evaluated as true when you would expect it to be false:

    if (s[0] <= 0x8d

  7. Returns zero (0) to indicate that the character length is zero (0) bytes if s contains or points to NULL.

  8. Returns -1 and sets errno to EILSEQ (invalid character sequence) if maxlen (the maximum number of bytes to consider) is 0 or a negative number.

    To set errno in a way that works correctly with multithreaded applications, use _Seterrno rather than an assignment statement.

  9. Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x8d.

    If yes, returns 1 to indicate that the character length is 1 byte.

  10. Determines if the first byte identifies a double-byte character whose first byte contains the value 0x8e and second byte contains a value in the range 0xa1 to 0xfe.

    If yes, returns 2 to indicate that the character length is 2 bytes.

  11. Determines if the first byte identifies a triple-byte character whose first byte contains the value 0x8f and whose second and third bytes contain a value in the range 0xa1 to 0xfe.

    If yes, returns 3 to indicate that the character length is 3 bytes.

  12. Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x9f.

    If yes, returns 1 to indicate that the character length is 1 byte.

  13. Determines if the first byte identifies a double-byte character whose first byte contains a value in the range 0xa1 to 0xfe and whose second byte contains a value in the range 0x21 to 0x7e.

    If yes, returns 2 to indicate that the character length is 2 bytes.

  14. Returns -1 and sets errno to EILSEQ to indicate an invalid multibyte sequence.

    These statements execute if the multibyte data in the standard I/O buffer satisfies none of the preceding if conditions.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.1.6    Writing a Method for the mbstowcs Function
The mbstowcs() function uses the mbstowcs method to convert a multibyte character string to process (wide-character) code and to return the number of resultant wide characters. By convention, a C source file for this method has the file name __mbstowcs_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-15 shows the file __mbstowcs_sdeckanji.c that defines the mbstowcs method used with the ja_JP.sdeckanji locale.


Example 7-15: The __mbstowcs_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h>   (1)
#include <wchar.h>   
#include <sys/localedef.h>  

size_t __mbstowcs_sdeckanji(
        wchar_t *pwcs,   (2)
        const char *s,   (3)
        size_t n,   (4)
        _LC_charmap_t *handle )   (5)
{
    int len = n;   (6)
    int rc;   (7)
    int cnt;   (8)
    wchar_t *pwcs0 = pwcs;   (9)
    int mb_cur_max;   (10)

    if (s == NULL)
        return (0);   (11)

    mb_cur_max = MB_CUR_MAX;   (12)

    if (pwcs == (wchar_t *)NULL) {
        cnt = 0;
        while (*s != '\0') {
             if ((rc = __mblen_sdeckanji(s, mb_cur_max, handle)) == -1)
                return(-1);
             cnt++  ;
             s += rc;
        }
        return(cnt);
    }   (13)

    while (len-- > 0) {
        if ( *s == '\0') {
            *pwcs = (wchar_t) '\0';
            return (pwcs - pwcs0);
        }
        if ((cnt = __mbtowc_sdeckanji(pwcs, s, mb_cur_max, handle)) < 0)
            return(-1);
        s += cnt;
        ++pwcs;
    }   (14)

    return (n);   (15)
}

  1. Include header files that contain constants and structures required for this method.

  2. Points, through pwcs, to a buffer that contains the wide-character string.

  3. Points, through s, to a buffer that contains the multibyte-character string.

  4. Defines a variable, n, that contains the number of wide characters in pwcs.

  5. Points, through handle, to a structure that stores pointers to the methods that parse character maps for this locale.

  6. Assigns the number of wide characters in the pwcs buffer (the n value supplied by the calling function) to len.

  7. Defines a variable, rc, that stores the return count from a call this method makes to the mblen function.

  8. Defines a variable, cnt, that counts the bytes used by characters in the s buffer.

  9. Saves the start of the wide-character string passed by the calling function in the pwcs0 variable.

  10. Defines a variable, mb_cur_max, that is later set to MB_CUR_MAX and used in a call to the mblen method.

  11. Returns zero (0) if s is null. A method should return zero (0) if the locale's character encoding is stateless and a nonzero value if the locales's character encoding is stateful.

  12. Assigns the value defined for MB_CUR_MAX to mb_cur_max for use on the following call to the mblen method.

  13. Checks to see if a null pointer was passed from the calling function and, if yes, calls the mblen method to calculate the size of the wide-character string.

    The programmer can request the size of the pwcs buffer (for memory allocation purposes) by passing a null wide character as the pwcs parameter in the call to mbstowcs(). The programmer can then use the return value to efficiently allocate memory space for the application's wide-character buffer before calling mbstowcs() again to actually convert the multibyte string.

  14. Converts bytes in the multibyte-character buffer by calling the __mbtowc method until a null character (end-of-string) is encountered.

    Stops processing and returns the number of wide characters in the pwcs buffer if a NULL is encountered; increments the byte position in the multibyte character buffer by an appropriate number each time a character is successfully converted.

    This while loop uses the condition len-- > 0 to ensure that processing stops when the pwcs buffer is full. The first if condition in the loop makes sure that, if the multibyte string in the s buffer is null terminated, the associated null terminator in the pwcs buffer is not included in the wide-character count that the mbtowcs() function returns to the application.

  15. Returns the value in n to indicate the resultant number of wide characters in the pwcs buffer.

    This statement executes if the pwcs buffer runs out of space before a NULL is encountered in the s buffer.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.1.7    Writing a Method for the mbtowc Function
The mbtowc(\) function uses the mbtowc method to convert a multibyte character to a wide character and to return the number of bytes in the multibyte character that was converted. By convention, a C source file for this method has the file name __mbtowc_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-16 shows the file __mbtowc_sdeckanji.c that defines the mbtowc method used with the ja_JP.sdeckanji locale.


Example 7-16: The __mbtowc_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h>   (1)
#include <wchar.h>   
#include <sys/errno.h>   
#include <sys/localedef.h>   

/*
The algorithm for this conversion is:

s[0] < 0x9f:  PC = s[0]
s[0] = 0x8e:  PC = s[1] + 0x5f;
s[0] = 0x8f   PC = (((s[1] - 0xa1) << 7) | (s[2] - 0xa1)) + 0x303c
s[0] > 0xa1:0xa1 < s[1] < 0xfe
              PC = (((s[0] - 0xa1) << 7) | (s[1] - 0xa1)) + 0x15e
0x21 < s[1] < 0x7e
              PC = (((s[0] - 0xa1) << 7) | (s[1] - 0x21)) + 0x5f1a

+-----------------+-----------+-----------+-----------+
|  process code   |   s[0]    |   s[1]    |   s[2]    |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f |    --     |    --     |
| 0x00a0 - 0x00ff |   --      |    --     |    --     |
| 0x0100 - 0x015d | 0x8e      | 0xa1-0xfe |    --     | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe |    --     | JIS X0208
| 0x303c - 0x5f19 | 0x8f      | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe |    --     | UDC
+-----------------+-----------+-----------+-----------+
*/   (2)
int __mbtowc_sdeckanji(
        wchar_t *pwc,   (3)
        const char *ts,   (4)
        size_t maxlen,   (5)
        _LC_charmap_t *handle )   (6)
{
    unsigned char *s = (unsigned char *)ts;   (7)
    wchar_t dummy;   (8)

    if (s == NULL)
        return(0);   (9)

    if (maxlen < 1) {
        _Seterrno(EILSEQ);
        return((size_t)-1);
    }   (10)

    if (pwc == (wchar_t *)NULL)
        pwc = &dummy;   (11)

    if (s[0] <= 0x8d) {
        *pwc = (wchar_t) s[0];
        if (s[0] != '\0')
            return(1);
        else
            return(0);
    }   (12)

    else if (s[0] == 0x8e) {
        if ( (maxlen >= 2) && ((s[1] >=0xa1) && (s[1] <=0xfe))) {
            *pwc = (wchar_t) (s[1] + 0x5f); /* 0x100 - 0xa1 */
            return(2);
        }
    }   (13)

    else if (s[0] == 0x8f) {
        if((maxlen >= 3) && (((s[1] >=0xa1) && (s[1] <=0xfe))
           && ((s[2] >=0xa1) && (s[2] <= 0xfe)))) {
                *pwc = (wchar_t) (((s[1] - 0xa1) << 7) |
                   (wchar_t) (s[2] - 0xa1)) + 0x303c;
           return(3);
        }
    }   (14)

    else if (s[0] <= 0x9f) {
        *pwc = (wchar_t) s[0];
        if (s[0] != '\0')
            return(1);
        else
            return(0);
    }   (15)

    else if (((s[0] >= 0xa1) && (s[0] <= 0xfe)) && (maxlen >= 2)){
            if (((s[1] >=0xa1) && (s[1] <= 0xfe))){
                    *pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
                              (wchar_t)(s[1] - 0xa1)) + 0x15e;
                    return(2);
            } else if (((s[1] >=0x21) && (s[1] <= 0x7e))){
                    *pwc = (wchar_t) (((s[0] - 0xa1) << 7) |
                              (wchar_t)(s[1] - 0x21)) + 0x5f1a;
                    return(2);
            }
    }   (16)
    _Seterrno(EILSEQ);
    return(-1);   (17)
}

  1. Includes header files that contain constants and structures required for this method.

  2. Describes the algorithm used to determine the number of bytes in the character and whether it is a valid byte sequence.

    The codeset supports several character sets and each set contains characters of only one length. The value in the first byte indicates the character set and therefore the character length. For character sets with multibyte characters, one or more additional bytes must be examined to determine whether the value sequence identifies a character or is invalid.

  3. Points, through pwc, to a buffer that contains the wide character.

  4. Points, through ts, to a buffer that contains values in multibyte-character format.

  5. Defines a variable, maxlen, that stores the maximum length of a multibyte character.

    This value is passed from the calling function; the value will have been set to MB_CUR_MAX on the original call made by the application programmer.

  6. Points, through handle, to a structure that stores pointers to the methods that parse character maps for this locale.

  7. Casts ts (an array of signed characters) to s (an array of unsigned characters).

    This operation prevents problems when integer values are stored in the array and then referenced by index. Compilers apply sign extension to values when comparing a small signed data type, such as int, to a large signed data type, such as char. Sign extension means that the high bit of the value in the small data type is used to fill in bits that remain when the value is converted to the larger data type for comparison. For example, if s[0] is the value 0x8e, sign extension would cause it to be treated as 0xffffff8e. In this case, a condition like the following one would be evaluated as true when you would expect it to be false:

    if (s[0] <= 0x8d

  8. Defines a variable, dummy, that can be assigned to pwc to ensure pwc points to a valid address.

  9. Returns zero (0) to indicate that the locale's character encoding is stateless if s contains or points to NULL.

    If passed a null pointer, this method should return a value to indicate whether the locale's character encoding is stateful or stateless. Return a nonzero value if your locale's character encoding is stateful.

  10. Returns -1 cast to size_t and sets errno to EILSEQ (invalid byte sequence) if the multibyte data buffer is less than 1 byte in length.

  11. Stores the contents of dummy in the wide-character buffer if the ts buffer contains or points to NULL.

    This operation ensures that pwc always points to a valid address; otherwise, an application could produce a segmentation fault by referring to this pointer when a wide character has not been stored in pwc.

  12. Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x8d.

    If yes, stores the associated process code value in the pwc buffer and returns 1 to indicate that the character length is 1 byte.

  13. Determines if the first byte identifies a double-byte character whose first byte contains the value 0x8e and second byte contains a value in the range 0xa1 to 0xfe.

    If yes, stores the associated process code value in the pwc buffer and returns 2 to indicate that the character length is 2 bytes.

  14. Determines if the first byte identifies a triple-byte character whose first byte contains the value 0x8f and whose second and third bytes contain a value in the range 0xa1 to 0xfe.

    If yes, stores the associated process code value in the pwc buffer and returns 3 to indicate that the character length is 3 bytes.

  15. Determines if the first byte identifies a single-byte character whose value is equal to or less than 0x9f.

    If yes, stores the associated process code value in the pwc buffer and returns 1 to indicate that the character length is 1 byte.

  16. Determines if the first byte identifies a double-byte character whose first byte contains a value in the range x0a1 to x0fe and whose second byte contains a value in the range 0x21 to 0x7e.

    If yes, stores the associated process code value in the pwc buffer and returns 2 to indicate that the character length is 2 bytes.

  17. Returns -1 and sets errno to EILSEQ to indicate that an invalid multibyte sequence was encountered.

    These statements execute if the multibyte data in the s buffer satisfies none of the preceding if conditions.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.1.8    Writing a Method for the wcstombs Function
The wcstombs() function calls the wcstombs method to convert a wide-character string to a multibyte-character string and to return the number of bytes in the resultant multibyte-character string. By convention, a C source file for this method has the file name __wcstombs_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-17 shows the file __wcstombs_sdeckanji.c that defines the wcstombs method used with the ja_JP.sdeckanji locale.


Example 7-17: The __wcstombs_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h>   (1)
#include <wchar.h>   
#include <limits.h>   
#include <sys/localedef.h>   

size_t __wcstombs_sdeckanji(
        char *s,   (2)
        const wchar_t *pwcs,   (3)
        size_t n,   (4)
        _LC_charmap_t *handle )   (5)
{
    int cnt=0;   (6)
    int len=0;   (7)
    int i=0;   (8)
    char tmps[MB_LEN_MAX+1];   (9)

    if ( s == (char *)NULL) {
        cnt = 0;
        while (*pwcs != (wchar_t)'\0') {
            if ((len = __wctomb_sdeckanji(tmps, *pwcs)) == -1)
                    return(-1);
            cnt += len;
            pwcs++;
        }
        return(cnt);
    }   (10)

    if (*pwcs == (wchar_t)'\0') {
        *s = '\0';
        return(0);
    }   (11)

    while (1) {   (12)

        if ((len = __wctomb_sdeckanji(tmps, *pwcs)) == -1)
            return(-1);   (13)

        else if (cnt+len > n) {
            *s = '\0';
            break;
        }   (14)

        if (tmps[0] == '\0') {
            *s = '\0';
            break;
        }   (15)

        for (i=0; i<len; i++) {
            *s = tmps[i];
            s++;
        }   (16)

        cnt += len;   (17)

        if (cnt == n)
            break;   (18)

        pwcs++;   (19)
    }   (20)

    if (cnt == 0)
        cnt = len;   (21)
    return (cnt);   (22)
}

  1. Include header files that contain constants and structures required for this method.

  2. Points, through s, to a buffer that stores the multibyte-character string that this method passes to the calling function.

  3. Points, through pwcs, to a buffer that stores the wide-character string that is being converted.

  4. Defines a variable, n, that stores the number of maximum number of bytes in the multibyte-character string buffer.

    This value is supplied by the calling function.

  5. Points, through handle, to a structure that points to the methods that parse character maps for this locale.

  6. Initializes a variable, cnt, that is incremented by the number of bytes (len) of each converted character.

  7. Initializes a variable, len, that stores the length of each converted character.

  8. Initializes a variable, i, that is used to index the bytes in each multibyte character when moving a converted character from temporary storage to s.

  9. Defines a temporary buffer, tmps, that stores the multibyte character returned to this method from a call to the wctomb method.

  10. Checks to see if a NULL was passed from the calling function in the s buffer.

    If yes, calls the wctomb method to calculate the number of bytes required for converted characters (excluding the null terminator) in the multibyte-character buffer.

    The programmer can request the size of the s buffer (for memory allocation purposes) by passing a null byte as the data in the s parameter on the call to wcstombs(). The programmer can then use the return value to efficiently allocate memory space for the application's wide-character buffer before calling wcstombs() again to actually convert the wide-character string.

  11. Returns zero (0) to indicate that no multibyte characters resulted and sets s to NULL if pwcs points to NULL.

  12. Starts a while loop to process characters in the wide-character string.

  13. Converts characters in the wide-character buffer by calling the wctomb method; returns -1 to indicate an invalid character if wctomb returns -1.

  14. Terminates s with NULL and breaks out of the while loop if there is no room in s for the character just converted by wctomb.

  15. Moves a null terminator to s and breaks out of the while loop when a NULL is encountered in s.

  16. Appends each byte in tmps to s if the current wide character is not a null.

  17. Increments cnt by the number of bytes (len) occupied by this character in multibyte format.

  18. Breaks out of the while loop without adding a null terminator if the number of bytes processed equals n (the maximum number of bytes in s).

  19. Increments pwcs to point to the next wide character to be converted.

  20. Ends the while loop that converts each wide character.

  21. Ensures that zero (0) is returned if s does not contain enough space for even one character.

  22. Returns the number of bytes in the resultant multibyte-character string.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.1.9    Writing a Method for the wctomb Function
The wctomb() function calls the wctomb method to convert a wide character to a multibyte character and to return the number of bytes in the resultant multibyte character. By convention, a C source file for this method has the file name __wctomb_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-18 shows the file __wctomb_sdeckanji.c that defines the wctomb method for the ja_JP.sdeckanji locale.


Example 7-18: The __wctomb_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h>   (1)
#include <wchar.h>   
#include <sys/errno.h>   
#include <sys/localedef.h>   

/*
  The algorithm for this conversion is:

PC <= 0x009f:                 s[0] = PC
PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e
                              s[1] = PC - 0x005f
PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1
                              s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1
PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f
                              s[1] = ((PC - 0x303c) >> 7) + 0x00a1
                              s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1
PC >= 0x5f1a and PC <=0x8df7  s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1
                              s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021

+-----------------+-----------+-----------+-----------+
|  process code   |   s[0]    |   s[1]    |   s[2]    |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f |    --     |    --     |
| 0x00a0 - 0x00ff |   --      |    --     |    --     |
| 0x0100 - 0x015d | 0x8e      | 0xa1-0xfe |    --     | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe |    --     | JIS X0208
| 0x303c - 0x5f19 | 0x8f      | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe |    --     | UDC
+-----------------+-----------+-----------+-----------+
*/   (2)

int __wctomb_sdeckanji(
        char *s,    (3)
        wchar_t wc,    (4)
        _LC_charmap_t *handle )    (5)
{
    if (s == (char *)NULL)
        return(0);    (6)

    if (wc <= 0x9f) {
        s[0] = (char) wc;
        return(1);
    }    (7)

    else if ((wc >= 0x0100) && (wc <= 0x015d)) {
        s[0] = 0x8e;
        s[1] = wc - 0x5f;
        return(2);
    }    (8)

    else if ((wc >=0x015e) && (wc <= 0x303b)) {
        s[0] = (char) (((wc - 0x015e) >> 7) + 0x00a1);
        s[1] = (char) (((wc - 0x015e) & 0x007f) + 0x00a1);
        return(2);
    }    (9)

    else if ((wc >=0x303c) && (wc <= 0x5f19)) {
        s[0] = 0x8f;
        s[1] = (char) (((wc - 0x303c) >> 7) + 0x00a1);
        s[2] = (char) (((wc - 0x303c) & 0x007f) + 0x00a1);
        return(3);
    }    (10)

    else if ((wc >=0x5f1a) && (wc <= 0x8df7)) {
        s[0] = (char) (((wc - 0x5f1a) >> 7) + 0x00a1);
        s[1] = (char) (((wc - 0x5f1a) & 0x007f) + 0x0021);
        return(2);
    }    (11)

    _Seterrno(EILSEQ);
    return(-1);    (12)
}

  1. Include header files that contain constants and structures required for this method.

  2. Describes the conversion algorithm that this method uses.

    Each character set supported by the codeset corresponds to a unique range of wide-character (process code) values and, within each character set, multibyte characters are of uniform length (1, 2, or 3 bytes). Therefore, the range in which each wide-character value falls indicates the number of bytes required for the character in multibyte format; the wide-character value itself determines the specific byte value or values for the character in multibyte format.

  3. Points, through s, to a buffer that stores the multibyte character.

  4. Defines the wc variable that stores the wide character.

  5. Points, through handle, to a structure that stores pointers to the methods that parse the character maps for this locale.

  6. Returns zero (0) to indicate that no characters were converted if s points to NULL.

  7. If the wide-character value is equal to or less than 0x9f, moves that value into the first byte of the s array and returns 1 to indicate that the converted character is 1 byte in length.

  8. If the wide-character value is in the range 0x0100 to 0x015d, moves the value 0x8e to the first byte and a calculated value to the second byte of the s array; returns 2 to indicate that the converted character is 2 bytes in length.

  9. If the wide-character value is in the range 0x015e to 0x303b, moves calculated values to the first and second bytes of the s array and returns 2 to indicate that the converted character is 2 bytes in length.

  10. If the wide-character value is in the range 0x303c to 0x5f19, moves 0x8f to the first byte and calculated values to the second and third bytes of the s array; returns 3 to indicate that the converted character is 3 bytes in length.

  11. If the wide-character value is in the range 0x5f1a to 0x8df7, moves calculated values to the first and second bytes of the s array, and returns 2 to indicate that the converted character is 2 bytes in length.

  12. Sets errno to EILSEQ and returns -1 to indicate that the wide-character value is invalid.

    These statements execute if the wide-character values satisfies none of the preceding conditions.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.1.10    Writing a Method for the wcswidth Function
The wcswidth() function uses the wcswidth method to determine the number of columns required to display a wide-character string. By convention, a C source file for this method has the file name __wcswidth_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-19 shows the file __wcswidth_sdeckanji.c that defines the wcswidth method used for the ja_JP.sdeckanji locale.


Example 7-19: The __wcswidth_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h>   (1)
#include <wchar.h>   
#include <sys/localedef.h>   

/*
The algorithm for this conversion is:

PC <= 0x009f:                 s[0] = PC
PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e
                              s[1] = PC - 0x005f
PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1
                              s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1
PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f
                              s[1] = ((PC - 0x303c) >> 7) + 0x00a1
                              s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1
PC >= 0x5f1a and PC <=0x8df7  s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1
                              s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021

+-----------------+-----------+-----------+-----------+
|  process code   |   s[0]    |   s[1]    |   s[2]    |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f |    --     |    --     |
| 0x00a0 - 0x00ff |   --      |    --     |    --     |
| 0x0100 - 0x015d | 0x8e      | 0xa1-0xfe |    --     | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe |    --     | JIS X0208
| 0x303c - 0x5f19 | 0x8f      | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe |    --     | UDC
+-----------------+-----------+-----------+-----------+
*/   (2)

int __wcswidth_sdeckanji(
        const wchar_t *wcs,   (3)
        size_t n,   (4)
        _LC_charmap_t *hdl )   (5)
{
    int len;   (6)
    int i;   (7)

    if (wcs == (wchar_t *)NULL || *wcs == (wchar_t)NULL)
        return(0);   (8)

    len = 0;   (9)
    for (i=0; wcs[i] != (wchar_t)NULL && i<n; i++) {   (10)

        if (wcs[i] <= 0x9f)
             len += 1;   (11)

        else if ((wcs[i] >= 0x0100) && (wcs[i] <= 0x015d))
             len += 1;   (12)

        else if ((wcs[i] >=0x015e) && (wcs[i] <= 0x303b))
             len += 2;   (13)

        else if ((wcs[i] >=0x303c) && (wcs[i] <= 0x5f19))
            len += 2;   (14)

        else if ((wcs[i] >=0x5f1a) && (wcs[i] <= 0x8df7))
            len += 2;   (15)

        else
            return(-1);   (16)
    }   (17)

    return(len);   (18)
}

  1. Include header files that contain constants and structures required for this method.

  2. Describes the algorithm used to determine the required display width.

    Note that each character's display width is either 1 or 2 columns, depending on the character set to which a character belongs. Display width is different from the size of the character in multibyte format; for example, triple-byte characters require 2 display columns and double-byte characters can require either 1 or 2 display columns.

  3. Points, through wcs, to a buffer that stores the wide-character string for which display width information is requested.

  4. Defines a variable, n, that stores the maximum size of the wcs buffer.

  5. Points, through hdl, to a structure that stores pointers to the methods that parse character maps for this locale.

  6. Defines a variable, len, that stores the display width in bytes/columns.

  7. Defines a variable, i, that functions as a loop counter.

  8. Returns zero (0) if wcs contains or points to NULL.

  9. Initializes len to zero (0).

  10. Begins a for loop that processes each wide character in the wcs buffer and increments the wide-character pointer.

  11. Increments len by 1 if the value of the current wide character is less than or equal to 0x9f.

  12. Increments len by 1 if the value of the current wide character is in the range 0x0100 to 0x015d.

  13. Increments len by 2 if the value of the current wide character is in the range 0x015e to 0x303b.

  14. Increments len by 2 if the value of the current wide character is in the range 0x303c to 0x5f19.

  15. Increments len by 2 if the value of the current wide character is in the range 0x5f1a to 0x8df7.

  16. Returns -1 to indicate that the string contains an invalid wide character.

    This statement executes if a value that satisfies none of the preceding conditions is encountered in the string. The calling function, wcswidth(), also returns -1 if the wide character is nonprintable; however, this condition is evaluated at the level of the calling function and does not need to be evaluated by the method.

  17. Ends the for loop that processes wide characters in the wcs buffer.

  18. Returns len to indicate the number of columns required to display the wide-character string.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.1.11    Writing a Method for the wcwidth Function
The wcwidth() function uses the wcwidth method to determine the number of columns required to display a wide character. By convention, a C source file for this method has the file name __wcwidth_codeset .c, where codeset identifies the codeset for which this method is tailored. Example 7-20 shows the file __wcwidth_sdeckanji.c that defines the wcwidth method used with the ja_JP.sdeckanji locale.


Example 7-20: The __wcwidth_sdeckanji Method for the ja_JP.sdeckanji Locale
#include <stdlib.h>   (1)
#include <wchar.h>   
#include <sys/localedef.h>   

/*
The algorithm for this conversion is:

PC <= 0x009f:                 s[0] = PC
PC >= 0x0100 and PC <=0x015d: s[0] = 0x8e
                              s[1] = PC - 0x005f
PC >= 0x015e and PC <=0x303b: s[0] = ((PC - 0x015e) >> 7) + 0x00a1
                              s[1] = ((PC - 0x015e) & 0x007f) + 0x00a1
PC >= 0x303c and PC <=0x5f19: s[0] = 0x8f
                              s[1] = ((PC - 0x303c) >> 7) + 0x00a1
                              s[2] = ((PC - 0x303c) & 0x007f) + 0x00a1
PC >= 0x5f1a and PC <=0x8df7  s[0] = ((PC - 0x5f1a) >> 7) + 0x00a1
                              s[1] = ((PC - 0x5f1a) & 0x007f) + 0x0021

+-----------------+-----------+-----------+-----------+
|  process code   |   s[0]    |   s[1]    |   s[2]    |
+-----------------+-----------+-----------+-----------+
| 0x0000 - 0x009f | 0x00-0x9f |    --     |    --     |
| 0x00a0 - 0x00ff |   --      |    --     |    --     |
| 0x0100 - 0x015d | 0x8e      | 0xa1-0xfe |    --     | JIS X0201 RH
| 0x015e - 0x303b | 0xa1-0xfe | 0xa1-0xfe |    --     | JIS X0208
| 0x303c - 0x5f19 | 0x8f      | 0xa1-0xfe | 0xa1-0xfe | JIS X0212
| 0x5f1a - 0x8df7 | 0xa1-0xfe | 0x21-0xfe |    --     | UDC
+-----------------+-----------+-----------+-----------+
*/   (2)

int __wcwidth_sdeckanji(
        wint_t wc,   (3)
        _LC_charmap_t *hdl )   (4)
{

    if (wc == 0)
        return(0);   (5)

    if (wc <= 0x9f)
        return(1);   (6)

    else if ((wc >= 0x0100) && (wc <= 0x015d))
        return(1);   (7)

    else if ((wc >=0x015e) && (wc <= 0x303b))
        return(2);   (8)

    else if ((wc >=0x303c) && (wc <= 0x5f19))
        return(2);   (9)

    else if ((wc >=0x5f1a) && (wc <= 0x8df7))
        return(2);   (10)
        return(-1);   (11)
}

  1. Include header files that contain constants and structures required for this method.

  2. Describes the algorithm used to determine the required display width.

    Note that a character's display width is either 1 or 2 columns, depending on the character set to which a character belongs. Display width is different from the size of the character in multibyte format; for example, triple-byte characters require 2 display columns and double-byte characters can require either 1 or 2 display columns.

  3. Defines the wc variable that stores the wide character for which display width information is requested.

  4. Points, through hdl, to a structure that stores pointers to the methods that parse character maps for this locale.

  5. Returns zero (0) if the wide-character buffer is empty.

  6. Returns 1 if the wide-character value is less than or equal to 0x009f.

  7. Returns 1 if the wide-character value is in the range 0x0100 to 0x015d.

  8. Returns 2 if the wide-character value is in the range 0x015e to 0x303b.

  9. Returns 2 if the wide-character value is in the range 0x303c to 0x5f19.

  10. Returns 2 if the wide-character value is in the range 0x5f1a to 0x8df7.

  11. Returns -1 if the wide-character value is invalid.

    The calling function, wcwidth(), also returns -1 if the wide character is nonprintable; however, this condition is evaluated at the level of the calling function and does not need to be evaluated by the method.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.2    Optional Methods

A locale can include methods in addition to those discussed in Section 7.3.1. If your locale uses methods but does not supply any for the functions associated with particular locale categories or some other locale-related functions, the localedef command applies default methods that handle process code for both single-byte and multibyte characters. The following list names the optional methods:

Writing optional methods requires detailed information about the internal interfaces to C library routines. This information is proprietary to Digital and may be subject to change. In the rare cases where your locale must include an optional method, contact your Digital technical support representative to request information.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.3    Building a Shareable Library to Use with a Locale

Example 7-21 shows the compiler and linker command lines that are required to build the method source files into a shareable library that is used with the ja_JP.sdeckanji locale.


Example 7-21: Building a Library of Methods Used with the ja_JP.sdeckanji Locale
cc -std0 -c \
   __mblen_sdeckanji.c __mbstopcs_sdeckanji.c \
   __mbstowcs_sdeckanji.c __mbtopc_sdeckanji.c \
   __mbtowc_sdeckanji.c __pcstombs_sdeckanji.c \
   __pctomb_sdeckanji.c __wcstombs_sdeckanji.c \
   __wcswidth_sdeckanji.c __wctomb_sdeckanji.c \
   __wcwidth_sdeckanji.c

ld -shared -set_version osf.1 -soname libsdeckanji.so -shared \
   -no_archive -o libsdeckanji.so \
   __mblen_sdeckanji.o __mbstopcs_sdeckanji.o \
   __mbstowcs_sdeckanji.o __mbtopc_sdeckanji.o \
   __mbtowc_sdeckanji.o __pcstombs_sdeckanji.o __pctomb_sdeckanji.o \
   __wcstombs_sdeckanji.o __wcswidth_sdeckanji.o __wctomb_sdeckanji.o \
   __wcwidth_sdeckanji.o \
   -lc

Refer to the cc(1) and ld(1) reference pages for more information about the cc and ld commands and how you build shared libraries.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.3.4    Creating a methods File for a Locale

The methods file contains an entry for each function that is defined in the methods shared library for use with the locale. The operation performed by the function is identified by a method keyword, followed by quoted strings with the name of the function and the path to the shared library that contains the function.

Example 7-22 shows the section of a methods file for the methods used with the ja_JP.sdeckanji locale. Because there is a mandatory list of methods that you must define if you want to override any C library interfaces, your methods file must always specify an entry for each of the required methods as shown in this example. The ja_JP.sdeckanji locale relies on default implementations for all optional methods, so Example 7-22 does not contain entries for any of the optional methods.


Example 7-22: The methods File for the ja_JP.sdeckanji Locale
# sdeckanji.m   (1)
# <method_keyword> "<entry>" "<package>" "<library_path>"   (1)

METHODS    (2)

__mbstopcs "__mbstopcs_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  (3)
__mbtopc   "__mbtopc_sdeckanji"   "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  (3)
__pcstombs "__pcstombs_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  (3)
__pctomb   "__pctomb_sdeckanji"   "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  (3)
mblen      "__mblen_sdeckanji"    "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  (3)
mbstowcs   "__mbstowcs_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  (3)
mbtowc     "__mbtowc_sdeckanji"   "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  (3)
wcstombs   "__wcstombs_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  (3)
wcswidth   "__wcswidth_sdeckanji" "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  (3)
wctomb     "__wctomb_sdeckanji"   "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  (3)
wcwidth    "__wcwidth_sdeckanji"  "libsdeckanji.so" \
"/usr/shlib/libsdeckanji.so"  (3)

END METHODS    (4)

  1. Comment lines

    These lines specify the name of the methods file and the format of method entries. Note that the field identified in the format as <package> is ignored, but you must specify some string for this field in order to specify a library path.

  2. Header to mark start of method entries

  3. Entries for required methods

  4. Trailer to mark end of method entries

Refer to the localedef(1) reference page for detailed information about methods file entries.


[Return to Library]  [TOC]  [PREV]  --SECT  SECT--  [NEXT]  [INDEX] [Help]

7.4    Building and Testing the Locale

Use the localedef command to build a locale from its source files. Example 7-23 shows the command line needed to build the German locale used in most examples in this chapter. Assume for this example that all source files reside in the user's default directory and that the resulting locale is also created in that directory.


Example 7-23: Building the de_DE.ISO8859-1@example Locale
% localedef -f ISO8859-1.cmap \    (1)
-i de_DE.ISO8859-1.lscr \   (2)
de_DE.ISO8859-1@example   (3)

  1. The -f option specifies the character map source file.

  2. The -i option specifies the locale definition source file.

  3. The final argument to the command is the name of the locale.

When you are testing locales, particularly ones that are similar to standard locales installed on the system, you should add an extension to the locale name. Varying names with the at (@) extension allows you to specify the standard strings for language, territory, and codeset and still be sure that the test locale is uniquely identified. This is important if you later decide to move the locale to the directory /usr/lib/nls/loc where other locales reside.

Example 7-23 shows only one form and a few options for the localedef command. The localedef(1) reference page is a complete description of the command. The following is a summary of some important rules and options:

By default, locales must reside in the /usr/lib/nls/loc directory to be found. If you want to test your locale before moving it to the /usr/lib/nls/loc directory, you can define the LOCPATH variable to specify the directory where your locale is located. You can then define the LANG environment variable to be your new locale and interactively test the locale with commands and applications.

Example 7-24 uses the date command to test the date/time format.


Example 7-24: Setting the LOCPATH Variable and Testing a Locale

% setenv LOCPATH ~harry/locales

% setenv LANG de_DE.ISO8859-1@example

% date
12.Dezember 1993 09:18:11


Note

The LOCPATH variable is an extension to specifications in the X/Open UNIX standard and therefore may not be recognized on all systems that conform to this standard.


Some programs have support files that are installed in system directories with names that exactly match the names of standard locales. In such cases, application software, system software, or both might use the value of the LANG environment variable to determine the locale-specific directory in which the support files reside. If assigned directly to the LANG or LC_ALL environment variable, locale file names with an at (@) suffix may result in invalid search paths for some applications. The following example shows how you can work around this problem by assigning the standard locale name to the LANG variable and the name of your variant locale to the locale category variables. You need to make assignments only to those category variables that represent areas where your locale differs from the locale on which it is based.


% setenv LANG de_DE.ISO8859-1

% setenv LC_CTYPE de_DE.ISO8859-1@example

% setenv LC_COLLATE de_DE.ISO8859-1@example

.
.
.
% setenv LC_TIME de_DE.ISO8859-1@example