The familiar char
type is
sometimes called a narrow character , as opposed to wchar_t
, which is a wide
character . The key difference between a narrow and wide character
is that a wide character can represent any single character in any
character set that an implementation supports. A narrow character, on
the other hand, might be too small to represent all characters, so
multiple narrow char
objects can make
up a single, logical character called a multibyte
character .
Beyond some minimal requirements for the character sets (see Chapter 1), the C++ standard is purposely open-ended and imposes few restrictions on an implementation. Some basic behavioral requirements are that conversion from a narrow character to a wide character must produce an equivalent character. Converting back to a narrow character must restore the original character. The open nature of the standard gives the compiler and library vendor wide latitude. For example, a compiler for Japanese customers might support a variety of Japanese Industrial Standard ( JIS) character sets, but not any European character sets. Another vendor might support multiple ISO 8859 character sets for Western and Eastern Europe, but not any Asian multibyte character sets. Although the standard defines universal characters in terms of the Unicode (ISO/IEC 10646) standard, it does not require any support for Unicode character sets.
This section discusses some of the broad issues in dealing with wide and multibyte characters, but the details of specific characters and character sets are implementation-defined.
A program that must deal with international character sets might work entirely with wide characters. Although wide characters usually require more memory than narrow characters, they are usually easier to use. Searching for substrings in a wide string is easy because you never have the problem of matching partial characters (which can happen with multibyte characters).
A common implementation of wchar_t
is to use Unicode UTF-32 encoding, which means each
wide character is 32 bits and represents a single Unicode character.
Suppose you want to declare a wide string that contains the Greek
letter pi (π). You can specify the string with a universal name (see
Chapter 1):
wchar_t wpi[] = "\u03c0";
Using UTF-32, the string would contain L"\x03c0
". With a different wchar_t
implementation, the wpi
string would contain different
values.
The standard wstring
class
supports wide strings, and all the I/O streams support wide characters
(e.g., wistream
, wostream
).
A multibyte character represents a single character as a series
of one or more bytes, or narrow characters. Because a single character
might occupy multiple bytes, working with multibyte strings is more
difficult than working with wide strings. For example, if you search a
multibyte string for the character '\x20
', when you find a match, you must test
whether the matching character is actually part of a multibyte
character and is therefore not actually a match for the single
character you want to find.
Consider the problem of comparing multibyte strings. Suppose you need to sort the strings
in ascending order. If one string starts with the character '\xA1
' and other starts with '\xB2
', it seems that the first is smaller
than the second and therefore should come before the second. On the
other hand, these characters might be the first of multibyte character
sequences, so the strings cannot be compared until you have analyzed
the strings for multibyte character sequences.
Multibyte character sets abound, and a particular C++ compiler and library might support only one or just a few. Some multibyte character sets specifically support a particular language, such as the Chinese Big5 character set. The UTF-8 character set supports all Unicode characters using one to six narrow characters.
For example, consider how an implementation might encode the
Greek letter pi (π), which has a Unicode value of 0x03C0
:
char pi[] = "\u03c0";
If the implementation's narrow character set is ISO 8859-7
(8-bit Greek), the encoding is 0xF0, so pi[]
contains "\xf0
". If the narrow character set is UTF-8
(8-bit Unicode), the representation is a multibyte character, and
pi[]
would contain "\xe0\x8f\x80
". Many character sets do not
have any encoding for π, in which case the contents of pi[]
might be "?
", or some other implementation-defined
marker for unknown characters.
You can convert a multibyte character sequence to a wide character and
back using the functions in <cwchar>
. When performing such
conversions, the library might need to keep track of state information
during the conversion. This is known as the shift
state and is stored in an mbstate_t
object (also defined in <cwchar>
).
For example, the Japanese Industrial Standard (JIS) encodes
single-byte characters and double-byte characters. A 3-byte character
sequence shifts from single- to double-byte mode, and another sequence
shifts back. The shift state keeps track of the current mode. The
initial shift state is single-byte. Thus, the multibyte string
"\x1B$B&P\x1B(B
" represents one
wide character, namely, the Greek letter pi (π). The first three
characters switch to double-byte mode. The next two characters encode
the character, and the final three characters restore single-byte
mode.
Shift states are especially important when performing I/O. By
definition, file I/O uses multibyte characters. That is, a file is
treated as a sequence of narrow characters. When reading a
wide-character stream, the narrow characters are converted to wide
characters, and when writing a wide stream, wide characters are
converted back to multibyte characters. Seeking to a new position in a
file might seek to a position that falls in the middle of a multibyte
sequence. Therefore, a file position is required to keep track of a
shift state in addition to a byte position in the file. See <ios>
in Chapter 13.