All source code is divided into a stream of tokens . The compiler tries to collect as many contiguous characters as it can to build a valid token. (This is sometimes called the "max munch" rule.) It stops when the next character it would read cannot possibly be part of the token it is reading.
A token can be an identifier, a reserved keyword, a literal, or an operator or punctuation symbol. Each kind of token is described later in this section.
Step 3 of the compilation process reads preprocessor tokens. These tokens are converted automatically to ordinary compiler tokens as part of the main compilation in Step 7. The differences between a preprocessor token and a compiler token are small:
The preprocessor and the compiler might use different encodings for character and string literals.
The compiler treats integer and floating-point literals differently; the preprocessor does not.
The preprocessor recognizes <
header
>
as a single token (for
#include
directives); the
compiler does not.
An identifier is a name that you define or that is defined
in a library. An identifier begins with a nondigit character and is
followed by any number of digits and nondigits. A nondigit character
is a letter, an underscore, or one of a set of universal characters.
The exact set of nondigit universal characters is defined in the C++ standard and
in ISO/IEC PDTR 10176. Basically, this set contains the universal
characters that represent letters. Most programmers restrict
themselves to the characters a
...z
,
A
...Z
, and underscore, but the standard permits
letters in other languages.
Not all compilers support universal characters in identifiers.
Certain identifiers are reserved for use by the standard library:
Any identifier that contains two consecutive underscores
(like_ _this
) is reserved, that
is, you cannot use such an identifier for macros, class members,
global objects, or anything else.
Any identifier that starts with an underscore, followed by a capital letter (A-Z) is reserved.
Any identifier that starts with an underscore is reserved in the global namespace. You can use such names in other contexts (i.e., class members and local names).
The C standard reserves some identifiers for future use.
These identifiers fall into two categories: function names and
macro names. Function names are reserved and should not be used as
global function or object names; you should also avoid using them
as "C
" linkage names in any
namespace. Note that the C standard reserves these names
regardless of which headers you #include
. The reserved function names
are:
is
followed by a
lowercase letter, such as isblank
mem
followed by a
lowercase letter, such as memxyz
str
followed by a
lowercase letter, such as strtof
to
followed by a
lowercase letter, such as toxyz
wcs
followed by a
lowercase letter, such as wcstof
In <cmath>
with
f
or l
appended, such as cosf
and sinl
Macro names are reserved in all contexts. Do not use any of the following reserved macro names:
Identifiers that start with E
followed by a digit or an
uppercase letter
Identifiers that start with LC_
followed by an uppercase
letter
Identifiers that start with SIG
or SIG_
followed by an uppercase
letter
A keyword is an identifier that is reserved in all contexts for special use by the language. The following is a list of all the reserved keywords. (Note that some compilers do not implement all of the reserved keywords; these compilers allow you to use certain keywords as identifiers. See Section 1.5 later in this chapter for more information.)
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | |
A literal is an integer, floating-point, Boolean, character, or string constant.
An integer literal can be a decimal, octal, or
hexadecimal constant. A prefix specifies the base or radix: 0x
or 0X
for hexadecimal, 0
for octal, and nothing for decimal. An
integer literal can also have a suffix that is a combination of
U
and L
, for unsigned
and long
, respectively. The suffix can be
uppercase or lowercase and can be in any order. The suffix and
prefix are interpreted as follows:
If the suffix is UL
(or
ul
, LU
, etc.), the literal's type is
unsigned
long
.
If the suffix is L
, the
literal's type is long
or
unsigned
long
, whichever fits first. (That is,
if the value fits in a long
,
the type is long
; otherwise,
the type is unsigned
long
. An error results if the value
does not fit in an unsigned
long
.)
If the suffix is U
, the
type is unsigned
or unsigned
long
, whichever fits first.
Without a suffix, a decimal integer has type int
or long
, whichever fits first.
An octal or hexadecimal literal has type int
, unsigned
, long
, or unsigned
long
, whichever fits first.
Some compilers offer other suffixes as extensions to the standard. See Appendix A for examples.
Here are some examples of integer literals:
314 // Legal 314u // Legal 314LU // Legal 0xFeeL // Legal 0ul // Legal 078 // Illegal: 8 is not an octal digit 032UU // Illegal: cannot repeat a suffix
A floating-point literal has an integer part, a decimal
point, a fractional part, and an exponent part. You must include the
decimal point, the exponent, or both. You must include the integer
part, the fractional part, or both. The signed exponent is
introduced by e
or E
. The literal's type is double
unless there is a suffix: F
for type float
and L
for long
double
. The suffix can be uppercase or
lowercase.
Here are some examples of floating-point literals:
3.14159 // Legal .314159F // Legal 314159E-5L // Legal 314. // Legal 314E // Illegal: incomplete exponent 314f // Illegal: no decimal or exponent .e24 // Illegal: missing integer or fraction
Character literals are enclosed in single quotes. If
the literal begins with L
(uppercase only), it is a wide character literal (e.g., L'x
'). Otherwise, it is a narrow character
literal (e.g., 'x
'). Narrow
characters are used more frequently than wide characters, so the
"narrow" adjective is usually dropped.
The value of a narrow or wide character literal is the value of the character's encoding in the execution character set. If the literal contains more than one character, the literal value is implementation-defined. Note that a character might have different encodings in different locales. Consult your compiler's documentation to learn which encoding it uses for character literals.
A narrow character literal with a single character has
type char
. With more than one
character, the type is int
(e.g.,
'abc
'). The type of a wide character literal is always wchar_t
.
In C, a character literal always has type int
. C++ changed the type of character
literals to support overloading, especially for I/O (e.g.,
cout
<<
'\n
' starts a new line and does not print
the integer value of the newline character).
A character literal can be a plain character (e.g., 'x
'), an escape sequence (e.g., '\b
'), or a universal character (e.g.,
'\u03C0
'). Table 1-1 lists the possible
escape sequences. Note that you must use an escape sequence for a
backslash or single-quote character literal. Using an escape for a
double quote or question mark is optional. Only the characters shown
in Table 1-1 are
allowed in an escape sequence. (Some compilers extend the standard
and recognize other escape sequences.)
Table 1-1. Character escape sequences
Escape sequence | Meaning |
---|---|
| |
| ' character |
| " character |
| |
| Alert or bell |
| Backspace |
| Form feed |
| Newline |
| Carriage return |
| Horizontal tab |
| Vertical tab |
| Octal number of one to three digits |
| Hexadecimal number of one or more digits |
String literals are enclosed in double quotes. A string contains characters that are similar to character literals: plain characters, escape sequences, and universal characters. A string cannot cross a line boundary in the source file, but it can contain escaped line endings (backslash followed by newline).
A wide string literal is prefaced with L
(always uppercase). In a wide string
literal, a single universal character always maps to a single wide
character. In a narrow string literal, the implementation determines
whether a universal character maps to one or multiple characters
(called a multibyte character). See Chapter 8 for more information on
multibyte characters.
Two adjacent string literals (possibly separated by whitespace, including new lines) are concatenated at compile time into a single string. This is often a convenient way to break a long string across multiple lines. Do not try to combine a narrow string with a wide string in this way.
After concatenating adjacent strings, the null character
('\0
' or L'\0
') is automatically appended after the
last character in the string literal.
Here are some examples of string literals. Note that the first three form identical strings.
"hello, reader" "hello, \ reader" "hello, " "rea" "der" "Alert: \a; ASCII tab: \010; portable tab: \t" "illegal: unterminated string L"string with \"quotes\""
A string literal's type is an array of const
char
. For example, "string
"'s type is const
char[7]
. Wide string literals are arrays
of const
wchar_t
. All string literals have static
lifetimes (see Chapter 2 for
more information about lifetimes).
As with an array of const
anything, the compiler can automatically convert the array to a
pointer to the array's first element. You can, for example, assign a
string literal to a suitable pointer object:
const char* ptr; ptr = "string";
As a special case, you can also convert a string literal to a
non-const
pointer. Attempting to
modify the string results in undefined behavior. This conversion is
deprecated, and well-written code does not rely on it.
Nonalphabetic symbols are used as operators and as punctuation (e.g., statement terminators). Some symbols are made of multiple adjacent characters. The following are all the symbols used for operators and punctuation:
| | | . | | . | | | | |
| | | | | | | | | |
| | | | | | | | | |
| | : | | | | | | | |
| | | | : | | | | | |
| | , | | | | |
You cannot insert whitespace between characters that make up a symbol, and
C++ always collects as many characters as it can to form a symbol
before trying to interpret the symbol. Thus, an expression such as
x+++y
is read as x ++ + y
. A common error when first using
templates is to omit a space between closing angle brackets in a
nested template instantiation. The following is an example with that
space:
std::list<std::vector<int> > list;↑ Note the space here.
The example is incorrect without the space character because the adjacent greater than signs would be interpreted as a single right-shift operator, not as two separate closing angle brackets. Another, slightly less common, error is instantiating a template with a template argument that uses the global scope operators:
::std::list< ::std::list<int> > list;↑ ↑ Space here and here
Again, a space is needed, this time between the angle-bracket
(<
) and the scope operator
(:
:), to prevent the compiler from
seeing the first token as <
:
rather than <
. The <
: token is an alternative token, as
described in Section 1.5
later in this chapter.