nonascii.texi 65.6 KB
Newer Older
Glenn Morris's avatar
Glenn Morris committed
1 2 3
@c -*-texinfo-*-
@c This is part of the GNU Emacs Lisp Reference Manual.
@c Copyright (C) 1998, 1999, 2001, 2002, 2003, 2004,
Glenn Morris's avatar
Glenn Morris committed
4
@c   2005, 2006, 2007, 2008  Free Software Foundation, Inc.
Glenn Morris's avatar
Glenn Morris committed
5
@c See the file elisp.texi for copying conditions.
6
@setfilename ../../info/characters
Glenn Morris's avatar
Glenn Morris committed
7 8 9 10 11 12
@node Non-ASCII Characters, Searching and Matching, Text, Top
@chapter Non-@acronym{ASCII} Characters
@cindex multibyte characters
@cindex characters, multi-byte
@cindex non-@acronym{ASCII} characters

13 14
  This chapter covers the special issues relating to characters and
how they are stored in strings and buffers.
Glenn Morris's avatar
Glenn Morris committed
15 16

@menu
17
* Text Representations::    How Emacs represents text.
Glenn Morris's avatar
Glenn Morris committed
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
* Converting Representations::  Converting unibyte to multibyte and vice versa.
* Selecting a Representation::  Treating a byte sequence as unibyte or multi.
* Character Codes::         How unibyte and multibyte relate to
                                codes of individual characters.
* Character Sets::          The space of possible character codes
                                is divided into various character sets.
* Scanning Charsets::       Which character sets are used in a buffer?
* Translation of Characters::   Translation tables are used for conversion.
* Coding Systems::          Coding systems are conversions for saving files.
* Input Methods::           Input methods allow users to enter various
                                non-ASCII characters without special keyboards.
* Locales::                 Interacting with the POSIX locale.
@end menu

@node Text Representations
@section Text Representations
34 35 36 37 38 39 40 41 42 43 44 45 46
@cindex text representation

  Emacs buffers and strings support a large repertoire of characters
from many different scripts.  This is so users could type and display
text in most any known written language.

@cindex character codepoint
@cindex codespace
@cindex Unicode
  To support this multitude of characters and scripts, Emacs closely
follows the @dfn{Unicode Standard}.  The Unicode Standard assigns a
unique number, called a @dfn{codepoint}, to each and every character.
The range of codepoints defined by Unicode, or the Unicode
47
@dfn{codespace}, is @code{0..10FFFF} (in hex), inclusive.  Emacs
48 49 50 51 52
extends this range with codepoints in the range @code{110000..3FFFFF},
which it uses for representing characters that are not unified with
Unicode and raw 8-bit bytes that cannot be interpreted as characters
(the latter occupy the range @code{3FFF80..3FFFFF}).  Thus, a
character codepoint in Emacs is a 22-bit integer number.
53 54 55 56 57 58 59 60 61 62 63 64

@cindex internal representation of characters
@cindex characters, representation in buffers and strings
@cindex multibyte text
  To conserve memory, Emacs does not hold fixed-length 22-bit numbers
that are codepoints of text characters within buffers and strings.
Rather, Emacs uses a variable-length internal representation of
characters, that stores each character as a sequence of 1 to 5 8-bit
bytes, depending on the magnitude of its codepoint@footnote{
This internal representation is based on one of the encodings defined
by the Unicode Standard, called @dfn{UTF-8}, for representing any
Unicode codepoint, but Emacs extends UTF-8 to represent the additional
65 66
codepoints it uses for raw 8-bit bytes and characters not unified with
Unicode.}.
67 68 69 70 71 72 73 74 75 76 77 78
For example, any @acronym{ASCII} character takes up only 1 byte, a
Latin-1 character takes up 2 bytes, etc.  We call this representation
of text @dfn{multibyte}, because it uses several bytes for each
character.

  Outside Emacs, characters can be represented in many different
encodings, such as ISO-8859-1, GB-2312, Big-5, etc.  Emacs converts
between these external encodings and the internal representation, as
appropriate, when it reads text into a buffer or a string, or when it
writes text to a disk file or passes it to some other process.

  Occasionally, Emacs needs to hold and manipulate encoded text or
79 80 81 82
binary non-text data in its buffers or strings.  For example, when
Emacs visits a file, it first reads the file's text verbatim into a
buffer, and only then converts it to the internal representation.
Before the conversion, the buffer holds encoded text.
Glenn Morris's avatar
Glenn Morris committed
83 84

@cindex unibyte text
85 86 87 88 89 90 91
  Encoded text is not really text, as far as Emacs is concerned, but
rather a sequence of raw 8-bit bytes.  We call buffers and strings
that hold encoded text @dfn{unibyte} buffers and strings, because
Emacs treats them as a sequence of individual bytes.  In particular,
Emacs usually displays unibyte buffers and strings as octal codes such
as @code{\237}.  We recommend that you never use unibyte buffers and
strings except for manipulating encoded text or binary non-text data.
Glenn Morris's avatar
Glenn Morris committed
92 93 94 95 96 97 98 99 100

  In a buffer, the buffer-local value of the variable
@code{enable-multibyte-characters} specifies the representation used.
The representation for a string is determined and recorded in the string
when the string is constructed.

@defvar enable-multibyte-characters
This variable specifies the current buffer's text representation.
If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
101
it contains unibyte encoded text or binary non-text data.
Glenn Morris's avatar
Glenn Morris committed
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119

You cannot set this variable directly; instead, use the function
@code{set-buffer-multibyte} to change a buffer's representation.
@end defvar

@defvar default-enable-multibyte-characters
This variable's value is entirely equivalent to @code{(default-value
'enable-multibyte-characters)}, and setting this variable changes that
default value.  Setting the local binding of
@code{enable-multibyte-characters} in a specific buffer is not allowed,
but changing the default value is supported, and it is a reasonable
thing to do, because it has no effect on existing buffers.

The @samp{--unibyte} command line option does its job by setting the
default value to @code{nil} early in startup.
@end defvar

@defun position-bytes position
120 121
Buffer positions are measured in character units.  This function
returns the byte-position corresponding to buffer position
Glenn Morris's avatar
Glenn Morris committed
122 123 124 125 126 127
@var{position} in the current buffer.  This is 1 at the start of the
buffer, and counts upward in bytes.  If @var{position} is out of
range, the value is @code{nil}.
@end defun

@defun byte-to-position byte-position
128 129 130 131 132 133 134 135 136
Return the buffer position, in character units, corresponding to given
@var{byte-position} in the current buffer.  If @var{byte-position} is
out of range, the value is @code{nil}.  In a multibyte buffer, an
arbitrary value of @var{byte-position} can be not at character
boundary, but inside a multibyte sequence representing a single
character; in this case, this function returns the buffer position of
the character whose multibyte sequence includes @var{byte-position}.
In other words, the value does not change for all byte positions that
belong to the same character.
Glenn Morris's avatar
Glenn Morris committed
137 138 139
@end defun

@defun multibyte-string-p string
140 141
Return @code{t} if @var{string} is a multibyte string, @code{nil}
otherwise.
Glenn Morris's avatar
Glenn Morris committed
142 143 144 145 146 147 148 149 150
@end defun

@defun string-bytes string
@cindex string, number of bytes
This function returns the number of bytes in @var{string}.
If @var{string} is a multibyte string, this can be greater than
@code{(length @var{string})}.
@end defun

151 152 153 154 155
@defun unibyte-string &rest bytes
This function concatenates all its argument @var{bytes} and makes the
result a unibyte string.
@end defun

Glenn Morris's avatar
Glenn Morris committed
156 157 158 159
@node Converting Representations
@section Converting Text Representations

  Emacs can convert unibyte text to multibyte; it can also convert
160
multibyte text to unibyte, provided that the multibyte text contains
161
only @acronym{ASCII} and 8-bit raw bytes.  In general, these
162 163 164
conversions happen when inserting text into a buffer, or when putting
text from several strings together in one string.  You can also
explicitly convert a string's contents to either representation.
Glenn Morris's avatar
Glenn Morris committed
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182

  Emacs chooses the representation for a string based on the text that
it is constructed from.  The general rule is to convert unibyte text to
multibyte text when combining it with other multibyte text, because the
multibyte representation is more general and can hold whatever
characters the unibyte text has.

  When inserting text into a buffer, Emacs converts the text to the
buffer's representation, as specified by
@code{enable-multibyte-characters} in that buffer.  In particular, when
you insert multibyte text into a unibyte buffer, Emacs converts the text
to unibyte, even though this conversion cannot in general preserve all
the characters that might be in the multibyte text.  The other natural
alternative, to convert the buffer contents to multibyte, is not
acceptable because the buffer's representation is a choice made by the
user that cannot be overridden automatically.

  Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
183 184
unchanged, and converts bytes with codes 128 through 159 to the
multibyte representation of raw eight-bit bytes.
Glenn Morris's avatar
Glenn Morris committed
185

186 187 188 189 190
  Converting multibyte text to unibyte converts all @acronym{ASCII}
and eight-bit characters to their single-byte form, but loses
information for non-@acronym{ASCII} characters by discarding all but
the low 8 bits of each character's codepoint.  Converting unibyte text
to multibyte and back to unibyte reproduces the original unibyte text.
Glenn Morris's avatar
Glenn Morris committed
191

192
The next two functions either return the argument @var{string}, or a
Glenn Morris's avatar
Glenn Morris committed
193 194 195 196
newly created string with no text properties.

@defun string-to-multibyte string
This function returns a multibyte string containing the same sequence
197
of characters as @var{string}.  If @var{string} is a multibyte string,
198 199 200 201 202
it is returned unchanged.  The function assumes that @var{string}
includes only @acronym{ASCII} characters and raw 8-bit bytes; the
latter are converted to their multibyte representation corresponding
to the codepoints in the @code{3FFF80..3FFFFF} area (@pxref{Text
Representations, codepoints}).
203 204 205 206 207 208
@end defun

@defun string-to-unibyte string
This function returns a unibyte string containing the same sequence of
characters as @var{string}.  It signals an error if @var{string}
contains a non-@acronym{ASCII} character.  If @var{string} is a
209 210 211
unibyte string, it is returned unchanged.  Use this function for
@var{string} arguments that contain only @acronym{ASCII} and eight-bit
characters.
Glenn Morris's avatar
Glenn Morris committed
212 213 214 215
@end defun

@defun multibyte-char-to-unibyte char
This convert the multibyte character @var{char} to a unibyte
216 217
character.  If @var{char} is a character that is neither
@acronym{ASCII} nor eight-bit, the value is -1.
Glenn Morris's avatar
Glenn Morris committed
218 219 220 221
@end defun

@defun unibyte-char-to-multibyte char
This convert the unibyte character @var{char} to a multibyte
222 223
character, assuming @var{char} is either @acronym{ASCII} or raw 8-bit
byte.
Glenn Morris's avatar
Glenn Morris committed
224 225 226 227 228 229 230 231 232 233 234 235 236 237
@end defun

@node Selecting a Representation
@section Selecting a Representation

  Sometimes it is useful to examine an existing buffer or string as
multibyte when it was unibyte, or vice versa.

@defun set-buffer-multibyte multibyte
Set the representation type of the current buffer.  If @var{multibyte}
is non-@code{nil}, the buffer becomes multibyte.  If @var{multibyte}
is @code{nil}, the buffer becomes unibyte.

This function leaves the buffer contents unchanged when viewed as a
238 239 240 241 242 243 244
sequence of bytes.  As a consequence, it can change the contents
viewed as characters; a sequence of three bytes which is treated as
one character in multibyte representation will count as three
characters in unibyte representation.  Eight-bit characters
representing raw bytes are an exception.  They are represented by one
byte in a unibyte buffer, but when the buffer is set to multibyte,
they are converted to two-byte sequences, and vice versa.
Glenn Morris's avatar
Glenn Morris committed
245 246 247 248 249 250 251 252 253 254 255 256 257 258

This function sets @code{enable-multibyte-characters} to record which
representation is in use.  It also adjusts various data in the buffer
(including overlays, text properties and markers) so that they cover the
same text as they did before.

You cannot use @code{set-buffer-multibyte} on an indirect buffer,
because indirect buffers always inherit the representation of the
base buffer.
@end defun

@defun string-as-unibyte string
This function returns a string with the same bytes as @var{string} but
treating each byte as a character.  This means that the value may have
259 260 261
more characters than @var{string} has.  Eight-bit characters
representing raw bytes are an exception: each one of them is converted
to a single byte.
Glenn Morris's avatar
Glenn Morris committed
262 263 264

If @var{string} is already a unibyte string, then the value is
@var{string} itself.  Otherwise it is a newly created string, with no
265
text properties.
Glenn Morris's avatar
Glenn Morris committed
266 267 268 269
@end defun

@defun string-as-multibyte string
This function returns a string with the same bytes as @var{string} but
270 271 272 273 274
treating each multibyte sequence as one character.  This means that
the value may have fewer characters than @var{string} has.  If a byte
sequence in @var{string} is invalid as a multibyte representation of a
single character, each byte in the sequence is treated as raw 8-bit
byte.
Glenn Morris's avatar
Glenn Morris committed
275 276 277

If @var{string} is already a multibyte string, then the value is
@var{string} itself.  Otherwise it is a newly created string, with no
278
text properties.
Glenn Morris's avatar
Glenn Morris committed
279 280 281 282 283 284
@end defun

@node Character Codes
@section Character Codes
@cindex character codes

285 286 287
  The unibyte and multibyte text representations use different
character codes.  The valid character codes for unibyte representation
range from 0 to 255---the values that can fit in one byte.  The valid
288 289 290 291 292 293 294
character codes for multibyte representation range from 0 to 4194303
(#x3FFFFF).  In this code space, values 0 through 127 are for
@acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F)
are for non-@acronym{ASCII} characters.  Values 0 through 1114111
(#10FFFF) corresponds to Unicode characters of the same codepoint,
while values 4194176 (#x3FFF80) through 4194303 (#x3FFFFF) are for
representing eight-bit raw bytes.
295 296 297 298

@defun characterp charcode
This returns @code{t} if @var{charcode} is a valid character, and
@code{nil} otherwise.
Glenn Morris's avatar
Glenn Morris committed
299 300

@example
301
@group
302
(characterp 65)
Glenn Morris's avatar
Glenn Morris committed
303
     @result{} t
304 305
@end group
@group
306
(characterp 4194303)
Glenn Morris's avatar
Glenn Morris committed
307
     @result{} t
308 309
@end group
@group
310 311
(characterp 4194304)
     @result{} nil
312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330
@end group
@end example
@end defun

@cindex maximum value of character codepoint
@cindex codepoint, largest value
@defun max-char
This function returns the largest value that a valid character
codepoint can have.

@example
@group
(characterp (max-char))
     @result{} t
@end group
@group
(characterp (1+ (max-char)))
     @result{} nil
@end group
Glenn Morris's avatar
Glenn Morris committed
331 332 333
@end example
@end defun

334 335 336 337 338 339 340 341 342 343 344 345 346
@defun get-byte pos &optional string
This function returns the byte at current buffer's character position
@var{pos}.  If the current buffer is unibyte, this is literally the
byte at that position.  If the buffer is multibyte, byte values of
@acronym{ASCII} characters are the same as character codepoints,
whereas eight-bit raw bytes are converted to their 8-bit codes.  The
function signals an error if the character at @var{pos} is
non-@acronym{ASCII}.

The optional argument @var{string} means to get a byte value from that
string instead of the current buffer.
@end defun

Glenn Morris's avatar
Glenn Morris committed
347 348 349 350
@node Character Sets
@section Character Sets
@cindex character sets

351 352 353 354
@cindex charset
@cindex coded character set
An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters
in which each character is assigned a numeric code point.  (The
355
Unicode standard calls this a @dfn{coded character set}.)  Each Emacs
356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372
charset has a name which is a symbol.  A single character can belong
to any number of different character sets, but it will generally have
a different code point in each charset.  Examples of character sets
include @code{ascii}, @code{iso-8859-1}, @code{greek-iso8859-7}, and
@code{windows-1255}.  The code point assigned to a character in a
charset is usually different from its code point used in Emacs buffers
and strings.

@cindex @code{emacs}, a charset
@cindex @code{unicode}, a charset
@cindex @code{eight-bit}, a charset
  Emacs defines several special character sets.  The character set
@code{unicode} includes all the characters whose Emacs code points are
in the range @code{0..10FFFF}.  The character set @code{emacs}
includes all @acronym{ASCII} and non-@acronym{ASCII} characters.
Finally, the @code{eight-bit} charset includes the 8-bit raw bytes;
Emacs uses it to represent raw bytes encountered in text.
Glenn Morris's avatar
Glenn Morris committed
373 374 375 376 377 378 379 380 381 382

@defun charsetp object
Returns @code{t} if @var{object} is a symbol that names a character set,
@code{nil} otherwise.
@end defun

@defvar charset-list
The value is a list of all defined character set names.
@end defvar

383 384 385 386 387 388 389 390
@defun charset-priority-list &optional highestp
This functions returns a list of all defined character sets ordered by
their priority.  If @var{highestp} is non-@code{nil}, the function
returns a single character set of the highest priority.
@end defun

@defun set-charset-priority &rest charsets
This function makes @var{charsets} the highest priority character sets.
Glenn Morris's avatar
Glenn Morris committed
391 392 393
@end defun

@defun char-charset character
394 395 396
This function returns the name of the character set of highest
priority that @var{character} belongs to.  @acronym{ASCII} characters
are an exception: for them, this function always returns @code{ascii}.
Glenn Morris's avatar
Glenn Morris committed
397 398 399
@end defun

@defun charset-plist charset
400 401 402 403 404
This function returns the property list of the character set
@var{charset}.  Although @var{charset} is a symbol, this is not the
same as the property list of that symbol.  Charset properties include
important information about the charset, such as its documentation
string, short name, etc.
Glenn Morris's avatar
Glenn Morris committed
405 406
@end defun

407 408 409
@defun put-charset-property charset propname value
This function sets the @var{propname} property of @var{charset} to the
given @var{value}.
Glenn Morris's avatar
Glenn Morris committed
410 411
@end defun

412 413 414
@defun get-charset-property charset propname
This function returns the value of @var{charset}s property
@var{propname}.
Glenn Morris's avatar
Glenn Morris committed
415 416
@end defun

417 418 419 420
@deffn Command list-charset-chars charset
This command displays a list of characters in the character set
@var{charset}.
@end deffn
Glenn Morris's avatar
Glenn Morris committed
421

422 423 424 425 426 427 428
  Emacs can convert between its internal representation of a character
and the character's codepoint in a specific charset.  The following
two functions support these conversions.

@c FIXME: decode-char and encode-char accept and ignore an additional
@c argument @var{restriction}.  When that argument actually makes a
@c difference, it should be documented here.
429 430 431
@defun decode-char charset code-point
This function decodes a character that is assigned a @var{code-point}
in @var{charset}, to the corresponding Emacs character, and returns
432 433 434 435
it.  If @var{charset} doesn't contain a character of that code point,
the value is @code{nil}.  If @var{code-point} doesn't fit in a Lisp
integer (@pxref{Integer Basics, most-positive-fixnum}), it can be
specified as a cons cell @code{(@var{high} . @var{low})}, where
436 437
@var{low} are the lower 16 bits of the value and @var{high} are the
high 16 bits.
Glenn Morris's avatar
Glenn Morris committed
438 439
@end defun

440 441
@defun encode-char char charset
This function returns the code point assigned to the character
442 443 444 445 446
@var{char} in @var{charset}.  If the result does not fit in a Lisp
integer, it is returned as a cons cell @code{(@var{high} . @var{low})}
that fits the second argument of @code{decode-char} above.  If
@var{charset} doesn't have a codepoint for @var{char}, the value is
@code{nil}.
Glenn Morris's avatar
Glenn Morris committed
447 448 449 450 451
@end defun

@node Scanning Charsets
@section Scanning for Character Sets

452 453 454 455 456 457
  Sometimes it is useful to find out, for characters that appear in a
certain part of a buffer or a string, to which character sets they
belong.  One use for this is in determining which coding systems
(@pxref{Coding Systems}) are capable of representing all of the text
in question; another is to determine the font(s) for displaying that
text.
Glenn Morris's avatar
Glenn Morris committed
458 459

@defun charset-after &optional pos
460 461 462 463
This function returns the charset of highest priority containing the
character in the current buffer at position @var{pos}.  If @var{pos}
is omitted or @code{nil}, it defaults to the current value of point.
If @var{pos} is out of range, the value is @code{nil}.
Glenn Morris's avatar
Glenn Morris committed
464 465 466
@end defun

@defun find-charset-region beg end &optional translation
467
This function returns a list of the character sets of highest priority
468
that contain characters in the current buffer between positions
469
@var{beg} and @var{end}.
Glenn Morris's avatar
Glenn Morris committed
470 471 472 473 474 475 476 477 478

The optional argument @var{translation} specifies a translation table to
be used in scanning the text (@pxref{Translation of Characters}).  If it
is non-@code{nil}, then each character in the region is translated
through this table, and the value returned describes the translated
characters instead of the characters actually in the buffer.
@end defun

@defun find-charset-string string &optional translation
479 480 481 482
This function returns a list of the character sets of highest priority
that contain characters in @var{string}.  It is just like
@code{find-charset-region}, except that it applies to the contents of
@var{string} instead of part of the current buffer.
Glenn Morris's avatar
Glenn Morris committed
483 484 485 486 487 488 489
@end defun

@node Translation of Characters
@section Translation of Characters
@cindex character translation tables
@cindex translation tables

490 491 492 493 494 495
  A @dfn{translation table} is a char-table (@pxref{Char-Tables}) that
specifies a mapping of characters into characters.  These tables are
used in encoding and decoding, and for other purposes.  Some coding
systems specify their own particular translation tables; there are
also default translation tables which apply to all other coding
systems.
Glenn Morris's avatar
Glenn Morris committed
496

497 498 499
  A translation table has two extra slots.  The first is either
@code{nil} or a translation table that performs the reverse
translation; the second is the maximum number of characters to look up
500 501
for translating sequences of characters (see the description of
@code{make-translation-table-from-alist} below).
Glenn Morris's avatar
Glenn Morris committed
502 503 504 505 506 507 508 509 510 511 512 513 514

@defun make-translation-table &rest translations
This function returns a translation table based on the argument
@var{translations}.  Each element of @var{translations} should be a
list of elements of the form @code{(@var{from} . @var{to})}; this says
to translate the character @var{from} into @var{to}.

The arguments and the forms in each argument are processed in order,
and if a previous form already translates @var{to} to some other
character, say @var{to-alt}, @var{from} is also translated to
@var{to-alt}.
@end defun

515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533
  During decoding, the translation table's translations are applied to
the characters that result from ordinary decoding.  If a coding system
has property @code{:decode-translation-table}, that specifies the
translation table to use, or a list of translation tables to apply in
sequence.  (This is a property of the coding system, as returned by
@code{coding-system-get}, not a property of the symbol that is the
coding system's name.  @xref{Coding System Basics,, Basic Concepts of
Coding Systems}.)  Finally, if
@code{standard-translation-table-for-decode} is non-@code{nil}, the
resulting characters are translated by that table.

  During encoding, the translation table's translations are applied to
the characters in the buffer, and the result of translation is
actually encoded.  If a coding system has property
@code{:encode-translation-table}, that specifies the translation table
to use, or a list of translation tables to apply in sequence.  In
addition, if the variable @code{standard-translation-table-for-encode}
is non-@code{nil}, it specifies the translation table to use for
translating the result.
Glenn Morris's avatar
Glenn Morris committed
534 535

@defvar standard-translation-table-for-decode
536 537 538
This is the default translation table for decoding.  If a coding
systems specifies its own translation tables, the table that is the
value of this variable, if non-@code{nil}, is applied after them.
Glenn Morris's avatar
Glenn Morris committed
539 540 541
@end defvar

@defvar standard-translation-table-for-encode
542 543 544
This is the default translation table for encoding.  If a coding
systems specifies its own translation tables, the table that is the
value of this variable, if non-@code{nil}, is applied after them.
Glenn Morris's avatar
Glenn Morris committed
545 546
@end defvar

547 548 549 550 551
@defun make-translation-table-from-vector vec
This function returns a translation table made from @var{vec} that is
an array of 256 elements to map byte values 0 through 255 to
characters.  Elements may be @code{nil} for untranslated bytes.  The
returned table has a translation table for reverse mapping in the
552
first extra slot, and the value @code{1} in the second extra slot.
553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571

This function provides an easy way to make a private coding system
that maps each byte to a specific character.  You can specify the
returned table and the reverse translation table using the properties
@code{:decode-translation-table} and @code{:encode-translation-table}
respectively in the @var{props} argument to
@code{define-coding-system}.
@end defun

@defun make-translation-table-from-alist alist
This function is similar to @code{make-translation-table} but returns
a complex translation table rather than a simple one-to-one mapping.
Each element of @var{alist} is of the form @code{(@var{from}
. @var{to})}, where @var{from} and @var{to} are either a character or
a vector specifying a sequence of characters.  If @var{from} is a
character, that character is translated to @var{to} (i.e.@: to a
character or a character sequence).  If @var{from} is a vector of
characters, that sequence is translated to @var{to}.  The returned
table has a translation table for reverse mapping in the first extra
572 573
slot, and the maximum length of all the @var{from} character sequences
in the second extra slot.
574 575
@end defun

Glenn Morris's avatar
Glenn Morris committed
576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605
@node Coding Systems
@section Coding Systems

@cindex coding system
  When Emacs reads or writes a file, and when Emacs sends text to a
subprocess or receives text from a subprocess, it normally performs
character code conversion and end-of-line conversion as specified
by a particular @dfn{coding system}.

  How to define a coding system is an arcane matter, and is not
documented here.

@menu
* Coding System Basics::        Basic concepts.
* Encoding and I/O::            How file I/O functions handle coding systems.
* Lisp and Coding Systems::     Functions to operate on coding system names.
* User-Chosen Coding Systems::  Asking the user to choose a coding system.
* Default Coding Systems::      Controlling the default choices.
* Specifying Coding Systems::   Requesting a particular coding system
                                    for a single file operation.
* Explicit Encoding::           Encoding or decoding text without doing I/O.
* Terminal I/O Encoding::       Use of encoding for terminal I/O.
* MS-DOS File Types::           How DOS "text" and "binary" files
                                    relate to coding systems.
@end menu

@node Coding System Basics
@subsection Basic Concepts of Coding Systems

@cindex character code conversion
606 607 608 609 610 611 612 613 614 615 616 617
  @dfn{Character code conversion} involves conversion between the
internal representation of characters used inside Emacs and some other
encoding.  Emacs supports many different encodings, in that it can
convert to and from them.  For example, it can convert text to or from
encodings such as Latin 1, Latin 2, Latin 3, Latin 4, Latin 5, and
several variants of ISO 2022.  In some cases, Emacs supports several
alternative encodings for the same characters; for example, there are
three coding systems for the Cyrillic (Russian) alphabet: ISO,
Alternativnyj, and KOI8.

@c I think this paragraph is no longer correct.
@ignore
Glenn Morris's avatar
Glenn Morris committed
618 619 620
  Most coding systems specify a particular character code for
conversion, but some of them leave the choice unspecified---to be chosen
heuristically for each file, based on the data.
621
@end ignore
Glenn Morris's avatar
Glenn Morris committed
622 623 624 625

  In general, a coding system doesn't guarantee roundtrip identity:
decoding a byte sequence using coding system, then encoding the
resulting text in the same coding system, can produce a different byte
626 627 628
sequence.  But some coding systems do guarantee that the byte sequence
will be the same as what you originally decoded.  Here are a few
examples:
Glenn Morris's avatar
Glenn Morris committed
629 630

@quotation
631
iso-8859-1, utf-8, big5, shift_jis, euc-jp
Glenn Morris's avatar
Glenn Morris committed
632 633 634
@end quotation

  Encoding buffer text and then decoding the result can also fail to
635 636 637 638 639
reproduce the original text.  For instance, if you encode a character
with a coding system which does not support that character, the result
is unpredictable, and thus decoding it using the same coding system
may produce a different text.  Currently, Emacs can't report errors
that result from encoding unsupported characters.
Glenn Morris's avatar
Glenn Morris committed
640 641 642 643

@cindex EOL conversion
@cindex end-of-line conversion
@cindex line end conversion
644 645 646 647 648 649 650
  @dfn{End of line conversion} handles three different conventions
used on various systems for representing end of line in files.  The
Unix convention, used on GNU and Unix systems, is to use the linefeed
character (also called newline).  The DOS convention, used on
MS-Windows and MS-DOS systems, is to use a carriage-return and a
linefeed at the end of a line.  The Mac convention is to use just
carriage-return.
Glenn Morris's avatar
Glenn Morris committed
651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668

@cindex base coding system
@cindex variant coding system
  @dfn{Base coding systems} such as @code{latin-1} leave the end-of-line
conversion unspecified, to be chosen based on the data.  @dfn{Variant
coding systems} such as @code{latin-1-unix}, @code{latin-1-dos} and
@code{latin-1-mac} specify the end-of-line conversion explicitly as
well.  Most base coding systems have three corresponding variants whose
names are formed by adding @samp{-unix}, @samp{-dos} and @samp{-mac}.

  The coding system @code{raw-text} is special in that it prevents
character code conversion, and causes the buffer visited with that
coding system to be a unibyte buffer.  It does not specify the
end-of-line conversion, allowing that to be determined as usual by the
data, and has the usual three variants which specify the end-of-line
conversion.  @code{no-conversion} is equivalent to @code{raw-text-unix}:
it specifies no conversion of either character codes or end-of-line.

669 670
@vindex emacs-internal@r{ coding system}
  The coding system @code{emacs-internal} specifies that the data is
Glenn Morris's avatar
Glenn Morris committed
671 672 673 674 675 676 677
represented in the internal Emacs encoding.  This is like
@code{raw-text} in that no code conversion happens, but different in
that the result is multibyte data.

@defun coding-system-get coding-system property
This function returns the specified property of the coding system
@var{coding-system}.  Most coding system properties exist for internal
678
purposes, but one that you might find useful is @code{:mime-charset}.
Glenn Morris's avatar
Glenn Morris committed
679 680 681 682
That property's value is the name used in MIME for the character coding
which this coding system can read and write.  Examples:

@example
683
(coding-system-get 'iso-latin-1 :mime-charset)
Glenn Morris's avatar
Glenn Morris committed
684
     @result{} iso-8859-1
685
(coding-system-get 'iso-2022-cn :mime-charset)
Glenn Morris's avatar
Glenn Morris committed
686
     @result{} iso-2022-cn
687
(coding-system-get 'cyrillic-koi8 :mime-charset)
Glenn Morris's avatar
Glenn Morris committed
688 689 690
     @result{} koi8-r
@end example

691
The value of the @code{:mime-charset} property is also defined
Glenn Morris's avatar
Glenn Morris committed
692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712
as an alias for the coding system.
@end defun

@node Encoding and I/O
@subsection Encoding and I/O

  The principal purpose of coding systems is for use in reading and
writing files.  The function @code{insert-file-contents} uses
a coding system for decoding the file data, and @code{write-region}
uses one to encode the buffer contents.

  You can specify the coding system to use either explicitly
(@pxref{Specifying Coding Systems}), or implicitly using a default
mechanism (@pxref{Default Coding Systems}).  But these methods may not
completely specify what to do.  For example, they may choose a coding
system such as @code{undefined} which leaves the character code
conversion to be determined from the data.  In these cases, the I/O
operation finishes the job of choosing a coding system.  Very often
you will want to find out afterwards which coding system was chosen.

@defvar buffer-file-coding-system
713 714 715 716 717 718 719 720 721
This buffer-local variable records the coding system used for saving the
buffer and for writing part of the buffer with @code{write-region}.  If
the text to be written cannot be safely encoded using the coding system
specified by this variable, these operations select an alternative
encoding by calling the function @code{select-safe-coding-system}
(@pxref{User-Chosen Coding Systems}).  If selecting a different encoding
requires to ask the user to specify a coding system,
@code{buffer-file-coding-system} is updated to the newly selected coding
system.
Glenn Morris's avatar
Glenn Morris committed
722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793

@code{buffer-file-coding-system} does @emph{not} affect sending text
to a subprocess.
@end defvar

@defvar save-buffer-coding-system
This variable specifies the coding system for saving the buffer (by
overriding @code{buffer-file-coding-system}).  Note that it is not used
for @code{write-region}.

When a command to save the buffer starts out to use
@code{buffer-file-coding-system} (or @code{save-buffer-coding-system}),
and that coding system cannot handle
the actual text in the buffer, the command asks the user to choose
another coding system (by calling @code{select-safe-coding-system}).
After that happens, the command also updates
@code{buffer-file-coding-system} to represent the coding system that
the user specified.
@end defvar

@defvar last-coding-system-used
I/O operations for files and subprocesses set this variable to the
coding system name that was used.  The explicit encoding and decoding
functions (@pxref{Explicit Encoding}) set it too.

@strong{Warning:} Since receiving subprocess output sets this variable,
it can change whenever Emacs waits; therefore, you should copy the
value shortly after the function call that stores the value you are
interested in.
@end defvar

  The variable @code{selection-coding-system} specifies how to encode
selections for the window system.  @xref{Window System Selections}.

@defvar file-name-coding-system
The variable @code{file-name-coding-system} specifies the coding
system to use for encoding file names.  Emacs encodes file names using
that coding system for all file operations.  If
@code{file-name-coding-system} is @code{nil}, Emacs uses a default
coding system determined by the selected language environment.  In the
default language environment, any non-@acronym{ASCII} characters in
file names are not encoded specially; they appear in the file system
using the internal Emacs representation.
@end defvar

  @strong{Warning:} if you change @code{file-name-coding-system} (or
the language environment) in the middle of an Emacs session, problems
can result if you have already visited files whose names were encoded
using the earlier coding system and are handled differently under the
new coding system.  If you try to save one of these buffers under the
visited file name, saving may use the wrong file name, or it may get
an error.  If such a problem happens, use @kbd{C-x C-w} to specify a
new file name for that buffer.

@node Lisp and Coding Systems
@subsection Coding Systems in Lisp

  Here are the Lisp facilities for working with coding systems:

@defun coding-system-list &optional base-only
This function returns a list of all coding system names (symbols).  If
@var{base-only} is non-@code{nil}, the value includes only the
base coding systems.  Otherwise, it includes alias and variant coding
systems as well.
@end defun

@defun coding-system-p object
This function returns @code{t} if @var{object} is a coding system
name or @code{nil}.
@end defun

@defun check-coding-system coding-system
794 795 796 797 798
This function checks the validity of @var{coding-system}.  If that is
valid, it returns @var{coding-system}.  If @var{coding-system} is
@code{nil}, the function return @code{nil}.  For any other values, it
signals an error whose @code{error-symbol} is @code{coding-system-error}
(@pxref{Signaling Errors, signal}).
Glenn Morris's avatar
Glenn Morris committed
799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869
@end defun

@defun coding-system-eol-type coding-system
This function returns the type of end-of-line (a.k.a.@: @dfn{eol})
conversion used by @var{coding-system}.  If @var{coding-system}
specifies a certain eol conversion, the return value is an integer 0,
1, or 2, standing for @code{unix}, @code{dos}, and @code{mac},
respectively.  If @var{coding-system} doesn't specify eol conversion
explicitly, the return value is a vector of coding systems, each one
with one of the possible eol conversion types, like this:

@lisp
(coding-system-eol-type 'latin-1)
     @result{} [latin-1-unix latin-1-dos latin-1-mac]
@end lisp

@noindent
If this function returns a vector, Emacs will decide, as part of the
text encoding or decoding process, what eol conversion to use.  For
decoding, the end-of-line format of the text is auto-detected, and the
eol conversion is set to match it (e.g., DOS-style CRLF format will
imply @code{dos} eol conversion).  For encoding, the eol conversion is
taken from the appropriate default coding system (e.g.,
@code{default-buffer-file-coding-system} for
@code{buffer-file-coding-system}), or from the default eol conversion
appropriate for the underlying platform.
@end defun

@defun coding-system-change-eol-conversion coding-system eol-type
This function returns a coding system which is like @var{coding-system}
except for its eol conversion, which is specified by @code{eol-type}.
@var{eol-type} should be @code{unix}, @code{dos}, @code{mac}, or
@code{nil}.  If it is @code{nil}, the returned coding system determines
the end-of-line conversion from the data.

@var{eol-type} may also be 0, 1 or 2, standing for @code{unix},
@code{dos} and @code{mac}, respectively.
@end defun

@defun coding-system-change-text-conversion eol-coding text-coding
This function returns a coding system which uses the end-of-line
conversion of @var{eol-coding}, and the text conversion of
@var{text-coding}.  If @var{text-coding} is @code{nil}, it returns
@code{undecided}, or one of its variants according to @var{eol-coding}.
@end defun

@defun find-coding-systems-region from to
This function returns a list of coding systems that could be used to
encode a text between @var{from} and @var{to}.  All coding systems in
the list can safely encode any multibyte characters in that portion of
the text.

If the text contains no multibyte characters, the function returns the
list @code{(undecided)}.
@end defun

@defun find-coding-systems-string string
This function returns a list of coding systems that could be used to
encode the text of @var{string}.  All coding systems in the list can
safely encode any multibyte characters in @var{string}.  If the text
contains no multibyte characters, this returns the list
@code{(undecided)}.
@end defun

@defun find-coding-systems-for-charsets charsets
This function returns a list of coding systems that could be used to
encode all the character sets in the list @var{charsets}.
@end defun

@defun detect-coding-region start end &optional highest
This function chooses a plausible coding system for decoding the text
870 871 872
from @var{start} to @var{end}.  This text should be a byte sequence,
i.e.@: unibyte text or multibyte text with only @acronym{ASCII} and
eight-bit characters (@pxref{Explicit Encoding}).
Glenn Morris's avatar
Glenn Morris committed
873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193

Normally this function returns a list of coding systems that could
handle decoding the text that was scanned.  They are listed in order of
decreasing priority.  But if @var{highest} is non-@code{nil}, then the
return value is just one coding system, the one that is highest in
priority.

If the region contains only @acronym{ASCII} characters except for such
ISO-2022 control characters ISO-2022 as @code{ESC}, the value is
@code{undecided} or @code{(undecided)}, or a variant specifying
end-of-line conversion, if that can be deduced from the text.
@end defun

@defun detect-coding-string string &optional highest
This function is like @code{detect-coding-region} except that it
operates on the contents of @var{string} instead of bytes in the buffer.
@end defun

  @xref{Coding systems for a subprocess,, Process Information}, in
particular the description of the functions
@code{process-coding-system} and @code{set-process-coding-system}, for
how to examine or set the coding systems used for I/O to a subprocess.

@node User-Chosen Coding Systems
@subsection User-Chosen Coding Systems

@cindex select safe coding system
@defun select-safe-coding-system from to &optional default-coding-system accept-default-p file
This function selects a coding system for encoding specified text,
asking the user to choose if necessary.  Normally the specified text
is the text in the current buffer between @var{from} and @var{to}.  If
@var{from} is a string, the string specifies the text to encode, and
@var{to} is ignored.

If @var{default-coding-system} is non-@code{nil}, that is the first
coding system to try; if that can handle the text,
@code{select-safe-coding-system} returns that coding system.  It can
also be a list of coding systems; then the function tries each of them
one by one.  After trying all of them, it next tries the current
buffer's value of @code{buffer-file-coding-system} (if it is not
@code{undecided}), then the value of
@code{default-buffer-file-coding-system} and finally the user's most
preferred coding system, which the user can set using the command
@code{prefer-coding-system} (@pxref{Recognize Coding,, Recognizing
Coding Systems, emacs, The GNU Emacs Manual}).

If one of those coding systems can safely encode all the specified
text, @code{select-safe-coding-system} chooses it and returns it.
Otherwise, it asks the user to choose from a list of coding systems
which can encode all the text, and returns the user's choice.

@var{default-coding-system} can also be a list whose first element is
t and whose other elements are coding systems.  Then, if no coding
system in the list can handle the text, @code{select-safe-coding-system}
queries the user immediately, without trying any of the three
alternatives described above.

The optional argument @var{accept-default-p}, if non-@code{nil},
should be a function to determine whether a coding system selected
without user interaction is acceptable. @code{select-safe-coding-system}
calls this function with one argument, the base coding system of the
selected coding system.  If @var{accept-default-p} returns @code{nil},
@code{select-safe-coding-system} rejects the silently selected coding
system, and asks the user to select a coding system from a list of
possible candidates.

@vindex select-safe-coding-system-accept-default-p
If the variable @code{select-safe-coding-system-accept-default-p} is
non-@code{nil}, its value overrides the value of
@var{accept-default-p}.

As a final step, before returning the chosen coding system,
@code{select-safe-coding-system} checks whether that coding system is
consistent with what would be selected if the contents of the region
were read from a file.  (If not, this could lead to data corruption in
a file subsequently re-visited and edited.)  Normally,
@code{select-safe-coding-system} uses @code{buffer-file-name} as the
file for this purpose, but if @var{file} is non-@code{nil}, it uses
that file instead (this can be relevant for @code{write-region} and
similar functions).  If it detects an apparent inconsistency,
@code{select-safe-coding-system} queries the user before selecting the
coding system.
@end defun

  Here are two functions you can use to let the user specify a coding
system, with completion.  @xref{Completion}.

@defun read-coding-system prompt &optional default
This function reads a coding system using the minibuffer, prompting with
string @var{prompt}, and returns the coding system name as a symbol.  If
the user enters null input, @var{default} specifies which coding system
to return.  It should be a symbol or a string.
@end defun

@defun read-non-nil-coding-system prompt
This function reads a coding system using the minibuffer, prompting with
string @var{prompt}, and returns the coding system name as a symbol.  If
the user tries to enter null input, it asks the user to try again.
@xref{Coding Systems}.
@end defun

@node Default Coding Systems
@subsection Default Coding Systems

  This section describes variables that specify the default coding
system for certain files or when running certain subprograms, and the
function that I/O operations use to access them.

  The idea of these variables is that you set them once and for all to the
defaults you want, and then do not change them again.  To specify a
particular coding system for a particular operation in a Lisp program,
don't change these variables; instead, override them using
@code{coding-system-for-read} and @code{coding-system-for-write}
(@pxref{Specifying Coding Systems}).

@defvar auto-coding-regexp-alist
This variable is an alist of text patterns and corresponding coding
systems. Each element has the form @code{(@var{regexp}
. @var{coding-system})}; a file whose first few kilobytes match
@var{regexp} is decoded with @var{coding-system} when its contents are
read into a buffer.  The settings in this alist take priority over
@code{coding:} tags in the files and the contents of
@code{file-coding-system-alist} (see below).  The default value is set
so that Emacs automatically recognizes mail files in Babyl format and
reads them with no code conversions.
@end defvar

@defvar file-coding-system-alist
This variable is an alist that specifies the coding systems to use for
reading and writing particular files.  Each element has the form
@code{(@var{pattern} . @var{coding})}, where @var{pattern} is a regular
expression that matches certain file names.  The element applies to file
names that match @var{pattern}.

The @sc{cdr} of the element, @var{coding}, should be either a coding
system, a cons cell containing two coding systems, or a function name (a
symbol with a function definition).  If @var{coding} is a coding system,
that coding system is used for both reading the file and writing it.  If
@var{coding} is a cons cell containing two coding systems, its @sc{car}
specifies the coding system for decoding, and its @sc{cdr} specifies the
coding system for encoding.

If @var{coding} is a function name, the function should take one
argument, a list of all arguments passed to
@code{find-operation-coding-system}.  It must return a coding system
or a cons cell containing two coding systems.  This value has the same
meaning as described above.

If @var{coding} (or what returned by the above function) is
@code{undecided}, the normal code-detection is performed.
@end defvar

@defvar process-coding-system-alist
This variable is an alist specifying which coding systems to use for a
subprocess, depending on which program is running in the subprocess.  It
works like @code{file-coding-system-alist}, except that @var{pattern} is
matched against the program name used to start the subprocess.  The coding
system or systems specified in this alist are used to initialize the
coding systems used for I/O to the subprocess, but you can specify
other coding systems later using @code{set-process-coding-system}.
@end defvar

  @strong{Warning:} Coding systems such as @code{undecided}, which
determine the coding system from the data, do not work entirely reliably
with asynchronous subprocess output.  This is because Emacs handles
asynchronous subprocess output in batches, as it arrives.  If the coding
system leaves the character code conversion unspecified, or leaves the
end-of-line conversion unspecified, Emacs must try to detect the proper
conversion from one batch at a time, and this does not always work.

  Therefore, with an asynchronous subprocess, if at all possible, use a
coding system which determines both the character code conversion and
the end of line conversion---that is, one like @code{latin-1-unix},
rather than @code{undecided} or @code{latin-1}.

@defvar network-coding-system-alist
This variable is an alist that specifies the coding system to use for
network streams.  It works much like @code{file-coding-system-alist},
with the difference that the @var{pattern} in an element may be either a
port number or a regular expression.  If it is a regular expression, it
is matched against the network service name used to open the network
stream.
@end defvar

@defvar default-process-coding-system
This variable specifies the coding systems to use for subprocess (and
network stream) input and output, when nothing else specifies what to
do.

The value should be a cons cell of the form @code{(@var{input-coding}
. @var{output-coding})}.  Here @var{input-coding} applies to input from
the subprocess, and @var{output-coding} applies to output to it.
@end defvar

@defvar auto-coding-functions
This variable holds a list of functions that try to determine a
coding system for a file based on its undecoded contents.

Each function in this list should be written to look at text in the
current buffer, but should not modify it in any way.  The buffer will
contain undecoded text of parts of the file.  Each function should
take one argument, @var{size}, which tells it how many characters to
look at, starting from point.  If the function succeeds in determining
a coding system for the file, it should return that coding system.
Otherwise, it should return @code{nil}.

If a file has a @samp{coding:} tag, that takes precedence, so these
functions won't be called.
@end defvar

@defun find-operation-coding-system operation &rest arguments
This function returns the coding system to use (by default) for
performing @var{operation} with @var{arguments}.  The value has this
form:

@example
(@var{decoding-system} . @var{encoding-system})
@end example

The first element, @var{decoding-system}, is the coding system to use
for decoding (in case @var{operation} does decoding), and
@var{encoding-system} is the coding system for encoding (in case
@var{operation} does encoding).

The argument @var{operation} is a symbol, one of @code{write-region},
@code{start-process}, @code{call-process}, @code{call-process-region},
@code{insert-file-contents}, or @code{open-network-stream}.  These are
the names of the Emacs I/O primitives that can do character code and
eol conversion.

The remaining arguments should be the same arguments that might be given
to the corresponding I/O primitive.  Depending on the primitive, one
of those arguments is selected as the @dfn{target}.  For example, if
@var{operation} does file I/O, whichever argument specifies the file
name is the target.  For subprocess primitives, the process name is the
target.  For @code{open-network-stream}, the target is the service name
or port number.

Depending on @var{operation}, this function looks up the target in
@code{file-coding-system-alist}, @code{process-coding-system-alist},
or @code{network-coding-system-alist}.  If the target is found in the
alist, @code{find-operation-coding-system} returns its association in
the alist; otherwise it returns @code{nil}.

If @var{operation} is @code{insert-file-contents}, the argument
corresponding to the target may be a cons cell of the form
@code{(@var{filename} . @var{buffer})}).  In that case, @var{filename}
is a file name to look up in @code{file-coding-system-alist}, and
@var{buffer} is a buffer that contains the file's contents (not yet
decoded).  If @code{file-coding-system-alist} specifies a function to
call for this file, and that function needs to examine the file's
contents (as it usually does), it should examine the contents of
@var{buffer} instead of reading the file.
@end defun

@node Specifying Coding Systems
@subsection Specifying a Coding System for One Operation

  You can specify the coding system for a specific operation by binding
the variables @code{coding-system-for-read} and/or
@code{coding-system-for-write}.

@defvar coding-system-for-read
If this variable is non-@code{nil}, it specifies the coding system to
use for reading a file, or for input from a synchronous subprocess.

It also applies to any asynchronous subprocess or network stream, but in
a different way: the value of @code{coding-system-for-read} when you
start the subprocess or open the network stream specifies the input
decoding method for that subprocess or network stream.  It remains in
use for that subprocess or network stream unless and until overridden.

The right way to use this variable is to bind it with @code{let} for a
specific I/O operation.  Its global value is normally @code{nil}, and
you should not globally set it to any other value.  Here is an example
of the right way to use the variable:

@example
;; @r{Read the file with no character code conversion.}
;; @r{Assume @acronym{crlf} represents end-of-line.}
(let ((coding-system-for-read 'emacs-mule-dos))
  (insert-file-contents filename))
@end example

When its value is non-@code{nil}, this variable takes precedence over
all other methods of specifying a coding system to use for input,
including @code{file-coding-system-alist},
@code{process-coding-system-alist} and
@code{network-coding-system-alist}.
@end defvar

@defvar coding-system-for-write
This works much like @code{coding-system-for-read}, except that it
applies to output rather than input.  It affects writing to files,
as well as sending output to subprocesses and net connections.

When a single operation does both input and output, as do
@code{call-process-region} and @code{start-process}, both
@code{coding-system-for-read} and @code{coding-system-for-write}
affect it.
@end defvar

@defvar inhibit-eol-conversion
When this variable is non-@code{nil}, no end-of-line conversion is done,
no matter which coding system is specified.  This applies to all the
Emacs I/O and subprocess primitives, and to the explicit encoding and
decoding functions (@pxref{Explicit Encoding}).
@end defvar

@node Explicit Encoding
@subsection Explicit Encoding and Decoding
@cindex encoding in coding systems
@cindex decoding in coding systems

  All the operations that transfer text in and out of Emacs have the
ability to use a coding system to encode or decode the text.
You can also explicitly encode and decode text using the functions
in this section.

  The result of encoding, and the input to decoding, are not ordinary
text.  They logically consist of a series of byte values; that is, a
1194 1195 1196 1197 1198 1199
series of @acronym{ASCII} and eight-bit characters.  In unibyte
buffers and strings, these characters have codes in the range 0
through 255.  In a multibyte buffer or string, eight-bit characters
have character codes higher than 255 (@pxref{Text Representations}),
but Emacs transparently converts them to their single-byte values when
you encode or decode such text.
Glenn Morris's avatar
Glenn Morris committed
1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216

  The usual way to read a file into a buffer as a sequence of bytes, so
you can decode the contents explicitly, is with
@code{insert-file-contents-literally} (@pxref{Reading from Files});
alternatively, specify a non-@code{nil} @var{rawfile} argument when
visiting a file with @code{find-file-noselect}.  These methods result in
a unibyte buffer.

  The usual way to use the byte sequence that results from explicitly
encoding text is to copy it to a file or process---for example, to write
it with @code{write-region} (@pxref{Writing to Files}), and suppress
encoding by binding @code{coding-system-for-write} to
@code{no-conversion}.

  Here are the functions to perform explicit encoding or decoding.  The
encoding functions produce sequences of bytes; the decoding functions
are meant to operate on sequences of bytes.  All of these functions
1217 1218
discard text properties.  They also set @code{last-coding-system-used}
to the precise coding system they used.
Glenn Morris's avatar
Glenn Morris committed
1219

1220
@deffn Command encode-coding-region start end coding-system &optional destination
Glenn Morris's avatar
Glenn Morris committed
1221
This command encodes the text from @var{start} to @var{end} according
1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235
to coding system @var{coding-system}.  Normally, the encoded text
replaces the original text in the buffer, but the optional argument
@var{destination} can change that.  If @var{destination} is a buffer,
the encoded text is inserted in that buffer after point (point does
not move); if it is @code{t}, the command returns the encoded text as
a unibyte string without inserting it.

If encoded text is inserted in some buffer, this command returns the
length of the encoded text.

The result of encoding is logically a sequence of bytes, but the
buffer remains multibyte if it was multibyte before, and any 8-bit
bytes are converted to their multibyte representation (@pxref{Text
Representations}).
Glenn Morris's avatar
Glenn Morris committed
1236 1237
@end deffn

1238
@defun encode-coding-string string coding-system &optional nocopy buffer
Glenn Morris's avatar
Glenn Morris committed
1239 1240 1241 1242 1243 1244 1245
This function encodes the text in @var{string} according to coding
system @var{coding-system}.  It returns a new string containing the
encoded text, except when @var{nocopy} is non-@code{nil}, in which
case the function may return @var{string} itself if the encoding
operation is trivial.  The result of encoding is a unibyte string.
@end defun

1246
@deffn Command decode-coding-region start end coding-system destination
Glenn Morris's avatar
Glenn Morris committed
1247
This command decodes the text from @var{start} to @var{end} according
1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260
to coding system @var{coding-system}.  To make explicit decoding
useful, the text before decoding ought to be a sequence of byte
values, but both multibyte and unibyte buffers are acceptable (in the
multibyte case, the raw byte values should be represented as eight-bit
characters).  Normally, the decoded text replaces the original text in
the buffer, but the optional argument @var{destination} can change
that.  If @var{destination} is a buffer, the decoded text is inserted
in that buffer after point (point does not move); if it is @code{t},
the command returns the decoded text as a multibyte string without
inserting it.

If decoded text is inserted in some buffer, this command returns the
length of the decoded text.
Glenn Morris's avatar
Glenn Morris committed
1261 1262
@end deffn

1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275
@defun decode-coding-string string coding-system &optional nocopy buffer
This function decodes the text in @var{string} according to
@var{coding-system}.  It returns a new string containing the decoded
text, except when @var{nocopy} is non-@code{nil}, in which case the
function may return @var{string} itself if the decoding operation is
trivial.  To make explicit decoding useful, the contents of
@var{string} ought to be a unibyte string with a sequence of byte
values, but a multibyte string is also acceptable (assuming it
contains 8-bit bytes in their multibyte form).

If optional argument @var{buffer} specifies a buffer, the decoded text
is inserted in that buffer after point (point does not move).  In this
case, the return value is the length of the decoded text.
Glenn Morris's avatar
Glenn Morris committed
1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292
@end defun

@defun decode-coding-inserted-region from to filename &optional visit beg end replace
This function decodes the text from @var{from} to @var{to} as if
it were being read from file @var{filename} using @code{insert-file-contents}
using the rest of the arguments provided.

The normal way to use this function is after reading text from a file
without decoding, if you decide you would rather have decoded it.
Instead of deleting the text and reading it again, this time with
decoding, you can call this function.
@end defun

@node Terminal I/O Encoding
@subsection Terminal I/O Encoding

  Emacs can decode keyboard input using a coding system, and encode
1293 1294 1295 1296
terminal output.  This is useful for terminals that transmit or
display text using a particular encoding such as Latin-1.  Emacs does
not set @code{last-coding-system-used} for encoding or decoding of
terminal I/O.
Glenn Morris's avatar
Glenn Morris committed
1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513

@defun keyboard-coding-system
This function returns the coding system that is in use for decoding
keyboard input---or @code{nil} if no coding system is to be used.
@end defun

@deffn Command set-keyboard-coding-system coding-system
This command specifies @var{coding-system} as the coding system to
use for decoding keyboard input.  If @var{coding-system} is @code{nil},
that means do not decode keyboard input.
@end deffn

@defun terminal-coding-system
This function returns the coding system that is in use for encoding
terminal output---or @code{nil} for no encoding.
@end defun

@deffn Command set-terminal-coding-system coding-system
This command specifies @var{coding-system} as the coding system to use
for encoding terminal output.  If @var{coding-system} is @code{nil},
that means do not encode terminal output.
@end deffn

@node MS-DOS File Types
@subsection MS-DOS File Types
@cindex DOS file types
@cindex MS-DOS file types
@cindex Windows file types
@cindex file types on MS-DOS and Windows
@cindex text files and binary files
@cindex binary files and text files

  On MS-DOS and Microsoft Windows, Emacs guesses the appropriate
end-of-line conversion for a file by looking at the file's name.  This
feature classifies files as @dfn{text files} and @dfn{binary files}.  By
``binary file'' we mean a file of literal byte values that are not
necessarily meant to be characters; Emacs does no end-of-line conversion
and no character code conversion for them.  On the other hand, the bytes
in a text file are intended to represent characters; when you create a
new file whose name implies that it is a text file, Emacs uses DOS
end-of-line conversion.

@defvar buffer-file-type
This variable, automatically buffer-local in each buffer, records the
file type of the buffer's visited file.  When a buffer does not specify
a coding system with @code{buffer-file-coding-system}, this variable is
used to determine which coding system to use when writing the contents
of the buffer.  It should be @code{nil} for text, @code{t} for binary.
If it is @code{t}, the coding system is @code{no-conversion}.
Otherwise, @code{undecided-dos} is used.

Normally this variable is set by visiting a file; it is set to
@code{nil} if the file was visited without any actual conversion.
@end defvar

@defopt file-name-buffer-file-type-alist
This variable holds an alist for recognizing text and binary files.
Each element has the form (@var{regexp} . @var{type}), where
@var{regexp} is matched against the file name, and @var{type} may be
@code{nil} for text, @code{t} for binary, or a function to call to
compute which.  If it is a function, then it is called with a single
argument (the file name) and should return @code{t} or @code{nil}.

When running on MS-DOS or MS-Windows, Emacs checks this alist to decide
which coding system to use when reading a file.  For a text file,
@code{undecided-dos} is used.  For a binary file, @code{no-conversion}
is used.

If no element in this alist matches a given file name, then
@code{default-buffer-file-type} says how to treat the file.
@end defopt

@defopt default-buffer-file-type
This variable says how to handle files for which
@code{file-name-buffer-file-type-alist} says nothing about the type.

If this variable is non-@code{nil}, then these files are treated as
binary: the coding system @code{no-conversion} is used.  Otherwise,
nothing special is done for them---the coding system is deduced solely
from the file contents, in the usual Emacs fashion.
@end defopt

@node Input Methods
@section Input Methods
@cindex input methods

  @dfn{Input methods} provide convenient ways of entering non-@acronym{ASCII}
characters from the keyboard.  Unlike coding systems, which translate
non-@acronym{ASCII} characters to and from encodings meant to be read by
programs, input methods provide human-friendly commands.  (@xref{Input
Methods,,, emacs, The GNU Emacs Manual}, for information on how users
use input methods to enter text.)  How to define input methods is not
yet documented in this manual, but here we describe how to use them.

  Each input method has a name, which is currently a string;
in the future, symbols may also be usable as input method names.

@defvar current-input-method
This variable holds the name of the input method now active in the
current buffer.  (It automatically becomes local in each buffer when set
in any fashion.)  It is @code{nil} if no input method is active in the
buffer now.
@end defvar

@defopt default-input-method
This variable holds the default input method for commands that choose an
input method.  Unlike @code{current-input-method}, this variable is
normally global.
@end defopt

@deffn Command set-input-method input-method
This command activates input method @var{input-method} for the current
buffer.  It also sets @code{default-input-method} to @var{input-method}.
If @var{input-method} is @code{nil}, this command deactivates any input
method for the current buffer.
@end deffn

@defun read-input-method-name prompt &optional default inhibit-null
This function reads an input method name with the minibuffer, prompting
with @var{prompt}.  If @var{default} is non-@code{nil}, that is returned
by default, if the user enters empty input.  However, if
@var{inhibit-null} is non-@code{nil}, empty input signals an error.

The returned value is a string.
@end defun

@defvar input-method-alist
This variable defines all the supported input methods.
Each element defines one input method, and should have the form:

@example
(@var{input-method} @var{language-env} @var{activate-func}
 @var{title} @var{description} @var{args}...)
@end example

Here @var{input-method} is the input method name, a string;
@var{language-env} is another string, the name of the language
environment this input method is recommended for.  (That serves only for
documentation purposes.)

@var{activate-func} is a function to call to activate this method.  The
@var{args}, if any, are passed as arguments to @var{activate-func}.  All
told, the arguments to @var{activate-func} are @var{input-method} and
the @var{args}.

@var{title} is a string to display in the mode line while this method is
active.  @var{description} is a string describing this method and what
it is good for.
@end defvar

  The fundamental interface to input methods is through the
variable @code{input-method-function}.  @xref{Reading One Event},
and @ref{Invoking the Input Method}.

@node Locales
@section Locales
@cindex locale

  POSIX defines a concept of ``locales'' which control which language
to use in language-related features.  These Emacs variables control
how Emacs interacts with these features.

@defvar locale-coding-system
@cindex keyboard input decoding on X
This variable specifies the coding system to use for decoding system
error messages and---on X Window system only---keyboard input, for
encoding the format argument to @code{format-time-string}, and for
decoding the return value of @code{format-time-string}.
@end defvar

@defvar system-messages-locale
This variable specifies the locale to use for generating system error
messages.  Changing the locale can cause messages to come out in a
different language or in a different orthography.  If the variable is
@code{nil}, the locale is specified by environment variables in the
usual POSIX fashion.
@end defvar

@defvar system-time-locale
This variable specifies the locale to use for formatting time values.
Changing the locale can cause messages to appear according to the
conventions of a different language.  If the variable is @code{nil}, the
locale is specified by environment variables in the usual POSIX fashion.
@end defvar

@defun locale-info item
This function returns locale data @var{item} for the current POSIX
locale, if available.  @var{item} should be one of these symbols:

@table @code
@item codeset
Return the character set as a string (locale item @code{CODESET}).

@item days
Return a 7-element vector of day names (locale items
@code{DAY_1} through @code{DAY_7});

@item months
Return a 12-element vector of month names (locale items @code{MON_1}
through @code{MON_12}).

@item paper
Return a list @code{(@var{width} @var{height})} for the default paper
size measured in millimeters (locale items @code{PAPER_WIDTH} and
@code{PAPER_HEIGHT}).
@end table

If the system can't provide the requested information, or if
@var{item} is not one of those symbols, the value is @code{nil}.  All
strings in the return value are decoded using
@code{locale-coding-system}.  @xref{Locales,,, libc, The GNU Libc Manual},
for more information about locales and locale items.
@end defun

@ignore
   arch-tag: be705bf8-941b-4c35-84fc-ad7d20ddb7cb
@end ignore