Commit 97d8273f authored by Chong Yidong's avatar Chong Yidong

* nonascii.texi (Text Representations): Copyedits.

(Coding System Basics): Also mention utf-8-emacs.
(Converting Representations, Selecting a Representation)
(Scanning Charsets, Translation of Characters, Encoding and I/O):
Copyedits.
(Character Codes): Mention role of codepoints 1114112 to 4194175.
parent c872c51e
2009-04-10 Chong Yidong <cyd@stupidchicken.com>
* nonascii.texi (Text Representations): Copyedits.
(Coding System Basics): Also mention utf-8-emacs.
(Converting Representations, Selecting a Representation)
(Scanning Charsets, Translation of Characters, Encoding and I/O):
Copyedits.
(Character Codes): Mention role of codepoints 1114112 to 4194175.
2009-04-09 Chong Yidong <cyd@stupidchicken.com> 2009-04-09 Chong Yidong <cyd@stupidchicken.com>
* text.texi (Yank Commands): Note that yank uses push-mark. * text.texi (Yank Commands): Note that yank uses push-mark.
......
...@@ -36,8 +36,8 @@ how they are stored in strings and buffers. ...@@ -36,8 +36,8 @@ how they are stored in strings and buffers.
@cindex text representation @cindex text representation
Emacs buffers and strings support a large repertoire of characters Emacs buffers and strings support a large repertoire of characters
from many different scripts. This is so users could type and display from many different scripts, allowing users to type and display text
text in most any known written language. in most any known written language.
@cindex character codepoint @cindex character codepoint
@cindex codespace @cindex codespace
...@@ -65,15 +65,13 @@ This internal representation is based on one of the encodings defined ...@@ -65,15 +65,13 @@ This internal representation is based on one of the encodings defined
by the Unicode Standard, called @dfn{UTF-8}, for representing any by the Unicode Standard, called @dfn{UTF-8}, for representing any
Unicode codepoint, but Emacs extends UTF-8 to represent the additional Unicode codepoint, but Emacs extends UTF-8 to represent the additional
codepoints it uses for raw 8-bit bytes and characters not unified with codepoints it uses for raw 8-bit bytes and characters not unified with
Unicode.}. Unicode.}. For example, any @acronym{ASCII} character takes up only 1
For example, any @acronym{ASCII} character takes up only 1 byte, a byte, a Latin-1 character takes up 2 bytes, etc. We call this
Latin-1 character takes up 2 bytes, etc. We call this representation representation of text @dfn{multibyte}.
of text @dfn{multibyte}, because it uses several bytes for each
character.
Outside Emacs, characters can be represented in many different Outside Emacs, characters can be represented in many different
encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts
between these external encodings and the internal representation, as between these external encodings and its internal representation, as
appropriate, when it reads text into a buffer or a string, or when it appropriate, when it reads text into a buffer or a string, or when it
writes text to a disk file or passes it to some other process. writes text to a disk file or passes it to some other process.
...@@ -87,9 +85,9 @@ Before the conversion, the buffer holds encoded text. ...@@ -87,9 +85,9 @@ Before the conversion, the buffer holds encoded text.
Encoded text is not really text, as far as Emacs is concerned, but Encoded text is not really text, as far as Emacs is concerned, but
rather a sequence of raw 8-bit bytes. We call buffers and strings rather a sequence of raw 8-bit bytes. We call buffers and strings
that hold encoded text @dfn{unibyte} buffers and strings, because that hold encoded text @dfn{unibyte} buffers and strings, because
Emacs treats them as a sequence of individual bytes. In particular, Emacs treats them as a sequence of individual bytes. Usually, Emacs
Emacs usually displays unibyte buffers and strings as octal codes such displays unibyte buffers and strings as octal codes such as
as @code{\237}. We recommend that you never use unibyte buffers and @code{\237}. We recommend that you never use unibyte buffers and
strings except for manipulating encoded text or binary non-text data. strings except for manipulating encoded text or binary non-text data.
In a buffer, the buffer-local value of the variable In a buffer, the buffer-local value of the variable
...@@ -165,10 +163,10 @@ conversions happen when inserting text into a buffer, or when putting ...@@ -165,10 +163,10 @@ conversions happen when inserting text into a buffer, or when putting
text from several strings together in one string. You can also text from several strings together in one string. You can also
explicitly convert a string's contents to either representation. explicitly convert a string's contents to either representation.
Emacs chooses the representation for a string based on the text that Emacs chooses the representation for a string based on the text from
it is constructed from. The general rule is to convert unibyte text to which it is constructed. The general rule is to convert unibyte text
multibyte text when combining it with other multibyte text, because the to multibyte text when combining it with other multibyte text, because
multibyte representation is more general and can hold whatever the multibyte representation is more general and can hold whatever
characters the unibyte text has. characters the unibyte text has.
When inserting text into a buffer, Emacs converts the text to the When inserting text into a buffer, Emacs converts the text to the
...@@ -181,9 +179,9 @@ alternative, to convert the buffer contents to multibyte, is not ...@@ -181,9 +179,9 @@ alternative, to convert the buffer contents to multibyte, is not
acceptable because the buffer's representation is a choice made by the acceptable because the buffer's representation is a choice made by the
user that cannot be overridden automatically. user that cannot be overridden automatically.
Converting unibyte text to multibyte text leaves @acronym{ASCII} characters Converting unibyte text to multibyte text leaves @acronym{ASCII}
unchanged, and converts bytes with codes 128 through 159 to the characters unchanged, and converts bytes with codes 128 through 159 to
multibyte representation of raw eight-bit bytes. the multibyte representation of raw eight-bit bytes.
Converting multibyte text to unibyte converts all @acronym{ASCII} Converting multibyte text to unibyte converts all @acronym{ASCII}
and eight-bit characters to their single-byte form, but loses and eight-bit characters to their single-byte form, but loses
...@@ -214,9 +212,9 @@ characters. ...@@ -214,9 +212,9 @@ characters.
@end defun @end defun
@defun multibyte-char-to-unibyte char @defun multibyte-char-to-unibyte char
This convert the multibyte character @var{char} to a unibyte This converts the multibyte character @var{char} to a unibyte
character. If @var{char} is a character that is neither character, and returns that character. If @var{char} is neither
@acronym{ASCII} nor eight-bit, the value is -1. @acronym{ASCII} nor eight-bit, the function returns -1.
@end defun @end defun
@defun unibyte-char-to-multibyte char @defun unibyte-char-to-multibyte char
...@@ -238,9 +236,9 @@ is @code{nil}, the buffer becomes unibyte. ...@@ -238,9 +236,9 @@ is @code{nil}, the buffer becomes unibyte.
This function leaves the buffer contents unchanged when viewed as a This function leaves the buffer contents unchanged when viewed as a
sequence of bytes. As a consequence, it can change the contents sequence of bytes. As a consequence, it can change the contents
viewed as characters; a sequence of three bytes which is treated as viewed as characters; for instance, a sequence of three bytes which is
one character in multibyte representation will count as three treated as one character in multibyte representation will count as
characters in unibyte representation. Eight-bit characters three characters in unibyte representation. Eight-bit characters
representing raw bytes are an exception. They are represented by one representing raw bytes are an exception. They are represented by one
byte in a unibyte buffer, but when the buffer is set to multibyte, byte in a unibyte buffer, but when the buffer is set to multibyte,
they are converted to two-byte sequences, and vice versa. they are converted to two-byte sequences, and vice versa.
...@@ -256,28 +254,24 @@ base buffer. ...@@ -256,28 +254,24 @@ base buffer.
@end defun @end defun
@defun string-as-unibyte string @defun string-as-unibyte string
This function returns a string with the same bytes as @var{string} but If @var{string} is already a unibyte string, this function returns
treating each byte as a character. This means that the value may have @var{string} itself. Otherwise, it returns a new string with the same
more characters than @var{string} has. Eight-bit characters bytes as @var{string}, but treating each byte as a separate character
representing raw bytes are an exception: each one of them is converted (so that the value may have more characters than @var{string}); as an
to a single byte. exception, each eight-bit character representing a raw byte is
converted into a single byte. The newly-created string contains no
If @var{string} is already a unibyte string, then the value is
@var{string} itself. Otherwise it is a newly created string, with no
text properties. text properties.
@end defun @end defun
@defun string-as-multibyte string @defun string-as-multibyte string
This function returns a string with the same bytes as @var{string} but If @var{string} is a multibyte string, this function returns
treating each multibyte sequence as one character. This means that @var{string} itself. Otherwise, it returns a new string with the same
the value may have fewer characters than @var{string} has. If a byte bytes as @var{string}, but treating each multibyte sequence as one
sequence in @var{string} is invalid as a multibyte representation of a character. This means that the value may have fewer characters than
single character, each byte in the sequence is treated as raw 8-bit @var{string} has. If a byte sequence in @var{string} is invalid as a
byte. multibyte representation of a single character, each byte in the
sequence is treated as a raw 8-bit byte. The newly-created string
If @var{string} is already a multibyte string, then the value is contains no text properties.
@var{string} itself. Otherwise it is a newly created string, with no
text properties.
@end defun @end defun
@node Character Codes @node Character Codes
...@@ -291,9 +285,10 @@ character codes for multibyte representation range from 0 to 4194303 ...@@ -291,9 +285,10 @@ character codes for multibyte representation range from 0 to 4194303
(#x3FFFFF). In this code space, values 0 through 127 are for (#x3FFFFF). In this code space, values 0 through 127 are for
@acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F) @acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F)
are for non-@acronym{ASCII} characters. Values 0 through 1114111 are for non-@acronym{ASCII} characters. Values 0 through 1114111
(#10FFFF) corresponds to Unicode characters of the same codepoint, (#10FFFF) correspond to Unicode characters of the same codepoint;
while values 4194176 (#x3FFF80) through 4194303 (#x3FFFFF) are for values 1114112 (#110000) through 4194175 (#x3FFF7F) represent
representing eight-bit raw bytes. characters that are not unified with Unicode; and values 4194176
(#x3FFF80) through 4194303 (#x3FFFFF) represent eight-bit raw bytes.
@defun characterp charcode @defun characterp charcode
This returns @code{t} if @var{charcode} is a valid character, and This returns @code{t} if @var{charcode} is a valid character, and
...@@ -334,9 +329,9 @@ codepoint can have. ...@@ -334,9 +329,9 @@ codepoint can have.
@end defun @end defun
@defun get-byte pos &optional string @defun get-byte pos &optional string
This function returns the byte at current buffer's character position This function returns the byte at character position @var{pos} in the
@var{pos}. If the current buffer is unibyte, this is literally the current buffer. If the current buffer is unibyte, this is literally
byte at that position. If the buffer is multibyte, byte values of the byte at that position. If the buffer is multibyte, byte values of
@acronym{ASCII} characters are the same as character codepoints, @acronym{ASCII} characters are the same as character codepoints,
whereas eight-bit raw bytes are converted to their 8-bit codes. The whereas eight-bit raw bytes are converted to their 8-bit codes. The
function signals an error if the character at @var{pos} is function signals an error if the character at @var{pos} is
...@@ -360,13 +355,11 @@ of character properties. In particular, Emacs supports the ...@@ -360,13 +355,11 @@ of character properties. In particular, Emacs supports the
Model}, and the Emacs character property database is derived from the Model}, and the Emacs character property database is derived from the
Unicode Character Database (@acronym{UCD}). See the Unicode Character Database (@acronym{UCD}). See the
@uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character @uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character
Properties chapter of the Unicode Standard}, for detailed description Properties chapter of the Unicode Standard}, for a detailed
of Unicode character properties and their meaning. This section description of Unicode character properties and their meaning. This
assumes you are already familiar with that chapter of the Unicode section assumes you are already familiar with that chapter of the
Standard, and want to apply that knowledge to Emacs Lisp programs. Unicode Standard, and want to apply that knowledge to Emacs Lisp
programs.
The facilities documented in this section are useful for setting and
retrieving properties of characters.
In Emacs, each property has a name, which is a symbol, and a set of In Emacs, each property has a name, which is a symbol, and a set of
possible values, whose types depend on the property; if a character possible values, whose types depend on the property; if a character
...@@ -378,8 +371,8 @@ replacing each @samp{_} character with a dash @samp{-}. For example, ...@@ -378,8 +371,8 @@ replacing each @samp{_} character with a dash @samp{-}. For example,
@code{canonical-combining-class}. However, sometimes we shorten the @code{canonical-combining-class}. However, sometimes we shorten the
names to make their use easier. names to make their use easier.
Here's the full list of value types for all the character properties Here is the full list of value types for all the character
that Emacs knows about: properties that Emacs knows about:
@table @code @table @code
@item name @item name
...@@ -428,7 +421,7 @@ corresponding number. ...@@ -428,7 +421,7 @@ corresponding number.
@item numeric-value @item numeric-value
Corresponds to the Unicode @code{Numeric_Value} property for Corresponds to the Unicode @code{Numeric_Value} property for
characters whose @code{Numeric_Type} is @samp{Numeric}. The value of characters whose @code{Numeric_Type} is @samp{Numeric}. The value of
this property is an integer of a floating-point number. Examples of this property is an integer or a floating-point number. Examples of
characters that have this property include fractions, subscripts, characters that have this property include fractions, subscripts,
superscripts, Roman numerals, currency numerators, and encircled superscripts, Roman numerals, currency numerators, and encircled
numbers. For example, the value of this property for the character numbers. For example, the value of this property for the character
...@@ -656,16 +649,15 @@ or last codepoint of @var{charset}, respectively. ...@@ -656,16 +649,15 @@ or last codepoint of @var{charset}, respectively.
@node Scanning Charsets @node Scanning Charsets
@section Scanning for Character Sets @section Scanning for Character Sets
Sometimes it is useful to find out, for characters that appear in a Sometimes it is useful to find out which character set a particular
certain part of a buffer or a string, to which character sets they character belongs to. One use for this is in determining which coding
belong. One use for this is in determining which coding systems systems (@pxref{Coding Systems}) are capable of representing all of
(@pxref{Coding Systems}) are capable of representing all of the text the text in question; another is to determine the font(s) for
in question; another is to determine the font(s) for displaying that displaying that text.
text.
@defun charset-after &optional pos @defun charset-after &optional pos
This function returns the charset of highest priority containing the This function returns the charset of highest priority containing the
character in the current buffer at position @var{pos}. If @var{pos} character at position @var{pos} in the current buffer. If @var{pos}
is omitted or @code{nil}, it defaults to the current value of point. is omitted or @code{nil}, it defaults to the current value of point.
If @var{pos} is out of range, the value is @code{nil}. If @var{pos} is out of range, the value is @code{nil}.
@end defun @end defun
...@@ -675,15 +667,15 @@ This function returns a list of the character sets of highest priority ...@@ -675,15 +667,15 @@ This function returns a list of the character sets of highest priority
that contain characters in the current buffer between positions that contain characters in the current buffer between positions
@var{beg} and @var{end}. @var{beg} and @var{end}.
The optional argument @var{translation} specifies a translation table to The optional argument @var{translation} specifies a translation table
be used in scanning the text (@pxref{Translation of Characters}). If it to use for scanning the text (@pxref{Translation of Characters}). If
is non-@code{nil}, then each character in the region is translated it is non-@code{nil}, then each character in the region is translated
through this table, and the value returned describes the translated through this table, and the value returned describes the translated
characters instead of the characters actually in the buffer. characters instead of the characters actually in the buffer.
@end defun @end defun
@defun find-charset-string string &optional translation @defun find-charset-string string &optional translation
This function returns a list of the character sets of highest priority This function returns a list of character sets of highest priority
that contain characters in @var{string}. It is just like that contain characters in @var{string}. It is just like
@code{find-charset-region}, except that it applies to the contents of @code{find-charset-region}, except that it applies to the contents of
@var{string} instead of part of the current buffer. @var{string} instead of part of the current buffer.
...@@ -721,7 +713,7 @@ character, say @var{to-alt}, @var{from} is also translated to ...@@ -721,7 +713,7 @@ character, say @var{to-alt}, @var{from} is also translated to
During decoding, the translation table's translations are applied to During decoding, the translation table's translations are applied to
the characters that result from ordinary decoding. If a coding system the characters that result from ordinary decoding. If a coding system
has property @code{:decode-translation-table}, that specifies the has the property @code{:decode-translation-table}, that specifies the
translation table to use, or a list of translation tables to apply in translation table to use, or a list of translation tables to apply in
sequence. (This is a property of the coding system, as returned by sequence. (This is a property of the coding system, as returned by
@code{coding-system-get}, not a property of the symbol that is the @code{coding-system-get}, not a property of the symbol that is the
...@@ -779,8 +771,8 @@ respectively in the @var{props} argument to ...@@ -779,8 +771,8 @@ respectively in the @var{props} argument to
This function is similar to @code{make-translation-table} but returns This function is similar to @code{make-translation-table} but returns
a complex translation table rather than a simple one-to-one mapping. a complex translation table rather than a simple one-to-one mapping.
Each element of @var{alist} is of the form @code{(@var{from} Each element of @var{alist} is of the form @code{(@var{from}
. @var{to})}, where @var{from} and @var{to} are either a character or . @var{to})}, where @var{from} and @var{to} are either characters or
a vector specifying a sequence of characters. If @var{from} is a vectors specifying a sequence of characters. If @var{from} is a
character, that character is translated to @var{to} (i.e.@: to a character, that character is translated to @var{to} (i.e.@: to a
character or a character sequence). If @var{from} is a vector of character or a character sequence). If @var{from} is a vector of
characters, that sequence is translated to @var{to}. The returned characters, that sequence is translated to @var{to}. The returned
...@@ -891,10 +883,13 @@ end-of-line conversion. ...@@ -891,10 +883,13 @@ end-of-line conversion.
codes or end-of-line. codes or end-of-line.
@vindex emacs-internal@r{ coding system} @vindex emacs-internal@r{ coding system}
The coding system @code{emacs-internal} specifies that the data is @vindex utf-8-emacs@r{ coding system}
represented in the internal Emacs encoding. This is like The coding system @code{utf-8-emacs} specifies that the data is
@code{raw-text} in that no code conversion happens, but different in represented in the internal Emacs encoding (@pxref{Text
that the result is multibyte data. Representations}). This is like @code{raw-text} in that no code
conversion happens, but different in that the result is multibyte
data. The name @code{emacs-internal} is an alias for
@code{utf-8-emacs}.
@defun coding-system-get coding-system property @defun coding-system-get coding-system property
This function returns the specified property of the coding system This function returns the specified property of the coding system
...@@ -924,9 +919,9 @@ This function returns the list of aliases of @var{coding-system}. ...@@ -924,9 +919,9 @@ This function returns the list of aliases of @var{coding-system}.
@subsection Encoding and I/O @subsection Encoding and I/O
The principal purpose of coding systems is for use in reading and The principal purpose of coding systems is for use in reading and
writing files. The function @code{insert-file-contents} uses writing files. The function @code{insert-file-contents} uses a coding
a coding system for decoding the file data, and @code{write-region} system to decode the file data, and @code{write-region} uses one to
uses one to encode the buffer contents. encode the buffer contents.
You can specify the coding system to use either explicitly You can specify the coding system to use either explicitly
(@pxref{Specifying Coding Systems}), or implicitly using a default (@pxref{Specifying Coding Systems}), or implicitly using a default
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment