Commit aa5ccb01 authored by Eli Zaretskii's avatar Eli Zaretskii
Browse files

src/bidi.c: Describe the design of reordering engine in the commentary.

parent c8e7f832
......@@ -22,9 +22,16 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */
A sequential implementation of the Unicode Bidirectional algorithm,
(UBA) as per UAX#9, a part of the Unicode Standard.
Unlike the reference and most other implementations, this one is
designed to be called once for every character in the buffer or
string.
Unlike the Reference Implementation and most other implementations,
this one is designed to be called once for every character in the
buffer or string. That way, we can leave intact the design of the
Emacs display engine, whereby an iterator object is used to
traverse buffer or string text character by character, and generate
the necessary data for displaying each character in 'struct glyph'
objects. (See xdisp.c for the details of that iteration.) The
functions on this file replace the original linear iteration in the
logical order of the text with a non-linear iteration in the visual
order, i.e. in the order characters should be shown on display.
The main entry point is bidi_move_to_visually_next. Each time it
is called, it finds the next character in the visual order, and
......@@ -52,7 +59,182 @@ along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. */
A note about references to UAX#9 rules: if the reference says
something like "X9/Retaining", it means that you need to refer to
rule X9 and to its modifications described in the "Implementation
Notes" section of UAX#9, under "Retaining Format Codes". */
Notes" section of UAX#9, under "Retaining Format Codes".
Here's the overview of the design of the reordering engine
implemented by this file.
Basic implementation structure
------------------------------
The sequential processing steps described by UAX#9 are implemented
as recursive levels of processing, all of which examine the next
character in the logical order. This hierarchy of processing looks
as follows, from the innermost (deepest) to the outermost level,
omitting some subroutines used by each level:
bidi_fetch_char -- fetch next character
bidi_resolve_explicit -- resolve explicit levels and directions
bidi_resolve_weak -- resolve weak types
bidi_resolve_neutral -- resolve neutral types
bidi_level_of_next_char -- resolve implicit levels
Each level calls the level below it, and works on the result
returned by the lower level, including all of its sub-levels.
Unlike all the levels below it, bidi_level_of_next_char can return
the information about either the next or previous character in the
logical order, depending on the current direction of scanning the
buffer or string. For the next character, it calls all the levels
below it; for the previous character, it uses the cache, described
below.
Thus, the result of calling bidi_level_of_next_char is the resolved
level of the next or the previous character in the logical order.
Based on this information, the function bidi_move_to_visually_next
finds the next character in the visual order and updates the
direction in which the buffer is scanned, either forward or
backward, to find the next character to be displayed. (Text is
scanned backwards when it needs to be reversed for display, i.e. if
the visual order is the inverse of the logical order.) This
implements the last, reordering steps of the UBA, by successively
calling bidi_level_of_next_char until the character of the required
embedding level is found; the scan direction is dynamically updated
as a side effect. See the commentary before the 'while' loop in
bidi_move_to_visually_next, for the details.
Fetching characters
-------------------
In a nutshell, fetching the next character boils down to calling
STRING_CHAR_AND_LENGTH, passing it the address of a buffer or
string position. See bidi_fetch_char. However, if the next
character is "covered" by a display property of some kind,
bidi_fetch_char returns the u+FFFC "object replacement character"
that represents the entire run of text covered by the display
property. (The ch_len and nchars members of 'struct bidi_it'
reflect the length in bytes and characters of that text.) This is
so we reorder text on both sides of the display property as
appropriate for an image or embedded string. Similarly, text
covered by a display spec of the form '(space ...)', is replaced
with the u+2029 paragraph separator character, so such display
specs produce the same effect as a TAB under UBA. Both these
special characters are not actually displayed -- the display
property is displayed instead -- but just used to compute the
embedding level of the surrounding text so as to produce the
required effect.
Bidi iterator states
--------------------
The UBA is highly context dependent in some of its parts,
i.e. results of processing a character can generally depend on
characters very far away. The UAX#9 description of the UBA
prescribes a stateful processing of each character, whereby the
results of this processing depend on various state variables, such
as the current embedding level, level stack, and directional
override status. In addition, the UAX#9 description includes many
passages like this (from rule W2 in this case):
Search backward from each instance of a European number until the
first strong type (R, L, AL, or sos) is found. If an AL is found,
change the type of the European number to Arabic number.
To support this, we use a bidi iterator object, 'struct bidi_it',
which is a sub-structure of 'struct it' used by xdisp.c (see
dispextern.h for the definition of both of these structures). The
bidi iterator holds the entire state of the iteration required by
the UBA, and is updated as the text is traversed. In particular,
the embedding level of the current character being resolved is
recorded in the iterator state. To avoid costly searches backward
in support of rules like W2 above, the necessary character types
are also recorded in the iterator state as they are found during
the forward scan, and then used when such rules need to be applied.
(Forward scans cannot be avoided in this way; they need to be
performed at least once, and the results recorded in the iterator
state, to be reused until the forward scan oversteps the recorded
position.)
In this manner, the iterator state acts as a mini-cache of
contextual information required for resolving the level of the
current character by various UBA rules.
Caching of bidi iterator states
-------------------------------
As described above, the reordering engine uses the information
recorded in the bidi iterator state in order to resolve the
embedding level of the current character. When the reordering
engine needs to process the next character in the logical order, it
fetches it and applies to it all the UBA levels, updating the
iterator state as it goes. But when the buffer or string is
scanned backwards, i.e. in the reverse order of buffer/string
positions, the scanned characters were already processed during the
preceding forward scan (see bidi_find_other_level_edge). To avoid
costly re-processing of characters that were already processed
during the forward scan, the iterator states computed while
scanning forward are cached.
The cache is just a linear array of 'struct bidi_it' objects, which
is dynamically allocated and reallocated as needed, since the size
of the cache depends on the text being processed. We only need the
cache while processing embedded levels higher than the base
paragraph embedding level, because these higher levels require
changes in scan direction. Therefore, as soon as we are back to
the base embedding level, we can free the cache; see the calls to
bidi_cache_reset and bidi_cache_shrink, for the conditions to do
this.
The cache maintains the index of the next unused cache slot -- this
is where the next iterator state will be cached. The function
bidi_cache_iterator_state saves an instance of the state in the
cache and increments the unused slot index. The companion function
bidi_cache_find looks up a cached state that corresponds to a given
buffer/string position. All of the cached states must correspond
1:1 to the buffer or string region whose processing they reflect;
bidi.c will abort if it finds cache slots that violate this 1:1
correspondence.
When the parent iterator 'struct it' is pushed (see push_it in
xdisp.c) to pause the current iteration and start iterating over a
different object (e.g., a 'display' string that covers some buffer
text), the bidi iterator cache needs to be "pushed" as well, so
that a new empty cache could be used while iterating over the new
object. Later, when the new object is exhausted, and xdisp.c calls
pop_it, we need to "pop" the bidi cache as well and return to the
original cache. See bidi_push_it and bidi_pop_it for how this is
done.
Some functions of the display engine save copies of 'struct it' in
local variables, and restore them later. For examples, see
pos_visible_p and move_it_in_display_line_to in xdisp.c, and
window_scroll_pixel_based in window.c. When this happens, we need
to save and restore the bidi cache as well, because conceptually
the cache is part of the 'struct it' state, and needs to be in
perfect sync with the portion of the buffer/string that is being
processed. This saving and restoring of the cache state is handled
by bidi_shelve_cache and bidi_unshelve_cache, and the helper macros
SAVE_IT and RESTORE_IT defined on xdisp.c.
Note that, because reordering is implemented below the level in
xdisp.c that breaks glyphs into screen lines, we are violating
paragraph 3.4 of UAX#9. which mandates that line breaking shall be
done before reordering each screen line separately. However,
following UAX#9 to the letter in this matter goes against the basic
design of the Emacs display engine, and so we choose here this
minor deviation from the UBA letter in preference to redesign of
the display engine. The effect of this is only seen in continued
lines that are broken into screen lines in the middle of a run
whose direction is opposite to the paragraph's base direction.
Important design and implementation note: when the code needs to
scan far ahead, be sure to avoid such scans as much as possible
when the buffer/string doesn't contain any RTL characters. Users
of left-to-right scripts will never forgive you if you introduce
some slow-down due to bidi in situations that don't involve any
bidirectional text. See the large comment near the beginning of
bidi_resolve_neutral, for one situation where such shortcut was
necessary. */
#include <config.h>
#include <stdio.h>
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment