• Michal Nazarewicz's avatar
    Support casing characters which map into multiple code points (bug#24603) · b3b9b258
    Michal Nazarewicz authored
    Implement unconditional special casing rules defined in Unicode standard.
    
    Among other things, they deal with cases when a single code point is
    replaced by multiple ones because single character does not exist (e.g.
    ‘fi’ ligature turning into ‘FL’) or is not commonly used (e.g. ß turning
    into SS).
    
    * admin/unidata/SpecialCasing.txt: New data file pulled from Unicode
    standard distribution.
    * admin/unidata/README: Mention SpecialCasing.txt.
    
    * admin/unidata/unidata-get.el (unidata-gen-table-special-casing,
    unidata-gen-table-special-casing--do-load): New functions generating
    ‘special-uppercase’, ‘special-lowercase’ and ‘special-titlecase’
    character Unicode properties built from the SpecialCasing.txt Unicode
    data file.
    
    * src/casefiddle.c (struct casing_str_buf): New structure for
    representing short strings used to handle one-to-many character
    mappings.
    
    (case_character_imlp): New function which can handle one-to-many
    character mappings.
    (case_character, case_single_character): Wrappers for the above
    functions.  The former may map one character to multiple (or no)
    code points while the latter does what the former used to do (i.e.
    handles one-to-one mappings only).
    
    (do_casify_natnum, do_casify_unibyte_string,
    do_casify_unibyte_region): Use case_single_character.
    (do_casify_multibyte_string, do_casify_multibyte_region): Support new
    features of case_character.
    * (do_casify_region): Updated to reflact do_casify_multibyte_string
    changes.
    
    (casify_word): Handle situation when one character-length of a word
    can change affecting where end of the word is.
    
    (upcase, capitalize, upcase-initials): Update documentation to mention
    limitations when working on characters.
    
    * test/src/casefiddle-tests.el (casefiddle-tests-char-properties):
    Add test cases for the newly introduced character properties.
    (casefiddle-tests-casing): Update test cases which are now passing.
    
    * test/lisp/char-fold-tests.el (char-fold--ascii-upcase,
    char-fold--ascii-downcase): New functions which behave like old ‘upcase’
    and ‘downcase’.
    (char-fold--test-match-exactly): Use the new functions.  This is needed
    because otherwise fi and similar characters are turned into their multi-
    -character representation.
    
    * doc/lispref/strings.texi: Describe issue with casing characters versus
    strings.
    * doc/lispref/nonascii.texi: Describe the new character properties.
    b3b9b258
casefiddle.c 19.8 KB