|
Texis Version 6 has improved Unicode
(international/foreign/hi-bit/UTF-8) character support. Two new
settings were introduced: textsearchmode
(here)
and
stringcomparemode (here). Both have the same set of possible values,
and offer more flexibility in how text searches and string comparisons
(respectively) are handled. Some features:
- Full Unicode case-insensitivity
By default, text searches (e.g. the
LIKE operator) are
case-insensitive in version 6 for the entire Unicode 5.1
locale-independent character set, not just the given operating
system's locale (which may be inconsistent and does not support
characters beyond U+00FF). - UTF-8 support
UTF-8 is the expected character set, though ISO-8859-1 is
still accepted. (Other character sets are converted automatically.)
- Full-width ASCII
Full-width ASCII characters (used in CJK contexts) match their
normal/half-width ASCII counterparts.
- Diacritics ignored
Diacritic marks - umlauts, accents, etc. - are ignored, so
that e.g. "für" matches "fur".
- Ligatures expanded
Ligatures are expanded to match their expanded counterparts,
e.g. "œ" (U+0153) will match "oe".
All of these behaviors can be controlled with the
textsearchmode and stringcomparemode apicp
settings (see the Vortex manual for details).
Caveat: A version 5 or earlier Texis should not access or modify
a regular (B-tree) or Metamorph index originally created by a version
6 or later Texis, unless stringcomparemode was set to
ctype, respectcase, iso-8859-1 (regular indices) or
textsearchmode was set to ctype, ignorecase, iso-8859-1
(Metamorph indices) at creation. If hi-bit/UTF-8/Unicode characters
exist in the data, index corruption may result from Texis 5
modifications.
|
The stringcomparemode setting also affects the functions
<xtree>, <strstr>, <strstri>, <substr>,
<strcmp>, <strcmpi>, <strncmp>,
<strnicmp>, <strlen>, <strrev>, <upper>,
<lower>, <sort>, <uniq>, upper(),
lower(), initcap(), text2mm() and
length(). The length()/<strlen> functions
count charset characters (e.g. UTF-8 characters) not bytes.
Version 5 and earlier behavior can be restored by default by setting
the conf/texis.ini setting [Apicp] Text Search Mode
to ctype, ignorecase, iso-8859-1, and [Apicp] String Compare Mode
to ctype, respectcase, iso-8859-1.
Copyright © Thunderstone Software Last updated: Mon Feb 18 10:28:15 EST 2013
|