Character Encodings

There are two different items to discuss related to encodings; reading/writing XML, and working internally.

The XML API can read and write many character encodings, leveraging the power of the GNU libiconv library. The reading and writng encodings need not be simiar. For example, a SHIFT_JIS document can be written as UTF-8, and vice-versa.

When working within the library, everything is UTF-8 regardless of what character encoding it was read from or will be written to. This deserves stressing:

  • The XML API can parse many character encodings.

  • All data extracted through the XML API is UTF-8.

  • All data changed or inserted through the XML API must be UTF-8.

  • The XML API can serialize documents in many character encodings.

This means that a document may exist on disk in ISO-8859-1, but when the XML API parses it and you call xmlTreeGetContent() to get the text from an element, you'll get UTF-8 data. If the file exists on disk in ASCII, calling xmlTreeGetContent() will still give UTF-8 data.

Simiarly, regardless of whether a document will be outputted in BIG5, ISO-8859-7, UTF-32, etc., when adding a new element with xmlTreeNewElement(), the name and contents must be given in UTF-8.

This may sound restricting, but it's actually liberating in that when working in code, you never have to worry about what encoding the file was read from, or what encoding it will be written out as. Always use UTF-8.

The default encoding when working in Vortex is already UTF-8 (unless manually changed with <urlcp charsettxt>). If you have data that you need to convert to UTF-8, you can use the <urlutil charsetconv> Vortex function.


Copyright © Thunderstone Software     Last updated: Apr 15 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.