Tuesday, October 11, 2011

Internationalization tips for XML

In order for the XML to support Unicode, the following statement needs to be mentioned at the start of the XML:

<?xml version="1.0" encoding="UTF-8"?>


Apart from these there is BOM issue while saving the XMLs with Unicode characters. Many Windows based text editors add the bytes 0xEF,0xBB,0xBF at the start of document saved in UTF-8 encoding. These set of bytes are Unicode byte-order mark (BOM) though are not relevant to byte order. The BOM can also appear if another encoding with a BOM is translated to UTF-8 without stripping it.

The presence of the UTF-8 BOM may cause interoperability problems with existing software that could otherwise handle UTF-8, for example:

  • Older text editors may display the BOM as "" at the start of the document, even if the UTF-8 file contains only ASCII and would otherwise display correctly.
  • Programming language parsers can often handle UTF-8 in string constants and comments, but cannot parse the BOM at the start of the file.
  • Programs that identify file types by leading characters may fail to identify the file if a BOM is present even if the user of the file could skip the BOM. Or conversely they will identify the file when the user cannot handle the BOM. An example is the UNIX shebang syntax.
  • Programs that insert information at the start of a file will result in a file with the BOM somewhere in the middle of it (this is also a problem with the UTF-16 BOM). One example is offline browsers that add the originating URL to the start of the file
If compatibility with existing programs is not important, the BOM could be used to identify if a file is UTF-8 versus a legacy encoding, but this is still problematical due to many instances where the BOM is added or removed without actually changing the encoding, or various encodings are concatenated together. Checking if the text is valid UTF-8 is more reliable than using BOM. It’s better to omit the BOM while saving the Unicode files. One of the solutions and some discussion surrounding the problem can be found here

No comments:

Post a Comment