Sunday, 13 July 2014

Reading XMLs with Foreign Characters and different encodings in Informatica

Issue: XML Reader: Error [XMLException_Fatal] occurred while parsing:[FATAL: Error at (file EMPTY, line 1, char 1 ): An exception occurred! Type:UTFDataFormatException, Message:invalid byte 2 (c) of a 2-byte sequence..]

Have you ever tried reading an XML with french characters in Informatica and struggled? Well you are probably not the first. Anything to do with XML in Informatica takes a lot of trials.
This is how is finally succeeded in parsing the XML.

The XML file that I was reading was generated by a web service and did not have any information on encoding of the xml. The XML parser failed all the time with UTFDATAFORMAT Validation error.The dtd that used to create the xml was generated by xmlspy.

First thing obviously to check is the code pages used on the Informatica server. Everything I found recommended using utf-8(unicode) encoding. The code page MS Windows Latin 1 (ANSI), superset of Latin 1 used by the server apparently was good enough to parse french characters.

Second thing to check is the encoding used on the XML. Apparently ISO-8859-1 encoding was used on the XML and the xml parser was not able to detect that the xml file is in ISO-8859-1 format.

I finally found there are couple of ways to deal with this issue;

a) Add <?xml version=“1.0” encoding=“ISO-8859-1”?> tag to the xml, change the dtd to use encoding ISO-8859-1 and recreate the xml parser. Run the parser and the job succeeded.

2) change the encoding on the dtd to utf-8, use ICONV tool to change the encoding on the xml from ISO-8859-1 to UTF-8. The command used to change the encoding on XML is :

iconv -f ISO-8859-15 -t UTF-8 yourxmlfile.xml

Using either of the above two steps you can make the xml parser parse foreign characters.

See also: