XML Encoding, UTF-8 / UTF-16 Confusion

Posted on 2011/02/03 by shanmoon

Here’s a frustrating little problem I found when a service I deal with (we’ll call it SystemA for “Awesome”) suddenly changed character encoding… My app was suddenly getting parse exceptions for XML messages after an upgrade to SystemA was deployed to a test environment. A peak at my logs showed the xml response looked funky, with extra spaces all throught it… no wonder my XML API went blooey:

< ? x m l v e r s i o n = ” 1 . 0 ” e n c o d i n g = ” U T F – 8 ” ? >

I blinked a little, then tried a copy paste from the log file to put into a bug note and got this little gem from textpad:

Cannot cut, copy, or drag and dtop text containing null (code = 0) characters.

Cannot cut, copy, or drag and dtop text containing null (code = 0) characters.

Sweet!

I opened the file up in a Hex Editor, and low and behold there were extra nulls chars all through it. Even though the xml header specified UTF-8, it looked like it was actually encoded in UTF-16.

A quick hack to my code to override encoding to use UTF-16, and the xml was now readable…. looks like SystemA swapped encoding without informing anyone or updating their code to output the correct incoding in the file encoding attribute… makes me wonder if they String all their XML together by hand with String concatenations instead of using an XML library (shudder). Just goes to show, just ’cause the encoding attribute says one thing, it might actually be something else. Never presume that inside the black box that is someone else’s application, good coding practices are followed.