Wednesday, December 16, 2009

Encoding strings to UTF-8 in Java

The following code encodes the string in UTF-8 encoding.
 
String s;
.
.
byte[] b= null;
try {
b= s.getBytes("UTF-8"); //$NON-NLS-1$
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}

However small errors can lead to interesting consequences. To type uppercase alphabets I hold down the shift key and type, I rarely use caps lock. While I was writing the above code, to be specific the "UTF-8" part, I accidentally typed in "UTF_8", as I was still holding down the shift key while pressing the -/_ key. I use IBM JREs (1.4.2,1.5, 1.6) and tried this with all three of them, and it worked like a charm. Now someone else tried the same code with another vm and it failed. I thought strange...

Digging a bit deeper, http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html lists the correct name to be UTF8 or UTF-8, so UTF_8 should not work. Digging a bit more...IBM JRE 1.4.2 contains sun.io.CharacterEncoding in core.jar, the alias table in this also contains UTF8 and UTF-8 only. However this class also contains a private method 'replaceDash(String)' which seems to be replacing '_' with '-' , since I was looking at the class file I cant be too sure), and I could not find any other reason of why UTF_8 was working.

It looks to be a bit silly/dangerous of IBM JRE to support non-standard aliases, because things fail on other JREs.

See the standard charset names in the IANA registry

No comments:

Post a Comment