Unicode and Encoding: Python vs Java Shootout, part 2

So, here we come to our second part; if you did miss the first, it's there. All of the following discussion is relative to Sun Java 6, but most details - if not every detail - is still valid for Java 7 and 8.

Let's see the way Java handles unicode issues and why it's usually less problematic than Python - or, at least, while it looks less problematic than Python.

String literals in Java, just like in Python 3, are just Unicode objects; so you don't need to think whether it's a "binary string" or a "unicode object"; if it's a binary blob, you'll probably get a byte[], which should result in a 'here be dragons' behaviour about encoding.

Nevertheless, whenever you want to print out a string you actually need binary data, raw bytes! How does Java find out what to do?

System.out, which is an instance of java.io.PrintStream, internally uses a java.io.OutputStreamWriter which gets initialized using Charset.defaultCharset(), and such value will be used for conversions between raw bytes and strings:

Encoding.java

package eu.franzoni.examples.pythonvsjavaencoding;

import java.nio.charset.Charset;

public class Encoding {  
    public static void main(String[] args) throws Exception {
        String myString = "àèìòù";
        System.out.println(System.out.getClass());
        System.out.println(Charset.defaultCharset());
        System.out.println(myString);
        System.out.write(myString.getBytes(Charset.defaultCharset()));
        System.out.println("");
    }
}

result:

class java.io.PrintStream  
UTF-8  
àèìòù

The value of the default charset is detected by the JVM at startup, and it can vary on different OSes and because of different language environment settings. On Sun Java you can even set at JVM launch time through the file.encoding property, but that's discouraged as it can be unpredictable and unreliable. On the contrary, you should always use constructors and methods that accept a charset parameter if you're converting to/from raw bytes, just like String.getBytes(Charset charset) or OutputStreamWriter(OutputStream outputStream, Charset charset)

So far, so good. But what happens if the outputstream is unable to handle the chars you'd like it to write?

CantEncode.java

package eu.franzoni.examples.pythonvsjavaencoding;

import java.io.FileOutputStream;  
import java.io.OutputStream;  
import java.io.PrintStream;

public class CantEncode {  
    public static void main(String[] args) throws Exception {
        // change default output stream
        OutputStream outputStream = new FileOutputStream("eu.franzoni.examples/src/eu/franzoni/examples/pythonvsjavaencoding/CantEncode.output");
        PrintStream printStream = new PrintStream(outputStream, true, "ASCII");
        System.setOut(printStream);

        String myString = "àèìòù";

        System.out.println(System.out.getClass());
        System.out.println(myString);
     }
}

result:

class java.io.PrintStream  
?????

ASCII encoding cannot possibly output the accented chars, so they're replaced with question marks instead of throwing the dreaded UnicodeEncodeError.

Which approach is the best? That depends on the context. If you expect somebody to actually read your text, the Java approach might just let an error slip unnoticed for a long time, while the Python approach would trigger an immediate 'heads up!'. Otherwise if you're just logging something to the console (or to any other stream) an exception might disrupt your otherwise well-working program and force an unnecessary quit. The Java approach just looks easier for the non-charset-aware because it throws fewer errors around.

But there's another situation where Java falls short of Python: source file encoding. While Python allows setting it on a per-file basis and defaults it to iso-8859-1 if the coding directive is not there, on Java you'll use javac -encoding flag to set it globally, per compile, in order to tell the compiler which kind of encoding the source file uses; if you don't pass such option, the platform default converter is used

What's the problem with this approach? Consider a very common situation: you work on a project on Linux, your platform defaults to UTF-8 and everything just works.

Then you checkout your project on Windows; you compile it, the default charset is Windows-1252 and all your files print garbage. That's what happens, in fact:

echo "utf8 compiled:"  
javac -encoding utf8 eu/franzoni/examples/pythonvsjavaencoding/Encoding.java && java eu.franzoni.examples.pythonvsjavaencoding.Encoding  
echo "windows-1252 compiled:"  
javac -encoding windows-1252 eu/franzoni/examples/pythonvsjavaencoding/Encoding.java && java eu.franzoni.examples.pythonvsjavaencoding.Encoding  

result:

utf8 compiled:  
class java.io.PrintStream  
UTF-8  
àèìòù
àèìòù
windows-1252 compiled:  
class java.io.PrintStream  
UTF-8  
àèìòù
àèìòù

As you can see there, it's not that the JVM encoding has any issue at runtime; it's the parser, at compile time, which is told a wrong encoding about the string literals in your source code. There's no way around this!

What I can recommend you is to ALWAYS set the project.build.sourceEncoding property when using Maven to compile, and to ALWAYS set the encoding attribute whenever you're using Ant. This will save you a lot, lot, lot of headaches.

Alan Franzoni

Read more posts by this author.

Trieste, Italy
comments powered by Disqus