Unicode and encoding: Python vs Java shootout, part 1

Before going on with this post, be sure you've read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - otherwise, I won't be able to solve your issues.

Sometimes Java fans just say that Python Unicode support sucks. While there's a bit of truth in that assertion - for Python 2.x at least, since 3.x solved the problem at its root - the real problem lies in the fact that many programmers don't know what's going on "under the hood", and that python default behaviour is a bit unforgiving.

First things first: Python 2.x has got two distinct string types: the so-called "byte strings" (str type) are one, and unicode objects (unicode type) are another one. Unicode objects are much like Java strings; they're an internal abstraction of Python, and need to be interpreted to/from byte strings whenever printing, reading/writing from/to a file, etc.

Sometimes such interpreting "just works" (most probably if you're dealing with ascii-only text), but if such conversion fails, errors occur: Python infamous and dreaded UnicodeDecodeError can happen in many places and might sometime just puzzle the programmer:

# -- coding: utf-8 --
import sys

print "current default encoding: " + sys.getdefaultencoding()

"àèìòù" + u"asd"

javapythonunicode$ python unicode_concat.py
current default encoding: ascii
Traceback (most recent call last):
File "unicode_concat.py", line 6, in <module>
"àèìòù" + u"asd"
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

 This happens when trying to concatenate an unicode string to string object. What really happens here, behind the scenes, is explained in unicode_concat_explained.py:

# -- coding: utf-8 --
import sys

print "current default encoding: " + sys.getdefaultencoding()

"àèìòù".decode(sys.getdefaultencoding()) + u"asd"

When using a byte string (not a unicode object) python tries to convert that string to unicode before concatenating to a unicode object. Since an encoding wasn't specified, python just uses the default.

So, if you're planning to mix unicode and bytestring objects (which is not a good idea), always remember to explicitly convert string objects to unicode by their decode() method. Or, if you're sure the encoding is always the same, you could add a sitecustomize.py to your interpreter or your project PYTHONPATH, and set such encoding. Beware that setting it system wide may lead to unexpected results when running your project on another machine.

Also, just remember that the coding directive at the begininning of the file does not change unicode encoding/decoding behaviour: that's a directive to the parser, which is used to instruct it about the current encoding of the file. That information is not retained nor used at runtime.


Let's go for another much dreaded error now: UnicodeEncodeError!

# -- coding: utf-8 --
import sys
print "current stdout encoding: " + str(sys.stdout.encoding)
print "current default encoding: " + sys.getdefaultencoding()

print u"àèìòù"

results in this output:

javapythonunicode$ python unicode_print.py
current stdout encoding: UTF-8
àèìòù

Everything seems to work (as long as your terminal supports a charset, like UTF-8, or iso-8859-1, which can display accented chars) ! But now let's go for some "black magic":


That may be puzzling, but the sys.stdout printout should give you an hint.

Whenever output is to a terminal, Python does perform autodetection of the encoding of your terminal, and subsequently sets sys.stdout accordingly. When sys.stdout encoding is set, any unicode object that gets printed is encoded accordingly. When output is NOT to a terminal, python does not autodetect sys.stdout encoding and, when sys.stdout has no encoding set any unicode object that is printed is converted according to sys.getdefaultencoding()

So, what really happens here is:

# -- coding: utf-8 --
import sys
print "current stdout encoding: " + str(sys.stdout.encoding)
print "current default encoding: " + sys.getdefaultencoding()

sys.stdout.write(
u"àèìòù".encode(sys.stdout.encoding or sys.getdefaultencoding())
)

Since accented characters can be encoded to UTF-8 but not to ASCII, such UnicodeEncodeErrors arise.

Also beware of bug 4947 - it may hit you if you're using Python 2.6 or or older.


You can find the second part of this article here