Don’t mix unicode and encoded strings

The key to my unicode-in-python problems was that using unicode objects within Python feels the same as using a string object. But, they are really not the same (as pointed out by Diez Roggisch and Just van Rossum), and you see this when you start talking to C extensions. My original solution to working with elementtidy involved making elementtidy understand unicode objects.

The cleaner solution is to pass in UTF-8 encoded strings. That’s the trick: a unicode object in Python is a “perfect” representation of a unicode string, whereas UTF-8 is a way to represent a unicode string as a bunch of bytes. I’m pretty sure I read this somewhere in my recent travels, but now I really see the effects of this when it comes to talking with C modules.

This didn’t change the fact that elementtidy was hardcoded to encode as ASCII. The solution to this is simpler than the unicode change I made the other day. Taking a stock _elementtidy.c from 1.0a3, here is the breakdown of changes:

/* snip */
static PyObject*
elementtidy_fixup(PyObject* self, PyObject* args, PyObject* kwdict)
/* snip */
    char* encoding = "ascii";
    static char* kwlist[] = {"data", "encoding", NULL};

    char* text;
    if (!PyArg_ParseTupleAndKeywords(args, kwdict, "s|s", kwlist, &text,
        return NULL;

    doc = tidyCreate();

    /* options for nice XHTML output */
    tidyOptSetValue(doc, TidyCharEncoding, encoding);

/* snip */
static PyMethodDef _functions[] = {
    {"fixup", (PyCFunction) elementtidy_fixup, METH_VARARGS | METH_KEYWORDS, 
            "Use HTML Tidy to convert to XHTML"},
    {NULL, NULL}

These changes are very straightforward. Add an “encoding” keyword argument to the fixup function, and pass that in to tidylib, rather than the previous hardcoded “ascii”. The default is still “ascii” though. I also made minor changes to TidyHTMLTreeBuilder to add the encoding option (and use cElementTree, if available).

This solution works great and there is no needed to make it work. Unicode in Python looks quite well designed, especially given the needs of interfacing with C code that has its own ideas about text encoding and multibyte characters.

3 thoughts on “Don’t mix unicode and encoded strings”

  1. I’ve seen that one, but the link is a good reminder. I actually became fairly unicode-aware in my last job, but I missed some of the nuances of how Python’s implementation works. My previous unicode experience was all in Java, and you tend to do very little interfacing with C code and external processes in Java. Not that I’d want to go back 🙂

  2. Also note that that while XML uses Unicode on the “inside”, the XML “serialization” format always uses an encoding. An XML parser is designed to deal with encoded data, not with an abstract stream of “perfect” Unicode characters.

    (and thanks to character references, the encoding doesn’t in any way limit what characters you can store in an XML file)

Comments are closed.