Don’t mix unicode and encoded strings

Feb 3, 2005 15:52 · 335 words · 2 minute read

The key to my unicode-in-python problems was that using unicode objects within Python feels the same as using a string object. But, they are really not the same (as pointed out by Diez Roggisch and Just van Rossum), and you see this when you start talking to C extensions. My original solution to working with elementtidy involved making elementtidy understand unicode objects.

The cleaner solution is to pass in UTF-8 encoded strings. That’s the trick: a unicode object in Python is a “perfect” representation of a unicode string, whereas UTF-8 is a way to represent a unicode string as a bunch of bytes. I’m pretty sure I read this somewhere in my recent travels, but now I really see the effects of this when it comes to talking with C modules.

This didn’t change the fact that elementtidy was hardcoded to encode as ASCII. The solution to this is simpler than the unicode change I made the other day. Taking a stock _elementtidy.c from 1.0a3, here is the breakdown of changes:

/* snip */
static PyObject*
elementtidy_fixup(PyObject* self, PyObject* args, PyObject* kwdict)
/* snip */
    char* encoding = "ascii";
    static char* kwlist[] = {"data", "encoding", NULL};

    char* text;
    if (!PyArg_ParseTupleAndKeywords(args, kwdict, "s|s", kwlist, &text,
        return NULL;

    doc = tidyCreate();

    /* options for nice XHTML output */
    tidyOptSetValue(doc, TidyCharEncoding, encoding);

/* snip */
static PyMethodDef _functions[] = {
    {"fixup", (PyCFunction) elementtidy_fixup, METH_VARARGS | METH_KEYWORDS, 
            "Use HTML Tidy to convert to XHTML"},
    {NULL, NULL}

These changes are very straightforward. Add an “encoding” keyword argument to the fixup function, and pass that in to tidylib, rather than the previous hardcoded “ascii”. The default is still “ascii” though. I also made minor changes to TidyHTMLTreeBuilder to add the encoding option (and use cElementTree, if available).

This solution works great and there is no needed to make it work. Unicode in Python looks quite well designed, especially given the needs of interfacing with C code that has its own ideas about text encoding and multibyte characters.