Using Unicode with ElementTidy

I want to do some HTML scraping, and there are at least two Python packages to help with that: BeautifulSoup and ElementTidy. For what I’m doing, the input and output need to be UTF-8. I have had some success if I set my default encoding to utf8 via sitecustomize.py, but it would be nicer if I could explicitly set the encoding as needed.

I’m going to talk mostly about ElementTidy in this posting. Briefly, the problem I had with BeautifulSoup was that passing in a unicode string resulted in tags getting glommed together. For example, if I had two <a> tags, the content of the first tag would include the second tag, and the second tag would also appear separately. I didn’t spend any significant time looking at this problem.

With ElementTidy, the problems have been clearer, but not necessarily easy to fix. I still don’t have it working fully without using a sitecustomize.py file. What happens with ElementTidy is that the unicode string is being coerced to ASCII, causing an exception whenever there is a character that is not compatible with 7-bit ASCII. The first step along the way brought me to the _elementtidy.c module. This is the one that calls the HTML Tidy library to generate clean XHTML for ElementTree to work with. The first clue that something is amiss is that tidylib is called with an encoding of “ascii”. Sadly, just setting that to utf8 won’t do it, because the input from Python and the output to Python both need to be unicode objects, rather than standard strings. Here’s how I did it (for brevity, I’ll highlight the changes):

static PyObject*
elementtidy_fixup(PyObject* self, PyObject* args)
{
/* snip */
    char* text;
    if (!PyArg_ParseTuple(args, "es:fixup", "utf8", &text))
        return NULL;

    doc = tidyCreate();

    /* options for nice XHTML output */
    tidyOptSetValue(doc, TidyOutCharEncoding, "utf8");
    tidyOptSetValue(doc, TidyInCharEncoding, "utf8");

/* snip */
    pyout = PyUnicode_DecodeUTF8(out.bp ? out.bp : "", out.size, NULL);
    if (pyout)
        pyerr = PyUnicode_DecodeUTF8(err.bp ? err.bp : "", err.size, NULL);
/* snip */
    PyMem_Free(text);
/* snip */

What I had to do was pretty simple, but I haven’t looked at Python extensions in a few years, so I had to do a bit of reading to make these few minor changes. The first change is to tell PyArg_ParseTuple that we’re looking for utf8 coming in. By doing that, the incoming text will not be coerced to ASCII, and the text buffer will contain UTF-8 text. Then, in the options to tidylib, we need to specify that utf8 is our input and output format of choice. Once the response comes back, we create unicode objects rather than string objects. Thankfully, there is a function to take a UTF-8 encoded byte buffer and generate a Python unicode object. Finally, don’t forget to free the incoming argument, because the “es” option to PyArg_ParseTuple allocates a new buffer for the encoded string.

I’m pretty sure that there would be a way to directly get the unicode object out of the tuple so that no additional buffer needs to be created. I haven’t looked into that at all. Secondly, this function now only does utf-8 in and out (though if you pass it an ASCII string, that will still work). This is not ideal, but it meets my needs.

That works fine, and it would have been nice if that was the end of the story. Of course, it’s not, otherwise I would’ve stopped typing. By making the changes above, _elementtidy.fixup will nicely do UTF-8 in and out. Unfortunately, I then ran into problems with ElementTree coercing my document to ASCII. Looking at the Python implementation, I see that no encoding is passed to the expat parser. The docs for that package say that it will try to determine from the document what encoding to use. So, I made sure that I had a <?xml version=”1.0″ encoding=”utf-8″ ?> declaration, but that didn’t do the trick either.

I’m hoping there’s some way other than sitecustomize.py to pass along that the object needs to stay in unicode form. I’m sure I can do it if I hack at ElementTree a little bit, but I really don’t like altering third party packages unless I can send in a useful patch that will get integrated.

6 thoughts on “Using Unicode with ElementTidy”

  1. Hi Kevin,

    I’ve been banging my head against similar problems with HTML and also with processing syndicated feeds with feedparser. Some, but not all of them, seem to have been resolved by editing sitecustomize.py.

    I’m nowhere near your level of gurudom (I don’t grok C much), but could you explain why you think editing sitecustomize.py is a bad idea? I understand that in some cases (like on a hosted web server), the file may not be accessible to users without root, but isn’t it fair to say that utf-8 *should* be the default encoding, if at all possible?

    Interesting post, cheers…

  2. Whether or not sitecustomize is a good solution for you probably depends on your app. The nice thing about sitecustomize is that it can be anywhere in your pythonpath, so you generally don’t need root access to use it.

    If you’re using sitecustomize on an app for which you completely control the environment, that should work just fine. But, if you’re distributing the app, it’s really a lot better if your application can run properly in whatever environment it gets dropped into.

    By the way, with some help from the python-list, I solved my problem:

    http://www.blueskyonmars.com/archives/2005/02/03/dont_mix_unicode_and_encoded_strings.html

    Take a look at that posting, and you may be able to get around using a sitecustomize file as well.

    Kevin

  3. Hi again, thanks for the response.

    This is actually the part where I’m not sure what best practice is, because I’m under the impression that in fact you *can’t* add sitecustomize.py anywhere in your path, because of some programming oddities. These are mentioned in Martin Doudoroff’s article (which you recently blogged about):

    “You want this application to handle multi-lingual text, so you’re going to take advantage of Unicode. The first thing you will probably want to do is set up a sitecustomize.py file in the Lib directory of your python installation and designate a Unicode encoding (probably UTF-8) as the default encoding for Python.

    import sys
    sys.setdefaultencoding(“utf-8″)

    Important: as of Python 2.2, as far as I can tell, you can only call the setdefaultencoding method from within sitecustomize.py. You cannot perform this step from within your application! I don’t understand why Guido set it up this way, but I’m sure he had his reasons. ”

    I’m really glad you brought that article to my attention, because it’s the first time I’ve seen someone just come out and say that setting utf-8 as default in sitecustomize.py is the way to go. The number of headaches that it seems to solve, imho, are far greater than the ones involved in editing the file.

    I see what you’re saying about creating apps for distribution. That’s something I never really do, though, the only “distribution” I do is by way of cgi apps or something like that.

    Thanks for the discussion!

  4. sitecustomize.py can be anywhere on your PYTHONPATH. (I know, because I’ve tried it out of desparation and it does work.) The point made in that article is that you can’t, at some random point in your program, just import sys and set the encoding. It has to happen at startup time, so it has to be on the path at startup time.

    sitecustomize.py is definitely the easiest thing to do when you’re in complete control of what the PYTHONPATH will be at startup.

  5. sys.setdefaultencoding() was added for experimentation during Unicode development, and should not be used in production code. All sorts of ugliness can happen if you mess around with the conversion rules (especially if you use a variable-width encoding). It’s not that hard to write encoding-aware code, really.

    As for ElementTidy, an encoding attribute was added to the recent 1.0b1 release.

Comments are closed.