Deferring SQLObject database insertion until later

SQLObject automatically inserts entries in the database as soon as you instantiate them. It also runs updates as soon as you modify attributes. This is not always desirable and eter Butler has a recipe to choose when to insert and update new objects. It’s actually not a bad solution, because the object you’re holding is not truly one of your domain objects. That makes it clearer that something more needs to be done with it. SQLObject could implement this functionality directly with a flag passed to __init__. There is already a flag to defer the updates: just set _lazyUpdate = True on the object, then you need to run syncUpdate or sync to save the data.

Don’t mix unicode and encoded strings

The key to my unicode-in-python problems was that using unicode objects within Python feels the same as using a string object. But, they are really not the same (as pointed out by Diez Roggisch and Just van Rossum), and you see this when you start talking to C extensions. My original solution to working with elementtidy involved making elementtidy understand unicode objects.

The cleaner solution is to pass in UTF-8 encoded strings. That’s the trick: a unicode object in Python is a “perfect” representation of a unicode string, whereas UTF-8 is a way to represent a unicode string as a bunch of bytes. I’m pretty sure I read this somewhere in my recent travels, but now I really see the effects of this when it comes to talking with C modules.

This didn’t change the fact that elementtidy was hardcoded to encode as ASCII. The solution to this is simpler than the unicode change I made the other day. Taking a stock _elementtidy.c from 1.0a3, here is the breakdown of changes:

/* snip */
static PyObject*
elementtidy_fixup(PyObject* self, PyObject* args, PyObject* kwdict)
{
/* snip */
    char* encoding = "ascii";
    
    static char* kwlist[] = {"data", "encoding", NULL};

    char* text;
    if (!PyArg_ParseTupleAndKeywords(args, kwdict, "s|s", kwlist, &text,
        &encoding))
        return NULL;

    doc = tidyCreate();

    /* options for nice XHTML output */
    tidyOptSetValue(doc, TidyCharEncoding, encoding);

/* snip */
static PyMethodDef _functions[] = {
    {"fixup", (PyCFunction) elementtidy_fixup, METH_VARARGS | METH_KEYWORDS, 
            "Use HTML Tidy to convert to XHTML"},
    {NULL, NULL}
};

These changes are very straightforward. Add an “encoding” keyword argument to the fixup function, and pass that in to tidylib, rather than the previous hardcoded “ascii”. The default is still “ascii” though. I also made minor changes to TidyHTMLTreeBuilder to add the encoding option (and use cElementTree, if available).

This solution works great and there is no sitecustomize.py needed to make it work. Unicode in Python looks quite well designed, especially given the needs of interfacing with C code that has its own ideas about text encoding and multibyte characters.

Using Unicode with ElementTidy

I want to do some HTML scraping, and there are at least two Python packages to help with that: BeautifulSoup and ElementTidy. For what I’m doing, the input and output need to be UTF-8. I have had some success if I set my default encoding to utf8 via sitecustomize.py, but it would be nicer if I could explicitly set the encoding as needed.

I’m going to talk mostly about ElementTidy in this posting. Briefly, the problem I had with BeautifulSoup was that passing in a unicode string resulted in tags getting glommed together. For example, if I had two <a> tags, the content of the first tag would include the second tag, and the second tag would also appear separately. I didn’t spend any significant time looking at this problem.

With ElementTidy, the problems have been clearer, but not necessarily easy to fix. I still don’t have it working fully without using a sitecustomize.py file. What happens with ElementTidy is that the unicode string is being coerced to ASCII, causing an exception whenever there is a character that is not compatible with 7-bit ASCII. The first step along the way brought me to the _elementtidy.c module. This is the one that calls the HTML Tidy library to generate clean XHTML for ElementTree to work with. The first clue that something is amiss is that tidylib is called with an encoding of “ascii”. Sadly, just setting that to utf8 won’t do it, because the input from Python and the output to Python both need to be unicode objects, rather than standard strings. Here’s how I did it (for brevity, I’ll highlight the changes):

static PyObject*
elementtidy_fixup(PyObject* self, PyObject* args)
{
/* snip */
    char* text;
    if (!PyArg_ParseTuple(args, "es:fixup", "utf8", &text))
        return NULL;

    doc = tidyCreate();

    /* options for nice XHTML output */
    tidyOptSetValue(doc, TidyOutCharEncoding, "utf8");
    tidyOptSetValue(doc, TidyInCharEncoding, "utf8");

/* snip */
    pyout = PyUnicode_DecodeUTF8(out.bp ? out.bp : "", out.size, NULL);
    if (pyout)
        pyerr = PyUnicode_DecodeUTF8(err.bp ? err.bp : "", err.size, NULL);
/* snip */
    PyMem_Free(text);
/* snip */

What I had to do was pretty simple, but I haven’t looked at Python extensions in a few years, so I had to do a bit of reading to make these few minor changes. The first change is to tell PyArg_ParseTuple that we’re looking for utf8 coming in. By doing that, the incoming text will not be coerced to ASCII, and the text buffer will contain UTF-8 text. Then, in the options to tidylib, we need to specify that utf8 is our input and output format of choice. Once the response comes back, we create unicode objects rather than string objects. Thankfully, there is a function to take a UTF-8 encoded byte buffer and generate a Python unicode object. Finally, don’t forget to free the incoming argument, because the “es” option to PyArg_ParseTuple allocates a new buffer for the encoded string.

I’m pretty sure that there would be a way to directly get the unicode object out of the tuple so that no additional buffer needs to be created. I haven’t looked into that at all. Secondly, this function now only does utf-8 in and out (though if you pass it an ASCII string, that will still work). This is not ideal, but it meets my needs.

That works fine, and it would have been nice if that was the end of the story. Of course, it’s not, otherwise I would’ve stopped typing. By making the changes above, _elementtidy.fixup will nicely do UTF-8 in and out. Unfortunately, I then ran into problems with ElementTree coercing my document to ASCII. Looking at the Python implementation, I see that no encoding is passed to the expat parser. The docs for that package say that it will try to determine from the document what encoding to use. So, I made sure that I had a <?xml version=”1.0″ encoding=”utf-8″ ?> declaration, but that didn’t do the trick either.

I’m hoping there’s some way other than sitecustomize.py to pass along that the object needs to stay in unicode form. I’m sure I can do it if I hack at ElementTree a little bit, but I really don’t like altering third party packages unless I can send in a useful patch that will get integrated.

Pyro: remote objects for Python

I’m sure that this is old news for everyone else, but it’s new to me. Look at the features of Pyro, Python Remote Objects. Wow. If you think Python would save you a lot of code and hassle compared to Java, that’s nothing compared to the pain that Pyro would save you over Java’s RMI! I don’t think I need Pyro right this second, but I have no doubt that I’ll be using this at some point. It also looks like Pyro already handles what Py.Execnet is trying to do with remote code running. The Pyro features list says that it can send objects across the wire, even including the Python bytecode if necessary. (The security implications there are staggering… but, the feature would be very powerful in a closed environment.)

(updated to fix idiotic typos)

Don’t import inside of PyObjC actions

Bob Ippolito came to my rescue today. I started using Cheetah for some template work, and it turns out that doing:

from Cheetah.Template import Template

from within your action is bad. (Bus Error bad, not simple exception bad.) So, if you’re using PyObjC, make sure you do your imports at the module level, not in a given block of code.
Update: Bob tracked it down. Don’t import Cheetah inside of an action, because it will import PyObjC’s WebKit (thinking that it’s getting WebWare’s WebKit) and importing PyObjC’s WebKit inside of an action has some issue at present. So, go ahead and do imports wherever you want again, as long as it doesn’t involve WebKit 🙂

Jason Orendorff, where are you?

I’ve been reading lots of good things about Jason Orendorff’s path module lately. I’ve been sprucing up my app’s build/packaging process and using os.path a lot. I wish I had Jason’s path module, but his site seems to have fallen over. That seems to be the only place to get the module, short of begging people who already have it (something I’m not above doing 🙂

ClearSilver built for Windows

I don’t like make. There, I’ve said it.

But, of course, there’s always more to say. I like make plenty when I don’t personally have to work with makefiles. Typing “./configure ; make” is a fine process for building software. But, it wasn’t that easy when I was trying to build ClearSilver for Windows with Python 2.4. There are instructions that came in the ClearSilver tarball that got me part of the way. For some reason, though, the configure script just wasn’t getting my Python include and libs directories right. The simple solution was to just edit rules.mk to point to the correct directory, which worked just fine.

I’m quite thankful for the work that the MinGW people have put in. This kind of thing would be far harder without it.

Python dynamic code replacement

It’s interesting the stuff you see when you’re looking at an aggregator. Michael Lucas-Smith of Software WithStyle writes about how updating live server code is so easy in Smalltalk. You replace the code, and all of the instances of those classes immediately get the new behavior.

Not so in Python, laments Chui Tey. He says that Twisted has a metaclass trick to do this and mentions one on ASPN. I’m not sure if this is the one, but the RubyClass hack does look like it’s designed for code replacement.

There was debate following Guido’s discussion of optional static typing in Python about whether adding static typing is a good use of time. While Python’s metaclasses do allow addition of all sorts of interesting behavior, I do think that supporting Smalltalk-like live system updating is a bigger win for Python developers than static typing would be.

Update: I’ve corrected the mention of Michael Lucas-Smith’s employer based on the comment from James Robertson.