Archive for January, 2005

I want to do some HTML scraping, and there are at least two Python packages to help with that: BeautifulSoup and ElementTidy. For what I’m doing, the input and output need to be UTF-8. I have had some success if I set my default encoding to utf8 via sitecustomize.py, but it would be nicer if I could explicitly set the encoding as needed.

I’m going to talk mostly about ElementTidy in this posting. Briefly, the problem I had with BeautifulSoup was that passing in a unicode string resulted in tags getting glommed together. For example, if I had two <a> tags, the content of the first tag would include the second tag, and the second tag would also appear separately. I didn’t spend any significant time looking at this problem.

With ElementTidy, the problems have been clearer, but not necessarily easy to fix. I still don’t have it working fully without using a sitecustomize.py file. What happens with ElementTidy is that the unicode string is being coerced to ASCII, causing an exception whenever there is a character that is not compatible with 7-bit ASCII. The first step along the way brought me to the _elementtidy.c module. This is the one that calls the HTML Tidy library to generate clean XHTML for ElementTree to work with. The first clue that something is amiss is that tidylib is called with an encoding of “ascii”. Sadly, just setting that to utf8 won’t do it, because the input from Python and the output to Python both need to be unicode objects, rather than standard strings. Here’s how I did it (for brevity, I’ll highlight the changes):

static PyObject*
elementtidy_fixup(PyObject* self, PyObject* args)
{
/* snip */
    char* text;
    if (!PyArg_ParseTuple(args, "es:fixup", "utf8", &text))
        return NULL;

    doc = tidyCreate();

    /* options for nice XHTML output */
    tidyOptSetValue(doc, TidyOutCharEncoding, "utf8");
    tidyOptSetValue(doc, TidyInCharEncoding, "utf8");

/* snip */
    pyout = PyUnicode_DecodeUTF8(out.bp ? out.bp : "", out.size, NULL);
    if (pyout)
        pyerr = PyUnicode_DecodeUTF8(err.bp ? err.bp : "", err.size, NULL);
/* snip */
    PyMem_Free(text);
/* snip */

What I had to do was pretty simple, but I haven’t looked at Python extensions in a few years, so I had to do a bit of reading to make these few minor changes. The first change is to tell PyArg_ParseTuple that we’re looking for utf8 coming in. By doing that, the incoming text will not be coerced to ASCII, and the text buffer will contain UTF-8 text. Then, in the options to tidylib, we need to specify that utf8 is our input and output format of choice. Once the response comes back, we create unicode objects rather than string objects. Thankfully, there is a function to take a UTF-8 encoded byte buffer and generate a Python unicode object. Finally, don’t forget to free the incoming argument, because the “es” option to PyArg_ParseTuple allocates a new buffer for the encoded string.

I’m pretty sure that there would be a way to directly get the unicode object out of the tuple so that no additional buffer needs to be created. I haven’t looked into that at all. Secondly, this function now only does utf-8 in and out (though if you pass it an ASCII string, that will still work). This is not ideal, but it meets my needs.

That works fine, and it would have been nice if that was the end of the story. Of course, it’s not, otherwise I would’ve stopped typing. By making the changes above, _elementtidy.fixup will nicely do UTF-8 in and out. Unfortunately, I then ran into problems with ElementTree coercing my document to ASCII. Looking at the Python implementation, I see that no encoding is passed to the expat parser. The docs for that package say that it will try to determine from the document what encoding to use. So, I made sure that I had a <?xml version=”1.0″ encoding=”utf-8″ ?> declaration, but that didn’t do the trick either.

I’m hoping there’s some way other than sitecustomize.py to pass along that the object needs to stay in unicode form. I’m sure I can do it if I hack at ElementTree a little bit, but I really don’t like altering third party packages unless I can send in a useful patch that will get integrated.

Comments 6 Comments »

Facing the threat of shutout from government bids, Microsoft is opening up the file formats for Office 2003. This is a big step, because it means that other software products that want to be able to make use of Office files will be able to do so with far greater compatibility than in the past. Open source projects like Abiword, Gnumeric and Open Office will really gain from this. And, Microsoft even designed the license to ensure that open products can make use of the formats, even if a Microsoft patent covers the use of the format. Good deal!

People paying close attention did notice that there is an attribution clause in the license (which means that you have to credit Microsoft for the file format package if you use it). This is not compatible with the GPL license, which means that GPL’ed software will likely need to do some tricks like having a plugin for the MS office formats. We’ll see how people deal with that.

Comments No Comments »

There are some fuzzy statistics going on here, but it’s still a funny article: Excluding Bill Gates, Dec. incomes seen rising 2.6%

Excluding Gates’ $3.3 billion gain, personal incomes probably rose by 2.6 percent, with 0.3 percentage point of the increase going to just one really smart but not so tall man.

Microsoft paid out $32 billion of its massive cash hoard to stock holders. Guess who owns 10% of Microsoft?

Comments No Comments »

Found via Ksenia Marasanova: the css-discuss wiki is a companion to the css-discuss mailing list and is full of good info. Particularly given the variety of CSS support in browsers, this is a great resource. Lots of links to resources all over the place.

Comments No Comments »

Actually, that’s not true. For many software developers, it is an ASCII world. But, if you’re making software for other developers to use, please keep character encodings in mind. I’ve had to jump through hoops with many packages along the way to get them using UTF-8.

Luckily, when you use open source libraries, the lack of non-ASCII support is usually not a big problem, because it’s easy enough to add the support. Not so with a closed-source third party library. Though, anyone offering a commercial library would have to deal with encodings once they start offering the product internationally.

Comments 3 Comments »

Creating an LLC or a small corporation is a very well-understood problem. If you’re getting VC, you should be talking to a lawyer. If you’re just bootstrapping your own business, you should check out Nolo. For anything you do that is standard, boilerplate kind of stuff, Nolo can save you a lot of money. They provide books and software on a whole bunch of topics, including forming an LLC or corporation. A lawyer may charge you $600 to form your company. An incorporation service may charge you $150. At Nolo, you can pick up a book for $30-40. They even have an LLC Maker software package that will create Articles of Organization and an Operating Agreement for you.

By the way, I know that using the Nolo products will take more of your time than just answering a couple of questions from your lawyer and having them do the rest. I also know that bootstrappers are strapped for time. However, we’re also strapped for cash. Most good lawyers are going to bill at well over $100 an hour. Just as with programming tasks, estimate how long it will take and then choose the best of the two approaches.

The state of Michigan has a great website. You step through the items on the form and voila! You get a PDF of the form, filled out and ready to file. You can file by fax and have a response back in 2-3 days. In fact, if your state has something like Michigan’s website, I wouldn’t recommend getting Nolo’s LLC Maker software, because some of the value is replaced by your state’s free services. The Articles of Organization for an LLC are very easy. The Operating Agreement, which is often not filed with the state, has more detail and is especially critical if your organization has multiple owners. I’m sure you can find samples online, and there are samples in Nolo’s books. Using Nolo’s products and the spiffy Michigan website, setting up my LLC took just a couple of hours in total.

Nolo also offers a book that tells you how to maintain your LLC records and ensure that your business entity stays real in the eyes of the law.

Even if your needs are out of the ordinary, Nolo can save you some money by making you aware of the issues up front and giving you boilerplate to work with. You can always start with the boilerplate, add in notes of what you think needs to be different, and have your lawyer review that.

Nolo’s website also offers a fair bit of free information. I’ve used Nolo products for a number of years now and have been happy with them. I’d be curious to hear of alternatives as well.

Comments No Comments »

Today, my new company was officially born as a Michigan LLC. The name, which you’re likely to see here quite a bit more in the future, is Blazing Things LLC. Woo hoo!

Comments No Comments »

Default Passwords has the out-of-the-box passwords for hundreds of items. You never know when that will come in handy. Via Tug’s Blog.

Comments No Comments »

I’m sure that this is old news for everyone else, but it’s new to me. Look at the features of Pyro, Python Remote Objects. Wow. If you think Python would save you a lot of code and hassle compared to Java, that’s nothing compared to the pain that Pyro would save you over Java’s RMI! I don’t think I need Pyro right this second, but I have no doubt that I’ll be using this at some point. It also looks like Pyro already handles what Py.Execnet is trying to do with remote code running. The Pyro features list says that it can send objects across the wire, even including the Python bytecode if necessary. (The security implications there are staggering… but, the feature would be very powerful in a closed environment.)

(updated to fix idiotic typos)

Comments No Comments »

Ever since Comcast upgraded us to 3 Mbps, our cable modem has frozen up from time to time (usually during large transfers). I had always assumed that it was because our modem is ancient. We got the Comcast service as soon as they (or rather, Media One as it was at the time) offered two-way service. That was probably about 5 years ago.

The past few days, our service has gotten spotty. I had read about Comcast moving to 4 Mbps, so it might be related to that. Today, a cheerful Comcast service guy came out and replaced out modem with a sleek, new Motorola model. That’s the nice thing about paying the $5/month for the modem. It’s their responsibility!

Sadly, after the upgrade, our old Linksys BEFSR41 router refused to get an address via DHCP. My Mac could do it just fine, so I knew the cable modem was OK. I gave a firmware upgrade a go, but that didn’t do the trick either.

This was one of those points where I could have chosen to try one thing after another to make a really old, inexpensive router work properly. I value my time more than that, so I ran out to Best Buy and got a Linksys WRT54G, conveniently $50 after rebate. That’s the spiffy Linux-based model. So, we get an 802.11g upgrade in the process.

Setup for that was easy enough, but it’s pretty obnoxious that Linksys doesn’t include any instructions for setting the router up on a non-Windows machine. I went through the setup on our Windows box and then discovered that I probably could have just plugged the box in and pointed a browser at 192.168.1.1. Oh well.

Anyhow, that’s done and working now. Sadly, my net connectiion still seems to be moving slowly. It looks like there’s some nasty packet loss somewhere within Comcast’s network, so I guess I’ll be back on the phone with them.

Other than this little episode, though, I have to give Comcast a big thumbs up. Over the years, our service has been very reliable and very fast. Assuming they get this problem fixed soon, I’d still be quite happy to recommend them (at least here in Ann Arbor).

Comments No Comments »