ZODB vs. pysqlite with SQLObject

I have found only two good options for an embedded, transactional database that can be conveniently used in a non-free Python application: the ZODB and sqlite. I started out with sqlite and then recently started doing some heavy duty tinkering with the ZODB.

Concurrency

For many desktop applications, sqlite is great. The data is all conveniently stored in one file and current versions of sqlite even eliminate the need for “vacuuming” up deleted data. Add SQLObject to the mix, and you can get fairly transparent access to the database once you’ve set up your classes.

The problem that I ran into with SQLite is with concurrency. Concurrency is not much of a problem for many desktop apps: you have one user at the computer doing work, and that’s it. For my application, the app is doing things in the background and updating the database while the user is using the software. This effectively brings in some of the usually issues you have in a multiuser app.

sqlite has a simple locking strategy: if anyone starts writing executing data update statements in a transaction, the whole database is locked. So, while the background process was updating the database, the GUI would get stuck waiting for it. Ick.

The ZODB, particularly ZODB 3.4, has a great concurrency picture. The current ZODB implementation uses multi-version concurrency control (MVCC). This means that rather than locking the data when a write is occurring, other threads will just get older data. The ZODB doesn’t do locking at all. Instead, of two threads try to make updates to the same object a Conflict exception is raised for the one that didn’t make it through. The great thing about this strategy is that you can often just capture that exception and replay the transaction at the application level. This is exactly what Zope does.

In the work that I’ve been doing with it of late, ZODB 3.4 really does handle concurrency beautifully.

APIs

The other great thing about the ZODB is that it is transparent. There are only a very few things that you would do differently using the ZODB than what you do normally in Python. To save an object to the database, you just need to attach it to an object that is already hanging somewhere off of the root of the database.

SQLObject is a great object-relational mapper (ORM). Once you set up your classes, using SQLObject is about as transparent as an ORM can get. The downside is that you’re still doing mapping so you can’t completely forget about the fact that there’s a relational database back there.

With the ZODB, you can store any objects you need to in there, and you can do convenient Pythonic things, like adding new variables to instances that are stored in the database.

Since it sits on a relational database, SQLObject gives you excellent querying capability out-of-the-box. With the ZODB, you need to use a “catalog” to do anything other than what is effectively a “primary key search”. The catalog maintains indexes of the objects based on whatever attributes you need to index and will run queries against those indexes. Using the persistent BTrees that come with the ZODB, these indexes are very easy to create.

Note, however, that you need to create these indexes yourself and keep them up to date. IndexedCatalog can provide a little more of the relational database ease-of-use for searching.

The ZODB comes with a full-text search index, which sqlite does not.

The downside of ZODB

The ZODB, particularly with MVCC, works great for the application for which is was developed: web sites. The biggest trouble that I have with it for a desktop application is the packing operation required by the FileStorage. FileStorage is by far the most used (and maintained) way to store the database data. I don’t believe there is presently a maintained storage that does not require packing.

The good thing about packing is that it can happen in the background. The bad thing is that packing a 250MB database file down to 195MB takes many minutes (on my 1.25GHz PowerBook) with the Python process grabbing as much CPU as it can. It’s fine for a website to do a pack during off-peak hours, but sucking up a desktop user’s CPU for a good long time is unpleasant.

The ZODB tends to grow more quickly than a relational database does. The ZODB stores pickles of the objects. If you look at how pickles are stored, everything from an object’s dict gets dropped directly into the pickle with both the key and value. This makes sense, because any instance can have any set of attributes in its dict. I noticed that using slots does not seem to gain you any extra space efficiency in your pickles. Since a relational database has no such flexibility in storage, it can be far more efficient in space. That 195MB ZODB I mentioned is about 75MB in sqlite. Luckily, disk space is cheap.

ZODB (with FileStorage) uses a bit more memory than sqlite. I believe the current figure is about 8 bytes per persistent object to keep track of where the objects are in the big .fs database file. This is not too onerous. 1 million objects means 8MB of memory used, and machines can readily handle that.

It’s just worth pointing out that the ZODB is more resource hungry than sqlite.

Conclusion

Isn’t computer science great? There’s very rarely a “perfect” solution to a problem. I figured I’d write this up to help anyone who might be thinking about how they want to store their data in their application. sqlite and the ZODB have some significant differences and choosing one over the other really means taking into account what is most important for your application.

27 thoughts on “ZODB vs. pysqlite with SQLObject”

  1. Another alternative (perhaps not for your use) is Durus, which is a lighter-weight implementation of the same base functionality ZODB gives you; indeed the api is modelled after ZODB. As I understand it the motivation for writing Durus was simplicity and in fact most Python programmers can grasp what is happening under the covers – the code for Durus is fairly small. Durus does not offer MVCC.

    http://www.mems-exchange.org/software/durus/

    Pycon 2005 Durus presentation:
    http://www.python.org/pycon/2005/papers/17/Durus.html

  2. Oh yes, I forgot entirely to mention that I did look at Firebird as well.

    Firebird sounded great, but it has one significant problem: the “embedded” style database does not exist outside of Windows. If my app was Windows-only, Firebird would be an excellent option.

    On the Mac, you can get a Superserver build of Firebird going (there are some instructions out there for doing so), but you still have to get a separate database process going.

  3. I am aware of Durus as well… you’re right that it’s not appropriate for my app, but may be for some other folks. Personally, I find ZODB quite easy to work with from the get-go and it can scale up pretty well.

    Durus also requires packing, as I understand it.

  4. I think ZODB is an excellent choice where people need the additional features it offers – MVCC being among them.

    Durus is a good choice for those seeking simplicity, or are using Quixote/Dulcinea web framework. It scales fairly well too – having a ZEO like architecture. One specific feature I like in Durus that I am not sure exists today in ZODB is the “ComputedAttribute” subclass of Persistent.

    http://www.python.org/pycon/2005/papers/17/Durus.html

    It makes caching the output of a complex operation easy to keep synchronized with other connections (processes).

    Other things I like about Durus are not so much Durus specific as they are add-on bits available in Dulcinea, which is a collection of Quixote web helpers (user classes, login functionality, etc) and “site” management tools (start/stop apps, access the app’s database from the command line and manipulate it, manage a set of development, test and production “servers”).

    I liked your comment about the state of computer science and availability of multiple solutions which is why I popped these notes your comments.

  5. Yes, there are always tradeoffs. Mentioning Durus was a helpful addition here, especially given that I started the article with “there are two choices for embedded databases”… Firebird on Windows and Durus are two more choices. So, I guess there are now 4 choices 🙂

    I’ve looked at a whole bunch of different storage options, but for my application ZODB and sqlite got the most attention because they were the most applicable.

  6. By the way, I think you’re correct that the ZODB doesn’t have any specific equivalent to Durus’ ComputedAttribute. ZODB Persistent objects can have _v_ attributes (volatile) which do not persist and need to be regenerated whenever the object is pulled from the database. I’m not certain that the semantics or lifecycle are exactly the same as they are in Durus, however.

  7. Err, you wouldn’t. Isn’t “with” higher in precedence than “vs.”? 🙂

    The headline should be: ZODB vs. (pysqlite with SQLObject). sqlite is not OO, so SQLObject comes in handy. With the ZODB, you most certainly don’t (can’t) use SQLObject.

  8. How about interrupting the background process when the user wants to do something to the databse.
    I’d definitely go with SQLite if you can’t use Firebird.

  9. I did look at Metakit some months back… That is another option that is available for embedding a database in a non-free application. If memory serves, there was some scalability issues with it and the interface wasn’t quite as pleasant as the ZODB. I’ll have to see if I can find my notes on that.

    Regarding sqlite and the background jobs: that was exactly the route I was taking. I was moving the write jobs into a queue and would halt processing of that queue when the user is taking action.

    I’m using the past tense in that paragraph, though, because it turns out that Firebird was an option and it’s working great. I’m a bit busy with code right now, but I do plan to do a follow up article.

  10. Note that since version 3, SQLite no longer needs an exclusive lock for the entire duration of a write operation (instead, it’s only needed briefly, during the final stage of the commit). In other words, you *can* still read from the database without blocking, while a lengthy write operation is going in the background.

  11. That is true, and a good point. But, you do have to carefully consider how you do things. Here’s the sqlite page about locking:

    http://www.sqlite.org/lockingv3.html

    Section 7 (all the way at the end) tells the most relevant part of the locking story.

    pysqlite2’s handling of transactions is done well, because it doesn’t start a transaction until you run a statement that does some kind of database modification. That avoids getting a SHARED lock too early.

    Thanks for bringing up the point of exactly how sqlite handles locks. It is very relevant for anyone considering it.

  12. Given that Durus has already been mentioned above, i’d like to further add a few comments regarding (a) Durus itself, (b) one of the things you said in the article concerning sqlite: “The downside is that you’re still doing mapping so you can’t completely forget about the fact that there’s a relational database back there.”, and (c) regarding the lack, on the other hand, of out of the box “querying ability” of ZODB).

    To facilitate using Durus, I have made a small light layer, moellus, that adds precisely some very useful “relational-like” characteristics to the database’s organization… this way, I can help benefit from these (having, for example, some querying abilities out of the box), but still have a perfectly fine object graph with all of my objects. The Durus feature mentioned above, ComputedAttribute, is also used heavily to automate things such as the inverse of a many-to-one relationship, that can therefore be available simply by declaration.

    However, there is object-relational mapper in sight, as there is none! A pure object graph, with some model and organizational impositions…

    I have actually released a first version recently, that can be found at:
    http://scoopics.com/software/moellus/

    And, there is now also a presentation, just hot off the europython2005 press:
    http://www.python-in-business.org/ep2005/talk.chtml?talk=2092&track=692

  13. Anyone read about Gadfly?. A simple, in memory SQL python database. Not really a “proffesional” database, but good enough for “light-weight” desktop projects (much better than using access for example), and have not problems to store small images or files through Python pickle. I have used it in tens of projects with zero problems to recover from power failures up to date.

  14. Probably the only reason to choose Gadfly over sqlite is if you want to distribute one set of .py files, instead of having platform-specific binaries.

    While working with the ZODB, I did take a look at AdvancedQuery. Nice stuff, and you’re probably right that it would work outside of Zope with a little work. I didn’t mention it in the main article, because ZCatalog is enough to get you querying…

  15. Re: ComputedAttributes and _v_ attributes.
    Unlinke _v_ attributes in ZODB, ComputedAttributes provide a mechanism
    for invalidating data in other connections to the database. This makes it
    possible to maintain consistency of information derived from other persistent
    instances.

    The ComputedAttribute class in Durus is simple and doesn’t rely on any feature of Durus that is not also present in ZODB. You could do the same thing in a ZODB application

  16. probably you should also have a look at Atop

    “””
    ATOP, the Atomic Transactional Object Persistor, is a Python object database implemented atop the Berkeley DB and bsddb python module, with functional similarities to other python packages such as ZODB and COG.
    “””
    from the Atop project site – http://www.divmod.org/projects/atop

Comments are closed.