PyORQ - Python Object Relational binding with Queries

Home
Introduction
Features/TODO
Tutorial
Documentation
Implementation

SF Summary
Download
Contact

Implementation

PyORQ relies heavily on new features in Python 2.2/2.3, including properties, metaclasses (2.2) and the datetime module (2.3). I want to highlight some parts of PyORQ that illustrate nicely what you can do with current Python. I also want to explain some of the gory details of the query-to-SQL translation. Simply because I'm rather proud of them.

I assume you have some basic understanding of the new features Python 2.3. You can read up on them here.

Object identities

The key ingredient of any object-relational mapping is the notion of object identity. PyORQ uses the shorthand oid. In addition, each persistent class has a corresponding table with table identity tid.

At the Python side of PyORQ, each instance i can be uniquely identified with a (cls, oid) pair. If we know the oid of an instance of class cls, we can retrieve that instance using:

i = cls(oid=oid)
i.oid is oid        # This will now be True

Of course, we normally don't know the object identity, and that's why we use queries to instantiate objects. However, PyORQ does use this instantiation mechanism internally.

At the SQL side each reference can be described by a (tid, oid) pair. We need the tid, because references can point to subclasses.

The ptype metaclass

Modifying the semantics of class creation and object instantiation are the canonical examples of the things you can do with metaclasses. The metaclass for persistent objects is rather simple:

class ptype(type):
    def __init__(cls, name, bases, attributes):
        """Called after class has been created"""

        ...do some bookkeeping on persistent properties...

        cls.database.register_class(cls)

    def __call__(cls, *args, **kwargs)
        """Called when object is instantiated"""
        return cls.database.get_instance(cls, *args, **kwargs)

The ptype.__init__() method is the constructor of the ptype instance, which is a class. It passes the newly created class object to the database for registration. The database can then decide if it needs to create a new table for the class.

The ptype.__call__() methods is called when an instance of a persistent class is created (when a persistent class is 'called'). This job is also passed to the database. The database can decide if a new instance must be created or if an existing instance should be retrieved from either a cache or the backend.

The main purpose of pobject, the base class of all persistent classes, is to make these classes instances of ptype.

class pobject(object):
    __metaclass__ = ptype

The full definition of ptype and pobject can be found in module pyorq.ptype. The abstract base class defining the database interface can be found in module pyorq.interface.db_base.

Persistent properties

Properties are objects that define __get__(), __set__() and __del__() methods to get, set, and delete instance attributes.

The module pyorq.pprop defines persistent value properties (pval objects) and persistent references (pref objects). The different types of persistent values (pint, pfloat, etc.) derive from pval.

The __set__() method for persistent properties is very straightforward. It performs a type check on the value, stores the value in the instance __dict__, and sets the dirty bit.

The __get__() method for persistent values is a little more interesting, because it gets called when we get the attribute for an instance, as well as when we get the attribute for the class:

class pval(persistent):
    ...
    def __get__(self, instance, cls=None):
        if instance is None:
	    # called as cls.attr
            return prel.value_attr(cls, self)
        else:
            # called as instance.attr
            return instance.__dict__.get(self.key, self.default)

If we do the attribute access on an instance, we return the value from the __dict__ or a default. If we do an attribute access on a class, we create a new object value_attr, which is where the query-mechanism starts. I will come back to that later.

The __get__() method for persistent references is also interesting, because it solves a small problem we have with retrieving references from the database.

If we retrieve an instance, and also, recursively, retrieve the instances it refers to, we could end up with a huge number of retrieval operations. We therefore initially don't retrieve these referred objects, but rather store the (cls, oid) pair. Only when we access a reference do we instantiate this object. In code:

class pref(persistent):
    ...
    def __get__(self, instance, cls=None):
        if instance is None:
            return prel.ref_attr(cls, self)
        else:
            val = instance.__dict__.get(self.key, self.default)
            if isinstance(val, tuple):
	        val = val[0](oid=val[1])
            instance.__dict__[self.key] = val
            return val

Note that accessing a class attribute returns a prel.ref_attr object if that attribute is a pref object.

Queries

The module pyorq.prel, defines the relational algebra for PyORQ. As we saw earlier, attribute access on classes may return value_attr and ref_attr objects for persistent values and persistent references respectively. These are both subtypes of the expression object, defined in this module.

Expressions define arithmetic operators which return other types of expression object, depending on the operator. Expressions also define comparison operators, which return relation objects. Relations define logical operators (using bitwise operators), that return new relation objects.

For example, the query

(A.x>=7) & (A.x + 1 < A.y.z)

returns an object

rel_and(comp_ge(value_attr(A, A_x),
	        7),
        comp_lt(expr_add(value_attr(A, A_x), 
                         1),
                value_attr(ref_attr(A, A_y),
                           Y_z))

where A_x and Y_z are pval objects, and A_y is a pref object.

Translating queries into SQL

The relation object in module pyorq.prel defines an iterator method (__iter__()) that generates instances that satisfy the query.

I distinguish between queries that that don't contain references, and queries that do. Let's first consider simple queries without references (or bound variables, as I call them).

Given a class hierarchy:

class A(pobject):
    x = pint()

class B(A):
    pass

and the queries

q1 = (B.x == 3)  # q1.free_variable() is B
q2 = (A.x == 3)  # q2.free_variable() is A

The query q1 should only generate instances of type B, since the free variable B has no subclasses. The query q2 should generate instances of types A and/or B, since in both cases the instance i, may satisfy (isinstance(i, A) and i.x==3).

Here is the __iter__ method for relation objects:

class relation(object):
     ...
     def __iter__(self):
        cls = self.free_variable()
        db = cls.database

        for i in cls.all_subclasses():
            for j in db.query_generator(i, self):
                yield j

The class method all_subclasses is defined in ptype.ptype. It recursively generates the class and all of its subclasses.

The query_generator method of the database interface is responsible for generating the the SQL query, and retrieving the instances. In the simple case, without references, it uses this code-path:

def query_generator(self, cls, relation)
    cmnd = "SELECT _x.oid FROM %s _x WHERE %s" % (cls.table_name,
                                              relation.sql_repr(self))
    oids = self.fetchall(cmnd)
    for oid in oids:
         yield cls(oid=oid)

Note that I use _x as an alias for the free variable. The sql_repr method of the relation object knows about this aliasing convention, and can thus easily produce a WHERE-clause.

Queries become more complicated if they contain references. Consider the following addition to our database schema:

class C(pobject):
    ref1 = pref(A)
    ref2 = pref(A)

class D(pobject):
    ref = pref(C)

If we want to translate the query

C.ref1.x == 3

we have to take into account that C.ref1 refers to objects of type A or B. To do this, the query generator first asks the relation object for an inventory of bound variables and their 'reference chains'. In this case there is one bound variable (A) with 'chain' ('_x', 'ref1').

The reference chain is used to generate aliases for the bound variable, in this case '_x_ref1'. They are also used to define join clauses. In this case _x.ref1_oid = _x_ref1.oid

We now have to loop over all permutations that are generated by considering all subclasses of each bound variable. In this case, that means we execute the query twice, Once with the table for A aliased to '_x_ref1', and once with the table for B, using the same alias. In SQL:

SELECT * FROM c _x, a _x_ref1 
         WHERE _x.ref1_oid=_x_ref1.oid AND (_x_ref1.x = 3)
SELECT * FROM c _x, b _x_ref1 
         WHERE _x.ref1_oid=_x_ref1.oid AND (_x_ref1.x = 3)

Note that the WHERE clause does not change. Only the table that is aliased to _x_ref1 is changed.

The query

(C.ref1.x == 3) | (C.ref2.x==3)

has two bound variables, that happen to be of the same type (A). The reference chains are ('_x', 'ref1') and ('_x', 'ref2') respectively. This means that we have to evaluate the query for the the following permutations of the bound variables: (A, A), (A, B), (B, A), and (B, B). Or, four queries of the form:

SELECT * FROM c _x, a _x_ref1, a _x_ref2
         WHERE _x.ref1_oid=_x_ref1.oid AND _x.ref2_oid = _x_ref2.oid 
               AND ((_x_ref1.x = 3) OR (_x_ref2.x = 3))

with four permutations of the tables that refer to _x_ref1 and _x_ref2 respectively.

Queries containing references to references, introduce extra bound variables. For example, the query

D.ref.ref1.x == 3

has two bound variables C and A, with reference chains ('_x', 'ref') and ('_x', 'ref', 'ref1') respectively. We have to execute the following query with permutations (C, A) and (C, B).

SELECT * FROM d _x, c _x_ref, a _x_ref_ref1
         WHERE _x.ref_oid = _x_ref.oid AND _x_ref.ref1_oid = _x_ref_ref1.oid
               AND (_x_ref_ref.x = 3)

The code path for the query generator with references is:

def query_generator(self, cls, relation):

    # get the list of bound variables and their reference chains
    # use a dict mapping chains to bound variables to avoid duplicate joins
    bound_variables = {}
    relation.update_bound_variables(bound_variables)
    bound_variables = bound_variables.items()
    
    cmnd_format = "SELECT _x.oid FROM %s _x,%s WHERE %s AND %s"

    # loop over all permutations of the bound variables set
    for current_vars in self.generate_bound_variables(bound_variables):
        aliases = self.create_aliases(current_vars)
        join_clause = self.create_join_clause(current_vars)

        cmnd = cmnd_format % (cls.table_name, aliases, 
                              join_clause, select_clause)
        oids = self.fetchall(cmnd)
        for oid in oids:
            yield cls(oid=oid)