Dynamic XML Library with Python

Here is a high-level overview of a pattern I see at work all too often:

A new XML type is conceived and a schema definition document (.xsd) is introduced
XML files are created based on the new .xsd
Code files are written to abstract away the low-level details of reading/writing XML data and give an easy to use API for accessing data.

Lets look at very simple example

<!-- ConfigFile.xml -->
<ConfigFile>
    <Parameters
        timeout="1000"
        runtimeDataPath="/path/to/data" />

</ConfigFile>

<!-- ConfigFile.xsd -->
<!--
    Describe the allowable structure of <ConfigFile> and <Parameters>
-->

# configfile.py
class ConfigFile:
    def __init__(self, path_to_xml_file):
        self.parameters = {}

        '''
        ... XML parsing code  to fill in self.parameters...
        '''

    @property
    def timeout(self):
        return self.parameters.get('timeout', 2000)

    @timeout.setter
    def timeout(self, value):
        self.parameters['timeout'] = value

    @property
    def runtime_data_path(self):
        return self.parameters.get(
            'runtimeDataPath',
            '/default/path/to/runtime/data'
        )

    @runtime_data_path.setter
    def runtime_data_path(self, value):
        return self.parameters['runtimeDataPath'] = value

# main.py
from configfile import ConfigFile

config = ConfigFile('/path/to/config')

print (config.runtime_data_path)
print (config.timeout)

Lets breakdown what’s going on

There exists “ConfigFile.xml” which describes parameters to control our script
There exists “ConfigFile.xsd” which describes the allowed format of the “ConfigFile” type
There exists “ConfigFile.py” which has a class that gives the rest of our code easy access to config file data

Here’s what I hate about this pattern

If a property of the .xsd needs to be changed for any reason, changes to the source code are almost always needed
It may not be obvious, but the definition of the “ConfigFile” type is defined in two locations: “ConfigFile.xsd” and “ConfigFile.py”. They are tightly coupled because they both reflect the actual definition of the “ConfigFile” type, albeit in different file formats.
For example, if runtime_data_path changes to runtime_data_location, changes to both the .xsd and python code files need to be made.

You can imagine how tedious this becomes on codebases that have hundreds of different XML file types. Our coding standards at work mandates APIs must (MUST) be created to read from and write to our configuration files. Thankfully, there are some tools out there to help with this problem. Here are a few:

generateDS
- Autogenerates code from schema definition files
PyXB
- Autogenerates code from schema definition files

These two tools are outstanding and are most likely good enough for the majority of use cases. The code autogeneration is fast and the penalty for changing the .xsd files is low, usually only re-committing the autogenerated code. The point of this post, however, is to remove the duplication completely. And the autogenerated code is still duplication no matter how small the perceived cost is to maintain it.

The trick to this is overriding two class level attributes: __getattr__ and __setattr__. If you aren’t familiar with these, here’s a brief overview:

__getattr__ is called whenever a member lookup of a class instance fails. For example, if a caller invokes obj.foo and foo is not a member of obj, __getattr__ is invoked. The default effect of __getattr__ is to raise an AttributeError.
__setattr__ is called whenever a member of a class instance is being set to a value.
There is a slight subtlety here that will need to be considered. __setattr__ is ALWAYS called, where as __getattr__ is only called after a member lookup fails.

Let’s look at a very basic example of this concept

#xmlnode.py
class XmlNode:
    def __init__(self):
        self.data = {
            'a': 1,
            'b': 2,
            'c': 3,
        }

    # This only gets called when member lookup fails.
    def __getattr__(self, key):
        # Before raising, see if key is in data.
        if key not in self.data:
            raise AttributeError(key)

        # it's a valid key, return the data
        return self.data[key]

    # this always gets called when a member is beign set, whether it
    # exists or not.
    def __setattr__(self, key, value):
        # Defer to our super if it's in __dict__ as it's existence in
        # __dict__ implies it's a valid member of the class.
        # Check for 'data' specifically for the edge case that
        # this is the first time data is set as it won't be in __dict__.
        # In general, this extra list contains all hand-defined members
        # of a class
        if key in self.__dict__ or key in ['data']:
            super().__setattr__(key, value)
            return

        # The user gave us an invalid key, just raise
        if key not in self.data:
            raise AttributeError(key)

        self.data[key] = value

root = XmlNode()

# This won't raise, even though none of these members are defined in
# code!
print(root.a)
print(root.b)
print(root.c)
root.c = 10
print(root.c)

# this will raise
print (root.d)

Let’s break this down:

There exists a dictionary data within the class instance that contains some data
__getattr__ and __setattr__ are overriden to catch “bad” invocations of non-existent members of the XmlData class
When root.a is invoked, the interpreter makes a call into XmlData.__getattr__ because a does not exist
Before deciding to raise an exception, see if a is in data. If it is, return the value pointed to by a. If it’s not, raise AttributeError, just like what would happen in the default implementation of __getattr__.
Same goes with __setattr__. If the member being set is an actual defined member of the class, then call super().__setattr__(). A list of pre-defined instance members needs to be maintained to avoid entering an infinite recursive loop.

So, lets change the XmlNode class slightly and make the point of the post obvious.

# xmlnode.py
class XmlNode:
    def __init__(self, path_to_xml_file, attr_prefix='attr_'):
        # child nodes
        self.nodes = dict(list())

        # attributes of the node
        self.attributes = {}

        # text of the node
        self.text = None

        # tag of the node
        self.tag = None

        '''
        ... XML Data Parsing ...
        '''

    def __getattr__(self, key):
        # caller is trying to get an attribute
        if key.startswith(attr_prefix):
            # trim the prefix
            key = key.replace(attr_prefix, '')

            # invalid key
            if key not in self.attributes:
                raise AttributeError(key)

            return self.attributes[key]

        # caller is trying to get a node
        else:
            # invalid key
            if key not in self.nodes:
                raise AttributeError(key)

            return self.nodes[key]

    def __setattr__(self, key, value):
        # catch valid members here and defer to super
        if key in self.__all__ or \
            key in ['nodes', 'attributes', 'text', 'tag']:
            super().__setattr__(key, value)
            return

        # don't support setting anything other than attributes
        if not key.startswith(attr_prefix):
            raise AttributeError(key)

        # trim the prefix
        key = key.replace(attr_prefix, '')

        # invalid key
        if key not in attributes:
            raise AttributeError(key)

        self.attributes[key] = value

# main.py
from xmlnode import XmlNode

root = XmlNode('/path/to/ConfigFile.xml')

# Node access returns a list, and .xsd file guarantees there must be one
# and only one.  So it's safe to access index 0.
parameterNode = root.Parameters[0]

# direct attribute access
print (parameterNode.attr_timeout)
print (parameterNode.attr_runtimeDataPath)

parameterNode.attr_timeout = "2000"
print (parameterNode.attr_timeout)

First, create an XmlNode object which parses the provided configuration file path to create a dictionary of attributes and a dictionary of child nodes which are also of type XmlNode. To access attributes of a node, prepend the member name with attr_. To access child nodes, simply use the name of the node and a list of child nodes gets returned. If you compare this with the previous version of main.py you’ll notice it’s very similar. Yet in this new version, there is no ConfigFile.py and no dependencies on tools that generate ConfigFile.py.

The above example is still more pseudo-code-ish than anything else, so lets look at an implementation that uses etree. etree is very easy to use and extend, and it’s underlying datastructure is very similar to what was used in the previous example. Other libraries like lxml could be easily adapted.

Head over to dynamic-xml to see the fully working implementation and examples. The high-level details:

DynamicXmlParser exists to inject DynamicTreeBuilder. It’s not worth discussing.
DynamicTreeBuilder exists to inject DynamicElement. It’s also not worth discussing.
DynamicElement is the work horse of the library and has the __getattr__ and __setattr__ implementations described in this post. The implementation is very similar to XmlNode, with added care to handle small details introduced by extending xml.etree.ElementTree.
dynamicxml.py is the main entry point and the only file that needs to be imported for use.

One deficiency with the implementation is that it does not consider the .xsd document that backs the xml file. The assumption is the xml file has already been validated with an offline tool and therefore can be accessed without concern of accessing erroneous data. The only trouble can arise with optional attributes and nodes which the library does not consider.

Dynamic XML Library with Python

Categories

Tags

Add Comment(cancel reply)