Here is a high-level overview of a pattern I see at work all too often:
- A new XML type is conceived and a schema definition document (.xsd) is introduced
- XML files are created based on the new .xsd
- Code files are written to abstract away the low-level details of reading/writing XML data and give an easy to use API for accessing data.
Lets look at very simple example
<!-- ConfigFile.xml -->
<ConfigFile>
<Parameters
timeout="1000"
runtimeDataPath="/path/to/data" />
</ConfigFile>
<!-- ConfigFile.xsd -->
<!--
Describe the allowable structure of <ConfigFile> and <Parameters>
-->
# configfile.py
class ConfigFile:
def __init__(self, path_to_xml_file):
self.parameters = {}
'''
... XML parsing code to fill in self.parameters...
'''
@property
def timeout(self):
return self.parameters.get('timeout', 2000)
@timeout.setter
def timeout(self, value):
self.parameters['timeout'] = value
@property
def runtime_data_path(self):
return self.parameters.get(
'runtimeDataPath',
'/default/path/to/runtime/data'
)
@runtime_data_path.setter
def runtime_data_path(self, value):
return self.parameters['runtimeDataPath'] = value
# main.py
from configfile import ConfigFile
config = ConfigFile('/path/to/config')
print (config.runtime_data_path)
print (config.timeout)
Lets breakdown what’s going on
- There exists “ConfigFile.xml” which describes parameters to control our script
- There exists “ConfigFile.xsd” which describes the allowed format of the “ConfigFile” type
- There exists “ConfigFile.py” which has a class that gives the rest of our code easy access to config file data
Here’s what I hate about this pattern
- If a property of the .xsd needs to be changed for any reason, changes to the source code are almost always needed
- It may not be obvious, but the definition of the “ConfigFile” type is defined in two locations: “ConfigFile.xsd” and “ConfigFile.py”. They are tightly coupled because they both reflect the actual definition of the “ConfigFile” type, albeit in different file formats.
- For example, if
runtime_data_path
changes toruntime_data_location
, changes to both the .xsd and python code files need to be made.
You can imagine how tedious this becomes on codebases that have hundreds of different XML file types. Our coding standards at work mandates APIs must (MUST) be created to read from and write to our configuration files. Thankfully, there are some tools out there to help with this problem. Here are a few:
- generateDS
- Autogenerates code from schema definition files
- PyXB
- Autogenerates code from schema definition files
These two tools are outstanding and are most likely good enough for the majority of use cases. The code autogeneration is fast and the penalty for changing the .xsd files is low, usually only re-committing the autogenerated code. The point of this post, however, is to remove the duplication completely. And the autogenerated code is still duplication no matter how small the perceived cost is to maintain it.
The trick to this is overriding two class level attributes: __getattr__
and __setattr__
. If you
aren’t familiar with these, here’s a brief overview:
__getattr__
is called whenever a member lookup of a class instance fails. For example, if a caller invokes obj.foo and foo is not a member of obj,__getattr__
is invoked. The default effect of__getattr__
is to raise anAttributeError
.__setattr__
is called whenever a member of a class instance is being set to a value.- There is a slight subtlety here that will need to be considered.
__setattr__
is ALWAYS called, where as__getattr__
is only called after a member lookup fails.
Let’s look at a very basic example of this concept
#xmlnode.py
class XmlNode:
def __init__(self):
self.data = {
'a': 1,
'b': 2,
'c': 3,
}
# This only gets called when member lookup fails.
def __getattr__(self, key):
# Before raising, see if key is in data.
if key not in self.data:
raise AttributeError(key)
# it's a valid key, return the data
return self.data[key]
# this always gets called when a member is beign set, whether it
# exists or not.
def __setattr__(self, key, value):
# Defer to our super if it's in __dict__ as it's existence in
# __dict__ implies it's a valid member of the class.
# Check for 'data' specifically for the edge case that
# this is the first time data is set as it won't be in __dict__.
# In general, this extra list contains all hand-defined members
# of a class
if key in self.__dict__ or key in ['data']:
super().__setattr__(key, value)
return
# The user gave us an invalid key, just raise
if key not in self.data:
raise AttributeError(key)
self.data[key] = value
root = XmlNode()
# This won't raise, even though none of these members are defined in
# code!
print(root.a)
print(root.b)
print(root.c)
root.c = 10
print(root.c)
# this will raise
print (root.d)
Let’s break this down:
- There exists a dictionary
data
within the class instance that contains some data __getattr__
and__setattr__
are overriden to catch “bad” invocations of non-existent members of theXmlData
class- When
root.a
is invoked, the interpreter makes a call intoXmlData.__getattr__
becausea
does not exist - Before deciding to raise an exception, see if
a
is indata
. If it is, return the value pointed to bya
. If it’s not, raiseAttributeError
, just like what would happen in the default implementation of__getattr__
. - Same goes with
__setattr__
. If the member being set is an actual defined member of the class, then callsuper().__setattr__()
. A list of pre-defined instance members needs to be maintained to avoid entering an infinite recursive loop.
So, lets change the XmlNode class slightly and make the point of the post obvious.
# xmlnode.py
class XmlNode:
def __init__(self, path_to_xml_file, attr_prefix='attr_'):
# child nodes
self.nodes = dict(list())
# attributes of the node
self.attributes = {}
# text of the node
self.text = None
# tag of the node
self.tag = None
'''
... XML Data Parsing ...
'''
def __getattr__(self, key):
# caller is trying to get an attribute
if key.startswith(attr_prefix):
# trim the prefix
key = key.replace(attr_prefix, '')
# invalid key
if key not in self.attributes:
raise AttributeError(key)
return self.attributes[key]
# caller is trying to get a node
else:
# invalid key
if key not in self.nodes:
raise AttributeError(key)
return self.nodes[key]
def __setattr__(self, key, value):
# catch valid members here and defer to super
if key in self.__all__ or \
key in ['nodes', 'attributes', 'text', 'tag']:
super().__setattr__(key, value)
return
# don't support setting anything other than attributes
if not key.startswith(attr_prefix):
raise AttributeError(key)
# trim the prefix
key = key.replace(attr_prefix, '')
# invalid key
if key not in attributes:
raise AttributeError(key)
self.attributes[key] = value
# main.py
from xmlnode import XmlNode
root = XmlNode('/path/to/ConfigFile.xml')
# Node access returns a list, and .xsd file guarantees there must be one
# and only one. So it's safe to access index 0.
parameterNode = root.Parameters[0]
# direct attribute access
print (parameterNode.attr_timeout)
print (parameterNode.attr_runtimeDataPath)
parameterNode.attr_timeout = "2000"
print (parameterNode.attr_timeout)
First, create an XmlNode
object which parses the provided configuration file path to create a dictionary
of attributes and a dictionary of child nodes which are also of type XmlNode
. To access attributes of a
node, prepend the member name with attr_
. To access child nodes, simply use the name of the node and
a list of child nodes gets returned. If you compare this with the previous version of main.py
you’ll notice
it’s very similar. Yet in this new version, there is no ConfigFile.py
and no dependencies on tools that
generate ConfigFile.py
.
The above example is still more pseudo-code-ish than anything else, so lets look at an implementation that uses
etree
. etree
is very easy to use and extend, and it’s underlying datastructure is very
similar to what was used in the previous example. Other libraries like lxml
could be easily adapted.
Head over to dynamic-xml to see the fully working implementation and examples. The high-level details:
DynamicXmlParser
exists to injectDynamicTreeBuilder
. It’s not worth discussing.DynamicTreeBuilder
exists to injectDynamicElement
. It’s also not worth discussing.DynamicElement
is the work horse of the library and has the__getattr__
and__setattr__
implementations described in this post. The implementation is very similar toXmlNode
, with added care to handle small details introduced by extendingxml.etree.ElementTree
.dynamicxml.py
is the main entry point and the only file that needs to be imported for use.
One deficiency with the implementation is that it does not consider the .xsd document that backs the xml file. The assumption is the xml file has already been validated with an offline tool and therefore can be accessed without concern of accessing erroneous data. The only trouble can arise with optional attributes and nodes which the library does not consider.