XML data transfers between Drupal sites
With the launch of the KSA Digital Library last month, and even more sites anticipated in the very near future, we can finally begin working on something I’ve wanted to tackle for a long time – the use of XML feeds to transfer data between Drupal sites. This post is an attempt to map out that process for myself, hopefully help others who are thinking about similar questions, and get feedback from those who have already put this in place.
Our example: Wexner Center for the Arts
The clearest example of this process is the one which I’m tackling as a proof-of-concept. Two of the sites we maintain are the Digital Library and the Herrick Building Archives. The former is a collection of images, videos, documents and other media that learners can use to study buildings, landscapes, and other phenomena. The latter records the history of every structure built on the Columbus campus of Ohio State…ever.
Now, it isn’t hard to see how these two sites overlap. To speak in terms of a specific building, check out each site’s record for the Wexner Center for the Arts:
If you look at the bottom of the Herrick record on the Wexner Center, you’ll see some images of the building that were provided by the OSU Archives. Thus far, those images had to be manually added to each record – which is a laborious process to maintain, especially given the fact that the school also maintains the Digital Library.
The goal, then, is to allow the Digital Library to serve as the image repository for other KSA websites, with sites like Herrick, our external-facing site, and others able to pull images from the library as needed without needing to duplicate records in multiple places.
Before getting into the specifics of modules, the conceptual diagram of this handoff between the Digital Library and Herrick works as follows:
- The Digital Library assembles a body of media, classifying each with terms from several vocabularies to aid discovery via browsing, and search. There are in the neighborhood of 35,000 such items currently. For any given work there is a pool of media documenting that example (in this case, the Wexner Center)
- From the pool of media about the Wexner Center, not every asset should be exported – sometimes due to rights restrictions, sometimes due to the content of a given asset. So a subset of the potential pool must be curated, and made available from the Digital Library in an XML format for ingestion elsewhere.
- The Herrick archives record for the Wexner Center must be pointed toward the Digital Library’s XML feed for that building, allowing the first harvesting of data.
- Because of the variety of sites coming to the Digital Library for information, however, it is possible that the Herrick Archives may not be interested in every image in the XML feed. The ingesting site, then, should have the ability to further curate the pool of supplied records before they are displayed alongside the core Wexner Center record.
Our solution: ????
My intended solution for this process is to use the Views, Feeds, and Flag modules. On the Digital Library side, we have already built a view, with a feed display, showing all the assets for a given work. I’ve begun implementing the Feeds module on a development Drupal site, and have built a harvester to take a given work’s feed into a separate content type.
Still on the development plate:
- Currently, nodes which are ingested from a feed do not have any node references. I need to tweak this so that they harvest their work node.
- When I’ve tried this process previously, there has been a rampant problem of duplication – the same nodes being imported repeatedly, potentially every time cron runs.
- Looking at the source of the feed out of the Digital Library, I’m thinking that something will need to be tweaked – either via overrides or (last result) generating the feed via something other than Views.
- Similar to the previous item, I’m probably going to have to write my own Mapper for the Feeds module, because the content types I’m importing will be a bit more specific than the generic options provided by default.
More as I work it out.
Update 1: Since publishing this post this morning, I’ve made a fair amount of progress – but one lingering issue still remains. Specifically:
- Node references are still a problem. The feed node in this case is a Building node, which creates Image nodes from the specified RSS feed. I want each created Image node to reference the Building node – but this has not been possible thus far.
- The duplication problem was solved by re-reading the documentation and declaring which of the imported fields need to be unique. Pretty simple, in hindsight.
- The feed out of the Digital Library has not, thus far, needed to be tweaked. I’d still like to do so – for example to place the image into an <enclosure> tag, or citation information into a <source> tag. So far, however, this hasn’t stopped progress.
- I have been completely stymied in my attempts to write a mapper – but have worked around this by implementing the Feeds XPath Parser module. This is also what has allowed me to not do more to re-write the feed coming out of the Digital Library. XPath parsing of the feed allows me to get greater access to elements within the feed, such as the source URL. Ultimately, however, XPath doesn’t help me get access to the feed node on the ingesting side (see the first note in this list)
Again, more progress as it happens.