XML

XML documents are a series of nested tags, possibly with attributes. An example is the MSigDB xml file, which contains curated gene sets stored following a format specification.

Download a copy to your AMI, and store it in a directory ~/xml/.

The first few lines of this file look like

    [1] <?xml version="1.0" encoding="UTF-8"?>
    [2]
    [3] <MSIGDB NAME="msigdb" VERSION="4.0" BUILD_DATE="May 31, 2013">
    [4]    <GENESET STANDARD_NAME="NUCLEOPLASM" ...></GENESET>
    ...
[10299] </MSIGDB>

Line 1 tells us about the version of XML used in the document, and the character encoding. Line 3 opens the MSIGDB node. The node has several attributes, NAME, VERSION, BUILD_DATE, as described in the format specification. Nested inside the MSIGDB node is the first of many GENESET nodes; the node terminates on the final line of the file, with </MSIGDB>. The GENESET node has several attributes (of which only one is shown) an empty body, and terminates with </GENESET>.

Interacting with XML: XPath

Load the data base in to R

library(XML)
xml <- xmlTreeParse("~/xml/msigdb_v4.0.xml", useInternalNodes=TRUE)

Don't bother to print xml, it'll scroll across your screen for quite a while.

Elements of XML can be addressed using XPath. The idea is to specify the path from the root of the document to the node(s) or attributes that you're interested in. The path is like a linux file path, starting with /. Attributes are specified with @ before their name. We can subset the xml object using this language, e.g.,

xml[["/MSIGDB/@NAME"]]

An alternative is to use the xmlAttrs function to extract the attributes of the node we're interested in

xmlAttrs(xml[["/MSIGDB"]])

There is only one NAME attribute of MSIGDB, but there are many GENESET child nodes. Here we create a node set of all of these

sets <- xml["/MSIGDB/GENESET"]
class(sets)
length(sets)

XPath provides a convenient syntax for querying nested paths: //GENESET says to start at the root and find all paths that have GENESET at any level.

We could manipulate sets at the R level, e.g., selecting the second element and viewing the first four attributes

head(xmlAttrs(sets[[2]]), 4)

but it's more fun to formulate this query using XPath to select all attributes of the second gene set

head(xml["//GENESET[2]/@*"], 4)

Notice that this gene set has a STANDARD_NAME attribute. We can use this to select the gene set

yy <- xml[["//GENESET[@STANDARD_NAME = 'EXTRINSIC_TO_PLASMA_MEMBRANE']"]]
xmlAttrs(yy)[1:4]

There are many gene sets in our document; we might like to visit them all and extract a particular element, e.g., the ORGANISM attribute. We can do this by iterating over the node set in R

organism <- sapply(sets, function(elt) xmlAttrs(elt)["ORGANISM"])

but again a fun way to do this is to use an sapply-like formulation on the XML document itself

organism <- xpathSApply(xml, "//GENESET/@ORGANISM")
table(organism)

The XPath specification includes functions that are useful for, e.g., string matching. A simple example is to count the number of gene sets in our document

xml[["count(//GENESET)"]]
xml[["count(//GENESET[@ORGANISM='Homo sapiens'])"]]

Section 2.5 Abbreviated Syntax of the XPath specification is a very handy introduction to the flexibility of XPath queries.

Exercise Use an XPath query to select the 5 gene sets that have ORGANISM equal to 'Danio rerio'. Use a single XPath query to determine the STANDARD_NAME of these gene sets.

(Advanced) XML event parsing

Scenario: very large XML file
Solution: iterative processing
Implementation: XML package xmlEventParse()

xmlEventParse()

Provide a 'call-back' function to process data each time a node of a particular type is encountered
Implement the call back as a closure, e.g., by using a 'factory' function that returns a function, that retains state across calls.

Example: from StackOverflow

Advanced exercise: implement event parsing to retrieve the STANDARD_NAME and DESCRIPTION_BRIEF attributes from all GENESET nodes.