Thursday, January 13, 2011

Parse XML with Java - SAX and DOM parser

Parsing XML Documents

To process the data contained in XML documents, you need to write a application program (in a programming language such as Java, JavaScript). The program makes use of an XML parser to tokenize and retrieve the data/objects in the XML documents. An XML parser is the software that sits between the application and the XML documents to shield the application developer from the intricacies of the XML syntax. The parser reads a raw XML document, ensures that is well-formed, and may validate the document against a DTD or schema.
There are two standard APIs for parsing XML documents:
  1. SAX (Simple API for XML)
  2. DOM (Document Object Model)
SAX is an event-driven API. The SAX API defines a number of callback methods, which will be called when events occur during parsing. The SAX parser reads an XML document and generate events as it finds elements, attributes, or data in the document. There are events for document start, document end, element start-tags, element end-tags, attributes, text context, entities, processing instructions, comments and others.

DOM is an object-oriented API. The DOM parser explicitly builds an object model, in the form of a tree structure, to represent an XML document. Your application can then manipulate the nodes in the tree. DOM is a platform- and language-independent interface for processing XML documents. The DOM API defines the mechanism for querying, traversing and manipulating the object model built.

The JAXP (Java APIs for XML Processing) provides a common interface for creating, parsing and manipulating XML documents using the standard SAX, DOM and XSLTs.
More Information with a  detailed example



DOM (Document Object Model)
DOM is a platform- and language-independent API for processing XML documents. The DOM parser loads the XML document, builds an object model in the memory, in the form of a tree comprised of nodes. The DOM API defines the mechanism for querying, traversing the tree; and adding, modifying and deleting the elements and nodes.

Example XML File to parse with java DOM.
<bookstore>
  <book ISBN="012345600">
    <title>Java</title>
    <author>CBG</author>
    <category>Programming</category>
    <year>2009</year>
    <edition>7</edition>
    <price>19.99</price>
  </book>
  <book ISBN="012345602">
    <title>Intro C/title>
    <author>BalaGuruswamy</author>
    <category>Programming</category>
    <year>2008</year>
    <price>25.99</price>
  </book>
  <book ISBN="0123456010">
    <title>The Complete Guide to Movies </title>
    <author>me</author>
    
    <author>Mary</author>
    <category>Movies</category>
    <category>Leisure</category>
    <language>Telugu</language>
    <year>2000</year>
    <edition>2</edition>
    <price>49.99</price>
  </book>
</bookstore>


import java.io.File;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
/**
 * Use DOM Parser to display all books: isbn, title and authors.
 */
public class DOMParserBookStore1 {
   public static void main(String[] args) throws Exception {
      DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
      DocumentBuilder docBuilder = factory.newDocumentBuilder();
      File file = new File("bookStore.xml");
      Document doc = docBuilder.parse(file);
   
      // Get a list of all <book> elements in the document
      NodeList bookNodes = doc.getElementsByTagName("book");
      for (int i = 0; i < bookNodes.getLength(); i++) {
         Element bookElement = (Element)bookNodes.item(i); // <book> element
         System.out.println("BOOK " + (i+1));
         String isbn = bookElement.getAttribute("ISBN"); // <book> attribute
         System.out.println("\tISBN:\t" + isbn);
   
         // Get the child elements <title> of <book>, only one
         NodeList titleNodes = bookElement.getElementsByTagName("title");
         Element titleElement = (Element)titleNodes.item(0);
         System.out.println("\tTitle:\t" + titleElement.getTextContent());
   
         // Get the child elements <author> of <book>, one or more
         NodeList authorNodes = bookElement.getElementsByTagName("author");
         for (int author = 0; author < authorNodes.getLength(); author++) {
            Element authorElement = (Element)authorNodes.item(author);
            System.out.println("\tAuthor:\t" + authorElement.getTextContent());
         }
      }
   }
}
We first get a new instance of DocumentBuilderFactory, and then obtain an instance of DocumentBuilder from the factory (in package javax.xml.parsers). After than, we can use the parse() method to parse an XML document (as a FileInputStreamInputSource, or String) and build a DOM tree to represent the XML document. The parse() method returns Document object (of package org.w3c.dom).  Check the API for package org.w3c.dom for the various classes used in DOM, such as ElementNodeNodeListText, etc.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = factory.newDocumentBuilder();
Document doc = docBuilder.parse(new File("bookStore.xml"));
The DOM tree for the bookstore example is illustrated below:

You can use the following method to get the root element of the document:
Element root = doc.getDocumentElement();   // return the root element
You can search element by tag-names as follows:
NodeList bookNodes = doc.getElementsByTagName("book"); // return all the book elements as NodeList
NodeList allNodes = doc.getElementsByTagName("*");     // return all the elements as NodeList,
                                                       //   wild card * matches all elements
The org.w3c.dom.Node interface defines constants for various type of nodes, such as Node.ELEMENT_NODENode.ATTRIBUTE_NODENode.COMMENT_NODENode.ENTITY_NODE,Node.ENTITY_REFERENCE_NODENode.PROCESSING_INSTRUCTION_NODENode.TEXT_NODE, etc.

No comments:

Post a Comment

subversion video