[an error occurred while processing this directive]
> developer > web app development
SAX Pre-Processing in xCommerce
by exteNd Composer Product Team, eBusiness Integration Products, Novell
Date Created: 2001-04-30 09:07:00.000
  Introduction
  What is SAX?
  Document-Processing Use Cases
  xCommerce Use Cases
  The File I/O Case
  Does It Really Work?
  Conclusion
  For More Information
introduction

Performance is a key concern in all types of applications, but in J2EE-based enterprise applications it can be especially pressing. With XML integration apps, one factor that always figures prominently in performance is the size of the document being manipulated. Simply put, a large in-memory DOM can impose huge demands not only on system memory but CPU time.  

In SilverStream xCommerce, as with many DOM processors, performance slows exponentially with increasing DOM size. The issue is that if you pass a large XML document in to xCommerce, it will parse that file into a roughly two-and-a-half-times-larger DOM, which has to fit in memory. But parse time quickly becomes a problem: On a system where a 50K XML file takes a couple hundred milliseconds to parse, a 500K file takes ten seconds or more.

Fortunately, there's a way around the "large DOM" problem. It's called SAX processing, and it's a technique you should definitely consider if you routinely deal with large (100K or more) XML files.

what is SAX?

SAX stands for Simple API for XML. It's an open-source library of event-based parsing classes for XML handling, developed collaboratively by the members of the XML-DEV mailing list (hosted by http://www.oasis-open.org). SAX functionality is included as part of the Apache Xerces package (see http://xml.apache.org/xerces-j/), and because Xerces is incorporated into xCommerce, you already have the power of SAX at your fingertips. (The SAX parser context is com.sssw.b2b.xerces.parsers.SAXParser.)

Space obviously precludes a detailed discussion of SAX internals here, but the general gist is this: Rather than parsing an entire XML document into one large, in-memory DOM all at once, SAX allows an application to process an XML document as it is being read. In SAX, a ContentHandler can be associated with a parser. At key times during parsing, the parser calls the ContentHandler's callback routines, which have names like startDocument(), startElement(), endElement(), and so on. Basically, key "parse points" are treated as events, and the ContentHandler handles those events.

ContentHandler is actually just one of four handler interfaces defined by SAX (the others being DTDHandler, EntityResolver, and ErrorHandler). If your goal were to use SAX to "filter" a large incoming XML doc so as to retrieve just the portion(s) of the DOM you're interested in, you would write a Java class that implements the ContentHandler interface. After instantiating a SAXParser (either in your own class or some other class), you would attach your custom handler class to the parser via the parser's setContentHandler() method. Then you would call parse() on the URI of the XML file, and let SAX do its magic.

Document-Processing Use Cases

In terms of document handling, you can think of SAX processing being used in two possible ways:

  1. If only selected portions of a large document are relevant to the application, SAX pre-processing can be used to pull out just the needed information, ignoring the rest. For example, the document might contain a particular set of line items that need to be processed by the application. Through SAX, you could gather together the line items and package them in one (much smaller) DOM, which could in turn be handled by an xCommerce Web Service.
  1. All of the incoming XML document contains relevant information that must be processed; but the document is too large to fit in memory (or would bog down performance). SAX processing might be useful for processing the file's entire contents one small portion at a time, assuming the application logic can deal with any dependencies spanning various nodes in the doc.

As you are writing your XML integration application, you will want to think about which of these categories your input document(s) might fall into, and what the implications might be for data flow. Some of the possible implications vis-?-vis xCommerce are discussed next.

xCommerce Use Cases

In an xCommerce application, there are three use cases where the SAX pre-processing strategy can be useful: 

  1. A large inbound XML document is received by an xCommerce Web Service through one of the standard service trigger mechanisms.
  1. A large XML document is received through an XML Interchange action.
  1. A large file needs to be read from disk and processed by a service or component.

The first case requires a bit of work to handle, because xCommerce has (at present) no direct mechanism for invoking SAX handling at the entry point to a Web Service. To apply SAX logic here requires writing a custom service trigger using the framework provided for this purpose in the xCommerce Server installation. (The use of this framework is explained in the xCommerce Server documentation.) In your custom trigger, the processRequest() method of your InputConversion class would have to do SAX processing before the service is actually invoked. Otherwise, the service will be dealing with an in-memory DOM (which we want to avoid).

Fortunately, the first case is not apt to be a common one, for the simple reason that HTTP isn't generally used for moving large documents around. But "large" is a relative term. On some systems, an XML document that's as small as 50 Kbytes in size will introduce significant latency to a DOM-oriented app. And 50K files are transported via HTTP all the time.

The second use case outlined above (involving XML Interchange) poses a challenge in that there is no way, at present, to intercept a large incoming XML doc obtained via XML Interchange action and keep it from being parsed into a DOM by xCommerce. The normal situation looks something like this:

 

 

An XML Interchange action (in this example) pulls an XML file from a URI, but as it is coming in, the XML Interchange action automatically converts it to a DOM, which swells the effective document size.

One workaround to this situation is to call a custom Java class from an ECMAScript function (via a Function action) and let the Java class pull the XML document in from an URL and SAX-process it. (See below.) If your service or component needs to take action on individual parts of the XML document inside a loop, you can design your Java object in such a way that it acts like a node iterator, returning nodesets of interest on each call until there are no more nodesets.

 

 

Another workaround that might make sense in a case where the incoming XML document contains within it logical groupings of data that are best handled by independent services (i.e., one service per grouping) would be to have a custom Java object pull the XML doc from the appropriate URL, break it (using SAX) into multiple small XML documents, and fire multiple xCommerce Web Services (labeled Service B, C, and D in the following diagram) using standard triggers:

 

One advantage of this strategy is that it takes advantage of multithreading, since each Web Service runs in its own thread.

The File I/O Case

The use case involving a large XML file on disk (number 3 in our list further above) can be handled in a straightforward way: Write a SAX handler (custom Java class) that can do the necessary document parsing, and call the class from ECMAScript in a Function Action. Depending on your needs, you could have the class return an array of Strings (representing the useful parts of the XML document), a single consolidated String, or it could act as an iterator, returning one XML doclet at a time until nothing is left to process.

Fortunately, a tool exists that can make your life easier with respect to the last option.

The SplitSAXWriter Class

An upcoming release of xCommerce will offer a special Java class, currently called SplitSAXWriter (based on the SAXWriter class available in the standard Xerces/SAX package), that splits an XML document into pieces according to an XPath-like description of the node pattern you wish to process. The class lets you obtain nodesets in groups of arbitrary size. For example, you might want to obtain 20 nodes at a time of type "Root/InvoiceBatch/Invoice." With SplitSAXWriter, you can do:

SplitSAXWriter aWriter =

        new  SplitSAXWriter("huge.xml",20," Root/InvoiceBatch/Invoice");

String chunkOf20 =  aWriter.getCurrentBatchContent();

//

// do something with chunk

//

aWriter.processNextBatch();

 

The following illustration shows what an actual Action Model that uses the SplitSAXWriter class looks like:

 

 

Before the loop is called, a Function action is used to set up a loop counter variable, and another Function action instantiates a SplitSAXWriter object. The ECMAScript expression for the latter looks like:

var mySaxParser = new Packages.SplitSAXWriter("file:///D:\\xCommerce\\Samples\\ING_orig\\tango\\NRE1150.xml", 100, "XML-S/Statement")

The SplitSAXWriter returns nodes meeting the "XML-S/Statement" pattern, in blocks of 100. A Repeat While loop calls getCurrentBatchContent() at the top (and checks for non-nullness), and calls processNextBatch at the bottom of the loop. In the guts of the loop, a Map action is used to bring a node-batch into a Temp DOM, and two external components (a Map component and a JDBC component) are called to operate on the DOM.

Note that if you choose to write your own SAX pre-processor, you should take care to avoid having it return large chunks of the document in an Array (or List or Vector). It's better to return one chunk at a time and have the code block while you process the chunk, before continuing to parse. After all, if your pre-processor simply divides the input file into 10 or 20 (or however many) segments and hands you back an array of segments, you haven't accomplished any memory savings. All the nodes are in memory at once. That's exactly what you want to avoid.

does it really work?

Preliminary performance testing of the SplitSAXWriter class confirms that the processing time with SAX increases linearly (rather than exponentially) with increasing DOM size. The graph below summarizes test results obtained by running a test document through ordinary DOM processing versus SAX processing. The test document contained nodes that were replicated as need be to give a desired filesize. Documents thus created varied in size from 20 nodesets to 3,300 (with 2,500 being equivalent to a half-megabyte file).

In the graph, document size is plotted on the X-axis, going from zero to 3,290 nodesets. Parsing time is plotted on the y-axis, going from zero to 16.3 seconds.

 

 

The red curve clearly shows that SAX parsing, while incurring a penalty for small XML documents, rapidly becomes superior to all-at-once DOM parsing for files greater than about 75K in size. The DOM curve follows a logarithmic path. (Note: Curves were fitted using a second-degree polynomial.) The logarithmic path means that with very large DOMs, you fall into a performance bottleneck very quickly.

conclusion
Incremental parsing and processing of large XML files by means of SAX is a powerful tool for achieving better runtime performance (and economic use of RAM) in XML integration applications. Since SAX classes are included in the Xerces package (which in turn is used by xCommerce), SAX processing is already available, if you want to use it, in the xCommerce environment. Currently, to take advantage of that, you have to write your own ContentHandler (or subclass the SAXWriter class of Xerces), but soon you won't even have to do that: an upcoming release of xCommerce will include a custom SAX handler class, which you can call directly from ECMAScript in a Function action.
For More Information

You can find out everything you want to know about SAX by visiting http://www.megginson.com/SAX/. For the Xerces package, see http://xml.apache.org/xerces-j/. An excellent introduction to the use of SAX classes can be found in Brett McLaughlin's Java and XML (O'Reilly & Associates), ISBN 0-596-00016-2, available wherever Java books are sold.

If you would like to use the SplitSAXWriter class discussed in this article contact the eBusiness Integration Products Division for details at ebizintegration@silverstream.com.