|
|

[an error occurred while processing this directive]
|
 |
 |
 |
introduction |
 |
 |
 |
|
Performance
is a key concern in all types of applications, but in J2EE-based enterprise
applications it can be especially pressing. With XML integration apps, one
factor that always figures prominently in performance is the size of the
document being manipulated. Simply put, a large in-memory DOM can impose huge
demands not only on system memory but CPU time.
In
SilverStream xCommerce, as with many DOM processors, performance slows
exponentially with increasing DOM size. The issue is that if you pass a large
XML document in to xCommerce, it will parse that file into a roughly
two-and-a-half-times-larger DOM, which has to fit in memory. But parse time
quickly becomes a problem: On a system where a 50K XML file takes a couple
hundred milliseconds to parse, a 500K file takes ten seconds or more.
Fortunately,
there's a way around the "large DOM" problem. It's called SAX
processing, and it's a technique you should definitely consider if you routinely
deal with large (100K or more) XML files.
|
 |
 |
what is SAX? |
 |
 |
 |
|
SAX
stands for Simple API for XML. It's an open-source library of event-based
parsing classes for XML handling, developed collaboratively by the members of
the XML-DEV mailing list (hosted by http://www.oasis-open.org). SAX
functionality is included as part of the Apache Xerces package (see http://xml.apache.org/xerces-j/),
and because Xerces is incorporated into xCommerce, you already have the power of
SAX at your fingertips. (The SAX parser context is com.sssw.b2b.xerces.parsers.SAXParser.)
Space
obviously precludes a detailed discussion of SAX internals here, but the general
gist is this: Rather than parsing an entire XML document into one large,
in-memory DOM all at once, SAX allows an application to process an XML document as
it is being read. In SAX, a ContentHandler can be associated with a parser.
At key times during parsing, the parser calls the ContentHandler's callback
routines, which have names like startDocument(), startElement(), endElement(),
and so on. Basically, key "parse points" are treated as events, and
the ContentHandler handles those events.
ContentHandler
is actually just one of four handler interfaces defined by SAX (the others being
DTDHandler, EntityResolver, and ErrorHandler). If your goal were to use SAX to
"filter" a large incoming XML doc so as to retrieve just the portion(s)
of the DOM you're interested in, you would write a Java class that implements
the ContentHandler interface. After instantiating a SAXParser (either in your
own class or some other class), you would attach your custom handler class to
the parser via the parser's setContentHandler() method. Then you would call
parse() on the URI of the XML file, and let SAX do its magic.
|
 |
 |
xCommerce Use Cases |
 |
 |
 |
|
In
an xCommerce application, there are three use cases where the SAX pre-processing
strategy can be useful:
- A
large inbound XML document is received by an xCommerce Web Service through
one of the standard service trigger mechanisms.
- A
large XML document is received through an XML Interchange action.
- A
large file needs to be read from disk and processed by a service or
component.
The
first case requires a bit of work to handle, because xCommerce has (at present)
no direct mechanism for invoking SAX handling at the entry point to a Web
Service. To apply SAX logic here requires writing a custom service trigger
using the framework provided for this purpose in the xCommerce Server
installation. (The use of this framework is explained in the xCommerce Server
documentation.) In your custom trigger, the processRequest() method of your
InputConversion class would have to do SAX processing before the service is
actually invoked. Otherwise, the service will be dealing with an in-memory DOM
(which we want to avoid).
Fortunately,
the first case is not apt to be a common one, for the simple reason that HTTP
isn't generally used for moving large documents around. But "large" is
a relative term. On some systems, an XML document that's as small as 50 Kbytes
in size will introduce significant latency to a DOM-oriented app. And 50K files
are transported via HTTP all the time.
The
second use case outlined above (involving XML Interchange) poses a challenge in
that there is no way, at present, to intercept a large incoming XML doc obtained
via XML Interchange action and keep it from being parsed into a DOM by
xCommerce. The normal situation looks something like this:
An
XML Interchange action (in this example) pulls an XML file from a URI, but as it
is coming in, the XML Interchange action automatically converts it to a DOM,
which swells the effective document size.
One
workaround to this situation is to call a custom Java class from an ECMAScript
function (via a Function action) and let the Java class pull the XML document in
from an URL and SAX-process it. (See below.) If your service or component needs
to take action on individual parts of the XML document inside a loop, you can
design your Java object in such a way that it acts like a node iterator,
returning nodesets of interest on each call until there are no more nodesets.
Another
workaround that might make sense in a case where the incoming XML document
contains within it logical groupings of data that are best handled by
independent services (i.e., one service per grouping) would be to have a custom
Java object pull the XML doc from the appropriate URL, break it (using SAX) into
multiple small XML documents, and fire multiple xCommerce Web Services (labeled
Service B, C, and D in the following diagram) using standard triggers:
One
advantage of this strategy is that it takes advantage of multithreading, since
each Web Service runs in its own thread.
|
 |
 |
The File I/O Case |
 |
 |
 |
|
The
use case involving a large XML file on disk (number 3 in our list further above)
can be handled in a straightforward way: Write a SAX handler (custom Java class)
that can do the necessary document parsing, and call the class from ECMAScript
in a Function Action. Depending on your needs, you could have the class return
an array of Strings (representing the useful parts of the XML document), a
single consolidated String, or it could act as an iterator, returning one XML
doclet at a time until nothing is left to process.
Fortunately,
a tool exists that can make your life easier with respect to the last option.
The
SplitSAXWriter Class
An
upcoming release of xCommerce will offer a special Java class, currently called
SplitSAXWriter (based on the SAXWriter class available in the standard Xerces/SAX
package), that splits an XML document into pieces according to an XPath-like
description of the node pattern you wish to process. The class lets you obtain
nodesets in groups of arbitrary size. For example, you might want to obtain 20
nodes at a time of type "Root/InvoiceBatch/Invoice." With
SplitSAXWriter, you can do:
SplitSAXWriter
aWriter =
new
SplitSAXWriter("huge.xml",20," Root/InvoiceBatch/Invoice");
String
chunkOf20 = aWriter.getCurrentBatchContent();
//
//
do something with chunk
//
aWriter.processNextBatch();
The
following illustration shows what an actual Action Model that uses the
SplitSAXWriter class looks like:
Before
the loop is called, a Function action is used to set up a loop counter variable,
and another Function action instantiates a SplitSAXWriter object. The ECMAScript
expression for the latter looks like:
var
mySaxParser = new Packages.SplitSAXWriter("file:///D:\\xCommerce\\Samples\\ING_orig\\tango\\NRE1150.xml",
100, "XML-S/Statement")
The
SplitSAXWriter returns nodes meeting the "XML-S/Statement" pattern, in
blocks of 100. A Repeat While loop calls getCurrentBatchContent() at the top
(and checks for non-nullness), and calls processNextBatch at the bottom of the
loop. In the guts of the loop, a Map action is used to bring a node-batch into a
Temp DOM, and two external components (a Map component and a JDBC component) are
called to operate on the DOM.
Note
that if you choose to write your own SAX pre-processor, you should take care to
avoid having it return large chunks of the document in an Array (or List or
Vector). It's better to return one chunk at a time and have the code block while
you process the chunk, before continuing to parse. After all, if your
pre-processor simply divides the input file into 10 or 20 (or however many)
segments and hands you back an array of segments, you haven't accomplished any
memory savings. All the nodes are in memory at once. That's exactly what you
want to avoid.
|
 |
 |
does it really work? |
 |
 |
 |
|
Preliminary
performance testing of the SplitSAXWriter class confirms that the processing
time with SAX increases linearly (rather than exponentially) with increasing DOM
size. The graph below summarizes test results obtained by running a test
document through ordinary DOM processing versus SAX processing. The test
document contained nodes that were replicated as need be to give a desired
filesize. Documents thus created varied in size from 20 nodesets to 3,300 (with
2,500 being equivalent to a half-megabyte file).
In
the graph, document size is plotted on the X-axis, going from zero to 3,290
nodesets. Parsing time is plotted on the y-axis, going from zero to 16.3
seconds.
The
red curve clearly shows that SAX parsing, while incurring a penalty for small
XML documents, rapidly becomes superior to all-at-once DOM parsing for files
greater than about 75K in size. The DOM curve follows a logarithmic path. (Note:
Curves were fitted using a second-degree polynomial.) The logarithmic path means
that with very large DOMs, you fall into a performance bottleneck very quickly.
|
 |
 |
conclusion |
 |
 |
 |
|
Incremental parsing and processing of large XML files by means of SAX is a powerful tool for achieving better runtime performance (and economic use of RAM) in XML integration applications. Since SAX classes are included in the Xerces package (which in turn is used by xCommerce), SAX processing is already available, if you want to use it, in the xCommerce environment. Currently, to take advantage of that, you have to write your own ContentHandler (or subclass the SAXWriter class of Xerces), but soon you won't even have to do that: an upcoming release of xCommerce will include a custom SAX handler class, which you can call directly from ECMAScript in a Function action.
|
 |
|
 |
 |
 |