[an error occurred while processing this directive]
> developer > web app development
Regular Expressions in ECMAScript: A Guide for xCommerce Users
by exteNd Composer Product Team, eBusiness Integration Products, Novell
Date Created: 2001-04-30 08:36:00.000
  Introduction
  The RegExp Object
  Where Can You Use Regular Expressions?
  Global Search and Replace
  Tokenizing with Split()
  Limitations
  Conclusion
introduction
Regular expressions, a powerful and versatile shorthand for pattern matching, have long been a favorite tool among power users of scripting languages like Perl and ECMAScript (not to mention users of the Unix grep utility). Often, a single well-crafted regular expression can accomplish string-parsing tasks that would otherwise require dozens (or even hundreds) of lines of procedural code. Regular expressions, or regexes (as they're known to power users), are of potentially enormous importance in any application that processes large amounts of text (in particular, "flat files" exported from databases).

Starting with version 2.0, xCommerce made regex functionality a part of its ECMAScript binding, greatly expanding the power and flexibility of custom scripts and expressions. If you haven't yet tapped into this power, you should take a moment to acquaint yourself with ECMAScript's regular expression syntax and its uses in xCommerce.

the RegExp object
As you probably know, most implementations of ECMAScript or JavaScript have a built-in RegExp object. xCommerce is no different. The main thing to remember about regexes in xCommerce is that they always begin life as Strings. That is, to create a regular expression, you need to pass the string representation of your expression to the RegExp constructor:
    var pattern = "[a-z]+";
    
    
    
    var regex = new RegExp(pattern);
    
    
    
    
    
    
    
    // or, equivalently:
    
    
    
    var regex = new RegExp("[a-z]+");
    
    
    
    

In this example, the pattern to be matched is one or more sequential instances of any lowercase letter from 'a' to 'z'. (The pattern would match "cript" in ECMAScript.) Passing the string "[a-z]+" to the RegExp constructor causes a new RegExp object to be assigned to the variable regex. This variable can then be passed as a parameter to any ECMAScript method that can take regular-expression arguments. (See further below.)

Notice that you cannot create a RegExp object using literal notation. That is, you cannot do:

    var pattern = /[a-z]+/;     // illegal
    
    

Try to keep in mind that because you are passing a String representation of your regex to the constructor, you must escape (with a backslash) any characters that need escaping, including:

  • Quotation marks or any characters that normally need escaping inside an ECMAScript string (such as non-printing characters, literal backslashes, and so on).
  • Any characters that have special meaning inside a regex. For example, the period and plus signs should be escaped if you wish to match literal periods or pluses in your target.
  • For character-class shorthand expressions, such as \s for space character and \S for nonspace character, the backslash itself should be escaped.

An example of a character-class regex is:

    var myRegex = new RegExp("\\d+");

In this example, the regular expression means "match one or more serial instances of digit characters." The literal notation for this (in Perl, for instance) would normally be \d+. But to form a valid string out of that, you need two backslashes.

where can you use regular expressions?
In xCommerce's ECMAScript engine, as in most other implementations, you can use RegExp objects in the String methods match(), replace(), and split(), as well as the RegExp methods exec() and test(). Documentation for these methods can be found in any good JavaScript reference, so we won't rehash the basics here. Nevertheless, a quick example or two will give you an idea of the kinds of things you can do using these methods.

The test() method is handy for testing the existence, in a target string, of a particular pattern. This method is parented off the regular expression object itself and returns a Boolean. Example:

    var myString = "There were 474 orders.";
    
    
    var myRegex = new RegExp("\\d+"); // match one or more numeric characters
    
    
    var match = myRegex.test( myString ); // true
    
    

The test matches the substring "474" in the target. It would also have matched "2" or "2000" (although not "2,000"). If you're accustomed to using the indexOf() method of the String object to perform substring tests, you can appreciate how much more powerful the RegExp test() method is. The RegExp method matches patterns, not just static substrings. For example, a regular expression of "apples|oranges" will match apples or oranges. (The pipe character is a logical-OR operator inside a regex.)

Of course, the indexOf() method has the mitigating benefit of providing a string offset (the index of the first match) as a return value. But here again, a more powerful regex alternative that does the same thing is available:

    var match = myRegex.exec( myString );  // match == '474'
    
    
    var index = match.index; // match == 11
    
    

The exec() method returns an object that has an index property containing the offset of the first character of the match within the target string. But be aware that if no match occurs, the return value of exec() is null, and trying to inspect the index property will cause an error.

global search and replace
One of the handiest uses for regular expressions is performing global replacement of tokens in strings. For this, you will use the String object's replace() method.

If you want truly global replacement of matched patterns, rather than simply replacement of the first occurrence of a match, you need to set the global flag on your RegExp instance:

    myRegex.global = true;
    
    

This way, any operations involving the myRegex pattern will be applied to all matches throughout the target string.

You may also want to apply your operation in case-insensitive manner:

    myRegex.ignoreCase = true;
    
    

The default for global and ignoreCase properties is false.

Let's look at an example of using replace(). Suppose you have an HTML document (as a String) that you want to remove IMG (image) and HR (horizontal rule) tags from. You could do this as follows:

    var regex = new RegExp("<IMG>|<HR>");
    
    
    
    regex.ignoreCase = true;
    
    
    
    regex.global = true;
    
    
    
    
    
    
    
    myHTML = myHTML.replace(regex, ""); 
    
    
    
    

The replace operation substitutes a null string (no character) for every occurrence of an IMG or HR tag, in case-insensitive manner (which is important, since HTML tags can be upper or lower case). Notice that because the replace method returns a new string (leaving the original alone), we must capture the return value. In this case we just stuff the return value back into myHTML.

To eliminate every tag from an HTML document, you could specify an expression of:

    new RegExp("<[^>]+ >");
    
    
    
    

which literally means "any occurrence of a left angle bracket followed by one or more non-right-angle-bracket characters, followed by a right angle bracket."

tokenizing with Split()
One of the most powerful regex operations available in ECMAScript is the String object's built-in split() method. This method breaks a string into substrings at every occurrence of the specified delimiter, be it a single character or a complex pattern. The return value is an Array object containing the substrings created by the operation.

Suppose you are processing a flat file (or EDI document, etc.) in which you want to break a string into substrings at every occurrence of * (asterisk) or | (pipe). You could do it this way:

    var regex = new RegExp("\\*|\\|");
    
    
    
    
    
    var arrayOfFields = record.split( regex ); 
    
    

No matter whether record uses asterisks or pipes as delimiters, it will be split into substrings at every occurrence of the delimiter. Notice that in our regular expression string, we not only must escape the asterisk but also the pipe character. This is because the asterisk has special meaning inside a regex: it means "match zero or more occurrences of the preceding pattern." Likewise, the pipe character has special meaning (logical OR), as explained earlier.

limitations
You will probably discover a few limitations to regexes in xCommerce. The most noteworthy are the lack of support for quantizers (curly braces enclosing numbers) and no caching of hits into variables $1, $2, etc. (Regex aficianados know what this means. If you don't know, you won't miss the missing functionality.)

The only other thing to be wary of is that complicated and/or poorly crafted regular expressions can be costly in CPU cycles, especially if declared in loops. But this is just as true for regexes in Perl and other languages as it is for regexes in xCommerce ECMAScript.

One thing is for sure. Once you start using regular expressions in your scripts, you won't want to be without them ever again. They bring tremendous power and flexibility in string manipulation. Be sure to use that power for good, not evil.

conclusion
There are several good online tutorials devoted to regular expression usage. A good starting point is the tutorial at http://www.webreference.com/js/column5/, which specifically addresses regular expressions in JavaScript (as opposed to Perl) and starts at an easy level. An excellent Java-based tutorial can be found at http://javaregex.com/cgi-bin/pat/tutorial.asp.

Incredibly, the Java language has no built-in regex capability. The http://www.javaregex.com site offers a nice Java regex package, however, for as low as $14.95.

Jeffrey Friedl's Mastering Regular Expressions (ISBN 1565922573) is generally considered the standard work on regex syntax, usage, and performance. Like most O'Reilly titles, you can find it at any bookstore (click or brick).