XML is a popular data format for several reasons: it is human readable, self-describing, and portable. Unfortunately, many Java-based XML parsers are very large; for example, Sun Microsystems'
parser.jar libraries are 1.4 MB each. If you are running with limited memory (for example, in a J2ME (Java 2 Platform, Micro Edition) environment), or bandwidth is at a premium (for example, in an applet), using those large parsers might not be a viable solution.
Those libraries' large size is partly due to having a lot of functionality—perhaps more than you require. They validate XML DTDs (document type definitions), possibly schemas, and more. However, you might already know that your application will receive valid XML. Also, you might already decide that you want just the UTF-8 character set. Therefore, you really want event-based processing of XML elements and translation of standard XML entities—you want a nonvalidating parser.
Note: You can download this article's source code in Resources.
Why not just use SAX?
You could implement SAX (Simple API for XML) interfaces with limited functionality, throwing an exception named
NotImplemented when you encountered something unnecessary.
Undoubtedly, you could develop something much smaller than the 1.4 MB
jaxp.jar/parser.jar libraries. But instead, you can cut down the code size even more by defining your own classes. In fact, the package we construct here will be considerably smaller than the jar file containing the SAX interface definitions.
Our quick-and-dirty parser is event-based like the SAX parser. Also like the SAX parser, it lets you implement an interface to catch and process events corresponding to attributes and start/end element tags. Hopefully, those of you who have used SAX will find this parser familiar.
Limit XML functionality
Many people want XML's simple, self-describing textual data format. They want to easily pick out elements, attributes and their values, and elements' textual content. With that in mind, let's consider what functionality we need to preserve.
Our simple parsing package has just one class,
QDParser, and one interface,
QDParser itself has one public static method,
parse(DocHandler,Reader), which we will implement as a finite state machine.
Our limited functionality parser treats the DTD
<!DOCTYPE> and processing instructions
<?xml version="1.0"?> simply as comments, so it won't be confused by their presence nor use their content.
Because we won't process
DOCTYPE, our parser cannot read custom entity definitions. We will have only the standard ones available: &, <, >, ', and ". If this is a problem, you can insert code to expand custom definitions, as the source code shows. Alternatively, you could preprocess the document—replacing custom entity definitions with their expanded text before handing the document to the
Our parser also cannot support conditional sections; for example,
<![INCLUDE[ ... ]]> or
<![IGNORE[ ... ]]>. Without the ability to define custom entity definitions in
DOCTYPE, we don't really need this functionality anyway. We could process such sections, if any, before the data is sent to our limited-space application.
Because we won't process any attribute declarations, the XML specification requires that we consider all attribute types to be
CDATA. Thus, we can simply use
java.util.Hashtable instead of
org.xml.sax.AttributeList to hold an element's attribute list. We have only name/value information to use in
Hashtable, but we don't need a
getType() method because it would always return
The lack of attribute declarations has other consequences as well. For example, the parser won't supply default attribute values. In addition, we can't automatically reduce white space using a
NMTOKENS declaration. However, we could handle both issues when preparing our XML document, so the extra programming could be excluded from the application using the parser.
In fact, all the missing functionality can be compensated for by preparing the document appropriately. You can offload all the work associated with the missing features (if you want them) from the quick-and-dirty parser to the document preparation step.
Enough about what the parser cannot do. What can it do?
- It recognizes all the elements' start tags and end tags
- It lists attributes, where attribute values can be enclosed in single or double quotes
- It recognizes the
<[CDATA[ ... ]]>construct
- It recognizes the standard entities: &, <, >, ", and ', as well as numeric entities
- It maps lines ending in
\non input, in accordance with the XML Specification, Section 2.11
The parser does only minimal error checking and throws an
Exception if it encounters unexpected syntax, such as unknown entities. Again, however, this parser does not validate; it assumes the XML document it receives is valid.
How to use this package
Using the quick-and-dirty XML parser is simple. First, implement the
DocHandler interface. Then, easily parse a file named
DocHandler doc = new MyDocHandler(); QDParser.parse(doc,new FileReader("config.xml"));
The source code includes two examples that provide full
DocHandler implementations. The first
Reporter, simply reports all events to
System.out as it reads them. You can test the
Reporter with the sample XML file (
The second and more complex example,
Conf, updates fields on an existing data structure that resides in memory.
Conf uses the
java.lang.reflect package to locate fields and objects described in
config.xml. If you run this program, it will print diagnostic information telling you what objects it is updating and how. It prints error messages if the config file asks it to update nonexistent fields.
Modify this package
You'll likely want to modify this package for your own application. You might add custom entity definitions—line 180 in
QDParser.java contains an "Insert custom entity definitions here" comment.
You could also add to the finite state machine's functionality, restoring functionality I have excluded here. If so, the source code's small size should make this task relatively easy.
Keep it small
QDParser class occupies around 3 KB after you compile and pack it into a jar file. The source code itself, with comments, is just over 300 lines. This should be small enough for most space-constrained applications, and retain enough of the XML specification to enjoy most of its useful features.
Learn more about this topic
- The source code for this tip
- The XML specification at the W3C
- The SAX Website
- The JAXP Website
- The J2ME Website
- Browse the Java and XML section of JavaWorld's Topical Index
- View all previous Java Tips and submit your own
- Learn Java from the ground up in JavaWorld's Java 101 column
- Java experts answer your toughest Java questions in JavaWorld's Java Q&A column
- Browse the Core Java section of JavaWorld's Topical Index
- Stay on top of our Tips 'N Tricks by subscribing to JavaWorld's free weekly email newsletters
- Learn the basics of client-side Java in JavaWorld's Java Beginner discussion. Core topics include the Java language, the Java Virtual Machine, APIs, and development tools
- You'll find a wealth of IT-related articles from our sister publications at IDG.net