I needed an XML parser for my tests with Rust (parsing a gigantic 2Go Wikivoyage XML dump) and there is no native one. So I would need a wrapper around a C implementation. At first, I used libxml2, because it has a very appealing name it has a very convenient XmlTextReader API where you are controlling the parsing loop (no callback like in SAX).
I wondered how it compared to the StAX API of Java, and made a simplistic test printing (to /dev/null) the names of all the nodes. And surprise, StAX was way faster:
- the StAX (openjdk-7, Ubuntu 14.04, amd64)
- 35 seconds
- Memory usage: 30 Mo
- XmlTextReader API of libxml2 (2.9.1), g++ 4.8.2
- 50 seconds
- 3.5 Mo
Then I made tests with SAX, comparing the default Java implementation with libxml2 and finally also libexpat 2.1:
- Java: 30s (39Mo)
- libxml2: 34s (2Mo)
- libexpat: 21s (180ko)
Conclusions:
- the Java implementation seems quite good (at least in this particular scenario)
- libexpat is quite fast, but it’s SAX and not pull-parsing (which is very convenient)
- if you are processing huge XML files, forget about libxml2