Microbenchmark: Expat outperforms Java and libxml2 in SAX parsing

I needed an XML parser for my tests with Rust (parsing a gigantic 2Go Wikivoyage XML dump) and there is no native one. So I would need a wrapper around a C implementation. At first, I used libxml2, because it has a very appealing name it has a very convenient XmlTextReader API where you are controlling the parsing loop (no callback like in SAX).

I wondered how it compared to the StAX API of Java, and made a simplistic test printing (to /dev/null) the names of all the nodes. And surprise, StAX was way faster:

  • the StAX (openjdk-7, Ubuntu 14.04, amd64)
    • 35 seconds
    • Memory usage: 30 Mo
  • XmlTextReader API of libxml2 (2.9.1), g++ 4.8.2
    • 50 seconds
    • 3.5 Mo

Then I made tests with SAX, comparing the default Java implementation with libxml2 and finally also libexpat 2.1:

  • Java: 30s (39Mo)
  • libxml2: 34s (2Mo)
  • libexpat: 21s (180ko)

Conclusions:

  • the Java implementation seems quite good (at least in this particular scenario)
  • libexpat is quite fast, but it’s SAX and not pull-parsing (which is very convenient)
  • if you are processing huge XML files, forget about libxml2

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *