

{"id":1358,"date":"2014-07-01T00:00:02","date_gmt":"2014-06-30T23:00:02","guid":{"rendered":"http:\/\/fabsk.eu\/blog\/?p=1358"},"modified":"2015-07-11T13:55:23","modified_gmt":"2015-07-11T11:55:23","slug":"microbenchmark-stax-vs-libxml2xmltextreader-java-wins","status":"publish","type":"post","link":"https:\/\/fabsk.eu\/blog\/2014\/07\/01\/microbenchmark-stax-vs-libxml2xmltextreader-java-wins\/","title":{"rendered":"Microbenchmark: Expat outperforms Java and libxml2 in SAX parsing"},"content":{"rendered":"<p>I needed an XML parser for my tests with <a href=\"http:\/\/www.rust-lang.org\/\">Rust<\/a> (parsing a gigantic 2Go <a href=\"http:\/\/dumps.wikimedia.org\/\">Wikivoyage XML dump<\/a>) and there is no native one. So I would need a wrapper around a C implementation. At first, I used libxml2, because <del>it has a very appealing name<\/del> it has a very convenient <a href=\"http:\/\/xmlsoft.org\/html\/libxml-xmlreader.html\">XmlTextReader<\/a> API where you are controlling the parsing loop (no callback like in SAX).<\/p>\n<p>I wondered how it compared to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/StAX\"><strong>StAX<\/strong><\/a> API of Java, and made a simplistic test printing (to \/dev\/null) the names of all the nodes. And surprise, StAX was way faster:<\/p>\n<ul>\n<li>the <a href=\"https:\/\/en.wikipedia.org\/wiki\/StAX\"><strong>StAX<\/strong><\/a> (openjdk-7, Ubuntu 14.04, amd64)\n<ul>\n<li><strong>35<\/strong> seconds<\/li>\n<li>Memory usage: 30 Mo<\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/xmlsoft.org\/html\/libxml-xmlreader.html\">XmlTextReader<\/a> API of <strong>libxml2<\/strong> (2.9.1), g++ 4.8.2\n<ul>\n<li><strong> 50<\/strong> seconds<\/li>\n<li>3.5 Mo<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Then I made tests with SAX, comparing the default Java implementation with libxml2 and finally also <a href=\"http:\/\/www.libexpat.org\/\">libexpat<\/a> 2.1:<\/p>\n<ul>\n<li>Java: <strong>30s<\/strong> (39Mo)<\/li>\n<li>libxml2: <strong>34s<\/strong> (2Mo)<\/li>\n<li>libexpat: <strong>21s<\/strong> (180ko)<\/li>\n<\/ul>\n<p>Conclusions:<\/p>\n<ul>\n<li>the Java implementation seems quite good (at least in this particular scenario)<\/li>\n<li>libexpat is quite fast, but it&rsquo;s SAX and not pull-parsing (which is very convenient)<\/li>\n<li>if you are processing huge XML files, forget about libxml2<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>I needed an XML parser for my tests with Rust (parsing a gigantic 2Go Wikivoyage XML dump) and there is no native one. So I would need a wrapper around a C implementation. At first, I used libxml2, because it has a very appealing name it has a very convenient XmlTextReader API where you are [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[24,6,28],"tags":[],"class_list":["post-1358","post","type-post","status-publish","format-standard","hentry","category-dev","category-informatique","category-java","\"lang=\"en"],"_links":{"self":[{"href":"https:\/\/fabsk.eu\/blog\/wp-json\/wp\/v2\/posts\/1358","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fabsk.eu\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fabsk.eu\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fabsk.eu\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fabsk.eu\/blog\/wp-json\/wp\/v2\/comments?post=1358"}],"version-history":[{"count":9,"href":"https:\/\/fabsk.eu\/blog\/wp-json\/wp\/v2\/posts\/1358\/revisions"}],"predecessor-version":[{"id":1368,"href":"https:\/\/fabsk.eu\/blog\/wp-json\/wp\/v2\/posts\/1358\/revisions\/1368"}],"wp:attachment":[{"href":"https:\/\/fabsk.eu\/blog\/wp-json\/wp\/v2\/media?parent=1358"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fabsk.eu\/blog\/wp-json\/wp\/v2\/categories?post=1358"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fabsk.eu\/blog\/wp-json\/wp\/v2\/tags?post=1358"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}