i18n: text expansion for several languages

From the KDE translations, I compiled statistics for text expansion, following the model presented by IBM on this page. So it will tell you for example that if you want to translate a text from English to French, which is between 11 and 20 characters long:

  • if you want 50% of such texts to fit, you will need 24% of additional space (for an English text of 100 pixels, you will need 124 pixels)
  • if you want 80%, you will need 57% of space
  • for 90%: 77%
  • for 95%: 98%
  • for 99%: 150%

Sources and data are available. Please excuse me for the poor HTML design (written in HTML 5, looks like HTML 1.0 ☺), I did not put any effort in it (not my cup of tea), but at least it does the job.

I also wrote another version of the generator using the Go language. I was a little bit disappointed because:

  • I did not get a formidable boost in performance (I should have, I have a 4-core CPU and Python can not fully take advantage of it because of its GIL). In fact, the file parsing part (the most time consuming one) gave similar performance, and the statistic production part was notably faster
  • The GTK/Pango (I used to get text length in pixels) binding for Go was incomplete and boring to complete (even if Go has a really nice support to write C bindings). I also had some crashes (related to GObject reference counting).
  • There is a port of Freetype written in Go, but the code to compute the length of a text was incomplete (it did not cover a given case)
  • At the time I wrote this version, I thought about integrating Fontconfig directly in the program, but it would have taken more time that I wanted to spend on it.

Golang: lack of generics bothers me

I think I will give up Go, mainly because of the lack of generics. What bothers me is that I can’t see how to write all-purpose algorithm functions like the ones C++ have (I love them), like for example «std::remove_if». Without them, you will have to write the same little pieces of code again, and again, and again. Or use cast everywhere (not great for a language with strong static types).
The built-in functions (like «copy») can do such a magic, but you, developer, can’t.
Oh, it’s possible to do like the package sort: provide to the function an interface that will perform the operations on the data (like «Swap», «Len», «Less»). If I want to implement my «remove_if», implementing such an interface will be a drag.
Same problem if you want to create a generic data structure, like a «set» or a b-tree of anything (interface or native type), and keep the type-safety.

That’s a pity, Go has some great features. Maybe I will try Rust.

Golang, Openstreetmap, threads

In my previous attempt of parsing Openstreetmap data, I found parsing XML data slow. Fortunately I realized that the data was also available in Google Prococol Buffer format, and

  • it’s 30-40% smaller than bzip2 XML (and you know, bzip2 requires a fair amount of CPU power)
  • it’s much faster to parse: with Osmosis, reading a PBF file on my quad-core took 14 seconds, and reading the XML.bzip2 with the help of lbunzip2 (multi-thread decompressor) it took 1mn50s. Ouch!

There is a Go library to handle Protocol Buffers, so I tried to write a PBF reader in this language and could see how efficient it would be. My program worked like this:

  • the main thread would read blocks from the file and pass them to thread workers using a channel
  • each worker (a goroutine) would decompress the block, unmarshall the data and process it
  • when there would not be any block left to process, the results of the workers would be merged into a single image

So what kind of performance did I get? I get the best result with a number of workers equal to the number of cores + 1 (so 5 workers): about 28 seconds. I cannot compare this result with Osmosis (not all the cases are handled), but it’s quite acceptable.

I find Go a nice language to use, and it compiles very very quickly. I struggled a little bit with some points, and everything is not clear yet for me also. It feels strange not to program in a OO-way. And I can’t be sure if I have to trigger tons of goroutines or use a pool of workers, if I should pass callback functions or channels.

That’s also a pity that the tremendous performance is not there yet. It’s supposed to be «close to the metal», «a language for system programming», but for the moment (after 3 years) it is not a fast as Java.

Go XML sax-like parsing is slow

I wanted to write a Openstreetmap XML processor in Go language, hoping that I would get a performance boost from my Python implementation. And it ended being slower. Python is using Expat (written in C) and maybe the Go module «encoding/xml» is not the state of the art of optimization.

In wrote simple programs handling the event «start element». In 10 seconds, I could parse the following amount of XML data (Athlon II X4 620):

  • PyPy: did not run because of a bug (no progressive parsing)
  • Go: 70Mo
  • Python 2.7: 210Mo
  • Python 3.2: 215Mo
  • Java 7: 460Mo
  • C++ / libxml: 675Mo

I tried to use Expat or Libxml in Go, but for the moment it is just too complicated. In Go code, It’s easy to call C functions located in shared libraries, but if you need to pass callback functions written in Go to a library written in C, you will have to do dirty things (create wrappers in a Go module having C code).

That’s a pity because the Go compiler automatically generates C wrappers for your exported Go functions, but you can not get an raw pointer to these wrappers (this way I would have been able to pass my callbacks to LibXML or Expat)… See you later, Go.

ua-site-switch, a per-site User-Agent switcher for Firefox

There are many user-agent switcher add-ons for Firefox. Most of them change it globally. There is one called UAControl that allows to specify it on a per-site basis (and does it well).

I wanted one that would be based on Jetpack, the new framework for Firefox add-ons. At least I would learn something. So I made ua-site-switch, which not as complete as its nice competitor yet. The code it there on github (git and github are so nice to use!).