Important MO file optimisation for en_* locales, and partly others

During GUADEC, Tomas Frydrych gave a talk on exmap-console, a cut-down version of exmap that can work well on mobile devices.

During the presentation, Tomas showed how to use the tool to find the culprits in memory (ab)use on the GNOME desktop. One issue that came up was that the MO files taking up space though the desktop showed English. Why would the MO translation files loaded in memory be so big in size?

gtk20.mo                             : VM   61440  B, M   61440  B, S   61440  B

atk10.mo                      	     : VM    8192  B, M    8192  B, S    8192  B

libgnome-2.0.mo			: VM   28672  B, M   24576  B, S   24576  B

glib20.mo			     : VM   20480  B, M   16384  B, S   16384  B

gtk20-properties.mo           : VM     128 KB, M     116 KB, S     116 KB

launchpad-integration.mo  : VM    4096  B, M    4096  B, S    4096  B

A translation file looks like

msgid “File”

msgstr “”

When translated to Greek it is

msgid “File”

msgstr “Αρχείο”

In the English UK translation it would be

msgid “File”

msgstr “File”

This actually is not necessary because if you leave those messags untranslated, the system will use the original messages that are embedded in the executable file.

However, for the purposes of the English UK, English Canadian, etc teams, it makes sense to copy the same messages in the translated field because it would be an indication that the message was examined by the translation. Any new messages would appear as untranslated and the same process would continue.

Now, the problem is that the gettext tools are not smart enough when they compile such translation files; they replicate without need those messages occupying space in the generated MO file.

Apart from the English variants, this issue is also present in other languages when the message looks like

msgid “GConf”

msgstr “GConf”

Here, it does not make much sense to translate the message in the locale language. However, the generated MO file contains now more than 10 bytes (5+5) , plus some space for the index.

Therefore, what’s the solution for this issue?

One solution is to add to msgattrib the option to preprocess a PO file and remove those unneeded copies. Here is a patch,

— src.ORIGINAL/msgattrib.c 2007-07-18 17:17:08.000000000 +0100
+++ src/msgattrib.c 2007-07-23 01:20:35.000000000 +0100
@@ -61,7 +61,8 @@
REMOVE_FUZZY = 1 << 2,
REMOVE_NONFUZZY = 1 << 3,
REMOVE_OBSOLETE = 1 << 4,
– REMOVE_NONOBSOLETE = 1 << 5
+ REMOVE_NONOBSOLETE = 1 << 5,
+ REMOVE_COPIED = 1 << 6
};
static int to_remove;

@@ -90,6 +91,7 @@
{ “help”, no_argument, NULL, ‘h’ },
{ “ignore-file”, required_argument, NULL, CHAR_MAX + 15 },
{ “indent”, no_argument, NULL, ‘i’ },
+ { “no-copied”, no_argument, NULL, CHAR_MAX + 19 },
{ “no-escape”, no_argument, NULL, ‘e’ },
{ “no-fuzzy”, no_argument, NULL, CHAR_MAX + 3 },
{ “no-location”, no_argument, &line_comment, 0 },
@@ -314,6 +316,10 @@
to_change |= REMOVE_PREV;
break;

+ case CHAR_MAX + 19: /* –no-copied */
+ to_remove |= REMOVE_COPIED;
+ break;
+
default:
usage (EXIT_FAILURE);
/* NOTREACHED */
@@ -436,6 +442,8 @@
–no-obsolete remove obsolete #~ messages\n”));
printf (_(“\
–only-obsolete keep obsolete #~ messages\n”));
+ printf (_(“\
+ –no-copied remove copied messages\n”));
printf (“\n”);
printf (_(“\
Attribute manipulation:\n”));
@@ -536,6 +544,21 @@
: to_remove & REMOVE_NONOBSOLETE))
return false;

+ if (to_remove & REMOVE_COPIED)
+ {
+ if (!strcmp(mp->msgid, mp->msgstr) && strlen(mp->msgstr)+1 >= mp->msgstr_len)
+ {
+ return false;
+ }
+ else if ( strlen(mp->msgstr)+1 < mp->msgstr_len )
+ {
+ if ( !strcmp(mp->msgstr + strlen(mp->msgstr)+1, mp->msgid_plural) )
+ {
+ return false;
+ }
+ }
+ }
+
return true;
}
However, if we only change msgattrib, we would need to adapt the build system for all packages.

Apparently, it would make sense to change the default behaviour of msgfmt, the program that compiles PO files into MO files.

An e-mail was sent to the email address for the development team of gettext regarding the issue. The development team does not appear to have a Bugzilla to record these issues. If you know of an alternative contact point, please notify me.

Update #1 (23Jul07): As an indication of the file size savings, the en_GB locale on Ubuntu in the installation CD occupies about 424KB where in practice it should have been 48KB.

A full installation of Ubuntu with some basic KDE packages (only for the basic libraries, i.e. KBabel – (ls k* | wc -l = 499)) occupies about 26MB of space just for the translation files. When optimising in the MO files, the translation files occupy only 7MB. This is quite important because when someone installs for example the en_CA locale, all en_?? locales are added.

The reason why the reduction is more has to do with the message types that KDE uses. For example,

msgid “”
“_: Unknown State\n”
“Unknown”
msgstr “Unknown”

I cannot see a portable way to code the gettext-tools so that they understand that the above message can be easily omitted. For the above reduction to 7MB, KDE applications (k*) occupy 3.6MB. The non-KDE applications include GNOME, XFCE and GNU traditional tools. The biggest culprits in KDE are kstars (386KB) and kgeography (345KB).

Update #2 (23Jul07): (Thanks Deniz for the comment below on gweather!) The po-locations translations (gnome-applets/gweather) of all languages are combined together to generate a big XML file that can be found at usr/share/gnome-applets/gweather/Locations.xml (~15MB).

This file is not kept in memory while the gweather applet is running.
However, the file is parsed when the user opens the properties dialog to change the location.
I would say that the main problem here is the file size (15.8MB) that can be easily reduced when stripping copied messages. This file is included in any Linux distribution, whatever the locale.

The po-locations directory currently occupies 107MB and when copied messages are eliminated it occupies 78MB (a difference of 30MB). The generated XML file is in any case smaller (15.8MB without optimisation) because it does not include repeatedly the msgid lines for each language.

I regenerated the Locations.xml file with the optimised PO files and the resulting file is 7.6MB. This is a good reduction in file space and also in packaging size.

Update #3 (25Jul07): Posted a patch for gettext-tools/msgattrib.c. Sent an e-mail to the kde-i18n-doc mailing list and got good response and a valid argument for the proposed changes. Specifically, there is a case when one gives custom values to the LANGUAGE variable. This happens when someone uses the LANGUAGE variable with a value such as “es:fr” which means show me messages in Spanish and if something is untranslated show me in French. If a message has msgid==msgstr for Spanish but not for French, then it would show in French if we go along with the proposed optimisation.

1 comment

  • Deniz Koçak

    Some pot files like the gnome-applet-locations also contains more than 4000 line of string and almost all of them are written like the same in English. I have done just what you mentioned in your blog entry (copy&paste the original string). Now I am not sure if it is a good idea or not, to copy&paste the original strings if these are same in other languages??

Leave a Reply

%d bloggers like this: