Mon 12 May 2025

Software Localisation

American websites format the date as MM/DD/YYYY and this can be confusing for Europeans. If I see the date 05/03/2025, I can't be sure if we're dealing with March or May.

Localisation extends further than just the format of dates. There are many things that require localisation the most obvious is language. If your site does not support the dominant language within a geography you're creating a language barrier between you and your customers. In Typeform's case, their customer's customer.

Translating and localising your software opens your business to new markets where relying on English won't cut it. Providing your system in a locale that's familiar to the user allows your system to feel natural and trustworthy. Luckily for us the internet has been around for more than 40 years and this, is sort of an old problem. There's been a good effort put towards enabling multilingual support.

ID or string?

When translating software, the first thing you'll need to determine is how to identify text that requires translation.

There's two ways you can do this:

  1. Mapping a key to the text. This key will be used to lookup the correct message given the user's preferred language. Something like this:

    message_key: "MISSING_NAME_TEXT"

  2. Alternatively; provide the text as is and using that as the message key:

    message: "You're missing your first name"

Systems have been written using both styles so there's no consensus on which one you should pick (I wish you luck in driving consensus in your own place of work). Here are some things to consider.

Subtle punctuation can change the whole meaning of a sentence. This is why systems tend to favour the entire sentence as the key for translation. Updating the sentence, even if you're just adding punctuation, should invalidate the translation or at least flag the translation so that it can be double checked.

It's also useful keeping the full text within the context of where it's being used. This way the developer or engineer can determine themselves if it makes sense. It is harder to determine if you're using the correct message if you're relying on message keys like: "missing.name_text" and "missing.text_name", the full text provides a clearer indication of the output.

Scaling the message keys can also be tricky as you'll need to avoid name clashes. The best thing to do is use them with some sort of namespacing e.g. "signup.error.missing_name" and redefine the key for every use-case, even if the full text ends up being the same, this allows you to change each text independently.

Localisation built in

For those of us gifted enough to using a Unix based system you might have access to gettext and xgettext in the command line. These are tools used to translate "natural language messages into the user's language, by looking up the translation in a message catalog".1

Python has some built-in libraries which allow you to manage internationalisation and localisation. Unless you've dealt with localisation, I think very few people are aware of the existence of gettext.

Localization for Python Applications

The python gettext library provides an interface which allows you to define your program in a core language and uses a separate message catalog to look up message translation. As an example we can define a message that requires localisation like so:

from gettext import gettext as _

_("Welcome!")

Using xgettext we can construct a .pot file. Which will be used as a template for our language catalogues.

xgettext -o messages.pot --language=Python src/*.py

The pot file should look like this after running xgettext.

#: main.py:3
msgid "Welcome!"
msgstr ""

It's pretty neat that it has provided us the file name and the line number for the text, although more useful in larger codebases, we can use this to track redundant translation strings. You'll also notice that it's using the full string as the msgid instead of assigning it to a code or number.

From this we create .po files (unrelated to the teletubby)2 , these are the concrete versions of the .pot file which contain the translations, if we were to make a .po file for Norwegian this would look like:

#: main.py:4
msgid "Welcome!"
msgstr "Velkomst"

Now that we have a localised form of our language catalogue we can use msgfmt to compile a binary version of our .po file, like so:

msgfmt -o messages.mo no_NO/messages.po

This command takes our no_NO (Norwegian) messages and compiles a precomputed hash table of the msgid -> msgstr and outputs it to the .mo file. These files are stored in binaries so they're not human readable, but are efficient to load into the application at start up.

When you start accumulating a lot of these catalogues they might require their own system to manage. These systems are also useful interfaces for the person providing the translation. As an example you can see PoEditor or Lokalise

People that work in localisation and translation will be familiar with .po files since they're often the file format used with translation software.

Translation within context

If our app supports more than one localisation we have to indicate which localisation should be returned to a user. For an API we can set the user's locale within the context of a request.

Flask offers a library called Flask-Babel which allows you to set this locale. So if a Norwegian was to hit our API we'd have the client set a header on the request: Content-Language: no_NO3, on returning the response all the strings instantiated with gettext will be translated into Norwegian.

There are some cases where you'll need to switch the locale context mid request or mid process, for example; a Norwegian user triggers an alert to an English user. We can instantiate a context manager with flask-babel, which will translate the strings to a specified locale:

def handler():
    # <norwegian scope>
    with force_locale(to_user.locale):
        # <english scope>
        send_email(to_user)
    # <norwegian scope>

Plural Forms

Language is weird and there's nearly an edge-case for everything. One of these cases that gettext supports is defining rules for plural forms. For example in English we might say "one apple" and "two apples". However in a language like Hebrew the plural form for two apples can't be used for three apples, so to account for this gettext provides ngettext. Which is used like so:

from gettext import ngettext as n_

n_("%(num)d apple", "%(num)d apples", 3) % {"num": 3}

This allows gettext to pull the correct plural form given the int 3 and then formats the returning string, replacing %(num)d with 3.

Lazy strings

If you're reusing the same string across your application and defining it at module level this string will be translated as soon as the module is instantiated. The module will always fallback to your app's default locale and your strings will not be translated. To get around this we use something called lazy_gettext. This allows us to define the string and reuse it across the application as lazy_gettext will keep a reference to the msgid and defer translation until the text is needed.

You can see support for lazy_gettext in the django documentation

Wikipedia

Wikipedia manages content in over 300 languages. There are numerous volunteers which help to translate wiki into other languages and they do this through an interface called translatewiki.net. There's an entire team managing the infra and tools used in localisation.

Similar to Wikipedia, I've seen interfaces used to manage and update translation files as well as a process that can be triggered automatically or on a schedule to update the .mo files that a service references. After updating the .mo file you can automatically roll out a deployment. The new deployment should then load the new .mo files into memory when the service is instantiated.

You don't need to be working on a system the scale of Wikipedia to include translations. You can rely on a user's system locale to translate CLI tools. My Fiancee's system is set to Norwegian, if I ever write a CLI for her I think it would be fun to provide a Norwegian interface.


  1. $ man gettext 

  2. I'd like to draw attention that I stole this joke from myself. I don't want to draw attention to the poorly performed lightening talk I did. It was my first time and I tried to fit this entire post into 5 minutes. Link for posterity 

  3. Mozilla: Content-Language 

Socials
Friends
Subscribe