Mon 12 May 2025
Software Localisation
American websites format the date as MM/DD/YYYY
and this
can be confusing for Europeans. If I see the date 05/03/2025
,
I can't be sure if we're dealing with March or May.
Localisation extends further than just the format of dates. There are many things that require localisation the most obvious is language. If your site does not support the dominant language within a geography you're creating a language barrier between you and your customers. In Typeform's case, their customer's customer.
Translating and localising your software opens your business to new markets where relying on English won't cut it. Providing your system in a locale that's familiar to the user allows your system to feel natural and trustworthy. Luckily for us the internet has been around for more than 40 years and this, is sort of an old problem. There's been a good effort put towards enabling multilingual support.
ID or string?
When translating software, the first thing you'll need to determine is how to identify text that requires translation.
There's two ways you can do this:
-
Mapping a key to the text. This key will be used to lookup the correct message given the user's preferred language. Something like this:
message_key: "MISSING_NAME_TEXT"
-
Alternatively; provide the text as is and using that as the message key:
message: "You're missing your first name"
Systems have been written using both styles so there's no consensus on which one you should pick (I wish you luck in driving consensus in your own place of work). Here are some things to consider.
Subtle punctuation can change the whole meaning of a sentence. This is why systems tend to favour the entire sentence as the key for translation. Updating the sentence, even if you're just adding punctuation, should invalidate the translation or at least flag the translation so that it can be double checked.
It's also useful keeping the full text within the context of where it's being used. This way the developer or engineer can determine themselves if it makes sense. It is harder to determine if you're using the correct message if you're relying on message keys like: "missing.name_text" and "missing.text_name", the full text provides a clearer indication of the output.
Scaling the message keys can also be tricky as you'll need to
avoid name clashes. The best thing to do is use them with
some sort of namespacing e.g. "signup.error.missing_name"
and redefine the key for every use-case, even if the full
text ends up being the same, this allows you to change each
text independently.
Localisation built in
For those of us gifted enough to using a Unix based system
you might have access to gettext
and xgettext
in the
command line. These are tools used to translate
"natural language messages into the user's language, by
looking up the translation in a message catalog".1
Python has some built-in libraries which allow you to manage
internationalisation and localisation. Unless you've dealt
with localisation, I think very few people are aware of the
existence of gettext
.
Localization for Python Applications
The python gettext
library provides an interface which
allows you to define your program in a core language and
uses a separate message catalog to look up message
translation. As an example we can define a message that
requires localisation like so:
from gettext import gettext as _
_("Welcome!")
Using xgettext
we can construct a .pot
file. Which
will be used as a template for our language catalogues.
xgettext -o messages.pot --language=Python src/*.py
The pot file should look like this after running
xgettext
.
#: main.py:3
msgid "Welcome!"
msgstr ""
It's pretty neat that it has provided us the file name and
the line number for the text, although more useful in larger
codebases, we can use this to track redundant translation
strings. You'll also notice that it's using the full string
as the msgid
instead of assigning it to a code or number.
From this we create .po
files (unrelated to the
teletubby)2
, these are the concrete
versions of the .pot
file which contain the translations,
if we were to make a .po
file for Norwegian this would look
like:
#: main.py:4
msgid "Welcome!"
msgstr "Velkomst"
Now that we have a localised form of our language catalogue
we can use msgfmt
to compile a binary version of our .po
file, like so:
msgfmt -o messages.mo no_NO/messages.po
This command takes our no_NO
(Norwegian) messages and
compiles a precomputed hash table of the msgid -> msgstr and
outputs it to the .mo
file. These files are stored in
binaries so they're not human readable, but are efficient to
load into the application at start up.
When you start accumulating a lot of these catalogues they might require their own system to manage. These systems are also useful interfaces for the person providing the translation. As an example you can see PoEditor or Lokalise
People that work in localisation and translation will be
familiar with .po
files since they're often the file
format used with translation software.
Translation within context
If our app supports more than one localisation we have to indicate which localisation should be returned to a user. For an API we can set the user's locale within the context of a request.
Flask offers a library called Flask-Babel
which allows you to set this locale.
So if a Norwegian was to hit our API we'd have the client set
a header on the request: Content-Language: no_NO
3, on
returning the response all the strings instantiated
with gettext
will be translated into Norwegian.
There are some cases where you'll need to switch the locale context mid request or mid process, for example; a Norwegian user triggers an alert to an English user. We can instantiate a context manager with flask-babel, which will translate the strings to a specified locale:
def handler():
# <norwegian scope>
with force_locale(to_user.locale):
# <english scope>
send_email(to_user)
# <norwegian scope>
Plural Forms
Language is weird and there's nearly an edge-case for
everything. One of these cases that gettext
supports is
defining rules for plural forms. For example in English we
might say "one apple" and "two apples". However in a
language like Hebrew the plural form for two apples can't be
used for three apples, so to account for this gettext
provides ngettext
. Which is used like so:
from gettext import ngettext as n_
n_("%(num)d apple", "%(num)d apples", 3) % {"num": 3}
This allows gettext
to pull the correct plural form given
the int 3
and then formats the returning string,
replacing %(num)d
with 3
.
Lazy strings
If you're reusing the same string across your application
and defining it at module level this string will be translated
as soon as the module is instantiated. The module will
always fallback to your app's default locale and your
strings will not be translated. To get around this we use
something called lazy_gettext
. This allows us to define
the string and reuse it across the application as
lazy_gettext
will keep a reference to the msgid
and
defer translation until the text is needed.
You can see support for lazy_gettext in the django documentation
Wikipedia
Wikipedia manages content in over 300 languages. There are numerous volunteers which help to translate wiki into other languages and they do this through an interface called translatewiki.net. There's an entire team managing the infra and tools used in localisation.
Similar to Wikipedia, I've seen interfaces used to
manage and update translation files as well as a process
that can be triggered automatically or on a schedule to
update the .mo
files that a service references. After
updating the .mo
file you can automatically roll out a
deployment. The new deployment should then load the new
.mo
files into memory when the service is instantiated.
You don't need to be working on a system the scale of Wikipedia to include translations. You can rely on a user's system locale to translate CLI tools. My Fiancee's system is set to Norwegian, if I ever write a CLI for her I think it would be fun to provide a Norwegian interface.
-
$ man gettext
↩ -
I'd like to draw attention that I stole this joke from myself. I don't want to draw attention to the poorly performed lightening talk I did. It was my first time and I tried to fit this entire post into 5 minutes. Link for posterity ↩