Localization plays a key role in adapting projects for users around the world.
Localization plays a central role in the ability to customize an open source project to suit the needs of users around the world. Besides coding, language translation is one of the main ways people around the world contribute to and engage with open source projects.
There are tools specific to the language services industry (surprised to hear that’s a thing?) that enable a smooth localization process with a high level of quality. Categories that localization tools fall into include:
Computer-assisted translation (CAT) tools
Machine translation (MT) engines
Translation management systems (TMS)
Terminology management tools
Localization automation tools
The proprietary versions of these tools can be quite expensive. A single license for SDL Trados Studio (the leading CAT tool) can cost thousands of euros, and even then it is only useful for one individual and the customizations are limited (and psst, they cost more, too). Open source projects looking to localize into many languages and streamline their localization processes will want to look at open source tools to save money and get the flexibility they need with customization. I’ve compiled this high-level survey of many of the open source localization tool projects out there to help you decide what to use.
OmegaT CAT tool. Here you see the translation memory (Fuzzy Matches) and terminology recall (Glossary) features at work. OmegaT is licensed under the GNU Public License version 3+.
CAT tools are a staple of the language services industry. As the name implies, CAT tools help translators perform the tasks of translation, bilingual review, and monolingual review as quickly as possible and with the highest possible consistency through reuse of translated content (also known as translation memory). Translation memory and terminology recall are two central features of CAT tools. They enable a translator to reuse previously translated content from old projects in new projects. This allows them to translate a high volume of words in a shorter amount of time while maintaining a high level of quality through terminology and style consistency. This is especially handy for localization, as text in a lot of software and web UIs is often the same across platforms and applications. CAT tools are standalone pieces of software though, requiring translators that use them to work locally and merge to a central repository.
MT engines automate the transfer of text from one language to another. MT is broken up into three primary methodologies: rules-based, statistical, and neural (which is the new player). The most widespread MT methodology is statistical, which (in very brief terms) draws conclusions about the interconnectedness of a pair of languages by running statistical analyses over annotated bilingual corpus data using n-gram models. When a new source language phrase is introduced to the engine for translation, it looks within its analyzed corpus data to find statistically relevant equivalents, which it produces in the target language. MT can be useful as a productivity aid to translators, changing their primary task from translating a source text to a target text to post-editing the MT engine’s target language output. I don’t recommend using raw MT output in localizations, but if your community is trained in the art of post-editing, MT can be a useful tool to help them make large volumes of contributions.
Mozilla’s Pontoon translation management system user interface. With WYSIWYG editing, you can translate content in context and simultaneously perform translation and quality assurance. Pontoon is licensed under the BSD 3-clause New or Revised License.
TMS tools are web-based platforms that allow you to manage a localization project and enable translators and reviewers to do what they do best. Most TMS tools aim to automate many manual parts of the localization process by including version control system (VCS) integrations, cloud services integrations, project reporting, as well as the standard translation memory and terminology recall features. These tools are most amenable to community localization or translation projects, as they allow large groups of translators and reviewers to contribute to a project. Some also use a WYSIWYG editor to give translators context for their translations. This added context improves translation accuracy and cuts down on the amount of time a translator has to wait between doing the translation and reviewing the translation within the user interface.
Brigham Young University’s BaseTerm tool displays the new-term entry dialogue window. BaseTerm is licensed under the Eclipse Public License.
Terminology management tools give you a GUI to create terminology resources (known as termbases) to add context and ensure translation consistency. These resources are consumed by CAT tools and TMS platforms to aid translators in the process of translation. For languages in which a term could be either a noun or a verb based on the context, terminology management tools allows you to add metadata for a term that labels its gender, part of speech, monolingual definition, as well as context clues. Terminology management is often an underserved, but no less important, part of the localization process. In both the open source and proprietary ecosystems, there are only a small handful of options available.
The Ratel and Rainbow components of the Okapi Framework. Photo courtesy of the Okapi Framework. The Okapi Framework is licensed under the Apache License version 2.0.
Localization automation tools facilitate the way you process localization data. This can include text extraction, file format conversion, tokenization, VCS synchronization, term extraction, pre-translation, and various quality checks over common localization standard file formats. In some tool suites, like the Okapi Framework, you can create automation pipelines for performing various localization tasks. This can be very useful for a variety of situations, but their main utility is in the time they save by automating many tasks. They can also move you closer to a more continuous localization process.
Localization is most powerful and effective when done in the open. These tools should give you and your communities the power to localize your projects into as many languages as humanly possible.
Want to learn more? Check out these additional resources:
He’s been called “punctuation’s answer to Banksy”. A self-styled grammar vigilante who spends his nights surreptitiously correcting apostrophes on shop signs and billboards. The general consensus is that he’s a modern-day hero – a mysterious crusader against the declining standards of English. But his exploits represent an altogether darker reality.
The man himself is not particularly offensive. In a BBC Radio 4 report, he comes across as a reasonable person who simply feels a compulsion to quietly make a difference to what matters to him. He doesn’t ridicule, he doesn’t court publicity, he simply goes out and adds or removes apostrophes as required. And he does it with care, usually.
So what’s the problem? The problem lies in what this kind of behaviour represents and therefore normalises. In championing our vigilante, we are saying that it’s okay to pull people up on their use of language. It gives people the confidence to unleash their own pet peeves onto the world, however linguistically dubious.
The grammar vigilante himself appears to have a specific type of target, and his approach is nothing if not considerate. However, there is another type of pedant who is not so subtle or self aware. Some people think nothing of highlighting inconsistent punctuation wherever they might see it, however innocuous or irrelevant it might be (apostrophes rarely actually disambiguate – after all, we get along fine without them in speech).
Never mind that it’s a handwritten notice in a shop window, written by someone for whom English is a second (or third, or fourth) language. Never mind that it’s a leaflet touting for work from someone who didn’t get the chance to complete their education. They need to be corrected and/or posted online for others to see. Otherwise, how will anybody learn?
After all, apostrophes are easy. If people would just take a bit of time to learn the rules, then there wouldn’t be any mistakes. For example, everybody knows that apostrophes are used to indicate possession. So the car belongs to Lynda, the car is Lynda’s. But what about the car belongs to her, the car is her’s? Of course not, we don’t use apostrophes with pronouns (although this was quite common in Shakespeare’s time) as they each have a possessive form of their own. Except for one that is, which still needs one: one does one’s duty. It doesn’t need one though – it’s is something.
Then there’s the question of showing possession with nouns already ending in “s”: Chris’s cat or Chris’ cat? Jess’s decision or Jess’ decision? Or plural nouns ending in “s”: The princesses’s schedule or the princesses’ schedule? I don’t remember it being this difficult in the 1980’s/1980s/’80s/80s/80’s.
We definitely don’t use apostrophes to indicate plurals, something that routinely trips up the fabled greengrocer’s with its potato’s (although it was once seen as correct to use apostrophes with some words ending in a vowel). But what about when we need to refer to dotting the i’s and crossing the t’s, or someone makes a sign saying CD’S £5.00?
Clever clogs
The point is, while some are clear, many of the rules around apostrophes are not as transparent as some people would have us believe. This is largely due to the fact that they are not actually rules after all, but conventions. And conventions change over time (see David Crystal’s excellent book for a detailed history).
When things are open to change, there will inevitably be inconsistencies and contradictions. These inconsistencies surround us every day – just look at the London Underground stations of Earl’s Court and Barons Court, or St James’s Park in London, and St James’ Park in Newcastle. Or business names such as McDonald’s, Lloyds Bank, and Sainsbury’s. Is it any surprise people are confused?
Of course, all of these conventions are learnable or available to be looked up. But if people haven’t had the opportunity to learn them, or do not have the skills or awareness to look them up, what gives other people the right to criticise? Are those who point out mistakes really doing it to educate, or are they doing it to highlight their own superior knowledge? Are they judging the non-standard punctuation or the sub-standard person?
Picking on someone because of their language is always a cowardly attack. Linguist Deborah Cameron makes the point that this is still the case even when highlighting the poor linguistic skills of bigots and racists on social media. Tempting as it is to call out a racist on their inability to spell or punctuate, by doing so we are simply replacing one prejudice with another, and avoiding the actual issue. As Deborah Cameron says: “By all means take issue with bigots – but for their politics, not their punctuation.”
Apostrophes matter, at least in certain contexts. Society deems it important that job applications, essays, notices and the like adhere to the current conventions of apostrophe usage. For this reason, it is right that we teach and learn these conventions.
But fetishising the apostrophe as if its rules are set in stone, and then fostering an environment in which it is acceptable to take pleasure in uncovering other people’s linguistic insecurities is not okay. The grammar (punctuation?) vigilante of Bristol is relatively harmless. But he is the unassuming face of a much less savoury world of pedantry.
ARAB newspapers have a reputation, partly deserved, for tamely taking the official line. On any given day, for example, you might read that “a source close to the Iranian Foreign Ministry told Al-Hayat that ‘Tehran will continue to abide by the terms of the nuclear agreement as long as the other side does the same.’” But the exceptional thing about this unexceptional story is that, thanks to Google, English-speaking readers can now read this in the Arab papers themselves.
In the past few months free online translators have suddenly got much better. This may come as a surprise to those who have tried to make use of them in the past. But in November Google unveiled a new version of Translate. The old version, called “phrase-based” machine translation, worked on hunks of a sentence separately, with an output that was usually choppy and often inaccurate.
The new system still makes mistakes, but these are now relatively rare, where once they were ubiquitous. It uses an artificial neural network, linking digital “neurons” in several layers, each one feeding its output to the next layer, in an approach that is loosely modelled on the human brain. Neural-translation systems, like the phrase-based systems before them, are first “trained” by huge volumes of text translated by humans. But the neural version takes each word, and uses the surrounding context to turn it into a kind of abstract digital representation. It then tries to find the closest matching representation in the target language, based on what it has learned before. Neural translation handles long sentences much better than previous versions did.
The new Google Translate began by translating eight languages to and from English, most of them European. It is much easier for machines (and humans) to translate between closely related languages. But Google has also extended its neural engine to languages like Chinese (included in the first batch) and, more recently, to Arabic, Hebrew, Russian and Vietnamese, an exciting leap forward for these languages that are both important and difficult. On April 25th Google extended neural translation to nine Indian languages. Microsoft also has a neural system for several hard languages.
Google Translate does still occasionally garble sentences. The introduction to a Haaretz story in Hebrew had text that Google translated as: “According to the results of the truth in the first round of the presidential elections, Macaron and Le Pen went to the second round on May 7. In third place are Francois Peyon of the Right and Jean-Luc of Lanschon on the far left.” If you don’t know what this is about, it is nigh on useless. But if you know that it is about the French election, you can see that the engine has badly translated “samples of the official results” as “results of the truth”. It has also given odd transliterations for (Emmanuel) Macron and (François) Fillon (P and F can be the same letter in Hebrew). And it has done something particularly funny with Jean-Luc Mélenchon’s surname. “Me-” can mean “of” in Hebrew. The system is “dumb”, having no way of knowing that Mr Mélenchon is a French politician. It has merely been trained on lots of text previously translated from Hebrew to English.
Such fairly predictable errors should gradually be winnowed out as the programmers improve the system. But some “mistakes” from neural-translation systems can seem mysterious. Users have found that typing in random characters in languages such as Thai, for example, results in Google producing oddly surreal “translations” like: “There are six sparks in the sky, each with six spheres. The sphere of the sphere is the sphere of the sphere.”
Although this might put a few postmodern poets out of work, neural-translation systems aren’t ready to replace humans any time soon. Literature requires far too supple an understanding of the author’s intentions and culture for machines to do the job. And for critical work—technical, financial or legal, say—small mistakes (of which even the best systems still produce plenty) are unacceptable; a human will at the very least have to be at the wheel to vet and edit the output of automatic systems.
Online translating is of great benefit to the globally curious. Many people long to see what other cultures are reading and talking about, but have no time to learn the languages. Though still finding its feet, the new generation of translation software dangles the promise of being able to do just that.