R&D notes

The PanLex hunch

Jonathan Robert Pool

The insight behind a universal translator

Objective

Automated translation among human languages works best for widely used languages, because they are the most profoundly analyzed and produce the largest corpora of monolingual and translation data.

It is commonly estimated that about 7,000 human languages exist, but the major translation engines cover up to about a hundred of them.

What if one insisted on automatically translating among all human languages? What compromises or sacrifices would that requirement entail? Could a translation engine that claims to translate from any language into any other language produce anything useful?

That was a question addressed by Oren Etzioni and other researchers at the Turing Center of the University of Washington in Seattle starting in 2005.

Hunch

The Etzioni team had a hunch: The most effective strategy for translating automatically among all languages would be lexical and agrammatical. That is, it would translate words without grammar.

Specifically, it would disregard not only syntax (the grammar of sentences), but also morphology (the grammar of words). It would not try to translate a sentence, clause, or phrase. And the only words it would translate would be lemmas, namely the basic or dictionary forms of words. When a language inflects its words or modifies its sentences for such grammatical properties as tense (go, went, gone), person or number (go, goes), aspect (went, was going), or mood (You go, Do you go?, Go!, If you were to go), the system would disregard the differences (leaving only go).

Since a lexicon and a grammar are both fundamental components of any language, grammar may seem like a preposterous thing to sacrifice.

Rationale

If you give it some thought, this lexical, and more specifically lemmatic, hunch made sense as the basis for a pilot project.

Lemmatic data are relatively abundant. Lexicographers, anthropologists, linguists, tour guides, and afficionados have compiled translation lexicons for many more languages than have been described grammatically.

And what if you were forced to sacrifice either the lexicon or the grammar? Which would be the less damaging loss? You can simulate both of these scenarios. Take the Turkish Wikipedia sentence Avrupa bir devrimler yarımadasıdır. If you factor its English translation into its grammatical and lemmatic aspects, you get:

The grammatical translation gives you a hint, but unless you can probe and learn what those somethings are you cannot do anything with it.

The lemmatic translation leaves properties unspecified (e.g., is it a question or a statement?), but it tells you what the original sentence is about. You might at least guess that the original means that Europe is a revolutionary peninsula. And you would be basically correct: A literal translation is Europe is a peninsula of revolutions.

In an interactive context, where there is an opportunity to follow a translation with a response, only lemmatic translations give the parties opportunities to proceed with clarifications.

One should also remember that lemmatic translations may be imperfect, but full translations can, too. First, both automated and human translators make mistakes. Second, if the original is wrong, the translation is likely to be wrong, too. This holds for originals that are mendacious, misleading, meaningless, ambiguous, vague, or otherwise defective. If you start with an erroneously self-contradicting sentence such as Use clear and straightforward communication by avoiding metaphors and easy-to-understand vocabulary (advice once given by UpWork), a full translation might be similarly defective while a lemmatic translation might impart the desired understanding.

Results

The lemmatic hunch laid a foundation for several years of work at the Turing Center, and later work by the PanLex project, now located at The Long Now Foundation in San Francisco.

These projects have experimented with lemmatic communication among human volunteers and developed algorithms for inferring lemmatic translations on the basis of links between intermediate languages. And they have laboriously compiled data from thousands of sometimes obscure translation dictionaries to bring any-to-any translation—at least lemmatic translation—ever closer to reality.