Google Translate's Hebrew-Arabic-Persian war of words

Google translate Photo: Shutterstock
Google translate Photo: Shutterstock

Why does Google mistranslate a Hebrew sentence on gender equality into Persian for celebrating the 14th anniversary of Al Qaeda in Israel?

Adv. Keren Greenblatt, executive director of the Partners coalition for promoting economic equality for women, has in recent days focused on a new digital campaign promoting a website to encourage businesses founded by women. Greenblatt wanted to appeal to everyone, so she included text in both Hebrew and Arabic.

Greenblatt told "Globes" that she used Google's translation services for the campaign. "My Arabic isn't especially good, so I put one of the texts for the campaign into Google Translate so that I could adapt it to the corresponding text in Hebrew. Google Translate classified the text as a section in Farsi (Persian), not Arabic, and the translated text that I got included all sorts of words relating to Israel and Al Qaeda. These things weren't in the original text in Arabic and wouldn't have appeared in it had it been written in Farsi."

The translated text was a sequence of sentences that at first glance seem logical: their syntax is correct and they seem to be a logical sequence. At the same time, there are unbridgeable gaps between the paragraphs and there is a wide gap between the subjects that the text seemingly deals with. The result is confusing, but at first glance appears to be likely a normal mediocre translation taken out of context. Many of the texts from Google Translate look like that, but how did words and phrases not appearing in the original text get there?

The paragraph begins with a vague statement about the right of a person to private dwellings in his country, and things go down from there: "Before I talk about partnerships… I will greet the 14th anniversary of the founding of Al Qaeda in Israel. In order to attain the objectives of cooperation in peace and security, we are committed to promoting and extending our relations with Israel." Following this bizarre statement, the text starts talking about the formation of international Internet and telecommunications networks. The paragraph ends with a link to the project website, while promising the reader, "You can find the product most suitable to your home. For absolutely free downloading! Only five minutes. Thank you for visiting."

From a victory of Bnei Sakhnin to war against the unbelievers

We looked for additional examples besides Greenblatt's text. We selected several texts in Arabic, defined the original language as Farsi, and tried to translate them, this time to English. The results were similar: instead of getting an error message that the text in Farsi was meaningless, Google Translate proposed completely imaginary translations. A section written by MK Ayman Odeh (Joint Arab List) dealing with a victory by the Bnei Sakhnin soccer team, for example, was translated into a garbled manifesto criticizing the fall of Islam and abandonment of the Koran and condemning unbelievers. What all of these translations had in common was the use of words taken from political and religious contexts: Islamic organizations, the US Department of Defense, State Department, Islamic Republic of Iran, United Arab Emirates, and the Koran appeared repeatedly,.

Dr. Thamar Eilam Gindin from Shalem College and the Ezri Center for Iran and Persian Studies confirms that there is no connection between any of the original texts and the translation obtained, even if read in Farsi. She says that it is possible that the similar vocabulary in Arabic and Farsi is what fooled Google Translate, but the two languages are completely different: "The Farsi language is an Indo-European language from the same family as English, French, and Russian. On the other hand, Iran accepted Islam more than 1,300 years ago, together with the Koran and the Arabic language. Farsi absorbed and continues to absorb many Arabic influences, especially borrowed words. Iranians pronounce these words completely differently and their meaning has sometimes changed over time, but in writing, the words are the same. Arabic has also borrowed words from Farsi, but the two languages remain distinct in syntax and morphology, i.e. in how they construct words and sentences."

Why does Google Translate turn a meaningless text in Farsi into a text that refers directly to political crises with Iran? A hint can be found in the period during which Google added Farsi to its translation services. "Google doesn't really say what they are doing behind their translation engines, but in general, we know that they used a statistical approach in the past," says Dr. Omri Abend of the Hebrew University Department of Cognitive Science. " According to this method, they do not insert rules in order to construct the translation system; they use learning from examples. The main source on which they usually train the translation systems is called parallel text corpus. They supply existing translations to the system, in our cases texts in Farsi and English, and the system deduces the rules from that." According to Abend, if they trained the system using texts of a certain kind, this is likely to affect the translation.

The Farsi language was added to Google Translate during the 2009 crisis in Iran. That was an election year in the Islamic Republic, and many demonstrations took place throughout the country and aroused great interest in the West. In addition, concern that Iran would develop nuclear weapons reached a peak. Google added the options of translation between English and Farsi before the they had planned to do it because of the turbulent political situation then. Its intentions were good: to make possible rapid translation between the languages at a time when it was greatly needed. It can be assumed that even afterwards, many English speakers around the world were inclined to translate texts from Farsi to English, especially at times of political crisis. These translations probably constitute the main text corpus on which the translation is based, and so a completely meaningless text becomes a threatening one.

The system reverts to what it recognizes

A biased text corpus, however, is only part of the possible explanation. Another factor is the translation technologies used by Google and the way they have changed over the years. "Up until a few years ago, translation systems were based on an attempt to find word sequences in one language and match them with words and word sequences in another language. When the system encounters a word it does not recognize, it does not supply a translation, because the word is not in the dictionary," Abend says. "Today, the translation systems are based on a different approach called neural machine translation."

This translation mechanism, to which Google switched in September, is capable of translating words that it does not recognize through letter-based patterns and educated guesses according to context. "Among other things, this mechanism relies on letter sequences, not just words and sentences. The system can also find patterns below the complete word level, for example based on similarity between words. Instead of simply not translating a word that it doesn't know, such systems can lead to unexpected results."

When the system encounters a ordinary text in Farsi, for example, it is able to identify the broader context and the style of the text. On this basis, it assigns the text to a specific set of concepts, from which it derives the likeliest translation for the words it does not recognize. But what happens when no word is identified? "If you put in an obviously unclear text and the system is unable to identify its characteristics, as in this case, the system will revert to the common patterns in the data fed into it," Abend says. In other words, the system goes back to what it knows well, in this case, the examples of translations from Farsi into English relating to political crises.

The errors in Arabic and Farsi are, of course, not the first time that Google has made translation mistakes. Actually, a simple search will yield innumerable articles, most of which are in English, with titles such as "Lost in Translation" and "15 Times that Google Translate Embarrassed Us." The many examples enshrine cases in which Google translated the word "kibush" (conquest) as "kivsha" (sheep). In addition to innocent errors, Google Translate has been a platform for trolling for years, among other things because it was relied on suggestion for improvement from the public. For example, in May, someone who tried to translation the sentence "I am a flat Earther" into French got "Je suis un fou" ("I am insane"). Bing, Microsoft's competing translation service, is also not free of error. When Benjamin Netanyahu congratulated Netta Barzilai for winning the Eurovision Song Contest, his sentence was mistranslated as "Netta, you're a real cow."

Diversifying with poetry and cookbooks

Google chose not to respond in this case and did the same for all of its translation mistakes up until now: "Google Translate works by studying patterns from many dictionaries of sample translations all over the Internet. Unfortunately, some of these patterns can lead to translation mistakes. The error has been reported and we are working on a correction." In this case, however, the problem is not an individual translation; it involves a more basic problem in the way that the Farsi language is perceived by the translation mechanisms.

Services based on artificial intelligence and machine learning have already been shown to have bias towards specific groups. In 2015, Google's facial identification services made headlines by identifying black men as gorillas. In the past year, a group of MIT researchers revealed that facial detection algorithms are 99% accurate in identifying white men but provide less accurate results for colored people and women. The rate of correct identification among black women was only 35%.

As in the case of Google Translate, the reason for this inaccuracy lies in the databases on which the technology is based. One database, for example, included more than 75% men and more than 80% white people, many of them programmers and their friends and relatives. The study's conclusion was that the only way to avoid human racist patterns in such technologies was to diversify the models used to try out the algorithms and the programmers themselves. Similarly, the way to avoid belligerent texts is to feed additional examples into the Google text corpus in Farsi - perhaps from beautiful literature, poetry, or traditional Iranian recipes. Then, of course, it is necessary to make sure that the system recognizes the difference between Farsi and Arabic.

Published by Globes [online], Israel business news - www.globes-online.com - on June 20, 2018

© Copyright of Globes Publisher Itonut (1983) Ltd. 2018

Google translate Photo: Shutterstock
Google translate Photo: Shutterstock
Twitter Facebook Linkedin RSS Newsletters גלובס Israel Business Conference 2018