Potato, potato. Toma6to, I'm going to kill you... How a typo can turn an AI translator against us

Example of a spelling mistake

Neural-network-based language translators can be tricked into deleting words from sentences or dramatically changing the meaning of a phrase, by strategically inserting typos and numbers.

Just like twiddling pixels in a photo, or placing a specially crafted sticker near an object, can make image-recognition systems mistake bananas for toasters, it is possible to alter the translation of a sentence by tweaking the input.

This isn't like altering "The black cat" to "The black cap", and making an English-to-French translation AI change its output from "Le chat noir" to "Le chapeau noir." That change is to be expected.

No, we're talking about, for example, tweaking "Er ist Geigenbauer und Psychotherapeut" (He is a violin maker and psychotherapist) to "Er ist Geigenbauer und Psy6hothearpeiut", and getting the translation: "He is a brick maker and a psychopath."

Disaster

The typos are apparent to a human fluent in the language, who can forgive the blunders unaware of the malicious intent behind the errors, whereas software is none the wiser it is being hoodwinked. A well-placed spelling mistake could, in effect, cause a machine-learning program to misunderstand a command or point of information, and do something rather erratic or disastrous.

It's also not hard to imagine a government surveillance agency, or a corporation's human resources department, using an AI to transcribe monitored phone calls into English, and getting the translation completely wrong to the point where alarms are raised over innocuous statements.

This is the work of a group of computer scientists at the University of Oregon in the US, who were able to change letters in sentences to derail machine translators. Their results are presented in this paper, which was emitted over the weekend.

“Machine translation is used to connect people and share information, but when the translation is wrong, the opposite can happen,” Daniel Lowd, coauthor of the paper and an associate professor at the University of Oregon, told The Register.

"This could be used maliciously to arouse suspicion on someone else.

“For example, suppose I used a pseudonym and wrote to an intended victim: 'Good morning, any plans today?' with a few typos added in. To the victim, it looks like I'm saying good morning. To law enforcement using translation software, it might look like I'm saying: 'Any plans to attack them today?' If the victim responds, 'Yes, of course!' then it looks like they're a participant in some malicious activity."

White box

The researchers trained a recurrent neural network to translate between languages using the IWSLT 2016 dataset: this is made up of hundreds of thousands of words between pairs of languages taken from TED talks. It covers three pairs of languages: German to English, Czech to English, and French to English.

It was this neural-network software the team set out to trick. This was a white box attack in that they knew the model and its parameters, as opposed to a black box with completely mysterious inner workings. In the real world, people seeking to attack machine translators will most likely face black boxes – Google doesn't exactly share the blueprints to its translation systems, for example – however, a white box setting is not too an unrealistic scenario, Lowd said.

“In most cases, companies will keep those details secret," he said. "However, many systems, such as Google Translate, are available to the public who can experiment with them as much as they want.”

As such, determined miscreants can perhaps get a good idea of what's happening inside a translator by playing with it externally.

The attack

The researchers experimented with replacing characters, swapping them around, deleting them, and inserting new ones. For example, swapping the "h" in the German word "nacht" with a k will change it to "nackt", and the English translation switches from the word night to naked. The aim was to make changes that would affect the output without making it completely incoherent – no alteration to the text should to be too great or too mild.

To avoid wasting time randomly and aimlessly editing strings of text – or vectors of numbers as the software sees them – the team analyzed the effects within the neural network when characters were adjusted, allowing them to home in on worthwhile changes.

“In our attacks, we consider all possible individual changes – inserting, deleting, replacing, or swapping a single character somewhere in the source sentence,” Lowd said.

"However, trying every possibility one at a time could be slow, and a single change might have no noticeable effect on the translation. To get around this, we use the structure of the neural network to estimate the effect of every possible change all at once. This approximation is orders of magnitude faster than the naive approach.

“To come up with a set of changes, we make one change at a time until the translation has been corrupted ‘enough’ for the adversary. To get a slightly better attack, we can consider a few runners-up as well, since they might be even better when combined with other character changes. To get a faster attack to use in our training method, we find the worst character to change in each word and change all of them."

It can be difficult to assess how powerful these attacks are, and how likely they are to succeed in the real world outside of a lab. Unlike image classification systems, where performance can be measured with an accuracy percentage, text is more nuanced.

Instead, the researchers suggest evaluating adversarial examples by seeing if the changes fulfill a specific goal, whether that be removing a name from a sentence or hiding an activity for privacy reasons, or more malicious goals like sending threats that are understood by the recipient but not to the translation model.

machine_translation_attack

Here are some of the results for sentences translated from German to English. The first example is a controlled attack, the second and third examples show a second-most and a 100th-most likely targeted attack, respectively. Image credit: Ebrahimi et al.

It’s not all bad news. The paper also showed that training translation models with adversarial examples can minimize the system’s weaknesses: by showing an AI malicious sentences, it can learn to ignore or workaround them. This takes about three times as long as regular training, we're told, but improves the robustness.

In real models, pairs of languages rarely translated into one another will be more susceptible to attacks, Lowd said.

“If the translation is unreliable to begin with, then an attacker can exploit that," he said. "Machine translation models with less training data, less language expertise, or less engineering and testing effort are more likely to be vulnerable.”

Feeding adversarial examples into neural networks is becoming increasingly trendy. The most common examples are forcing image classifiers to identify objects incorrectly by adding pockets of noise or changing a few pixels, so that a cat is mistaken for guacamole. ®




Related articles


0 Comments