AI, Racism, and Bad Data

6 minute read

Published:

Dark skin and Scottish accents and societally engrained biases, oh my!

Ya Done Goof’d

Google recently screwed the pooch with their image recognition.

And who can forget Microsoft’s very own version of Tomi Lahren?

“humans are super cool” → “I hate the jews” < 24hrs. Nice work, trolls.

Tay perfectly represents the fact that the data we get from humans is shit, because there are a bunch of shit humans. The point is, Google and Microsoft responded to these upsets by acknowledging them. Microsoft immediately shut down Tay (not before the world got those sweet screenshots) and responded with

We are deeply sorry for the unintended offensive and hurtful tweets from Tay, which do not represent who we are or what we stand for, nor how we designed Tay. Tay is now offline and we’ll look to bring Tay back only when we are confident we can better anticipate malicious intent that conflicts with our principles and values.

Sorry, M Dawg, but the point is you didn’t design Tay. Microsoft created a bot for man to shape In His Own Image™, specifically void of direction or bias. And the point is just that: unbiased data begets bias. If the world is terrible, AI will learn to be terrible. If we randomly sample a room of computer science degrees, our data looks bad. If we want 100 pictures of faces to train our algorithm on, this is what our data looks like:
80 men, 20 women
58 white, 18 asian, 9 latino, 5 black, and 10 multiracial.
Shocking nobody, the way an algorithm will achieve the highest accuracy rate is by really nailing the most common demographics: white, male faces.

It Feels Like The First Time

At my company, I was the first female voice that our bot heard. And the fact of the matter his he didn’t recognize what I was saying because the audio waveform that is generated when I speak into his microphones differed sufficiently from all his previous examples, that he couldn’t recognize my voice as saying his name. And that’s okay! For a long time, he had trouble with children’s voices, too, because it’s much easier to get an adult to stand around saying “hey robot,” for 30 repetitions during working hours at the office than it is to acquire kiddos for a robot-name-saying sweat shop.

I work in robotics. As in, I write software that could be running anywhere, but happens to run on a little robot buddy, Bob, who sits on the counter, swivels towards you when he hears you come in, recognizes that he knows you, says “oh, good morning, Caro! Nice to see you today,” and makes a wheee noise. Or, he could be doing his own thing, and I could say “oh hey Bob,” and he might say “howdy Caro!” Then we share a smile and my day is made slightly better by the impression that my little bot is happy to see me.

A big part of that is the “happy to see me,” part. In order for him to say “howdy Caro!” and not “top of the mornin’ to ya, Freddy,” or “well fancy meeting you here, Clyde,” he has to run a face or voice recognition algorithm on the inputs of his environment — namely, my face and voice. But the tricky part is obviously that my face and voice change depending on other factors in the environment; in a dark room my housemates might not even know I’m there, let alone recognize me. In an echo-y church with a truck driving by, my voice might be indistinguishable from my friends’. And that’s not even to mention that my face looks different from different angles, and my voice varies from Frog (when I wake up) to Drunk Hamster (when I’m highly caffeinated… or drunk).

In order to be robust against these changes, Bob needs training data. Not only on me, but on what a face looks like. And in order to do that, he needs to look at a lot of faces. And in order to look at a lot of faces, the people who program Bob’s algorithms will probably show them their own face, a lot of times.

And lo, the training data for these algorithms reflects the people who program them, and predominantly white, male, and middle-aged people become what Bob knows.

See, it’s nobody’s fault that Bob has never seen the rainbow of people that exist throughout the world. Bob’s nursery 

Should we be cognizant when learning algorithms reflect biases that already exist in the world? Most assuredly, yes. But what I don’t get is the outrage, the hate, and the anti-tech sentiments that seem to follow. Just like people go out and learn social biases from one another, so too will algorithms specifically designed to learn from the humans that surround it. If you want, you can think of an AI as a baby, constantly absorbing and integrating every new piece of information in order to act similarly to other agents in its environment. And, just like when children hear their parents curse and consequently tell their teachers to go fuck themselves, AI bots and algorithms will be sexist, racist shitlords if they read every tweet ever.


I don’t really get it, but here, have some kids cussing.

How We Gonna Fix It, Fix It, Fix It

Baby, I gotta know. Can you fix my B-R-A-I-N? 

Oversampling from different populations; explicitly biasing your training data to do better. Hard to test, unless you GET MORE DATA.

As my coworker says:  patrick [10:27]  We want Jibo to be the best he can be, but the world is a dark and scary place and so we must arm him with good morals and a solid work ethic before we send him out into the world alone.

But as he also says: patrick [10:29] Of course, we can always OTA better morals. 

And that’s the moral [HA] of the story: AI as a field is improving. Constantly. Exponentially. Continuously (and also sometimes step-wise). As more and more (deserved) hype bubbles around this issue, researchers listen and make improvements. 

Is every research scientist an angel or a saint? No, of course not. But the most hardcore engineers are motivated by solving the friggin’ problem, and if the problem is that the state of the art face recognition algorithms can’t reliably recognize 14% of the US population, that’s at best an 86% success rate. And as my good friend Lemongrab says: UNACCEPTABLE.