As if we didn’t have enough in-person racism to deal with, it turns out the chatbots might be racist too. Troubling reporting from the Washington Post revealed that one of the most widely-used data sets used to train AI chatbots contains a on of right-wing content.
For folks unfamiliar with artificial intelligence, AI programs like ChatGPT can’t literally think for themselves. Instead, companies feed AI programs a massive amount of data scraped from all over the internet. The AI uses this data set to mimic human thought. So if you’re robot friend starts trying to share 9/11 conspiracy theories, chances the data set had a little too much of Alex Jones.
And that, my friends, is precisely where the problem begins. According to the Washington Post investigation, the news websites used in one of the most widely used AI data sets include a ton of far-right and non-reputable sources. The data set in question is Google’s C4 data set, which powers some of the largest AI models in the world, including Facebook and Google’s AI models.
So where exactly are they getting their news from? Well, Breitbart is definitely on the list. The Russian-state propaganda website RT.com is also on there, alongside the anti-immigration group Vdare.com.
You don’t have to take our word for whether Breitbart is pushing racism. In 2016, right-wing commentator Ben Shapiro expressed disdain for the website, saying it pushed “white ethno-nationalism” content. And if you’re too far right for Ben Shapiro... you might want to start asking some tough questions. A massive concern is that AI programs don’t always cite their sources, which means you could ask an AI a question and not know the idea that the answer is coming from a right-wing site spewing hate.
MSNBC’s Sarah Posner, who covers the right, called attention to just how dangerous having these inputs in the algorithm can be:
Anyone who has searched the web for information on a topic knows that it can sometimes land them on a site spewing bigoted content or disinformation. The building blocks of chatbots have been scraped from the same internet. An offended user can navigate away from a toxic site in disgust. But because the data collection for LLMs is automated, such content gets included in the “instruction” for them. So if an LLM includes information from sites like Breitbart and VDare, which publish transphobic, anti-immigrant and racist content, that information — or disinformation — could be incorporated in a chatbot’s responses to your questions or requests for help.
The problem with AI (other than the inevitable day it takes over the world) is that it’s a product of our own biases and judgments. And until we get a much better handle on that or at least put up better guard rails, the racist chatbots might be here to stay.