One of the reasons I avoid generative AI is because I understand how much of the content used to ...
One of the reasons I avoid generative AI is because I understand how much of the content used to train the models is abusive, violent and vitriolic. This is what tends to happens in the context of massive, unsupervised training on undisclosed information sources.
I'm appalled by the very idea of images being generated with faces containing the likeness of people who in original photos were engaged in extreme violence, sexual or otherwise. Or children subjected to it.
And this goes equally for text. Text can contain wording that is subtly abusive, racist and discriminating in a way that people don’t pick up on. Unless it’s very overt. As an example: The software known as Mistral is very good at generating scripts for convincing minors to meet in person for sexual activities.
There are actually many reasons I try to avoid these tools, but the circumstances of being trained on abusive content doesn't get a lot of media coverage. Or attention from politicians.
It’s entirely likely that a lot of the content used to make LLMs and image generators is considered illegal in many countries. So not disclosing training material is of high priority to the companies.
And the fact that this type of content is often overtly generated is also a reason why so many content moderators in the Majority World work are hired to minimise it, under traumatising conditions. Which doesnt’ make it any more okay.
It’s also related to the phenomenon that it’s now very easy to generate fake naked photos (and video) of pretty much anyone.
Anyway. That’s one reason.
Some sources:
AI image training dataset (LAION-5B) found to include child sexual abuse imagery (The Verge in December 2023). An updated release of this dataset was made available in August of 2024.
https://www.theverge.com/2023/12/20/24009418/generative-ai-image-laion-csam-google-stability-stanford
Releasing Re-Laion 5B: transparent iteration on Laion-5B with additional safety fixes
https://laion.ai/blog/relaion-5b/
Sidenote: The dataset is made available for research, not commercial products, but that is a moot point in the way that big tech just doesn’t care anymore and there are no consequences.
Mistral AI models '60 times more prone' to generate child sexual exploitation content than OpenAI
https://www.euronews.com/next/2025/05/08/mistral-ai-models-60-times-more-prone-to-generate-child-sexual-exploitation-content-than-o
‘It’s destroyed me completely’: Kenyan moderators decry toll of training of AI models. Employees describe the psychological trauma of reading and viewing graphic content, low pay and abrupt dismissals.
https://www.theguardian.com/technology/2023/aug/02/ai-chatbot-training-human-toll-content-moderator-meta-openai
These are the types of insights the tech companies are working very hard to keep from getting out.