Digital Marketing

An AI Ethical Dilemma: What’s the Best Data Diet for LLMs?

June 21, 2025

There’s a burning controversy at the heart of large language model (LLM) development: the training data. While AI giants claim fair use after scraping the surface web and eating up copious amounts of public data, they likely haven’t bothered to check where it came from or who it belongs to.

Researchers from MIT, Cornell University and the University of Toronto came together to prove a point — that you can create a fairly capable LLM using 100% ethically-sourced data. But how? And does it stack up to the big dogs?

What Counts as Unlicensed Data?

Before getting into the study, which sought to use properly licensed data, it’s important to understand what counts as unlicensed data. “Unlicensed data” refers to content used for training AI models that was:

Scraped from the internet without explicit permission or licensing.
Often protected by copyright, such as books, news articles, websites or code.
Not shared by the original creators with the intention of being used in AI training.

OpenAI has acknowledged scraping large parts of the web, including copyrighted material, for training, which has led to hefty lawsuits from entities including Canada’s largest media outlets, the Authors Guild and The New York Times.

A Portion of Properly Licensed Data

Common Pile v0.1, the name these researchers assigned to their dataset, contains only public domain and openly licensed text. Their goal here was to demonstrate that high-performance large language models can be trained using exclusively legally and ethically sourced data — and they did it.

The researchers trained two 7-billion-parameter LLMs using Common Pile v0.1 data, which they claim match or even outperform prominent counterparts trained on unlicensed web data, like Meta’s LLaMA and others. For comparison, OpenAI’s ChatGPT 3 has 175 billion parameters. So, while this work may seem like just a drop in that bucket, it has further-reaching aspirations.

Legal and Ethical Motivations

Most current LLMs are trained on unlicensed web data, raising copyright concerns and ethical issues around consent and attribution. This has led to organizations taking action against AI companies, including everything from blocking AI crawlers to filing lawsuits.

Research Goals

Ultimately, researchers sought to explore whether openly licensed content could be a viable alternative for pretraining LLMs and to build a transparent, ethical and reproducible pipeline for future AI research and development.

Other Ethical AI Attempts and Their Pros and Cons

These researchers aren’t the first to dive into ethical data collection for AI, and they won’t be the last. Their contribution is significant; however, it wasn’t without its challenges. Stella Biderman, coauthor of the study, admits that creating the dataset was labor-intensive, where everything was “manually annotated” and “checked by people,” which took a long time.

On a recent episode of WBUR’s On Point podcast, host Meghna Chakrabarti spoke with Ari Morcos, co-founder and CEO of DatologyAI, and Kalyan Veeramachaneni, Principal Research Scientist at the MIT Schwarzman College of Computing and CEO of DataCebo.

Their discussion centered around using synthetic AI-generated data to train LLMs, and the legal, ethical, security and scalability reasons why the strategy is gaining popularity. But even a strategy like this still raises concerns, albeit for different reasons than using unsanctioned data.

When synthetic data is used to train models that then generate more synthetic data, this can lead to degradation of quality, a problem referred to as model collapse. In the episode, Morcos says that “models that have been trained primarily on synthetic data have a lot of problems,” and that they “get very brittle and weird.”

There’s concern that overreliance on synthetic data could detach AI from the complexity and ‘messiness’ of real human experiences.

Apple is known to employ this tactic, using synthetic data to improve Apple Intelligence, citing user privacy as a top concern.

Why Haven’t Major Players Adopted More Scalable Ethical Training Techniques?

Many leading AI companies, including OpenAI, Google and Meta, have acknowledged challenges in implementing ethical training techniques. And while many are developing and testing new ethical training techniques, none are doing it at great scale. Here are some reasons why that may be the case, even when we have the technology to do it well:

Technical and Resource Constraints

A core tenant of AI alignment is ethicality, meaning that “AI systems are aligned to societal values and moral standards. They adhere to human ethical principles such as fairness, environmental sustainability, inclusion, moral agency and trust,” according to IBM.

OpenAI emphasizes the complexity of aligning AI models with human values, noting that “we have yet to fully understand, measure, and leverage the relationship between capabilities, safety, and alignment.”

Balancing Transparency with Proprietary Interests

While companies like Google have established AI principles and publish responsible AI practices, they also face challenges in balancing transparency with proprietary interests. For instance, detailed information about training data and model architectures may be withheld to protect competitive advantages in a hyper-ambitious industry.

Ethical Dilemmas in Data Curation

There are inherent ethical complexities in curating training data. “Selective omission, even with benevolent intentions, can unintentionally shape narratives, perspectives, and emotional realities,” one developer said in OpenAI’s developer community, highlighting the difficulty in creating datasets that are both comprehensive and ethically sound.

What Can Marketers Do About All This?

Ethics are important, and while you may not have a say in how your organization’s chosen AI tools are trained or what data they’re fed, there are things you can control to uphold your own guiding moral principles:

You Can’t Control the Training Data But You Can Control the Output

Many generative tools are trained on unlicensed or opaque data. But your choice of use case, what you publish and how you review outputs is entirely in your hands. If you’re still looking for automation, try using fact-checking tools or plagiarism detectors before publishing. That said, your own two eyes and critical thinking skills are the best resources you have when it comes to double-checking work.

Give Credit Where It’s Due (Even If the AI Didn’t)

AI may generate content closely mimicking copyrighted work without attribution, but the buck stops with you, and you’re still responsible for avoiding unintentional plagiarism. If AI references a quote, concept or statistic, trace the source manually and cite it properly. Don’t assume the AI has given you something free to use.

Beware of “Garbage In, Garbage Out”

Garbage in, garbage out is the idea that output quality is a direct mirror of input quality. If you feed an AI vague prompts or plagiarized inputs (even unknowingly), it can output flawed or unethical content. Always write clear, unbiased prompt and avoid feeding it copyrighted text (like entire blog posts) for rewriting — which puts you on the same ethical slippery slope as the model builders.

Final Thoughts

It’s tough to say definitively whether or not LLMs trained exclusively on ethically sourced or synthetic data will grow as large or popular as some of the industry’s big wigs. That would be ideal for many creators who allege AI companies have used their content for training purposes without permission — and for swaths of users who prefer to support brands taking a more mindful and ethical approach to AI.

What is certain is that the work these researchers completed is the best of its kind so far, adding an ethical notch to AI developers’ toolbelts and standing as a good sign that tighter, more ethically trained systems could be on the way.

Note: This article was originally published on contentmarketing.ai.