Ethics of Using Public Data to train AI models - Protecting publicly posted data
- Neel Ramachandran
- 15 hours ago
- 2 min read
Article written by: Neel Ramachandran
Article designed by: Neel Ramachandran & Sanvi Desai
AI systems rely heavily on large amounts of data to work well. Every time you use social media, upload photos, write posts, or interact online, you generate information that can help train these models. Public datasets, websites, forums, and images all play a huge role in shaping how AI learns patterns and makes predictions. This data allows AI to improve its creation of useful outputs. High quality data has become one of the most important resources for developing these modern AI tools.
While this approach allows AI to grow quickly, it also raises important ethical questions. Many people don’t realize that the content they post online may end up being used to train AI models. Anything from a tweet to an image can be used, and this lack of awareness creates concerns about consent and ownership. At the same time, platforms like Reddit, news outlets, and online forums want to protect the information created by their communities, since that content is what makes their sites valuable. These competing interests often lead to disagreements between AI companies looking for data and the platforms that are trying to protect it. As AI continues to expand, the debate over who controls this online content and how it is used is becoming more and more important.

Image by TechHQ
While this problem is serious, there are several steps that can be taken to create a more transparent and responsible system for AI data use. First, online platforms should clearly explain how user data might be used in AI training and offer easy understandable options to opt in or out. This gives people more control and ensures they are informed about how their data is being handled.
AI companies also need to take responsibility by forming licensing agreements with websites instead of scraping data without permission. Agreements like these make it clear who owns the data and how it can be used, building trust between companies, creators, and users.

Image by FinTechWeekly
Finally, government policies should be enacted that set standards for what ethical data collection looks like. Regulations could require companies to be transparent about where their training data comes from and how this data is anonymized or protected. By combining these three steps, it becomes possible to create an AI ecosystem that respects privacy while still allowing for innovation.
Works Cited
News, PBS. “Reddit Sues AI Company over Alleged “Industrial-Scale” Scraping of Its Users’ Comments.” PBS News, 22 Oct. 2025, pbs.org/newshour/nation/reddit-sues-ai-company-over-alleged-industrial-scale-scraping-of-its-users-comments. Accessed 23 Nov. 2025.
Denison, George. “AI Data Scraping: Ethics and Data Quality Challenges.” Prolific, 24 Oct. 2023, www.prolific.com/resources/ai-data-scraping-ethics-and-data-quality-challenges.
Drenik, Gary. “Data Privacy and Ownership to Remain Key Concerns in Web Scraping Industry next Year.” Forbes, 18 Dec. 2023, www.forbes.com/sites/garydrenik/2023/12/18/data-privacy-and-ownership-to-remain-key-concerns-in-web-scraping-industry-next-year/.
Ameneh Dehshiri. “Addressing GDPR’s Shortcomings in AI Training Data Transparency with the AI Act.” Tech Policy Press, 31 July 2025, www.techpolicy.press/addressing-gdprs-shortcomings-in-ai-training-data-transparency-with-the-ai-act/.
