Matt Vetter (Department of Language, Literature, and Writing), with co-authors Jialei Jiang and Zachary McDowell, recently published an article titled “An Endangered Species: How LLMs Threaten Wikipedia’s Sustainability” in the journal AI & Society.
This research article investigates the intricate relationship between Wikipedia and large language models (LLMs), highlighting potential threats to Wikipedia’s sustainability as its vast, openly accessible data is used to train AI. Through expert interviews, the study uncovers concerns regarding the unclear role and value attributed to Wikipedia in LLM training, the ethical implications of this usage concerning contributor expectations and data provenance, and the risk of perpetuating systemic biases present in Wikipedia within AI systems. Vetter et al. also examine how LLMs can act as intermediaries, potentially diminishing information quality and undermining Wikipedia’s discoverability and community engagement. Ultimately, the authors call for accountability from big tech, advocating for collaborative frameworks that prioritize ethical considerations and the long-term health of the digital commons while exploring potential solutions involving licensing, financial support, and technical advancements in attribution, such as explainable AI (XAI).
Abstract
As a collaboratively edited and open-access knowledge archive, Wikipedia offers a vast dataset for training artificial intelligence applications and models, enhancing data accessibility and access to information. However, reliance on the crowd-sourced encyclopedia raises ethical issues related to data provenance, knowledge production, curation, and digital labor. Drawing on critical data studies, feminist posthumanism, and recent research at the intersection of Wikimedia and AI, this study employs problem-centered expert interviews to investigate the relationship between Wikipedia and large language models. Key findings include the unclear role of Wikipedia in LLM training, ethical issues, and potential solutions for systemic biases and sustainability challenges. By foregrounding these concerns, this study contributes to ongoing discourses on the responsible use of AI in digital knowledge production and information management. Ultimately, this article calls for greater transparency and accountability in how big tech entities use open-access datasets like Wikipedia, advocating for collaborative frameworks prioritizing ethical considerations and equitable representation.