Recent research suggests AI may be a threat to Reddit, Q&A platforms
A newly published study suggests that generative artificial intelligence and large language models could be a threat to question and answer platforms, including Reddit (NYSE:RDDT).
The study, published in the scientific journal PNAS Nexus this month, noted that widespread large language model adoption significantly reduced the amount of public sharing on Stack Overflow, a Q&A website for programmers.
The study saw a 25% decline in sharing on Stack Overflow in the first six months that ChatGPT was released, compared to Russian and Chinese counterpart sites, where the use of ChatGPT is “limited.”
The study’s authors said the decline may be the “lower bound of the true impact of ChatGPT on Stack Overflow,” and that the decline is larger for posts related to the most widely used programming languages.
“Thus, LLMs are not only displacing duplicate, low-quality, or beginner-level content,” the authors of the study wrote in the abstract. “Our findings suggest that the rapid adoption of LLMs reduces the production of public data needed to train them, with significant consequences.”
ChatGPT was created by OpenAI, which is backed in part by Microsoft (NASDAQ:MSFT).
Broader implications
While the study itself may have focused on Stack Overflow, the broader implications for the open web were added, with the study’s authors suggesting it could have wide-ranging implications.
“This substitution threatens the future of the open web, as interactions with AI models are not added to the shared pool of online knowledge,” the authors wrote. “Moreover, this phenomenon could weaken the quality of training data for future models, as machine-generated content likely cannot fully replace human creativity and insight. This shift could have significant consequences for both the public Internet and the future of AI.”
There may wind up impacting Reddit (RDDT), which went public in March.
The study’s authors noted that other analysis, which studied “the evolution of activity on Stack Overflow and Reddit found similar results to ours.”
However, the study’s authors also noted that the widespread adoption of ChatGPT could “ironically make it difficult to train future models” and may not “effectively replace its most important input: data derived from human activity.”
“Though researchers have already expressed concerns about running out of data for training AI models, our results show that the use of LLMs can slow down the creation of new (open) data. Given the growing evidence that data generated by LLMs are unlikely to effectively train new LLMs, modelers face the real problem of running out of useful data. While research on using synthetic data and mixed data to train LLMs is still ongoing, current results show that use of synthetic training data can degrade performance and may even amplify biases in models. Human input and guidance can mitigate these issues to some extent, but in general it is still unclear if synthetic data can power continued advances in LLM capabilities. If ChatGPT truly is a “blurry JPEG” of the web, then in the long run, it cannot effectively replace its most important input: data derived from human activity. Indeed, OpenAI’s recent strategic partnerships with Stack Overflow and Reddit demonstrate the value of this kind of data for the continued training of LLMs.”
Reddit has content licensing deals with Alphabet’s (GOOG) (GOOGL) Google, as well as OpenAI.
Reddit has not yet responded to a request for comment from Seeking Alpha.