Stack Overflow Will Charge AI Giants for Training Data

The programmer Q&A site joins Reddit in demanding compensation when its data is used to train algorithms and ChatGPT-style bots
Various folded US dollars on a red background
Photograph: Maryna Terletska/Getty Images

Developing the AI systems behind tools such as ChatGPT and the image generator Dall-E costs hundreds of millions of dollars—and it’s about to get more expensive.

OpenAI, Google, and other companies building large-scale AI projects have traditionally paid nothing for much of their training data, scraping it from the web. But Stack Overflow, a popular internet forum for computer programming help, plans to begin charging large AI developers as soon as the middle of this year for access to the 50 million questions and answers on its service, CEO Prashanth Chandrasekar says. The site has more than 20 million registered users.

Stack Overflow’s decision to seek compensation from companies tapping its data, part of a broader generative AI strategy, has not been previously reported. It follows an announcement by Reddit this week that it will begin charging some AI developers to access its own content starting in June.

The two community sites are not alone in wanting a share. The News/Media Alliance, a US trade group of publishers, including Condé Nast, which owns WIRED, today unveiled principles calling on generative AI developers to negotiate any use of their data for training and other purposes and respect their right to fair compensation.

Meta, Google, and OpenAI—maker of ChatGPT—all have developed AI systems using data sets that culled content from thousands of online sources, including Stack Overflow and Reddit, according to outside analyses and their own disclosures. Feeding text from online banter or expert discussions about programming into machine learning algorithms known as large language models, or LLMs, can help AI text generators or chatbots be more fluent and knowledgeable. Using LLMs to generate programming code is viewed as one of the technology's biggest opportunities, with Microsoft charging as much as $19 a month per person for its code generator GitHub Copilot.

“Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive,” Stack Overflow’s Chandrasekar says. “We're very supportive of Reddit’s approach.”

Chandrasekar described the potential additional revenue as vital to ensuring Stack Overflow can keep attracting users and maintaining high-quality information. He argues that will also help future chatbots, which need “to be trained on something that's progressing knowledge forward. They need new knowledge to be created.” But fencing off valuable data also could deter some AI training and slow improvement of LLMs, which are a threat to any service that people turn to for information and conversation. Chandrasekar says proper licensing will only help accelerate development of high-quality LLMs.

Every AI developer is seeking to bring down the huge costs of developing large scale AI systems, which take enormous amounts of expensive computers to power. Having to pay for data they once grabbed for free could extend the already unclear timelines to turning a profit on their emerging technologies. OpenAI did not respond to a request for comment, and Meta and Google did not have immediate comment.

Large language models can generate strings of text based on word patterns learned from the web pages, books, and other bodies of text in their training data. Besides ChatGPT, the programs make up the guts of search chatbots such as Microsoft Bing chat and Google’s Bard, and they underlie a growing number of applications that produce professional and creative copy in a flash. Their counterparts that generate AI-composed illustrations and videos draw on patterns from image datasets such as photos gathered from Pinterest and Flickr.

Often, data sets used in AI development are built through unofficial means such as dispatching software that scrapes content from websites. In the US that is typically considered legal, though copyright issues and websites’ terms of use against the practice have left it in dispute

A few websites such as Reddit and Stack Overflow have been more inviting. They offer downloadable “data dumps” or real-time data portals to help software to access their content known as APIs. In Stack Overflow’s case, LLM developers are getting their hands on data through a mix of dumps, APIs, and scraping, Chandrasekar says, all of which today can be done for free. 

But Chandrasekar says that LLM developers are violating Stack Overflow’s terms of service. Users own the content they post on Stack Overflow, as outlined in its TOS, but it all falls under a Creative Commons license that requires anyone later using the data to mention where it came from. When AI companies sell their models to customers, they “are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license,” Chandrasekar says.

Neither Stack Overflow nor Reddit has released pricing information. “We're working on that as we speak,” Reddit spokesperson Tim Rathschmidt says, “and will share more with partners in the coming weeks.” Stack Overflow will study Reddit’s strategy and consult with its own potential customers, some of whom have already reached out about data access, Chandrasekar says. 

A potential roadmap to pricing could come from Elon Musk, who this month hiked prices for access to Twitter data. They start at $42,000 per month for access to 50 million tweets. About three times the volume of tweets had been previously available for free. In a tweet this week, Musk accused Microsoft, a major AI developer and close partner of OpenAI, of training algorithms “illegally using Twitter data.” Without elaboration, he added, “Lawsuit time.”

Both Stack Overflow and Reddit will continue to license data for free to some people and companies. Chandrasekar says Stack Overflow only wants remuneration only from companies developing LLMs for big, commercial purposes. “When people start charging for products that are built on community-built sites like ours, that's where it's not fair use,” he says.

Reddit CEO Steve Huffman told The New York Times this week that he didn’t want to give a freebie to the world’s largest companies. “Crawling Reddit, generating value and not returning any of that value to our users is something we have a problem with,” he said.

As expectations surge that ChatGPT-style bots and other products built on LLMs will reap huge profits, other companies with stocks of content needed to train machine learning algorithms also want to be paid. Some news publishers have been wary of how Microsoft’s new Bing chatbot handles their content.  

But so far only a few public deals over access to training data have been announced, such as photo bank Shutterstock agreeing to license content to OpenAI. Its rival Getty Images is suing Stability AI, an OpenAI competitor, for not seeking a license before allegedly using over 12 million photos. The AI startup’s response is due in US federal court next week.

AI developers are not under all-out pressure to pay yet. Some companies with large volumes of academic text or casual conversations say they have no plans to start charging for their APIs or similar data portals. PLOS, a publisher of scientific research whose content has been leveraged in AI training, is “not likely” to change its fairly unrestrictive terms of use, spokesperson David Knutson says. Online community platform Discord has no plans to modify its API offerings, which are free and provided under terms that forbid AI training, says spokesperson Swaleha Carlson.

At Stack Overflow, charging for its API is just one part of a broader AI strategy that the company expects to unveil in a few months. About 10 percent of Stack Overflow's nearly 600 staff are focused on the initiative, which includes developing its own generative AI services. For example, an assistant function could help guide people as they compose questions to post.

To date, the Stack Overflow community’s primary action has been to ban users from posting AI-generated responses. Chandrasekar says a spike in inaccurate answers following the release of ChatGPT had created a challenge for the company’s several hundred or so moderators.

Launched in 2008, Stack Overflow generates about equal parts of its revenue from selling ads and licensing Q&A software as a subscription to more than 1,200 organizations for internal use. The company’s sales grew 33 percent to $45 million during the six months ended September 30, 2022, the most recent data available, compared with the year-earlier period. About 200,000 new users registered on average each month during that span.

Those users could reasonably clamor for their own compensation if Stack Overflow succeeds in licensing to AI makers the questions and answers they write for free. Chandrasekar says, “There's absolutely thought going into how best to make sure that our community members and the people that make the site what it is today—how we are going to take care of them in the context of what's happening here.”