Data revolutions are breaking out against artificial intelligence

For more than 20 years, Kit Loffstadt has written fan fiction that explores alternate universes of “Star Wars” heroes and “Buffy the Vampire Slayer” villains, sharing her stories for free online.

But in May, Ms. Lofstadt stopped publishing her creations after learning that a data company had copied her stories and fed them into the artificial intelligence technology behind ChatGPT, the viral chatbot. Horrified, she hid her writing behind a locked account.

Ms. Lofstadt also helped organize an act of rebellion last month against AI systems. Along with dozens of other fan fiction writers, she has published a torrent of irreverent stories online to overwhelm and obfuscate the data-gathering services that fuel writers’ work on AI technology.

“Each of us has to do everything we can to show them that the output of our creativity is not limited to harvesting machines as they wish,” said Ms. Lofstadt, a 42-year-old voice actress from South Yorkshire in Britain.

Fanfiction writers are just one group now making revolutions against AI systems as technology fever grips Silicon Valley and the world. In recent months, social media companies like Reddit and Twitter, news organizations including The New York Times and NBC News, and authors like Paul Tremblay and actress Sarah Silverman have taken a stand against AI sucking their data without permission.

Their protests took various forms. Writers and artists lock their files to protect their work or boycott some websites that post AI-generated content, while companies like Reddit want to charge for access to their data. At least 10 lawsuits have been filed this year against AI companies, accusing them of training their systems on the creative work of artists without consent. Last week, Ms. Silverman and authors Christopher Golden and Richard Cadre sued OpenAI, the maker of ChatGPT, and others over the use of artificial intelligence for their work.

At the heart of the rebellions is a new understanding that online information – stories, artwork, news articles, message board posts and photos – can have great untapped value.

The new wave of artificial intelligence — known as “generative AI” for the text, images, and other content it produces — is built on top of complex systems such as large language models capable of producing human-like prose. These models are trained on hoards of all kinds of data so that they can answer people’s questions and imitate their writing styles or produce comics and poetry.

This led to tech companies chasing more data to fuel their AI systems. Google, Meta, and OpenAI primarily used information from around the Internet, including large databases of fanfiction and a collection of news articles and book collections, many of which were freely available on the Internet. In tech industry parlance, this was known as “scraping” the Internet.

OpenAI’s GPT-3, an artificial intelligence system released in 2020, spans 500 billion “tokens,” each representing parts of words found mostly on the internet. Some AI models span more than a trillion tokens.

The practice of internet scraping is ancient and has been largely debunked by the companies and non-profit organizations that have done it. But which companies own the data was not well understood or seen as particularly problematic. That changed after ChatGPT debuted in November and the public learned more about the underlying AI models that power chatbots.

“What’s happening here is a fundamental realignment of the value of data,” said Brandon Duderstadt, founder and CEO of Nomic, an artificial intelligence company. “Before, the idea was that you get value out of data by making it open to everyone and showing ads. Now, the idea is that you lock down your data, because you can extract more value when you use it as input into your AI.”

Data protests may have little effect in the long term. Deep-pocketed tech giants like Google and Microsoft already sit on mountains of private information and have the resources to license more. But as the age of easy-to-detect content approaches, small AI startups and non-profit organizations that were hoping to compete with the big ones may not be able to get enough content to train their systems.

In a statement, OpenAI said that ChatGPT was trained on “licensed content, publicly available content, and content created by human AI trainers.” She added, “We respect the rights of creators and authors and look forward to continuing to work with them to protect their interests.”

Google said in a statement that it has been involved in talks about how publishers will manage their content in the future. “We believe everyone benefits from a vibrant content ecosystem,” the company said. Microsoft did not respond to a request for comment.

Data revolutions erupted last year after ChatGPT became a global phenomenon. In November, a group of programmers He filed a proposed class action lawsuit v. Microsoft and OpenAI, alleging that the companies infringed their copyrights after their code was used to train an AI programming assistant.

In January, Getty Images, which provides images and videos, sued Stability AI, an artificial intelligence company that generates images from text descriptions, claiming that the startup had used copyrighted images to train its systems.

Then in June, Clarkson, a Los Angeles law firm, filed a proposed 151-page class-action lawsuit against OpenAI and Microsoft, describing how OpenAI collected data from minors and saying web scraping violated copyright law and constituted “theft.” On Tuesday, the company filed a similar lawsuit against Google.

Ryan Clarkson said, “The data rebellion we’re seeing across the country is society’s way of responding to this notion that Big Tech is simply entitled to take any and all information from any source whatsoever, and make it their own.” Clarkson founder.

Eric Goldman, a professor at Santa Clara University School of Law, said the lawsuit’s arguments were broad and unlikely to be accepted by the court. But he said the wave of litigation is just beginning, with the emergence of the “second and third wave” that will define the future of AI.

Big companies are also fighting back against AI scrapers. In April, Reddit said it wanted to charge for access to its application programming interface, or API, which is the way third parties can download and analyze the social network’s vast database of personal conversations.

Reddit CEO Steve Hoffman said at the time that his company “didn’t need to give all that value to some of the biggest companies in the world for free.”

In the same month, Stack Overflow, a question-and-answer site for computer programmers, said it would also require AI companies to pay for data. The site contains nearly 60 million questions and answers. Its move was reported earlier by Wired.

News organizations are also fighting back against AI systems. In an internal memo on the use of generative AI in June, The Times said AI companies should “respect our intellectual property.” A Times spokesperson declined to elaborate.

For individual artists and writers, resisting AI systems means rethinking where to publish.

Nicholas Cole, 35, a painter in Vancouver, British Columbia, was alarmed at how his distinct art style could be replicated by an AI system, and suspected technology had outdone his work. He plans to continue posting his creations on Instagram, Twitter, and other social media sites to attract customers, but has stopped posting on sites like ArtStation that post AI-generated content alongside human-generated content.

“It’s just willful plagiarism from me and other artists,” said Mr. Cole. “It puts a hole of existential dread in my stomach.”

On Archive of Our Own, a fan fiction database of more than 11 million stories, writers have increasingly pressured the site to ban data scraping and AI-generated stories.

In May, when some Twitter accounts shared examples of ChatGPT mimicking the style of a popular fan fiction posted on Archive of Our Own, dozens of writers twitched. They banned their stories and wrote disruptive content to mislead AI scrapers. They also pushed Archive of Our Own leaders to stop allowing AI-generated content.

Betsy Rosenblatt, who provides legal advice to Our Private Archives and is a professor at the University of Tulsa School of Law, said the site has a policy of “maximum inclusivity” and does not want to be in the position to single out stories written with artificial intelligence.

For Ms. Loffstadt, the fanfiction writer, the battle against AI came when she was writing a story about “Horizon Zero Dawn,” a video game where humans battle AI-powered robots in a post-apocalyptic world. She said that some of the bots were good at the game and others were bad.

But in the real world, she said, “thanks to corporate arrogance and greed, they are being misdirected into doing bad things.”