AI scraping has become its own media business

There are several dimensions to the ongoing legal war between the media industry and AI companies over copyright, and one of the major ones is the question of outputs. Which is to say: Scraping content without permission may be detestable, but if the party doing the scraping isn’t doing anything with it that would compete with the content creator, it’s difficult to prove harm. And many legal proceedings, especially civil claims, depend on showing the actions were harmful.

One of the earlier rulings in this area exemplifies the point. A group of authors, including comedienne Sarah Silverman, sued OpenAI way back in 2023 for appropriating their books without compensation. A judge later dismissed several of the authors’ claims because the lawsuit didn’t identify specific outputs that were direct copies. It turns out just pointing out that a large language model (LLM) was trained on your material isn’t enough—you have to show it’s creating outputs that take business away from you.

The output problem

Copyright lawsuits like the Silverman case often depend on showing specific instances of scraping and reproduction. The problem is, much of this activity is in the realm of bots: scraping done quickly, silently, and at scale. And while the outputs of big, public-facing AI services like ChatGPT, Gemini, and Perplexity are there for everyone to see, there’s a whole shadow industry of mass AI scraping that isn’t.

{“blockType”:”mv-promo-block”,”data”:{“imageDesktopUrl”:”https://images.fastcompany.com/image/upload/f_webp,q_auto,c_fit/wp-cms-2/2025/03/media-copilot.png”,”imageMobileUrl”:”https://images.fastcompany.com/image/upload/f_webp,q_auto,c_fit/wp-cms-2/2025/03/fe289316-bc4f-44ef-96bf-148b3d8578c1_1440x1440.png”,”eyebrow”:””,”headline”:”u003Cstrongu003ESubscribe to The Media Copilotu003C/strongu003E”,”dek”:”Want more about how AI is changing media? Never miss an update from Pete Pachal by signing up for The Media Copilot. To learn more visit u003Ca href=u0022https://mediacopilot.substack.com/u0022u003Emediacopilot.substack.comu003C/au003E”,”subhed”:””,”description”:””,”ctaText”:”SIGN UP”,”ctaUrl”:”https://mediacopilot.substack.com/”,”theme”:{“bg”:”#f5f5f5″,”text”:”#000000″,”eyebrow”:”#9aa2aa”,”subhed”:”#ffffff”,”buttonBg”:”#000000″,”buttonHoverBg”:”#3b3f46″,”buttonText”:”#ffffff”},”imageDesktopId”:91453847,”imageMobileId”:91453848,”shareable”:false,”slug”:””,”wpCssClasses”:””}}

It’s been an open secret that AI companies sometimes obtain data from third-party brokers, and media industry analyst Matthew Scott Goldstein recently published an extensive report on them. The conclusions, as reported in Digiday, are eye-opening: At least 21 companies, several funded to the tune of hundreds of millions of dollars, routinely scrape publisher content without paying for it, and sell their “data services” to customers that include OpenAI, Amazon, and even other publishers like The Telegraph.

The report shows what “outputs” are when scraping is allowed at scale: multimillion-dollar companies built around parsing internet data for bots and agents, indexing that content, and selling it. These aren’t famous companies; they have names like Parallel AI, Exa, and Bright Data. Goldstein points out that they aren’t shy about what they’re doing: While a recent Wall Street Journal profile describes Parallel AI as a platform “dedicated to servicing AI agents,” he characterizes it as a “scraper company with better branding.”

As the saying goes, show me the incentives, and I’ll show you the outcome. Given the setbacks in copyright cases before the courts, not to mention the current administration’s dismissal of copyright concerns, the message is clear: There are little to no consequences to unauthorized scraping, and generally the legal and technical mechanisms governing it default to greater access for AI systems.

Block the bots, or build for them?

This reality creates an existential dilemma among media companies. Do you aggressively block bots from accessing your content, or do you let them do it? The latter means essentially conceding the fight (or at least letting others fight it for you), but it also gets you out of the game of whack-a-mole with AI scrapers. More importantly, it frees you up to build a business around the idea that AI ingests and repurposes your content.

I actually don’t believe these two perspectives are as contradictory as they may seem. Yes, copyright holders should assert their intellectual property rights, but they also need to contend with a future where AI engines are an essential part of content strategy. AI is a distribution channel, an intermediary, and an audience, all at the same time.

What does a considered approach to the scraping ecosystem look like? I see five components, not all of which will be available to every media company:

Get better at blocking bots: Protecting your IP requires both technical and legal components. Most major publishers are blocking bots, at least on paper, though being aggressive about it means going beyond adjustments to the robots exclusion protocol (the instructions every site has for bots trying to scrape their site—which are often ignored). For instance, People Inc. CEO Neil Vogel has said his company has needed to become highly sophisticated at blocking unauthorized bots.
Most publishers don’t have the same resources. However, there are technical partners that can help, and infrastructure companies like Cloudflare have moved toward copyright-protecting defaults. Even if sophisticated blocking tech isn’t an option, you can still gather intel. Don’t just look at the bot traffic to your site; you should regularly audit AI systems to find where your content has been appropriated and misused.
Practice good GEO: It might seem counterintuitive, but regardless of whether or not your site is being scraped, you should make your content as friendly to AI scrapers as possible. The question of access is a binary—either they should be scraping or not. The problem with ignoring generative engine optimization (GEO) is that, if your content is hard for bots to interpret, that counts for both authorized and unauthorized bots.
There are several advantages to practicing good GEO. For starters, there’s the reality that scraping is happening, so you should compete in summaries, even if you don’t like being there without getting compensated. You may as well get the visibility and the (small) qualified traffic that results. Also, it creates a paper trail for your proactive auditing, and potentially helps prove your value in any legal proceedings. Finally, it will be essential if you build an in-house agent or MCP server for your content.
Shift your business model: I’ve written about this extensively, but the reality is the media model of the Google era is rapidly diminishing. That means any business that’s primarily based on monetizing anonymous traffic is shrinking. New revenue streams need to be nurtured, including events, subscriptions, data and more. I know—easier said than done, but diversifying revenue needs to become religion among ad-dependent publishers.
Sue: This is not an option for everyone, obviously. Very few media companies have the resources to take on an OpenAI or a Perplexity in court. But the report on the shadow market of industrial-scale scraping opens up a group of companies that have been largely invisible up until now. Given what they’re openly doing, how much money is involved, and the stakes for publishers, it would be surprising if more legal action didn’t result.
Lobby for regulation: While regulation at the federal level seems unlikely in the current environment, many states are attempting to regulate AI, including through training-data transparency and disclosure rules. And it may not even require a wholesale updating of copyright law. The mere requirement for bots to properly identify themselves would ensure some bots couldn’t effectively impersonate humans, allowing for much more robust governance mechanisms.

Reasserting agency

As AI bots continue to “eat the internet,” publishers may feel a sense of helplessness—that scraping is just another brutal inevitability to be endured. There’s some truth to that. But inevitability shouldn’t become an excuse for paralysis. In a world increasingly dominated by agents, publishers need to reassert their own agency: protecting what they can, adapting where they must, and refusing to let the future of their work be decided entirely by the same companies who scraped it.

AI scraping has become its own media business

The designs that define America

The United States of Innovation: 13 Stories of American Ingenuity

The best and worst times to drive this July 4 weekend are here—and millions will get it wrong

‘I Don’t Have Anything To Negotiate’: Mike Johnson Holds Firm On GOP Shutdown Strategy

MSNBC is changing its name and embracing the ethos of a startup

The U.S. just changed marijuana law for the first time in decades

The 5 best World Cup ads (so far)

The designs that define America

The United States of Innovation: 13 Stories of American Ingenuity

The best and worst times to drive this July 4 weekend are here—and millions will get it wrong

Top Picks

The designs that define America

The United States of Innovation: 13 Stories of American Ingenuity

The best and worst times to drive this July 4 weekend are here—and millions will get it wrong

AI scraping has become its own media business

The output problem

Block the bots, or build for them?

Reasserting agency

Related Posts