AI scraping has become its own media business

There are a number of dimensions to the continuing authorized conflict between the media trade and AI companies over copyright, and one of many main ones is the query of outputs. Which is to say: Scraping content without permission could also be detestable, but when the occasion doing the scraping isn’t doing something with it that may compete with the content material creator, it’s troublesome to show hurt. And lots of authorized proceedings, particularly civil claims, rely upon exhibiting the actions have been dangerous.

One of many earlier rulings on this space exemplifies the purpose. A bunch of authors, together with comedienne Sarah Silverman, sued OpenAI approach again in 2023 for appropriating their books with out compensation. A decide later dismissed several of the authors’ claims as a result of the lawsuit didn’t determine particular outputs that have been direct copies. It seems simply stating {that a} large language model (LLM) was skilled in your materials isn’t sufficient—it’s a must to present it’s creating outputs that take enterprise away from you.

The output drawback

Copyright lawsuits just like the Silverman case typically rely upon exhibiting particular cases of scraping and replica. The issue is, a lot of this exercise is within the realm of bots: scraping accomplished rapidly, silently, and at scale. And whereas the outputs of massive, public-facing AI companies like ChatGPT, Gemini, and Perplexity are there for everybody to see, there’s a complete shadow trade of mass AI scraping that isn’t.

{“blockType”:”mv-promo-block”,”information”:{“imageDesktopUrl”:”https://photos.fastcompany.com/picture/add/f_webp,q_auto,c_fit/wp-cms-2/2025/03/media-copilot.png”,”imageMobileUrl”:”https://photos.fastcompany.com/picture/add/f_webp,q_auto,c_fit/wp-cms-2/2025/03/fe289316-bc4f-44ef-96bf-148b3d8578c1_1440x1440.png”,”eyebrow”:””,”headline”:”u003Cstrongu003ESubscribe to The Media Copilotu003C/strongu003E”,”dek”:”Need extra about how AI is altering media? By no means miss an replace from Pete Pachal by signing up for The Media Copilot. To study extra go to u003Ca href=u0022https://mediacopilot.substack.com/u0022u003Emediacopilot.substack.comu003C/au003E”,”subhed”:””,”description”:””,”ctaText”:”SIGN UP”,”ctaUrl”:”https://mediacopilot.substack.com/”,”theme”:{“bg”:”#f5f5f5″,”textual content”:”#000000″,”eyebrow”:”#9aa2aa”,”subhed”:”#ffffff”,”buttonBg”:”#000000″,”buttonHoverBg”:”#3b3f46″,”buttonText”:”#ffffff”},”imageDesktopId”:91453847,”imageMobileId”:91453848,”shareable”:false,”slug”:””,”wpCssClasses”:””}}

It’s been an open secret that AI firms typically receive information from third-party brokers, and media trade analyst Matthew Scott Goldstein lately revealed an intensive report on them. The conclusions, as reported in Digiday, are eye-opening: No less than 21 firms, a number of funded to the tune of a whole bunch of hundreds of thousands of {dollars}, routinely scrape writer content material with out paying for it, and promote their “information companies” to prospects that embrace OpenAI, Amazon, and even different publishers like The Telegraph.

The report exhibits what “outputs” are when scraping is allowed at scale: multimillion-dollar firms constructed round parsing web information for bots and brokers, indexing that content material, and promoting it. These aren’t well-known firms; they’ve names like Parallel AI, Exa, and Shiny Knowledge. Goldstein factors out that they aren’t shy about what they’re doing: Whereas a latest Wall Street Journal profile describes Parallel AI as a platform “devoted to servicing AI brokers,” he characterizes it as a “scraper firm with higher branding.”

Because the saying goes, present me the incentives, and I’ll present you the end result. Given the setbacks in copyright circumstances earlier than the courts, to not point out the current administration’s dismissal of copyright concerns, the message is evident: There are little to no penalties to unauthorized scraping, and customarily the authorized and technical mechanisms governing it default to higher entry for AI methods.

Block the bots, or construct for them?

This actuality creates an existential dilemma amongst media firms. Do you aggressively block bots from accessing your content material, or do you allow them to do it? The latter means primarily conceding the battle (or a minimum of letting others battle it for you), nevertheless it additionally will get you out of the sport of whack-a-mole with AI scrapers. Extra importantly, it frees you as much as construct a enterprise round the concept that AI ingests and repurposes your content material.

I really don’t consider these two views are as contradictory as they might appear. Sure, copyright holders ought to assert their mental property rights, however in addition they have to take care of a future the place AI engines are a necessary a part of content material technique. AI is a distribution channel, an middleman, and an viewers, all on the similar time.

What does a thought of method to the scraping ecosystem appear to be? I see 5 parts, not all of which might be out there to each media firm:

Get higher at blocking bots: Defending your IP requires each technical and authorized parts. Most main publishers are blocking bots, a minimum of on paper, although being aggressive about it means going past changes to the robots exclusion protocol (the directions each web site has for bots making an attempt to scrape their web site—which are sometimes ignored). For example, Folks Inc. CEO Neil Vogel has said his firm has wanted to turn out to be extremely refined at blocking unauthorized bots.
Most publishers don’t have the identical assets. Nonetheless, there are technical partners that may assist, and infrastructure firms like Cloudflare have moved towards copyright-protecting defaults. Even when refined blocking tech isn’t an possibility, you’ll be able to nonetheless collect intel. Don’t simply have a look at the bot site visitors to your web site; you must commonly audit AI methods to seek out the place your content material has been appropriated and misused.
Follow good GEO: It may appear counterintuitive, however no matter whether or not or not your web site is being scraped, you must make your content material as pleasant to AI scrapers as doable. The query of entry is a binary—both they need to be scraping or not. The issue with ignoring generative engine optimization (GEO) is that, in case your content material is tough for bots to interpret, that counts for each approved and unauthorized bots.
There are a number of benefits to working towards good GEO. For starters, there’s the fact that scraping is going on, so you must compete in summaries, even in case you don’t like being there with out getting compensated. It’s possible you’ll as properly get the visibility and the (small) certified site visitors that outcomes. Additionally, it creates a paper path on your proactive auditing, and doubtlessly helps show your worth in any authorized proceedings. Lastly, will probably be important in case you construct an in-house agent or MCP server on your content material.
Shift what you are promoting mannequin: I’ve written about this extensively, however the actuality is the media mannequin of the Google period is quickly diminishing. Meaning any enterprise that’s based totally on monetizing nameless site visitors is shrinking. New income streams should be nurtured, together with occasions, subscriptions, information and extra. I do know—simpler stated than accomplished, however diversifying income must turn out to be faith amongst ad-dependent publishers.
Sue: This isn’t an possibility for everybody, clearly. Only a few media firms have the assets to tackle an OpenAI or a Perplexity in court docket. However the report on the shadow market of industrial-scale scraping opens up a gaggle of firms which were largely invisible up till now. Given what they’re overtly doing, how a lot cash is concerned, and the stakes for publishers, it will be stunning if extra authorized motion didn’t outcome.
Foyer for regulation: Whereas regulation on the federal degree appears unlikely within the present atmosphere, many states try to control AI, together with by way of training-data transparency and disclosure guidelines. And it might not even require a wholesale updating of copyright regulation. The mere requirement for bots to properly identify themselves would guarantee some bots couldn’t successfully impersonate people, permitting for way more strong governance mechanisms.

Reasserting company

As AI bots proceed to “eat the web,” publishers could really feel a way of helplessness—that scraping is simply one other brutal inevitability to be endured. There’s some reality to that. However inevitability shouldn’t turn out to be an excuse for paralysis. In a world more and more dominated by brokers, publishers have to reassert their very own company: defending what they’ll, adapting the place they have to, and refusing to let the way forward for their work be determined fully by the identical firms who scraped it.

AI scraping has become its own media business

Welcome to the age of the underdog AI model

The creative risk of letting AI do all the work

Meta AI is coming to Threads, and some users aren’t thrilled

Which character will die in ‘Euphoria’? Polymarket bettors think they know—and maybe some actually do

Migrant Health Care Is A Catalyst For The Shutdown

Japan’s new PM Takaichi seeks to rebrand US ties in symbolic first meeting with Trump: Analysts

Infamous Phillies ‘Karen’ receives lucrative offer to make apology

Trump and Ukraine’s Zelenskyy have ‘very productive’ meeting in Rome: US official

Alarmed by Trump’s Gaza Plan, Arab Leaders Brainstorm on Their Own

AI scraping has become its own media business

The output drawback

Block the bots, or construct for them?

Reasserting company

Related Posts