> , that data can and will get scraped if the intention is to use for a model. H...

EMIRELADERO · on June 14, 2023

It is OpenAI's and Microsoft's position that obtaining, using for training, and displaying data via AI is fair use.

asylteltine · on June 16, 2023

I really really really hope the copilot and other suits are successful. The idea that you can literally steal content in the name of “””AI””” and profit off it is just insane! How is it not like copyright infringement? The Warhol case is just one step behind training data. It’s basically the same idea.

mminer237 · on June 14, 2023

Reddit's terms are irrelevant. Unless Reddit requires a login to view its site (which would also prevent Google indexing), anyone can view the data without agreeing to the terms.

The only question is copyright, but I find it hard to argue that LLM training is not sufficiently transformative in 99% of cases.

icebraining · on June 14, 2023

Plus Reddit doesn't own the copyright to the posts, the users do.

csdvrx · on June 14, 2023

> Wouldn't that be a violation of Reddit's terms and liable to some legal procedure?

It won't, with the LinkedIn vs HiQ precedent.

dom96 · on June 14, 2023

How does Google use Reddit's data in its models? You can access most (all?) Reddit pages without hitting Reddit at all via the "Cached" link in the search results.

Does Google have a special agreement with Reddit (and all other sites?) or is it legally "fair use" to reproduce web pages that are available freely online?

mminer237 · on June 14, 2023

I think that's a different legal question than LLM training, but webpage caching has been found to be fair use based on a number of factors: https://www.pinsentmasons.com/out-law/news/google-cache-does...

tedivm · on June 14, 2023

In the US at least the courts has made it clear that scrapping is legal.

https://techcrunch.com/2022/04/18/web-scraping-legal-court/

skilled · on June 14, 2023

Interesting question but sadly I am in no position to answer it.

I think there are probably issues to address with scraping it blindly:

- Can Reddit imprint its data somehow? A watermark?

- Can Reddit prove that certain type of information appeared on Reddit first and thus that serves as proof its data was used without authorization?

If OpenAI can't work around this, I'm not sure they would be willing to cross any lines in terms of copyright, they've already done it with ChatGPT and I am guessing rules are only going to get stricter on this topic.

comfypotato · on June 14, 2023

I think the bottom line is that Microsoft’s (and thus other for-profit AI initiatives) stance is that any and all data is fair game regardless of license or authorization. This results, in their opinion, from the fact that the AI alters the data, changes the output, and is otherwise “inspired” by the data in the same way an artist might be inspired by another without copyright infringement.

nologic01 · on June 14, 2023

This sounds very dodgy. Will somebody be checking the degree of such "data alteration" and verify that the "AI" is actually inspired rather than copying?

To me this feels like its opening up the door for the elimination of copyright as any algorithmic layer interjected between scrapped data and end users could claim to be "inspired".

comfypotato · on June 14, 2023

Welcome to the discussion lol. People have already provided examples of copilot producing niche code verbatim thereby proving their intuition incorrect. It’s a whole mess that will take years to be cleaned up by new legal conventions.

Jevon23 · on June 14, 2023

If my compiler was “inspired” by leaked Windows source code and altered it into a new form then I think their opinions on the matter would be very different.

comfypotato · on June 14, 2023

Not a great example: if Apple’s code leaked, theoretically they wouldn’t include it in the training as it’s not supposed to be seen by the public. If it’s public you can be inspired by it (so their logic goes).

dontupvoteme · on June 14, 2023

The true malicious, and probably effective, approach is to silently poison outputs if you suspect automated behaviour. These large language model things might be useful there. Or the old school NLP stuff.

cma · on June 14, 2023

Reddit don't own the copyright to it, just a license. That plus public web scraping is legal. Reproducing the data directly might violate the user's copyrights, but through an LLM it is assumed not.