> , that data can and will get scraped if the intention is to use for a model.
How would that work from a legal perspective, though?
Let's say there's no paywall and Reddit's terms of use disallow unauthorized commercial use of their data. Wouldn't that be a violation of Reddit's terms and liable to some legal procedure?
I really really really hope the copilot and other suits are successful. The idea that you can literally steal content in the name of “””AI””” and profit off it is just insane! How is it not like copyright infringement? The Warhol case is just one step behind training data. It’s basically the same idea.
Reddit's terms are irrelevant. Unless Reddit requires a login to view its site (which would also prevent Google indexing), anyone can view the data without agreeing to the terms.
The only question is copyright, but I find it hard to argue that LLM training is not sufficiently transformative in 99% of cases.
How does Google use Reddit's data in its models? You can access most (all?) Reddit pages without hitting Reddit at all via the "Cached" link in the search results.
Does Google have a special agreement with Reddit (and all other sites?) or is it legally "fair use" to reproduce web pages that are available freely online?
Interesting question but sadly I am in no position to answer it.
I think there are probably issues to address with scraping it blindly:
- Can Reddit imprint its data somehow? A watermark?
- Can Reddit prove that certain type of information appeared on Reddit first and thus that serves as proof its data was used without authorization?
If OpenAI can't work around this, I'm not sure they would be willing to cross any lines in terms of copyright, they've already done it with ChatGPT and I am guessing rules are only going to get stricter on this topic.
I think the bottom line is that Microsoft’s (and thus other for-profit AI initiatives) stance is that any and all data is fair game regardless of license or authorization. This results, in their opinion, from the fact that the AI alters the data, changes the output, and is otherwise “inspired” by the data in the same way an artist might be inspired by another without copyright infringement.
This sounds very dodgy. Will somebody be checking the degree of such "data alteration" and verify that the "AI" is actually inspired rather than copying?
To me this feels like its opening up the door for the elimination of copyright as any algorithmic layer interjected between scrapped data and end users could claim to be "inspired".
Welcome to the discussion lol. People have already provided examples of copilot producing niche code verbatim thereby proving their intuition incorrect. It’s a whole mess that will take years to be cleaned up by new legal conventions.
If my compiler was “inspired” by leaked Windows source code and altered it into a new form then I think their opinions on the matter would be very different.
Not a great example: if Apple’s code leaked, theoretically they wouldn’t include it in the training as it’s not supposed to be seen by the public. If it’s public you can be inspired by it (so their logic goes).
The true malicious, and probably effective, approach is to silently poison outputs if you suspect automated behaviour. These large language model things might be useful there. Or the old school NLP stuff.
Reddit don't own the copyright to it, just a license. That plus public web scraping is legal. Reproducing the data directly might violate the user's copyrights, but through an LLM it is assumed not.
How would that work from a legal perspective, though?
Let's say there's no paywall and Reddit's terms of use disallow unauthorized commercial use of their data. Wouldn't that be a violation of Reddit's terms and liable to some legal procedure?