Reddit has a new AI training deal to sell user content

L4sBot@lemmy.world · 2 years ago

Reddit has a new AI training deal to sell user content

Lmaydev@programming.dev · 2 years ago

I’d be very surprised if people weren’t already scraping Reddit for this.

NeatNit@discuss.tchncs.de · edit-2 2 years ago

it’s all but guaranteed. Reminds me of this Computerphile video: https://youtu.be/WO2X3oZEJOA?t=874 TL;DW: there were “glitch tokens” in GPT (and therefore ChatGPT) which undeniably came from Reddit usernames.

Note, there’s no proof that these reddit usernames were in the training data (and there’s even reasons to assume that they weren’t, watch the video for context) but there’s no doubt that OpenAI already had scraped reddit data at some point prior to training, probably mixed in with all the rest of their text data. I see no reason to assume they completely removed all reddit text before training. The video suggest reasons and evidence that they removed certain subreddits, not all of reddit.

PipedLinkBot@feddit.rocks · 2 years ago

Here is an alternative Piped link(s):

https://piped.video/WO2X3oZEJOA?t=874

Piped is a privacy-respecting open-source alternative frontend to YouTube.

I’m open-source; check me out at GitHub.

Verserk@lemmy.dbzer0.com · 2 years ago

That was the real reason for the API changes last year, apps just got caught in the crossfire.

fuckwit_mcbumcrumble@lemmy.world · 2 years ago

Yeah I thought that was pretty well the established conscientious on the thing. People questioning it confuses me honestly.

Lvxferre [he/him]@mander.xyz · edit-2 2 years ago

For anyone looking for a gibberish generator to replace their Reddit content with, here’s one. This shit is like poison for those large models.

For automatic edition I’m not sure on what people can use nowadays; back then just before the APIcalypse I’ve used power delete suite, I’m not sure if it still works and I’m not creating a Reddit account just to test it out.

greaprr@sh.itjust.works · 2 years ago

Not that I’m against telling Reddit to fuck off in no uncertain terms, but won’t providing this kind of poisoning to AI training just make it more resilient to exactly this kind of thing?