Reddit’s API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.
The key point: This doesn’t touch Reddit’s servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.
What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.
API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.
Self-hosting options:
- USB drive / local folder (just open the HTML files)
- Home server on your LAN
- Tor hidden service (2 commands, no port forwarding needed)
- VPS with HTTPS
- GitHub Pages for small archives
Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.
Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.
How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is “trust but verify” – it accelerates the boring parts but you still own the architecture.
Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)
Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4
How long it takes to download this 3TB torrent ?
week(s)
Eww, Voat and Ruqqus.
Thanks. This is great for mining data and urls.
It would be neat for someone to migrate this data set to a Lemmy instance
It would be inviting a lawsuit for sure. I like the essence of the idea, but it’s probably more trouble than it’s worth for all but the most fanatic.
Might be easiest to set up an instance in a country that doesn’t give a fuck about western IP law, then others can federate to it.
So yeah, fanatic levels of effort.
Brb, setting up a Lemmy server in Red Star OS
(The machine with the only Steam account active in North Korea
would like toalready knows your location)
Now this is a good idea.
Can anyone figure out what the minimum process is to just use the SSG function? I’m having a really hard time trying to understand the documentation.
so kinda like kiwix but for reddit. That is so cool
You should be very proud of this project!! Thank you for sharing.
This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.
Just so you’re aware, it is very noticeable that you also used AI to help write this post and its use of language can throw a lot of people off.
Not to detract from your project, which looks cool!
Yes I used AI, English is not my first language. Thank you for the kind words!
You’re awesome. AI is fun and there’s nothing wrong with using it especially how you did. Lemmy was hit hard with AI hate propaganda. China probably trying to stop it’s growth and development in other countries or some stupid shit like that. But you’re good. Fuck them
Yup, if there was ever a decent use for AI, this is it. Lemmy can (and will) hate the shit out of it, but it took a little burden off the shoulders of someone doing us a great service.
Removed by mod
Would love to see you learn an entire foreign language just so you are able to communicate with the world without being laughed at by people as hostile as yourself.
They said it wasn’t their “first” lanugage. Which leads me to believe that they do speak English. If that’s the case, then they indeed are kind of lazy. There have already been studies in the impact of AI when used for communication and the results are not positive.
This isn’t something I’d personally point out and criticize, just something I wouldn’t do personally. Take the time to express your own ideas in your own words. The long term cost is higher than the short term gains.
Hey I drove to the library, picked up all these things you needed, got dinner here ya go, free!
You drove? man that’s lazy…
He used AI to clean up translation and save time after he spent a fuck ton of time curating and delivering us a helpful product. Calling him out as lazy is an awful take.
there are the so called activists that complain alot then there are the activists that deliver projects and code… enough said
I have A1 and A2 level in a couple of non-first languages, technically I can speak those, realistically I don’t and will not be able to communicate something more complex than ‘here, take a look’
So I don’t agree with your absolutistic stance
I can’t even learn my own language!
Brush, you do not seem like a nice person to be around.
Spread love and kindness, not hate.
I hope you have a better rest of your day.
Shut the fuck up loser.
Yu mussi bawn backacow
I fucking hate lemmy sometimes.
What is the timing of the dataset, up through which date in time?
however the data from 2025-12 has been released already, it just needs to be split and reprocessed for 2025 by watchful1. once that happens then you can host archive up till end of 2025. i will probably add support for importing data from the arctic shift dumps instead so that archives can be updated monthly.
Thank you very much, very cool.
It’s literally says in the link. Go to the link and it’s the title.
Oh I didn’t see it. I’m sorry I asked.
And only a 3.28 TB database? Oh, because it’s compressed. Includes comments too, though.
Yes! Too many comments to count in a reasonable amount of time!
Yeah, it should inflate to 15TB or more I think
If only I had the space and bandwidth. I would host a mirror via Lemmy and drag the traffic away.
Actually, isn’t the a way to decentralize this that can be accessed from regular browsers on the internet? Live content here, archive everywhere.
Someone could format it into essentially static pages and publish it on IPFS. That would probably be the easiest “decentralized hosting” method that remains browsable
People will do anything to use Reddit instead of just letting go.
This is just an archive. No different from using the wayback machine or any other archive of web content.
You still use Reddit in some capacity.
Or would you deny watching a movie just because you watched it on your local Jellyfin folder instead of watching it on Netflix or the cinema?
“Stop talking to my clone, I specifically requested you never contact me again”
It’s an archive of reddit, not reddit
There is a ton of useful info on Reddit. I don’t use it anymore either but I’ll be downloading this project.
I never said I am not using it.
But that feels like it’s a compromise to keep using it as native as possible.
If it was just for research purposes, accessing archive.org would suffice.I think the idea here is to have it offline in the event of further fascist control of the internet. There is really so much useful information on there on a wide variety of topics. I don’t care about backing up memes and bot drivel.
that was exactly the idea, thanks for understanding…
also reddit’s ban on vpn also reddit’s mandatory id verification
and the list goes on…
Wow, great idea. So much useful information and discussion that users have contributed. Looking forward to checking this out.
thank you!!! i built on great ideas from others! i cant take all the credit 😋
PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!
We can’t share this on Reddit, but we can share this on other platforms. Basically, what you have done is you scraped tons of data for AI learning. Something like “create your own AI Redditor” . And greedy Reddit management will dislike it very much even if you will tell them that this is for the cultural inheritance. Your work is great anyway. Sadly, that I do not have enough free space to load and store all this data.
Anyone doing this will be banned in that platform.
Fuck Reddit and Fuck Spez.
You know what would be a good way to do t? Take all that content and throw it on a federated service like ours. Publicly visible. No bullshit. And no reason to visit Reddit to get that content. Take their traffic away.
Where would it be hosted so that Conde Nast lawyers can’t touch it?
What would they say? It’s information that’s freely available, no payment required, no accounts to simply read it, no copyrights, where’s the legal in hosting a duplicate of the content?
It might fall under the same concept that recipes do - you can’t copyright a recipe, but a collection of recipes (such as a book) is copyrightable.
In any case, they have a lot more money to pay lawyers than you or I do, I’ll bet, so even if you are right, that doesn’t mean you’ll have the money to actually win.














