Follow

Now how do I get all the owners of good soups to hand me their RSS link? 🤔

· · Web · 3 · 1 · 3

@rixx good news: found the password for my soup account again. bad news: where is the link to the settings where I can see my RSS link? (It seems some JS or CSS doesn't load because of cross-origin stuff...?)

@daniel_bohrer Alternatively, a naive scraper will probably net you even more images because some people used to post images as test-posts with links instead of image posts, but with their current response times, you'd just have to hope to get done before it's over.

@daniel_bohrer (I might be doing this to get the content of some of the good soups as a backup)

@rixx I already tried wget --mirror, but it obviously doesn't load the JS and execute it, so no content is loaded and therefore no content is downloaded at all :-/

@daniel_bohrer Oh huh, that's weird. Did you use a custom domain? JS execution should not be necessary at all.

@rixx Yeah, that's what I remember too, but even after login I cannot see this button.
Ah well, maybe I should just let it go.

@daniel_bohrer Nooo! I can give you a very mediocre semi-tested scraper if that helps?

@rixx maybe that would help. I've tried the export RSS now with the soup-backup script, but it always times out… The other option was wget --mirror --page-requisites, which only gave me the index.html without any images...

but inspecting the index.html, I see now that yes, there is indeed no need for javascript.

@daniel_bohrer drop.rixx.de/wkE/ with `pip install requests beautifulsoup4`. You'll have to make a "data" directory and touch some files in there first, though, and of course replace "rixx". Should put a bunch of files with URLs in the data directory (so made so that you can start downloading while it's scraping). Files are in the format "<url> <post_id>" to help you retain some sort of ID/ordering.

@daniel_bohrer Does a vague sort or resume on error, though you might end up with some duplicate images in any case. `uniq` if you care etc etc.
Doesn't download images because piping into curl is probably the best thing to do here.

@rixx thank you. but I don't think I need another BeautifulSoup when I have my soup… :)

@daniel_bohrer I just improved the script to be less annoying, gonna publish in like five minutes, if you're still interested.

@rixx mine timeouts with 503 when you request it unfortunately; it doesn’t matter how often or when you try

@bongo I suspect your soup is too large and you'll have to fall back on web scraping.

Sign in to participate in the conversation
chaos.social

chaos.social – a Fediverse instance for & by the Chaos community