# HG changeset patch # User Paper # Date 1769285405 18000 # Node ID 0c3cd90e91f7a1956580171bad494c1c66d86a40 # Parent c27afe8ead5fd760b38ec58666bc5e7710d9dec2 *: add sanae.site scraper scripts diff -r c27afe8ead5f -r 0c3cd90e91f7 sanae.site/README --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/sanae.site/README Sat Jan 24 15:10:05 2026 -0500 @@ -0,0 +1,31 @@ +These two scripts were used to scrape files and metadata off sanae.site, a +short-lived file-upload service ran by vhoda. He completely killed off any +access from the sanae.site domain, but it was still accessible via +donsebagay.mom, and that's what the ROOT variables contain. + +The first script "scrape.py" was used just to download all of the files. +These were the most important to save after all. After this, I wrote +"guess.py" which scraped all of the metadata (as the filename implies, +this was originally going to "guess" the file extension for each file, +but the servers were still up so I just scraped the metadata) + +The "guess.py" script requires the lxml package, which you will probably +already have installed. This is only used to strip off