Mercurial > codedump
diff sanae.site/README @ 135:0c3cd90e91f7 default tip
*: add sanae.site scraper scripts
| author | Paper <paper@tflc.us> |
|---|---|
| date | Sat, 24 Jan 2026 15:10:05 -0500 |
| parents | |
| children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/sanae.site/README Sat Jan 24 15:10:05 2026 -0500 @@ -0,0 +1,31 @@ +These two scripts were used to scrape files and metadata off sanae.site, a +short-lived file-upload service ran by vhoda. He completely killed off any +access from the sanae.site domain, but it was still accessible via +donsebagay.mom, and that's what the ROOT variables contain. + +The first script "scrape.py" was used just to download all of the files. +These were the most important to save after all. After this, I wrote +"guess.py" which scraped all of the metadata (as the filename implies, +this was originally going to "guess" the file extension for each file, +but the servers were still up so I just scraped the metadata) + +The "guess.py" script requires the lxml package, which you will probably +already have installed. This is only used to strip off <script> and <style> +tags from the file. + +The resulting files from these scripts should be of the format: + "[id] - [filename w/ extension]" + "[id] - [filename w/ extension].json" + +Of which the latter is a JSON object that may or may not contain any of the +following fields: + "filename" -- original filename, HTTP 'Content-Disposition' header + "date" -- the date and time of upload, ISO format + "visibility" -- "public" if accessible from a user's page, "unlisted" + if not. private videos cannot be accessed as this + script has no login details nor cookies. + "yturl" -- the original YouTube URL, if this is a YouTube download + "username" -- the username of the uploader; this includes the "!" + prefix. this will be "Anonymous" if the website + provided it as that. + for some files, e.g. FLAC, this is not available :(
