Mercurial > codedump

These two scripts were used to scrape files and metadata off sanae.site, a
short-lived file-upload service ran by vhoda. He completely killed off any
access from the sanae.site domain, but it was still accessible via
donsebagay.mom, and that's what the ROOT variables contain.

The first script "scrape.py" was used just to download all of the files.
These were the most important to save after all. After this, I wrote
"guess.py" which scraped all of the metadata (as the filename implies,
this was originally going to "guess" the file extension for each file,
but the servers were still up so I just scraped the metadata)

The "guess.py" script requires the lxml package, which you will probably
already have installed. This is only used to strip off <script> and <style>
tags from the file.

The resulting files from these scripts should be of the format:
	"[id] - [filename w/ extension]"
	"[id] - [filename w/ extension].json"

Of which the latter is a JSON object that may or may not contain any of the
following fields:
	"filename"   -- original filename, HTTP 'Content-Disposition' header
	"date"       -- the date and time of upload, ISO format
	"visibility" -- "public" if accessible from a user's page, "unlisted"
	                if not. private videos cannot be accessed as this
	                script has no login details nor cookies.
	"yturl"      -- the original YouTube URL, if this is a YouTube download
	"username"   -- the username of the uploader; this includes the "!"
	                prefix. this will be "Anonymous" if the website
	                provided it as that.
	                for some files, e.g. FLAC, this is not available :(
author	Paper <paper@tflc.us>
date	Sat, 07 Mar 2026 18:04:10 -0500
parents	0c3cd90e91f7
children