diff sanae.site/README @ 135:0c3cd90e91f7 default tip

*: add sanae.site scraper scripts
author Paper <paper@tflc.us>
date Sat, 24 Jan 2026 15:10:05 -0500
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/sanae.site/README	Sat Jan 24 15:10:05 2026 -0500
@@ -0,0 +1,31 @@
+These two scripts were used to scrape files and metadata off sanae.site, a
+short-lived file-upload service ran by vhoda. He completely killed off any
+access from the sanae.site domain, but it was still accessible via
+donsebagay.mom, and that's what the ROOT variables contain.
+
+The first script "scrape.py" was used just to download all of the files.
+These were the most important to save after all. After this, I wrote
+"guess.py" which scraped all of the metadata (as the filename implies,
+this was originally going to "guess" the file extension for each file,
+but the servers were still up so I just scraped the metadata)
+
+The "guess.py" script requires the lxml package, which you will probably
+already have installed. This is only used to strip off <script> and <style>
+tags from the file.
+
+The resulting files from these scripts should be of the format:
+	"[id] - [filename w/ extension]"
+	"[id] - [filename w/ extension].json"
+
+Of which the latter is a JSON object that may or may not contain any of the
+following fields:
+	"filename"   -- original filename, HTTP 'Content-Disposition' header
+	"date"       -- the date and time of upload, ISO format
+	"visibility" -- "public" if accessible from a user's page, "unlisted"
+	                if not. private videos cannot be accessed as this
+	                script has no login details nor cookies.
+	"yturl"      -- the original YouTube URL, if this is a YouTube download
+	"username"   -- the username of the uploader; this includes the "!"
+	                prefix. this will be "Anonymous" if the website
+	                provided it as that.
+	                for some files, e.g. FLAC, this is not available :(