Mercurial > codedump
view sanae.site/README @ 135:0c3cd90e91f7 default tip
*: add sanae.site scraper scripts
| author | Paper <paper@tflc.us> |
|---|---|
| date | Sat, 24 Jan 2026 15:10:05 -0500 |
| parents | |
| children |
line wrap: on
line source
These two scripts were used to scrape files and metadata off sanae.site, a short-lived file-upload service ran by vhoda. He completely killed off any access from the sanae.site domain, but it was still accessible via donsebagay.mom, and that's what the ROOT variables contain. The first script "scrape.py" was used just to download all of the files. These were the most important to save after all. After this, I wrote "guess.py" which scraped all of the metadata (as the filename implies, this was originally going to "guess" the file extension for each file, but the servers were still up so I just scraped the metadata) The "guess.py" script requires the lxml package, which you will probably already have installed. This is only used to strip off <script> and <style> tags from the file. The resulting files from these scripts should be of the format: "[id] - [filename w/ extension]" "[id] - [filename w/ extension].json" Of which the latter is a JSON object that may or may not contain any of the following fields: "filename" -- original filename, HTTP 'Content-Disposition' header "date" -- the date and time of upload, ISO format "visibility" -- "public" if accessible from a user's page, "unlisted" if not. private videos cannot be accessed as this script has no login details nor cookies. "yturl" -- the original YouTube URL, if this is a YouTube download "username" -- the username of the uploader; this includes the "!" prefix. this will be "Anonymous" if the website provided it as that. for some files, e.g. FLAC, this is not available :(
