Mercurial > codedump
comparison sanae.site/README @ 135:0c3cd90e91f7 default tip
*: add sanae.site scraper scripts
| author | Paper <paper@tflc.us> |
|---|---|
| date | Sat, 24 Jan 2026 15:10:05 -0500 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 134:c27afe8ead5f | 135:0c3cd90e91f7 |
|---|---|
| 1 These two scripts were used to scrape files and metadata off sanae.site, a | |
| 2 short-lived file-upload service ran by vhoda. He completely killed off any | |
| 3 access from the sanae.site domain, but it was still accessible via | |
| 4 donsebagay.mom, and that's what the ROOT variables contain. | |
| 5 | |
| 6 The first script "scrape.py" was used just to download all of the files. | |
| 7 These were the most important to save after all. After this, I wrote | |
| 8 "guess.py" which scraped all of the metadata (as the filename implies, | |
| 9 this was originally going to "guess" the file extension for each file, | |
| 10 but the servers were still up so I just scraped the metadata) | |
| 11 | |
| 12 The "guess.py" script requires the lxml package, which you will probably | |
| 13 already have installed. This is only used to strip off <script> and <style> | |
| 14 tags from the file. | |
| 15 | |
| 16 The resulting files from these scripts should be of the format: | |
| 17 "[id] - [filename w/ extension]" | |
| 18 "[id] - [filename w/ extension].json" | |
| 19 | |
| 20 Of which the latter is a JSON object that may or may not contain any of the | |
| 21 following fields: | |
| 22 "filename" -- original filename, HTTP 'Content-Disposition' header | |
| 23 "date" -- the date and time of upload, ISO format | |
| 24 "visibility" -- "public" if accessible from a user's page, "unlisted" | |
| 25 if not. private videos cannot be accessed as this | |
| 26 script has no login details nor cookies. | |
| 27 "yturl" -- the original YouTube URL, if this is a YouTube download | |
| 28 "username" -- the username of the uploader; this includes the "!" | |
| 29 prefix. this will be "Anonymous" if the website | |
| 30 provided it as that. | |
| 31 for some files, e.g. FLAC, this is not available :( |
