|
135
|
1 These two scripts were used to scrape files and metadata off sanae.site, a
|
|
|
2 short-lived file-upload service ran by vhoda. He completely killed off any
|
|
|
3 access from the sanae.site domain, but it was still accessible via
|
|
|
4 donsebagay.mom, and that's what the ROOT variables contain.
|
|
|
5
|
|
|
6 The first script "scrape.py" was used just to download all of the files.
|
|
|
7 These were the most important to save after all. After this, I wrote
|
|
|
8 "guess.py" which scraped all of the metadata (as the filename implies,
|
|
|
9 this was originally going to "guess" the file extension for each file,
|
|
|
10 but the servers were still up so I just scraped the metadata)
|
|
|
11
|
|
|
12 The "guess.py" script requires the lxml package, which you will probably
|
|
|
13 already have installed. This is only used to strip off <script> and <style>
|
|
|
14 tags from the file.
|
|
|
15
|
|
|
16 The resulting files from these scripts should be of the format:
|
|
|
17 "[id] - [filename w/ extension]"
|
|
|
18 "[id] - [filename w/ extension].json"
|
|
|
19
|
|
|
20 Of which the latter is a JSON object that may or may not contain any of the
|
|
|
21 following fields:
|
|
|
22 "filename" -- original filename, HTTP 'Content-Disposition' header
|
|
|
23 "date" -- the date and time of upload, ISO format
|
|
|
24 "visibility" -- "public" if accessible from a user's page, "unlisted"
|
|
|
25 if not. private videos cannot be accessed as this
|
|
|
26 script has no login details nor cookies.
|
|
|
27 "yturl" -- the original YouTube URL, if this is a YouTube download
|
|
|
28 "username" -- the username of the uploader; this includes the "!"
|
|
|
29 prefix. this will be "Anonymous" if the website
|
|
|
30 provided it as that.
|
|
|
31 for some files, e.g. FLAC, this is not available :(
|