Mercurial > channeldownloader
annotate channeldownloader.py @ 18:05e71dd6b6ca default tip
no more ia python library
| author | Paper <paper@tflc.us> |
|---|---|
| date | Sat, 28 Feb 2026 22:31:59 -0500 |
| parents | 0d10b2ce0140 |
| children |
| rev | line source |
|---|---|
|
5
d4740dc7470c
[channeldownloader.py] Python 2.7 compatibility
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
4
diff
changeset
|
1 #!/usr/bin/env python3 |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
2 # -*- coding: utf-8 -*- |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
3 # channeldownloader.py - scrapes youtube videos from a channel from |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
4 # a variety of sources |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
5 |
| 17 | 6 # Copyright (c) 2021-2026 Paper <paper@tflc.us> |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
7 # This program is free software: you can redistribute it and/or modify |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
8 # it under the terms of the GNU General Public License as published by |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
9 # the Free Software Foundation, either version 2 of the License, or |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
10 # (at your option) any later version. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
11 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
12 # This program is distributed in the hope that it will be useful, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
13 # but WITHOUT ANY WARRANTY; without even the implied warranty of |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
14 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
15 # GNU General Public License for more details. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
16 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
17 # You should have received a copy of the GNU General Public License |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
18 # along with this program. If not, see <http://www.gnu.org/licenses/>. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
19 |
| 18 | 20 # Okay, this is a bit of a clusterfuck. |
| 21 # | |
| 22 # This originated as a script that simply helped me scrape a bunch | |
| 23 # of videos off some deleted channels (in fact, that's still it's main | |
| 24 # purpose) and was very lackluster (hardcoded shite everywhere). | |
| 25 # Fortunately in recent times I've cleaned up the code and added some | |
| 26 # other mirrors, as well as improved the archive.org scraper to not | |
| 27 # shoot itself when it encounters an upload that's not from tubeup. | |
| 28 # | |
| 29 # Nevertheless, I still consider much of this file to be dirty hacks, | |
| 30 # especially some of the HTTP stuff. | |
| 31 | |
|
9
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
32 """ |
|
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
33 Usage: |
|
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
34 channeldownloader.py <url>... (--database <file>) |
|
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
35 [--output <folder>] |
|
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
36 channeldownloader.py -h | --help |
|
5
d4740dc7470c
[channeldownloader.py] Python 2.7 compatibility
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
4
diff
changeset
|
37 |
|
9
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
38 Arguments: |
|
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
39 <url> YouTube channel URL to download from |
|
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
40 |
|
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
41 Options: |
|
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
42 -h --help Show this screen |
|
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
43 -o --output <folder> Output folder, relative to the current directory |
|
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
44 [default: .] |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
45 -d --database <file> yt-dlp style database of videos. Should contain |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
46 an array of yt-dlp .info.json data. For example, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
47 FinnOtaku's YTPMV metadata archive. |
|
9
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
48 """ |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
49 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
50 # Built-in python stuff (no possible missing dependencies) |
|
5
d4740dc7470c
[channeldownloader.py] Python 2.7 compatibility
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
4
diff
changeset
|
51 from __future__ import print_function |
|
9
2e9ed463c0be
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
8
diff
changeset
|
52 import docopt |
|
0
d098a293a02d
Add channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
diff
changeset
|
53 import os |
|
2
c65d14f01453
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
1
diff
changeset
|
54 import re |
|
6
5d93490e60e2
[channeldownloader.py] Implement HTTPError to circumvent Python 2 weirdness
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
5
diff
changeset
|
55 import time |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
56 import urllib.request |
| 18 | 57 import urllib.parse |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
58 import os |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
59 import ssl |
|
15
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
60 import io |
| 16 | 61 import shutil |
| 62 import xml.etree.ElementTree as XmlET | |
| 18 | 63 import enum |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
64 from urllib.error import HTTPError |
|
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
65 from pathlib import Path |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
66 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
67 # We can utilize special simdjson features if it is available |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
68 simdjson = False |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
69 |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
70 try: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
71 import simdjson as json |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
72 simdjson = True |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
73 print("INFO: using simdjson") |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
74 except ImportError: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
75 try: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
76 import ujson as json |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
77 print("INFO: using ujson") |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
78 except ImportError: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
79 try: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
80 import orjson as json |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
81 print("INFO: using orjson") |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
82 except ImportError: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
83 import json |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
84 print("INFO: using built-in json (slow!)") |
|
0
d098a293a02d
Add channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
diff
changeset
|
85 |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
86 ytdlp_works = False |
|
0
d098a293a02d
Add channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
diff
changeset
|
87 |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
88 try: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
89 import yt_dlp as youtube_dl |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
90 from yt_dlp.utils import sanitize_filename, DownloadError |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
91 ytdlp_works = True |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
92 except ImportError: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
93 print("failed to import yt-dlp!") |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
94 print("downloading from YouTube directly will not work.") |
|
0
d098a293a02d
Add channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
diff
changeset
|
95 |
|
15
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
96 zipfile_works = False |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
97 |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
98 try: |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
99 import zipfile |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
100 zipfile_works = True |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
101 except ImportError: |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
102 print("failed to import zipfile!") |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
103 print("loading the database from a .zip file will not work.") |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
104 |
| 18 | 105 |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
106 ############################################################################## |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
107 ## DOWNLOADERS |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
108 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
109 # All downloaders should be a function under this signature: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
110 # dl(video: dict, basename: str, output: str) -> int |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
111 # where: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
112 # 'video': the .info.json scraped from the YTPMV metadata archive. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
113 # 'basename': the basename output to write as. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
114 # 'output': the output directory. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
115 # yes, it's weird, but I don't care ;) |
| 18 | 116 |
| 117 class DownloaderStatus(enum.Enum): | |
| 118 # Download finished successfully. | |
| 119 SUCCESS = 0 | |
| 120 # Download failed. | |
| 121 # Note that this should NOT be used for when the video is unavailable | |
| 122 # (i.e. error 404); it should only be used when the video cannot be | |
| 123 # downloaded *at this time*, indicating a server problem. This is very | |
| 124 # common for the Internet Archive, not sure about others. | |
| 125 ERROR = 1 | |
| 126 # Video is unavailable from this provider. | |
| 127 UNAVAILABLE = 2 | |
| 128 | |
| 129 """ | |
| 130 Downloads a file from `url` to `path`, and prints the progress to the | |
| 131 screen. | |
| 132 """ | |
| 133 def download_file(url: str, path: str, guessext: bool = False, length: int = None) -> DownloaderStatus: | |
| 134 # Download in 32KiB chunks | |
| 135 CHUNK_SIZE = 32768 | |
| 136 | |
| 137 # Don't exceed 79 chars. | |
| 138 try: | |
| 139 with urllib.request.urlopen(url) as http: | |
| 140 if length is None: | |
| 141 # Check whether the URL gives us Content-Length. | |
| 142 # If so, call f.truncate to tell the filesystem how much | |
| 143 # we will be downloading before we start writing. | |
| 144 # | |
| 145 # This is also useful for displaying how much we've | |
| 146 # downloaded overall as a percent. | |
| 147 length = http.getheader("Content-Length", default=None) | |
| 148 try: | |
| 149 if length is not None: | |
| 150 length = int(length) | |
| 151 f.truncate(length) | |
| 152 except: | |
| 153 # fuck it | |
| 154 length = None | |
| 155 | |
| 156 if guessext: | |
| 157 # Guess file extension from MIME type | |
| 158 mime = http.getheader("Content-Type", default=None) | |
| 159 if not mime: | |
| 160 return DownloaderStatus.ERROR | |
| 161 | |
| 162 if mime == "video/mp4": | |
| 163 path += ".mp4" | |
| 164 elif mime == "video/webm": | |
| 165 path += ".webm" | |
| 166 else: | |
| 167 return DownloaderStatus.ERROR | |
| 168 | |
| 169 par = os.path.dirname(path) | |
| 170 if not os.path.isdir(par): | |
| 171 os.makedirs(par) | |
| 172 | |
| 173 with open(path, "wb") as f: | |
| 174 # Download the entire file | |
| 175 while True: | |
| 176 data = http.read(CHUNK_SIZE) | |
| 177 if not data: | |
| 178 break | |
| 179 | |
| 180 f.write(data) | |
| 181 print("\r downloading to %s, " % path, end="") | |
| 182 if length: | |
| 183 print("%.2f%%" % (f.tell() / length * 100.0), end="") | |
| 184 else: | |
| 185 print("%.2f MiB" % (f.tell() / (1 << 20)), end="") | |
| 186 | |
| 187 print("\r downloaded to %s " % path) | |
| 188 | |
| 189 if length is not None and length != f.tell(): | |
| 190 # Server lied about what the length was? | |
| 191 print(" INFO: HTTP server's Content-Length header lied??") | |
| 192 except TimeoutError: | |
| 193 return DownloaderStatus.ERROR | |
| 194 except HTTPError: | |
| 195 return DownloaderStatus.UNAVAILABLE | |
| 196 except Exception as e: | |
| 197 print(" unknown error downloading video;", e); | |
| 198 return DownloaderStatus.ERROR | |
| 199 | |
| 200 return DownloaderStatus.SUCCESS | |
|
0
d098a293a02d
Add channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
diff
changeset
|
201 |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
202 |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
203 # Basic downloader template. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
204 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
205 # This does a brute-force of all extensions within vexts and iexts |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
206 # in an attempt to find a working video link. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
207 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
208 # linktemplate is a template to be created using the video ID and |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
209 # extension. For example: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
210 # https://cdn.ytarchiver.com/%s.%s |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
211 def basic_dl_template(video: dict, basename: str, output: str, |
| 18 | 212 linktemplate: str, vexts: list, iexts: list) -> DownloaderStatus: |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
213 # actual downloader |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
214 def basic_dl_impl(vid: str, ext: str) -> int: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
215 url = (linktemplate % (vid, ext)) |
| 18 | 216 return download_file(url, "%s.%s" % (basename, ext)) |
|
4
aa652a6f97af
Update channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
3
diff
changeset
|
217 |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
218 for exts in [vexts, iexts]: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
219 for ext in exts: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
220 r = basic_dl_impl(video["id"], ext) |
| 18 | 221 if r == DownloaderStatus.SUCCESS: |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
222 break # done! |
| 18 | 223 elif r == DownloaderStatus.ERROR: |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
224 # timeout; try again later? |
| 18 | 225 return DownloaderStatus.ERROR |
| 226 elif r == DownloaderStatus.UNAVAILABLE: | |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
227 continue |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
228 else: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
229 # we did not break out of the loop |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
230 # which means all extensions were unavailable |
| 18 | 231 return DownloaderStatus.UNAVAILABLE |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
232 |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
233 # video was downloaded successfully |
| 18 | 234 return DownloaderStatus.SUCCESS |
|
6
5d93490e60e2
[channeldownloader.py] Implement HTTPError to circumvent Python 2 weirdness
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
5
diff
changeset
|
235 |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
236 |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
237 # GhostArchive, basic... |
| 18 | 238 def ghostarchive_dl(video: dict, basename: str, output: str) -> DownloaderStatus: |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
239 return basic_dl_template(video, basename, output, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
240 "https://ghostvideo.b-cdn.net/chimurai/%s.%s", |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
241 ["mp4", "webm", "mkv"], |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
242 [] # none |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
243 ) |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
244 |
|
0
d098a293a02d
Add channeldownloader.py
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
diff
changeset
|
245 |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
246 # media.desirintoplaisir.net |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
247 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
248 # holds PRIMARILY popular videos (i.e. no niche internet microcelebrities) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
249 # or weeb shit, however it seems to be growing to other stuff. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
250 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
251 # there isn't really a proper API; I've based the scraping off of the HTML |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
252 # and the public source code. |
| 18 | 253 def desirintoplaisir_dl(video: dict, basename: str, output: str) -> DownloaderStatus: |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
254 return basic_dl_template(video, basename, output, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
255 "https://media.desirintoplaisir.net/content/%s.%s", |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
256 ["mp4", "webm", "mkv"], |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
257 ["webp"] |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
258 ) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
259 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
260 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
261 # Internet Archive's Wayback Machine |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
262 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
263 # Internally, IA's javascript routines forward to the magic |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
264 # URL used here. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
265 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
266 # TODO: Download thumbnails through the CDX API: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
267 # https://github.com/TheTechRobo/youtubevideofinder/blob/master/lostmediafinder/finder.py |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
268 # the CDX API is pretty slow though, so it should be used as a last resort. |
| 18 | 269 def wayback_dl(video: dict, basename: str, output: str) -> DownloaderStatus: |
| 270 PREFIX = "https://web.archive.org/web/2oe_/http://wayback-fakeurl.archive.org/yt/" | |
| 271 return download_file(PREFIX + video["id"], basename, True) | |
|
11
1ac85f6f40c4
channeldownloader: insane memory optimizations
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
10
diff
changeset
|
272 |
|
1ac85f6f40c4
channeldownloader: insane memory optimizations
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
10
diff
changeset
|
273 |
| 16 | 274 # Also captures the ID for comparison |
| 275 IA_REGEX = re.compile(r"(?:(?P<date>\d{8}) - )?(?P<title>.+?)?(?:-| \[)?(?:(?P<id>[A-z0-9_\-]{11})]?|(?: \((?P<format>(?:(?:(?P<resolution>\d+)p_(?P<fps>\d+)fps_(?P<vcodec>H264)-)?(?P<abitrate>\d+)kbit_(?P<acodec>AAC|Vorbis))|BQ|Description)\)))\.(?P<extension>mp4|info\.json|description|annotations\.xml|webp|mkv|webm|jpg|jpeg|ogg|txt|m4a)$") | |
| 276 | |
| 277 | |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
278 # Internet Archive (tubeup) |
| 18 | 279 # |
| 280 # NOTE: We don't actually need the python library anymore; we already | |
| 281 # explicitly download the file listing using our own logic, so there's | |
| 282 # really nothing stopping us from going ahead and downloading everything | |
| 283 # else using the download_file function. | |
| 284 def ia_dl(video: dict, basename: str, output: str) -> DownloaderStatus: | |
| 16 | 285 def ia_file_legit(f: str, vidid: str, vidtitle: str) -> bool: |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
286 # FIXME: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
287 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
288 # There are some items on IA that combine the old tubeup behavior |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
289 # (i.e., including the sanitized video name before the ID) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
290 # and the new tubeup behavior (filename only contains the video ID) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
291 # hence we will download the entire video twice. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
292 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
293 # This isn't much of a problem anymore (and hasn't been for like 3 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
294 # years), since I contributed code to not upload something if there |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
295 # is already something there. However we should handle this case |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
296 # anyway. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
297 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
298 # Additionally, there are some items that have duplicate video files |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
299 # (from when the owners changed the title). We should ideally only |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
300 # download unique files. IA seems to provide SHA1 hashes... |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
301 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
302 # We should also check if whether the copy on IA is higher quality |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
303 # than a local copy... :) |
| 16 | 304 |
| 305 IA_ID = "youtube-%s" % vidid | |
| 306 | |
| 307 # Ignore IA generated thumbnails | |
| 308 if f.startswith("%s.thumbs/" % IA_ID) or f == "__ia_thumb.jpg": | |
| 309 return False | |
| 310 | |
| 311 for i in ["_archive.torrent", "_files.xml", "_meta.sqlite", "_meta.xml"]: | |
| 312 if f == (IA_ID + i): | |
| 313 return False | |
| 314 | |
| 315 # Try to match with our known filename regex | |
| 316 # This properly matches: | |
| 317 # ??????????? - YYYYMMDD - TITLE [ID].EXTENSION | |
| 318 # old tubeup - TITLE-ID.EXTENSION | |
| 319 # tubeup - ID.EXTENSION | |
| 320 # JDownloader - TITLE (FORMAT).EXTENSION | |
| 321 # (Possibly we should match other filenames too??) | |
| 322 m = re.match(IA_REGEX, f) | |
| 323 if m is None: | |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
324 return False |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
325 |
| 16 | 326 if m.group("id"): |
| 327 return (m.group("id") == vidid) | |
| 328 elif m.group("title") is not None: | |
| 329 def asciify(s: str) -> str: | |
| 330 # Replace all non-ASCII chars with underscores, and get rid of any whitespace | |
| 331 return ''.join([i if ord(i) >= 0x20 and ord(i) < 0x80 and i not in "/\\" else '_' for i in s]).strip() | |
| 332 | |
| 333 if asciify(m.group("title")) == asciify(vidtitle): | |
| 334 return True # Close enough | |
| 335 | |
| 336 # Uh oh | |
| 337 return False | |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
338 |
| 16 | 339 def ia_get_original_files(identifier: str) -> typing.Optional[list]: |
| 340 def ia_xml(identifier: str) -> typing.Optional[str]: | |
| 341 for _ in range(1, 9999): | |
| 342 try: | |
| 343 with urllib.request.urlopen("https://archive.org/download/%s/%s_files.xml" % (identifier, identifier)) as req: | |
| 344 return req.read().decode("utf-8") | |
| 345 except HTTPError as e: | |
| 346 if e.code == 404 or e.code == 503: | |
| 347 return None | |
| 348 time.sleep(5) | |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
349 |
| 16 | 350 d = ia_xml(identifier) |
| 351 if d is None: | |
| 352 return None | |
| 353 | |
| 354 try: | |
| 18 | 355 r = [] |
| 356 | |
| 16 | 357 # Now parse the XML and make a list of each original file |
| 18 | 358 for x in filter(lambda x: x.attrib["source"] == "original", XmlET.fromstring(d)): |
| 359 l = {"name": x.attrib["name"]} | |
| 360 | |
| 361 sz = x.find("size") | |
| 362 if sz is not None: | |
| 363 l["size"] = int(sz.text) | |
| 364 | |
| 365 r.append(l) | |
| 366 | |
| 367 return r | |
| 368 | |
| 16 | 369 except Exception as e: |
| 370 print(e) | |
| 371 return None | |
| 372 | |
| 18 | 373 IA_IDENTIFIER = "youtube-%s" % video["id"] |
| 374 | |
| 375 originalfiles = ia_get_original_files(IA_IDENTIFIER) | |
| 16 | 376 if not originalfiles: |
| 18 | 377 return DownloaderStatus.UNAVAILABLE |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
378 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
379 flist = [ |
| 16 | 380 f |
| 381 for f in originalfiles | |
| 18 | 382 if ia_file_legit(f["name"], video["id"], video["title"] if not "fulltitle" in video else video["fulltitle"]) |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
383 ] |
|
15
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
384 |
| 16 | 385 if not flist: |
| 18 | 386 return DownloaderStatus.UNAVAILABLE # ?????? |
| 16 | 387 |
| 18 | 388 for i in flist: |
| 389 for _ in range(1, 10): | |
| 390 path = "%s/%s" % (IA_IDENTIFIER, i["name"]) | |
| 391 r = download_file("https://archive.org/download/" + urllib.parse.quote(path, encoding="utf-8"), path, False, None if not "size" in i else i["size"]) | |
| 392 if r == DownloaderStatus.SUCCESS: | |
| 393 break | |
| 394 elif r == DownloaderStatus.ERROR: | |
| 395 # sleep for a bit and retry | |
| 396 time.sleep(1.0) | |
| 397 continue | |
| 398 elif r == DownloaderStatus.UNAVAILABLE: | |
| 399 return DownloaderStatus.UNAVAILABLE | |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
400 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
401 # Newer versions of tubeup save only the video ID. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
402 # Account for this by replacing it. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
403 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
404 # paper/2025-08-30: fixed a bug where video IDs with hyphens |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
405 # would incorrectly truncate |
|
15
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
406 # |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
407 # paper/2026-02-27: an update in the IA python library changed |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
408 # the way destdir works, so it just gets entirely ignored. |
| 18 | 409 for f in flist: |
| 16 | 410 def getext(s: str, vidid: str) -> typing.Optional[str]: |
| 411 # special cases | |
| 412 for i in [".info.json", ".annotations.xml"]: | |
| 413 if s.endswith(i): | |
| 414 return i | |
|
15
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
415 |
| 16 | 416 # Handle JDownloader "TITLE (Description).txt" |
| 417 if s.endswith(" (Description).txt"): | |
| 418 return ".description" | |
| 419 | |
| 420 # Catch-all for remaining extensions | |
| 421 spli = os.path.splitext(s) | |
| 422 if spli is None or len(spli) != 2: | |
|
15
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
423 return None |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
424 |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
425 return spli[1] |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
426 |
| 18 | 427 ondisk = "youtube-%s/%s" % (video["id"], f["name"]) |
| 16 | 428 |
| 429 if not os.path.exists(ondisk): | |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
430 continue |
|
15
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
431 |
| 18 | 432 ext = getext(f["name"], video["id"]) |
|
15
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
433 if ext is None: |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
434 continue |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
435 |
| 16 | 436 os.replace(ondisk, "%s%s" % (basename, ext)) |
| 437 | |
| 438 shutil.rmtree("youtube-%s" % video["id"]) | |
| 439 | |
| 18 | 440 return DownloaderStatus.SUCCESS |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
441 |
|
11
1ac85f6f40c4
channeldownloader: insane memory optimizations
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
10
diff
changeset
|
442 |
| 18 | 443 def ytdlp_dl(video: dict, basename: str, output: str) -> DownloaderStatus: |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
444 # intentionally ignores all messages besides errors |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
445 class MyLogger(object): |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
446 def debug(self, msg): |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
447 pass |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
448 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
449 def warning(self, msg): |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
450 pass |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
451 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
452 def error(self, msg): |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
453 print(" " + msg) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
454 pass |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
455 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
456 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
457 def ytdl_hook(d) -> None: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
458 if d["status"] == "finished": |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
459 print(" downloaded %s: 100%% " % (os.path.basename(d["filename"]))) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
460 if d["status"] == "downloading": |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
461 print(" downloading %s: %s\r" % (os.path.basename(d["filename"]), |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
462 d["_percent_str"]), end="") |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
463 if d["status"] == "error": |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
464 print("\n an error occurred downloading %s!" |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
465 % (os.path.basename(d["filename"]))) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
466 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
467 ytdl_opts = { |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
468 "retries": 100, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
469 "nooverwrites": True, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
470 "call_home": False, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
471 "quiet": True, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
472 "writeinfojson": True, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
473 "writedescription": True, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
474 "writethumbnail": True, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
475 "writeannotations": True, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
476 "writesubtitles": True, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
477 "allsubtitles": True, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
478 "addmetadata": True, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
479 "continuedl": True, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
480 "embedthumbnail": True, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
481 "format": "bestvideo+bestaudio/best", |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
482 "restrictfilenames": True, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
483 "no_warnings": True, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
484 "progress_hooks": [ytdl_hook], |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
485 "logger": MyLogger(), |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
486 "ignoreerrors": False, |
| 18 | 487 # yummy |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
488 "outtmpl": output + "/%(title)s-%(id)s.%(ext)s", |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
489 } |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
490 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
491 with youtube_dl.YoutubeDL(ytdl_opts) as ytdl: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
492 try: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
493 ytdl.extract_info("https://youtube.com/watch?v=%s" % video["id"]) |
| 18 | 494 return DownloaderStatus.SUCCESS |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
495 except DownloadError: |
| 18 | 496 return DownloaderStatus.UNAVAILABLE |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
497 except Exception as e: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
498 print(" unknown error downloading video!\n") |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
499 print(e) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
500 |
| 18 | 501 return DownloaderStatus.ERROR |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
502 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
503 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
504 # TODO: There are multiple other youtube archival websites available. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
505 # Most notable is https://findyoutubevideo.thetechrobo.ca . |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
506 # This combines a lot of sparse youtube archival services, and has |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
507 # a convenient API we can use. Nice! |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
508 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
509 # There is also the "Distributed YouTube Archive" which is totally |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
510 # useless because there's way to automate it... |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
511 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
512 ############################################################################## |
|
11
1ac85f6f40c4
channeldownloader: insane memory optimizations
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
10
diff
changeset
|
513 |
|
1ac85f6f40c4
channeldownloader: insane memory optimizations
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
10
diff
changeset
|
514 |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
515 def main(): |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
516 def load_split_files(path: str): |
|
15
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
517 def cruft(isdir: bool, listdir, openf): |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
518 # build the path list |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
519 if not isdir: |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
520 list_files = [path] |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
521 else: |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
522 list_files = filter(lambda x: re.search(r"vids[0-9\-]+?\.json", x), listdir()) |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
523 |
|
15
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
524 # now open each as a json |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
525 for fi in list_files: |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
526 print(fi) |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
527 with openf(fi, "r") as infile: |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
528 if simdjson: |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
529 # Using this is a lot faster in SIMDJSON, since instead |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
530 # of converting all of the JSON key/value pairs into |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
531 # native Python objects, they stay in an internal state. |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
532 # |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
533 # This means we only get the stuff we absolutely need, |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
534 # which is the uploader ID, and copy everything else |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
535 # if the ID is one we are looking for. |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
536 parser = json.Parser() |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
537 yield parser.parse(infile.read()) |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
538 del parser |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
539 else: |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
540 yield json.load(infile) |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
541 |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
542 |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
543 try: |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
544 if not zipfile_works or os.path.isdir(path): |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
545 raise Exception |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
546 |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
547 with zipfile.ZipFile(path, "r") as myzip: |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
548 yield from cruft(True, lambda: myzip.namelist(), lambda f, m: io.TextIOWrapper(myzip.open(f, mode=m), encoding="utf-8")) |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
549 except Exception as e: |
|
615e1ca0212a
*: add support for loading the split db from a zip file
Paper <paper@tflc.us>
parents:
14
diff
changeset
|
550 yield from cruft(os.path.isdir(path), lambda: os.listdir(path), lambda f, m: open(path + "/" + f, m, encoding="utf-8")) |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
551 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
552 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
553 def write_metadata(i: dict, basename: str) -> None: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
554 # ehhh |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
555 if not os.path.exists(basename + ".info.json"): |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
556 with open(basename + ".info.json", "w", encoding="utf-8") as jsonfile: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
557 try: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
558 # orjson outputs bytes |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
559 jsonfile.write(json.dumps(i).decode("utf-8")) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
560 except AttributeError: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
561 # everything else outputs a string |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
562 jsonfile.write(json.dumps(i)) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
563 print(" saved %s" % os.path.basename(jsonfile.name)) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
564 if not os.path.exists(basename + ".description"): |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
565 with open(basename + ".description", "w", |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
566 encoding="utf-8") as descfile: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
567 descfile.write(i["description"]) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
568 print(" saved %s" % os.path.basename(descfile.name)) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
569 |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
570 args = docopt.docopt(__doc__) |
|
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
571 |
|
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
572 if not os.path.exists(args["--output"]): |
|
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
573 os.mkdir(args["--output"]) |
|
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
574 |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
575 channels = dict() |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
576 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
577 for url in args["<url>"]: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
578 chn = url.split("/")[-1] |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
579 channels[chn] = {"output": "%s/%s" % (args["--output"], chn)} |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
580 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
581 for channel in channels.values(): |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
582 if not os.path.exists(channel["output"]): |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
583 os.mkdir(channel["output"]) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
584 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
585 # find videos in the database. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
586 # |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
587 # despite how it may seem, this is actually really fast, and fairly |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
588 # memory efficient too (but really only if we're using simdjson...) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
589 videos = [ |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
590 i if not simdjson else i.as_dict() |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
591 for f in load_split_files(args["--database"]) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
592 for i in (f if not "videos" in f else f["videos"]) # logic is reversed kinda, python is weird |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
593 if "uploader_id" in i and i["uploader_id"] in channels |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
594 ] |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
595 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
596 while True: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
597 if len(videos) == 0: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
598 break |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
599 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
600 videos_copy = videos |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
601 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
602 for i in videos_copy: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
603 channel = channels[i["uploader_id"]] |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
604 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
605 # precalculated for speed |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
606 output = channel["output"] |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
607 |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
608 print("%s:" % i["id"]) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
609 basename = "%s/%s-%s" % (output, sanitize_filename(i["title"], |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
610 restricted=True), i["id"]) |
| 16 | 611 def filenotworthit(f) -> bool: |
| 612 try: | |
| 613 return bool(os.path.getsize(f)) | |
| 614 except: | |
| 615 return False | |
| 616 | |
| 617 pathoutput = Path(output) | |
| 618 | |
| 619 # This is terrible | |
| 620 files = list(filter(filenotworthit, [y | |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
621 for p in ["mkv", "mp4", "webm"] |
| 16 | 622 for y in pathoutput.glob(("*-%s." + p) % i["id"])])) |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
623 if files: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
624 print(" video already downloaded!") |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
625 videos.remove(i) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
626 write_metadata(i, basename) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
627 continue |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
628 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
629 # high level "download" function. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
630 def dl(video: dict, basename: str, output: str): |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
631 dls = [] |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
632 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
633 if ytdlp_works: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
634 dls.append({ |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
635 "func": ytdlp_dl, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
636 "name": "using yt-dlp", |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
637 }) |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
638 |
| 18 | 639 dls.append({ |
| 640 "func": ia_dl, | |
| 641 "name": "from the Internet Archive", | |
| 642 }) | |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
643 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
644 dls.append({ |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
645 "func": desirintoplaisir_dl, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
646 "name": "from LMIJLM/DJ Plaisir's archive", |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
647 }) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
648 dls.append({ |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
649 "func": ghostarchive_dl, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
650 "name": "from GhostArchive" |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
651 }) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
652 dls.append({ |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
653 "func": wayback_dl, |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
654 "name": "from the Wayback Machine" |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
655 }) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
656 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
657 for dl in dls: |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
658 print(" attempting to download %s" % dl["name"]) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
659 r = dl["func"](i, basename, output) |
| 18 | 660 if r == DownloaderStatus.SUCCESS: |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
661 # all good, video's downloaded |
| 18 | 662 return DownloaderStatus.SUCCESS |
| 663 elif r == DownloaderStatus.UNAVAILABLE: | |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
664 # video is unavailable here |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
665 print(" oops, video is not available there...") |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
666 continue |
| 18 | 667 elif r == DownloaderStatus.ERROR: |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
668 # error while downloading; likely temporary. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
669 # TODO we should save which downloader the video |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
670 # was on, so we can continue back at it later. |
| 18 | 671 return DownloaderStatus.ERROR |
| 672 | |
| 673 return DownloaderStatus.UNAVAILABLE | |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
674 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
675 r = dl(i, basename, output) |
| 18 | 676 if r == DownloaderStatus.ERROR: |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
677 continue |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
678 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
679 # video is downloaded, or it's totally unavailable, so |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
680 # remove it from being checked again. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
681 videos.remove(i) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
682 # ... and then dump the metadata, if there isn't any on disk. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
683 write_metadata(i, basename) |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
684 |
| 18 | 685 if r == DownloaderStatus.SUCCESS: |
|
14
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
686 # video is downloaded |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
687 continue |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
688 |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
689 # video is unavailable; write out the metadata. |
|
03c8fd4069fb
*: big refactor, switch to GPLv2, and add README
Paper <paper@tflc.us>
parents:
13
diff
changeset
|
690 print(" video is unavailable everywhere; dumping out metadata only") |
|
6
5d93490e60e2
[channeldownloader.py] Implement HTTPError to circumvent Python 2 weirdness
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
5
diff
changeset
|
691 |
|
10
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
692 |
|
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
693 if __name__ == "__main__": |
|
8969930a9fa4
*: major cleanup
Paper <37962225+mrpapersonic@users.noreply.github.com>
parents:
9
diff
changeset
|
694 main() |
