Mercurial > channeldownloader
changeset 14:03c8fd4069fb default tip
*: big refactor, switch to GPLv2, and add README
Okay: now, we use a modular approach for downloaders. Each downloader
is provided through a single function (which does the fetching).
Additionally, the internetarchive library is optional now if the user
does not want to install it.
yt-dlp is still necessary though for it's sanitize_filename function.
If and when I get to adding vanity features (such as finding the best
possible source by comparing resolution and bitrate), I'll probably
separate out all of the downloaders into different files.
I also moved this project to a separate repository from 'codedump',
keeping all of the relevant commit history :)
| author | Paper <paper@tflc.us> |
|---|---|
| date | Sat, 30 Aug 2025 17:09:56 -0400 |
| parents | 2e7a3725ad21 |
| children | |
| files | LICENSE README channeldownloader.py |
| diffstat | 3 files changed, 796 insertions(+), 166 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/LICENSE Sat Aug 30 17:09:56 2025 -0400 @@ -0,0 +1,338 @@ + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc., + <https://fsf.org/> + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + <one line to give the program's name and a brief idea of what it does.> + Copyright (C) <year> <name of author> + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License along + with this program; if not, see <https://www.gnu.org/licenses/>. + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) year name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + <signature of Moe Ghoul>, 1 April 1989 + Moe Ghoul, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README Sat Aug 30 17:09:56 2025 -0400 @@ -0,0 +1,3 @@ +this is a simple script for scraping videos off of a channel. +it is primarily meant for usage with FinnOtaku's metadata archive, +but it realistically can be used for other purposes as well.
--- a/channeldownloader.py Fri Apr 14 23:53:37 2023 -0400 +++ b/channeldownloader.py Sat Aug 30 17:09:56 2025 -0400 @@ -1,4 +1,22 @@ #!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# channeldownloader.py - scrapes youtube videos from a channel from +# a variety of sources + +# Copyright (c) 2021-2025 Paper <paper@tflc.us> +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. + """ Usage: channeldownloader.py <url>... (--database <file>) @@ -12,218 +30,489 @@ -h --help Show this screen -o --output <folder> Output folder, relative to the current directory [default: .] - -d --database <file> YTPMV_Database compatible JSON file + -d --database <file> yt-dlp style database of videos. Should contain + an array of yt-dlp .info.json data. For example, + FinnOtaku's YTPMV metadata archive. """ + +# Built-in python stuff (no possible missing dependencies) from __future__ import print_function import docopt -import internetarchive -try: - import orjson as json -except ImportError: - import json import os import re import time import urllib.request -import requests # need this for ONE (1) exception -import yt_dlp as youtube_dl +import os +import ssl from urllib.error import HTTPError -from yt_dlp.utils import sanitize_filename, DownloadError from pathlib import Path -from requests.exceptions import ConnectTimeout + +# We can utilize special simdjson features if it is available +simdjson = False +try: + import simdjson as json + simdjson = True + print("INFO: using simdjson") +except ImportError: + try: + import ujson as json + print("INFO: using ujson") + except ImportError: + try: + import orjson as json + print("INFO: using orjson") + except ImportError: + import json + print("INFO: using built-in json (slow!)") -class MyLogger(object): - def debug(self, msg): - pass - - def warning(self, msg): - pass +ytdlp_works = False - def error(self, msg): - print(" " + msg) - pass +try: + import yt_dlp as youtube_dl + from yt_dlp.utils import sanitize_filename, DownloadError + ytdlp_works = True +except ImportError: + print("failed to import yt-dlp!") + print("downloading from YouTube directly will not work.") +ia_works = False -def ytdl_hook(d) -> None: - if d["status"] == "finished": - print(" downloaded %s: 100%% " % (os.path.basename(d["filename"]))) - if d["status"] == "downloading": - print(" downloading %s: %s\r" % (os.path.basename(d["filename"]), - d["_percent_str"]), end="") - if d["status"] == "error": - print("\n an error occurred downloading %s!" - % (os.path.basename(d["filename"]))) +try: + import internetarchive + from requests.exceptions import ConnectTimeout + ia_works = True +except ImportError: + print("failed to import the Internet Archive's python library!") + print("downloading from IA will not work.") + +############################################################################## +## DOWNLOADERS + +# All downloaders should be a function under this signature: +# dl(video: dict, basename: str, output: str) -> int +# where: +# 'video': the .info.json scraped from the YTPMV metadata archive. +# 'basename': the basename output to write as. +# 'output': the output directory. +# yes, it's weird, but I don't care ;) +# +# Magic return values: +# 0 -- all good, video is downloaded +# 1 -- error downloading video; it may still be available if we try again +# 2 -- video is proved totally unavailable here. give up -def load_split_files(path: str): - if not os.path.isdir(path): - yield json.load(open(path, "r", encoding="utf-8")) - for fi in os.listdir(path): - if re.search(r"vids[0-9\-]+?\.json", fi): - with open(path + "/" + fi, "r", encoding="utf-8") as infile: - print(fi) - yield json.load(infile) +# Basic downloader template. +# +# This does a brute-force of all extensions within vexts and iexts +# in an attempt to find a working video link. +# +# linktemplate is a template to be created using the video ID and +# extension. For example: +# https://cdn.ytarchiver.com/%s.%s +def basic_dl_template(video: dict, basename: str, output: str, + linktemplate: str, vexts: list, iexts: list) -> int: + # actual downloader + def basic_dl_impl(vid: str, ext: str) -> int: + url = (linktemplate % (vid, ext)) + try: + with urllib.request.urlopen(url) as headers: + with open("%s.%s" % (basename, ext), "wb") as f: + f.write(headers.read()) + print(" downloaded %s.%s" % (basename, ext)) + return 0 + except TimeoutError: + return 1 + except HTTPError: + return 2 + except Exception as e: + print(" unknown error downloading video!") + print(e) + return 1 + for exts in [vexts, iexts]: + for ext in exts: + r = basic_dl_impl(video["id"], ext) + if r == 0: + break # done! + elif r == 1: + # timeout; try again later? + return 1 + elif r == 2: + continue + else: + # we did not break out of the loop + # which means all extensions were unavailable + return 2 -def reporthook(count: int, block_size: int, total_size: int) -> None: - global start_time - if count == 0: - start_time = time.time() - return - percent = int(count * block_size * 100 / total_size) - print(" downloading %d%% \r" % (percent), end="") + # video was downloaded successfully + return 0 -def write_metadata(i: dict, basename: str) -> None: - if not os.path.exists(basename + ".info.json"): - with open(basename + ".info.json", "w", encoding="utf-8") as jsonfile: - try: - jsonfile.write(json.dumps(i).decode("utf-8")) - except AttributeError: - jsonfile.write(json.dumps(i)) - print(" saved %s" % os.path.basename(jsonfile.name)) - if not os.path.exists(basename + ".description"): - with open(basename + ".description", "w", - encoding="utf-8") as descfile: - descfile.write(i["description"]) - print(" saved %s" % os.path.basename(descfile.name)) +# GhostArchive, basic... +def ghostarchive_dl(video: dict, basename: str, output: str) -> int: + return basic_dl_template(video, basename, output, + "https://ghostvideo.b-cdn.net/chimurai/%s.%s", + ["mp4", "webm", "mkv"], + [] # none + ) -def wayback_machine_dl(video: dict, basename: str) -> int: +# media.desirintoplaisir.net +# +# holds PRIMARILY popular videos (i.e. no niche internet microcelebrities) +# or weeb shit, however it seems to be growing to other stuff. +# +# there isn't really a proper API; I've based the scraping off of the HTML +# and the public source code. +def desirintoplaisir_dl(video: dict, basename: str, output: str) -> int: + return basic_dl_template(video, basename, output, + "https://media.desirintoplaisir.net/content/%s.%s", + ["mp4", "webm", "mkv"], + ["webp"] + ) + + +# Internet Archive's Wayback Machine +# +# Internally, IA's javascript routines forward to the magic +# URL used here. +# +# TODO: Download thumbnails through the CDX API: +# https://github.com/TheTechRobo/youtubevideofinder/blob/master/lostmediafinder/finder.py +# the CDX API is pretty slow though, so it should be used as a last resort. +def wayback_dl(video: dict, basename: str, output: str) -> int: try: - url = ''.join(["https://web.archive.org/web/2oe_/http://wayback-fakeu", - "rl.archive.org/yt/%s"]) - headers = urllib.request.urlopen(url % video["id"]) - contenttype = headers.getheader("Content-Type") - if contenttype == "video/webm": - ext = "webm" - elif contenttype == "video/mp4": - ext = "mp4" - else: - raise HTTPError(url=None, code=None, msg=None, - hdrs=None, fp=None) - urllib.request.urlretrieve(url % video["id"], "%s.%s" % (basename, ext), - reporthook) + url = ("https://web.archive.org/web/2oe_/http://wayback-fakeurl.archiv" + "e.org/yt/%s" % video["id"]) + with urllib.request.urlopen(url) as headers: + contenttype = headers.getheader("Content-Type") + if contenttype == "video/webm" or contenttype == "video/mp4": + ext = contenttype.split("/")[-1] + else: + raise HTTPError(url=None, code=None, msg=None, + hdrs=None, fp=None) + with open("%s.%s" % (basename, ext), "wb") as f: + f.write(headers.read()) print(" downloaded %s.%s" % (basename, ext)) return 0 except TimeoutError: return 1 except HTTPError: - print(" video not available on the Wayback Machine!") - return 0 + # dont keep trying + return 2 except Exception as e: - print(" unknown error downloading video!\n") + print(" unknown error downloading video!") print(e) - return 0 - - -def ia_file_legit(path: str, vidid: str) -> bool: - return True if re.search(''.join([r"((?:.+?-)?", vidid, r"\.(?:mp4|jpg|web" - r"p|mkv|webm|info\\.json|description|annotations.xml" - "))"]), - path) else False + return 1 -def internet_archive_dl(video: dict, basename: str, output: str) -> int: - if internetarchive.get_item("youtube-%s" % video["id"]).exists: - flist = [f.name for f in internetarchive.get_files("youtube-%s" % video["id"]) if ia_file_legit(f.name, video["id"])] - while True: - try: - internetarchive.download("youtube-%s" % video["id"], - files=flist, verbose=True, - destdir=output, - no_directory=True, - ignore_existing=True, - retries=9999) - break - except ConnectTimeout: - continue - except Exception as e: - print(e) - return 0 - if flist[0][:len(video["id"])] == video["id"]: - for fname in flist: - if os.path.exists("%s/%s" % (output, fname)): - os.replace("%s/%s" % (output, fname), - "%s-%s" % (basename.rsplit("-", 1)[0], - fname)) - return 1 +# Internet Archive (tubeup) +def ia_dl(video: dict, basename: str, output: str) -> int: + def ia_file_legit(file: internetarchive.File, vidid: str) -> bool: + # FIXME: + # + # There are some items on IA that combine the old tubeup behavior + # (i.e., including the sanitized video name before the ID) + # and the new tubeup behavior (filename only contains the video ID) + # hence we will download the entire video twice. + # + # This isn't much of a problem anymore (and hasn't been for like 3 + # years), since I contributed code to not upload something if there + # is already something there. However we should handle this case + # anyway. + # + # Additionally, there are some items that have duplicate video files + # (from when the owners changed the title). We should ideally only + # download unique files. IA seems to provide SHA1 hashes... + # + # We should also check if whether the copy on IA is higher quality + # than a local copy... :) + if not re.search(r"((?:.+?-)?" + vidid + r"\.(?:mp4|jpg|webp|mkv|w" + r"ebm|info\.json|description|annotations.xml))", + f.name): + return False + + # now, check the metadata + print(f) + return True + + + if not internetarchive.get_item("youtube-%s" % video["id"]).exists: + return 2 + + flist = [ + f.name + for f in internetarchive.get_files("youtube-%s" % video["id"]) + if ia_file_legit(f.name, video["id"]) + ] + while True: + try: + internetarchive.download("youtube-%s" % video["id"], files=flist, + verbose=True, destdir=output, + no_directory=True, ignore_existing=True, + retries=9999) + break + except ConnectTimeout: + time.sleep(1) + continue + except Exception as e: + print(e) + return 1 + + # Newer versions of tubeup save only the video ID. + # Account for this by replacing it. + # + # paper/2025-08-30: fixed a bug where video IDs with hyphens + # would incorrectly truncate + for fname in flist: + # ignore any files whose names are not simply the ID + if os.path.splitext(fname)[0] != video["id"]: + continue + + if os.path.exists("%s/%s" % (output, fname)): + os.replace("%s/%s" % (output, fname), + "%s.%s" % (basename, os.path.splitext(fname))[1]) return 0 -ytdl_opts = { - "retries": 100, - "nooverwrites": True, - "call_home": False, - "quiet": True, - "writeinfojson": True, - "writedescription": True, - "writethumbnail": True, - "writeannotations": True, - "writesubtitles": True, - "allsubtitles": True, - "addmetadata": True, - "continuedl": True, - "embedthumbnail": True, - "format": "bestvideo+bestaudio/best", - "restrictfilenames": True, - "no_warnings": True, - "progress_hooks": [ytdl_hook], - "logger": MyLogger(), - "ignoreerrors": False, -} +def ytdlp_dl(video: dict, basename: str, output: str) -> int: + # intentionally ignores all messages besides errors + class MyLogger(object): + def debug(self, msg): + pass + + def warning(self, msg): + pass + + def error(self, msg): + print(" " + msg) + pass + + + def ytdl_hook(d) -> None: + if d["status"] == "finished": + print(" downloaded %s: 100%% " % (os.path.basename(d["filename"]))) + if d["status"] == "downloading": + print(" downloading %s: %s\r" % (os.path.basename(d["filename"]), + d["_percent_str"]), end="") + if d["status"] == "error": + print("\n an error occurred downloading %s!" + % (os.path.basename(d["filename"]))) + + ytdl_opts = { + "retries": 100, + "nooverwrites": True, + "call_home": False, + "quiet": True, + "writeinfojson": True, + "writedescription": True, + "writethumbnail": True, + "writeannotations": True, + "writesubtitles": True, + "allsubtitles": True, + "addmetadata": True, + "continuedl": True, + "embedthumbnail": True, + "format": "bestvideo+bestaudio/best", + "restrictfilenames": True, + "no_warnings": True, + "progress_hooks": [ytdl_hook], + "logger": MyLogger(), + "ignoreerrors": False, + + #mm, output template + "outtmpl": output + "/%(title)s-%(id)s.%(ext)s", + } + + with youtube_dl.YoutubeDL(ytdl_opts) as ytdl: + try: + ytdl.extract_info("https://youtube.com/watch?v=%s" % video["id"]) + return 0 + except DownloadError: + return 2 + except Exception as e: + print(" unknown error downloading video!\n") + print(e) + + return 1 + + +# TODO: There are multiple other youtube archival websites available. +# Most notable is https://findyoutubevideo.thetechrobo.ca . +# This combines a lot of sparse youtube archival services, and has +# a convenient API we can use. Nice! +# +# There is also the "Distributed YouTube Archive" which is totally +# useless because there's way to automate it... + +############################################################################## def main(): + # generator; creates a list of files, and returns the parsed form of + # each. note that the parser is not necessarily + def load_split_files(path: str): + list_files = [] + + # build the path list + if not os.path.isdir(path): + list_files.append(path) + else: + for fi in os.listdir(path): + if re.search(r"vids[0-9\-]+?\.json", fi): + list_files.append(path + "/" + fi) + + # now open each as a json + for fi in list_files: + print(fi) + with open(fi, "r", encoding="utf-8") as infile: + if simdjson: + # Using this is a lot faster in SIMDJSON, since instead + # of converting all of the JSON key/value pairs into + # native Python objects, they stay in an internal state. + # + # This means we only get the stuff we absolutely need, + # which is the uploader ID, and copy everything else + # if the ID is one we are looking for. + parser = json.Parser() + yield parser.parse(infile.read()) + del parser + else: + yield json.load(infile) + + + def write_metadata(i: dict, basename: str) -> None: + # ehhh + if not os.path.exists(basename + ".info.json"): + with open(basename + ".info.json", "w", encoding="utf-8") as jsonfile: + try: + # orjson outputs bytes + jsonfile.write(json.dumps(i).decode("utf-8")) + except AttributeError: + # everything else outputs a string + jsonfile.write(json.dumps(i)) + print(" saved %s" % os.path.basename(jsonfile.name)) + if not os.path.exists(basename + ".description"): + with open(basename + ".description", "w", + encoding="utf-8") as descfile: + descfile.write(i["description"]) + print(" saved %s" % os.path.basename(descfile.name)) + args = docopt.docopt(__doc__) if not os.path.exists(args["--output"]): os.mkdir(args["--output"]) - for f in load_split_files(args["--database"]): - for i in f: - uploader = i["uploader_id"] if "uploader_id" in i else None - for url in args["<url>"]: - channel = url.split("/")[-1] + channels = dict() + + for url in args["<url>"]: + chn = url.split("/")[-1] + channels[chn] = {"output": "%s/%s" % (args["--output"], chn)} + + for channel in channels.values(): + if not os.path.exists(channel["output"]): + os.mkdir(channel["output"]) + + # find videos in the database. + # + # despite how it may seem, this is actually really fast, and fairly + # memory efficient too (but really only if we're using simdjson...) + videos = [ + i if not simdjson else i.as_dict() + for f in load_split_files(args["--database"]) + for i in (f if not "videos" in f else f["videos"]) # logic is reversed kinda, python is weird + if "uploader_id" in i and i["uploader_id"] in channels + ] + + while True: + if len(videos) == 0: + break + + videos_copy = videos + + for i in videos_copy: + channel = channels[i["uploader_id"]] + + # precalculated for speed + output = channel["output"] - output = "%s/%s" % (args["--output"], channel) - if not os.path.exists(output): - os.mkdir(output) - ytdl_opts["outtmpl"] = output + "/%(title)s-%(id)s.%(ext)s" + print("%s:" % i["id"]) + basename = "%s/%s-%s" % (output, sanitize_filename(i["title"], + restricted=True), i["id"]) + files = [y + for p in ["mkv", "mp4", "webm"] + for y in Path(output).glob(("*-%s." + p) % i["id"])] + if files: + print(" video already downloaded!") + videos.remove(i) + write_metadata(i, basename) + continue + + # high level "download" function. + def dl(video: dict, basename: str, output: str): + dls = [] + + if ytdlp_works: + dls.append({ + "func": ytdlp_dl, + "name": "using yt-dlp", + }) - if uploader == channel: - print(uploader, channel) - print("%s:" % i["id"]) - basename = "%s/%s-%s" % (output, sanitize_filename(i["title"], - restricted=True), i["id"]) - files = [y for p in ["mkv", "mp4", "webm"] for y in list(Path(output).glob(("*-%s." + p) % i["id"]))] - if files: - print(" video already downloaded!") - write_metadata(i, basename) + if ia_works: + dls.append({ + "func": ia_dl, + "name": "from the Internet Archive", + }) + + dls.append({ + "func": desirintoplaisir_dl, + "name": "from LMIJLM/DJ Plaisir's archive", + }) + dls.append({ + "func": ghostarchive_dl, + "name": "from GhostArchive" + }) + dls.append({ + "func": wayback_dl, + "name": "from the Wayback Machine" + }) + + for dl in dls: + print(" attempting to download %s" % dl["name"]) + r = dl["func"](i, basename, output) + if r == 0: + # all good, video's downloaded + return 0 + elif r == 2: + # video is unavailable here + print(" oops, video is not available there...") continue - # this code is *really* ugly... todo a rewrite? - with youtube_dl.YoutubeDL(ytdl_opts) as ytdl: - try: - ytdl.extract_info("https://youtube.com/watch?v=%s" - % i["id"]) - continue - except DownloadError: - print(" video is not available! attempting to find In" - "ternet Archive pages of it...") - except Exception as e: - print(" unknown error downloading video!\n") - print(e) - if internet_archive_dl(i, basename, output): # if we can't download from IA - continue - print(" video does not have a Internet Archive page! attem" - "pting to download from the Wayback Machine...") - while True: - if wayback_machine_dl(i, basename) == 0: # success - break - time.sleep(5) - continue - write_metadata(i, basename) + elif r == 1: + # error while downloading; likely temporary. + # TODO we should save which downloader the video + # was on, so we can continue back at it later. + return 1 + # video is unavailable everywhere + return 2 + + r = dl(i, basename, output) + if r == 1: + continue + + # video is downloaded, or it's totally unavailable, so + # remove it from being checked again. + videos.remove(i) + # ... and then dump the metadata, if there isn't any on disk. + write_metadata(i, basename) + + if r == 0: + # video is downloaded + continue + + # video is unavailable; write out the metadata. + print(" video is unavailable everywhere; dumping out metadata only") if __name__ == "__main__":
