Mercurial > channeldownloader

--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/LICENSE	Sat Aug 30 17:09:56 2025 -0400
@@ -0,0 +1,338 @@
+                    GNU GENERAL PUBLIC LICENSE
+                       Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
+ <https://fsf.org/>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+                            Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users.  This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it.  (Some other Free Software Foundation software is covered by
+the GNU Lesser General Public License instead.)  You can apply it to
+your programs, too.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+  To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have.  You must make sure that they, too, receive or can get the
+source code.  And you must show them these terms so they know their
+rights.
+
+  We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+  Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software.  If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+  Finally, any free program is threatened constantly by software
+patents.  We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary.  To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+                    GNU GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License.  The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language.  (Hereinafter, translation is included without limitation in
+the term "modification".)  Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+  1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+  2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+    a) You must cause the modified files to carry prominent notices
+    stating that you changed the files and the date of any change.
+
+    b) You must cause any work that you distribute or publish, that in
+    whole or in part contains or is derived from the Program or any
+    part thereof, to be licensed as a whole at no charge to all third
+    parties under the terms of this License.
+
+    c) If the modified program normally reads commands interactively
+    when run, you must cause it, when started running for such
+    interactive use in the most ordinary way, to print or display an
+    announcement including an appropriate copyright notice and a
+    notice that there is no warranty (or else, saying that you provide
+    a warranty) and that users may redistribute the program under
+    these conditions, and telling the user how to view a copy of this
+    License.  (Exception: if the Program itself is interactive but
+    does not normally print such an announcement, your work based on
+    the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole.  If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works.  But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+  3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+    a) Accompany it with the complete corresponding machine-readable
+    source code, which must be distributed under the terms of Sections
+    1 and 2 above on a medium customarily used for software interchange; or,
+
+    b) Accompany it with a written offer, valid for at least three
+    years, to give any third party, for a charge no more than your
+    cost of physically performing source distribution, a complete
+    machine-readable copy of the corresponding source code, to be
+    distributed under the terms of Sections 1 and 2 above on a medium
+    customarily used for software interchange; or,
+
+    c) Accompany it with the information you received as to the offer
+    to distribute corresponding source code.  (This alternative is
+    allowed only for noncommercial distribution and only if you
+    received the program in object code or executable form with such
+    an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it.  For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable.  However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+  4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License.  Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+  5. You are not required to accept this License, since you have not
+signed it.  However, nothing else grants you permission to modify or
+distribute the Program or its derivative works.  These actions are
+prohibited by law if you do not accept this License.  Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+  6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions.  You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+  7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all.  For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices.  Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+  8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded.  In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+  9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number.  If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation.  If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+  10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission.  For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this.  Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+                            NO WARRANTY
+
+  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+                     END OF TERMS AND CONDITIONS
+
+            How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License along
+    with this program; if not, see <https://www.gnu.org/licenses/>.
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+    Gnomovision version 69, Copyright (C) year name of author
+    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License.  Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary.  Here is a sample; alter the names:
+
+  Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+  `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+  <signature of Moe Ghoul>, 1 April 1989
+  Moe Ghoul, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs.  If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library.  If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/README	Sat Aug 30 17:09:56 2025 -0400
@@ -0,0 +1,3 @@
+this is a simple script for scraping videos off of a channel.
+it is primarily meant for usage with FinnOtaku's metadata archive,
+but it realistically can be used for other purposes as well.
--- a/channeldownloader.py	Fri Apr 14 23:53:37 2023 -0400
+++ b/channeldownloader.py	Sat Aug 30 17:09:56 2025 -0400
@@ -1,4 +1,22 @@
 #!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+# channeldownloader.py - scrapes youtube videos from a channel from
+# a variety of sources
+
+# Copyright (c) 2021-2025 Paper <paper@tflc.us>
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
 """
 Usage:
   channeldownloader.py <url>... (--database <file>)
@@ -12,218 +30,489 @@
   -h --help                    Show this screen
   -o --output <folder>         Output folder, relative to the current directory
                                [default: .]
-  -d --database <file>         YTPMV_Database compatible JSON file
+  -d --database <file>         yt-dlp style database of videos. Should contain
+                               an array of yt-dlp .info.json data. For example,
+                               FinnOtaku's YTPMV metadata archive.
 """
+
+# Built-in python stuff (no possible missing dependencies)
 from __future__ import print_function
 import docopt
-import internetarchive
-try:
-    import orjson as json
-except ImportError:
-    import json
 import os
 import re
 import time
 import urllib.request
-import requests  # need this for ONE (1) exception
-import yt_dlp as youtube_dl
+import os
+import ssl
 from urllib.error import HTTPError
-from yt_dlp.utils import sanitize_filename, DownloadError
 from pathlib import Path
-from requests.exceptions import ConnectTimeout
+
+# We can utilize special simdjson features if it is available
+simdjson = False

+try:
+    import simdjson as json
+    simdjson = True
+    print("INFO: using simdjson")
+except ImportError:
+    try:
+        import ujson as json
+        print("INFO: using ujson")
+    except ImportError:
+        try:
+            import orjson as json
+            print("INFO: using orjson")
+        except ImportError:
+            import json
+            print("INFO: using built-in json (slow!)")

-class MyLogger(object):
-    def debug(self, msg):
-        pass
-
-    def warning(self, msg):
-        pass
+ytdlp_works = False

-    def error(self, msg):
-        print(" " + msg)
-        pass
+try:
+    import yt_dlp as youtube_dl
+    from yt_dlp.utils import sanitize_filename, DownloadError
+    ytdlp_works = True
+except ImportError:
+    print("failed to import yt-dlp!")
+    print("downloading from YouTube directly will not work.")

+ia_works = False

-def ytdl_hook(d) -> None:
-    if d["status"] == "finished":
-        print(" downloaded %s:    100%% " % (os.path.basename(d["filename"])))
-    if d["status"] == "downloading":
-        print(" downloading %s: %s\r" % (os.path.basename(d["filename"]),
-                                         d["_percent_str"]), end="")
-    if d["status"] == "error":
-        print("\n an error occurred downloading %s!"
-              % (os.path.basename(d["filename"])))
+try:
+    import internetarchive
+    from requests.exceptions import ConnectTimeout
+    ia_works = True
+except ImportError:
+    print("failed to import the Internet Archive's python library!")
+    print("downloading from IA will not work.")
+
+##############################################################################
+## DOWNLOADERS
+
+# All downloaders should be a function under this signature:
+#    dl(video: dict, basename: str, output: str) -> int
+# where:
+#    'video': the .info.json scraped from the YTPMV metadata archive.
+#    'basename': the basename output to write as.
+#    'output': the output directory.
+# yes, it's weird, but I don't care ;)
+#
+# Magic return values:
+#  0 -- all good, video is downloaded
+#  1 -- error downloading video; it may still be available if we try again
+#  2 -- video is proved totally unavailable here. give up


-def load_split_files(path: str):
-    if not os.path.isdir(path):
-        yield json.load(open(path, "r", encoding="utf-8"))
-    for fi in os.listdir(path):
-        if re.search(r"vids[0-9\-]+?\.json", fi):
-            with open(path + "/" + fi, "r", encoding="utf-8") as infile:
-                print(fi)
-                yield json.load(infile)
+# Basic downloader template.
+#
+# This does a brute-force of all extensions within vexts and iexts
+# in an attempt to find a working video link.
+#
+# linktemplate is a template to be created using the video ID and
+# extension. For example:
+#    https://cdn.ytarchiver.com/%s.%s
+def basic_dl_template(video: dict, basename: str, output: str,
+        linktemplate: str, vexts: list, iexts: list) -> int:
+    # actual downloader
+    def basic_dl_impl(vid: str, ext: str) -> int:
+        url = (linktemplate % (vid, ext))
+        try:
+            with urllib.request.urlopen(url) as headers:
+                with open("%s.%s" % (basename, ext), "wb") as f:
+                    f.write(headers.read())
+            print(" downloaded %s.%s" % (basename, ext))
+            return 0
+        except TimeoutError:
+            return 1
+        except HTTPError:
+            return 2
+        except Exception as e:
+            print(" unknown error downloading video!")
+            print(e)
+            return 1

+    for exts in [vexts, iexts]:
+        for ext in exts:
+            r = basic_dl_impl(video["id"], ext)
+            if r == 0:
+                break  # done!
+            elif r == 1:
+                # timeout; try again later?
+                return 1
+            elif r == 2:
+                continue
+        else:
+            # we did not break out of the loop
+            # which means all extensions were unavailable
+            return 2

-def reporthook(count: int, block_size: int, total_size: int) -> None:
-    global start_time
-    if count == 0:
-        start_time = time.time()
-        return
-    percent = int(count * block_size * 100 / total_size)
-    print(" downloading %d%%        \r" % (percent), end="")
+    # video was downloaded successfully
+    return 0


-def write_metadata(i: dict, basename: str) -> None:
-    if not os.path.exists(basename + ".info.json"):
-        with open(basename + ".info.json", "w", encoding="utf-8") as jsonfile:
-            try:
-                jsonfile.write(json.dumps(i).decode("utf-8"))
-            except AttributeError:
-                jsonfile.write(json.dumps(i))
-            print(" saved %s" % os.path.basename(jsonfile.name))
-    if not os.path.exists(basename + ".description"):
-        with open(basename + ".description", "w",
-                  encoding="utf-8") as descfile:
-            descfile.write(i["description"])
-            print(" saved %s" % os.path.basename(descfile.name))
+# GhostArchive, basic...
+def ghostarchive_dl(video: dict, basename: str, output: str) -> int:
+    return basic_dl_template(video, basename, output,
+        "https://ghostvideo.b-cdn.net/chimurai/%s.%s",
+        ["mp4", "webm", "mkv"],
+        [] # none
+    )


-def wayback_machine_dl(video: dict, basename: str) -> int:
+# media.desirintoplaisir.net
+#
+# holds PRIMARILY popular videos (i.e. no niche internet microcelebrities)
+# or weeb shit, however it seems to be growing to other stuff.
+#
+# there isn't really a proper API; I've based the scraping off of the HTML
+# and the public source code.
+def desirintoplaisir_dl(video: dict, basename: str, output: str) -> int:
+    return basic_dl_template(video, basename, output,
+        "https://media.desirintoplaisir.net/content/%s.%s",
+        ["mp4", "webm", "mkv"],
+        ["webp"]
+    )
+
+
+# Internet Archive's Wayback Machine
+#
+# Internally, IA's javascript routines forward to the magic
+# URL used here.
+#
+# TODO: Download thumbnails through the CDX API:
+# https://github.com/TheTechRobo/youtubevideofinder/blob/master/lostmediafinder/finder.py
+# the CDX API is pretty slow though, so it should be used as a last resort.
+def wayback_dl(video: dict, basename: str, output: str) -> int:
     try:
-        url = ''.join(["https://web.archive.org/web/2oe_/http://wayback-fakeu",
-                       "rl.archive.org/yt/%s"])
-        headers = urllib.request.urlopen(url % video["id"])
-        contenttype = headers.getheader("Content-Type")
-        if contenttype == "video/webm":
-            ext = "webm"
-        elif contenttype == "video/mp4":
-            ext = "mp4"
-        else:
-            raise HTTPError(url=None, code=None, msg=None,
-                            hdrs=None, fp=None)
-        urllib.request.urlretrieve(url % video["id"], "%s.%s" % (basename, ext),
-                                   reporthook)
+        url = ("https://web.archive.org/web/2oe_/http://wayback-fakeurl.archiv"
+               "e.org/yt/%s" % video["id"])
+        with urllib.request.urlopen(url) as headers:
+            contenttype = headers.getheader("Content-Type")
+            if contenttype == "video/webm" or contenttype == "video/mp4":
+                ext = contenttype.split("/")[-1]
+            else:
+                raise HTTPError(url=None, code=None, msg=None,
+                                hdrs=None, fp=None)
+            with open("%s.%s" % (basename, ext), "wb") as f:
+                f.write(headers.read())
         print(" downloaded %s.%s" % (basename, ext))
         return 0
     except TimeoutError:
         return 1
     except HTTPError:
-        print(" video not available on the Wayback Machine!")
-        return 0
+        # dont keep trying
+        return 2
     except Exception as e:
-        print(" unknown error downloading video!\n")
+        print(" unknown error downloading video!")
         print(e)
-        return 0
-
-
-def ia_file_legit(path: str, vidid: str) -> bool:
-    return True if re.search(''.join([r"((?:.+?-)?", vidid, r"\.(?:mp4|jpg|web"
-                          r"p|mkv|webm|info\\.json|description|annotations.xml"
-                          "))"]),
-                         path) else False
+        return 1


-def internet_archive_dl(video: dict, basename: str, output: str) -> int:
-    if internetarchive.get_item("youtube-%s" % video["id"]).exists:
-        flist = [f.name for f in internetarchive.get_files("youtube-%s" % video["id"]) if ia_file_legit(f.name, video["id"])]
-        while True:
-            try:
-                internetarchive.download("youtube-%s" % video["id"],
-                                         files=flist, verbose=True,
-                                         destdir=output,
-                                         no_directory=True,
-                                         ignore_existing=True,
-                                         retries=9999)
-                break
-            except ConnectTimeout:
-                continue
-            except Exception as e:
-                print(e)
-                return 0
-        if flist[0][:len(video["id"])] == video["id"]:
-            for fname in flist:
-                if os.path.exists("%s/%s" % (output, fname)):
-                    os.replace("%s/%s" % (output, fname),
-                               "%s-%s" % (basename.rsplit("-", 1)[0],
-                                          fname))
-        return 1
+# Internet Archive (tubeup)
+def ia_dl(video: dict, basename: str, output: str) -> int:
+    def ia_file_legit(file: internetarchive.File, vidid: str) -> bool:
+        # FIXME:
+        #
+        # There are some items on IA that combine the old tubeup behavior
+        # (i.e., including the sanitized video name before the ID)
+        # and the new tubeup behavior (filename only contains the video ID)
+        # hence we will download the entire video twice.
+        #
+        # This isn't much of a problem anymore (and hasn't been for like 3
+        # years), since I contributed code to not upload something if there
+        # is already something there. However we should handle this case
+        # anyway.
+        #
+        # Additionally, there are some items that have duplicate video files
+        # (from when the owners changed the title). We should ideally only
+        # download unique files. IA seems to provide SHA1 hashes...
+        #
+        # We should also check if whether the copy on IA is higher quality
+        # than a local copy... :)
+        if not re.search(r"((?:.+?-)?" + vidid + r"\.(?:mp4|jpg|webp|mkv|w"
+                         r"ebm|info\.json|description|annotations.xml))",
+                         f.name):
+            return False
+
+        # now, check the metadata
+        print(f)
+        return True
+
+
+    if not internetarchive.get_item("youtube-%s" % video["id"]).exists:
+        return 2
+
+    flist = [
+        f.name
+        for f in internetarchive.get_files("youtube-%s" % video["id"])
+        if ia_file_legit(f.name, video["id"])
+    ]
+    while True:
+        try:
+            internetarchive.download("youtube-%s" % video["id"], files=flist,
+                                     verbose=True, destdir=output,
+                                     no_directory=True, ignore_existing=True,
+                                     retries=9999)
+            break
+        except ConnectTimeout:
+            time.sleep(1)
+            continue
+        except Exception as e:
+            print(e)
+            return 1
+
+    # Newer versions of tubeup save only the video ID.
+    # Account for this by replacing it.
+    #
+    # paper/2025-08-30: fixed a bug where video IDs with hyphens
+    # would incorrectly truncate
+    for fname in flist:
+        # ignore any files whose names are not simply the ID
+        if os.path.splitext(fname)[0] != video["id"]:
+            continue
+
+        if os.path.exists("%s/%s" % (output, fname)):
+            os.replace("%s/%s" % (output, fname),
+                       "%s.%s" % (basename, os.path.splitext(fname))[1])
     return 0


-ytdl_opts = {
-    "retries": 100,
-    "nooverwrites": True,
-    "call_home": False,
-    "quiet": True,
-    "writeinfojson": True,
-    "writedescription": True,
-    "writethumbnail": True,
-    "writeannotations": True,
-    "writesubtitles": True,
-    "allsubtitles": True,
-    "addmetadata": True,
-    "continuedl": True,
-    "embedthumbnail": True,
-    "format": "bestvideo+bestaudio/best",
-    "restrictfilenames": True,
-    "no_warnings": True,
-    "progress_hooks": [ytdl_hook],
-    "logger": MyLogger(),
-    "ignoreerrors": False,
-}
+def ytdlp_dl(video: dict, basename: str, output: str) -> int:
+    # intentionally ignores all messages besides errors
+    class MyLogger(object):
+        def debug(self, msg):
+            pass
+
+        def warning(self, msg):
+            pass
+
+        def error(self, msg):
+            print(" " + msg)
+            pass
+
+
+    def ytdl_hook(d) -> None:
+        if d["status"] == "finished":
+            print(" downloaded %s:    100%% " % (os.path.basename(d["filename"])))
+        if d["status"] == "downloading":
+            print(" downloading %s: %s\r" % (os.path.basename(d["filename"]),
+                                             d["_percent_str"]), end="")
+        if d["status"] == "error":
+            print("\n an error occurred downloading %s!"
+                  % (os.path.basename(d["filename"])))
+
+    ytdl_opts = {
+        "retries": 100,
+        "nooverwrites": True,
+        "call_home": False,
+        "quiet": True,
+        "writeinfojson": True,
+        "writedescription": True,
+        "writethumbnail": True,
+        "writeannotations": True,
+        "writesubtitles": True,
+        "allsubtitles": True,
+        "addmetadata": True,
+        "continuedl": True,
+        "embedthumbnail": True,
+        "format": "bestvideo+bestaudio/best",
+        "restrictfilenames": True,
+        "no_warnings": True,
+        "progress_hooks": [ytdl_hook],
+        "logger": MyLogger(),
+        "ignoreerrors": False,
+
+        #mm, output template
+        "outtmpl": output + "/%(title)s-%(id)s.%(ext)s",
+    }
+
+    with youtube_dl.YoutubeDL(ytdl_opts) as ytdl:
+        try:
+            ytdl.extract_info("https://youtube.com/watch?v=%s" % video["id"])
+            return 0
+        except DownloadError:
+            return 2
+        except Exception as e:
+            print(" unknown error downloading video!\n")
+            print(e)
+
+    return 1
+
+
+# TODO: There are multiple other youtube archival websites available.
+# Most notable is https://findyoutubevideo.thetechrobo.ca .
+# This combines a lot of sparse youtube archival services, and has
+# a convenient API we can use. Nice!
+#
+# There is also the "Distributed YouTube Archive" which is totally
+# useless because there's way to automate it...
+
+##############################################################################


 def main():
+    # generator; creates a list of files, and returns the parsed form of
+    # each. note that the parser is not necessarily
+    def load_split_files(path: str):
+        list_files = []
+
+        # build the path list
+        if not os.path.isdir(path):
+            list_files.append(path)
+        else:
+            for fi in os.listdir(path):
+                if re.search(r"vids[0-9\-]+?\.json", fi):
+                    list_files.append(path + "/" + fi)
+
+        # now open each as a json
+        for fi in list_files:
+            print(fi)
+            with open(fi, "r", encoding="utf-8") as infile:
+                if simdjson:
+                    # Using this is a lot faster in SIMDJSON, since instead
+                    # of converting all of the JSON key/value pairs into
+                    # native Python objects, they stay in an internal state.
+                    #
+                    # This means we only get the stuff we absolutely need,
+                    # which is the uploader ID, and copy everything else
+                    # if the ID is one we are looking for.
+                    parser = json.Parser()
+                    yield parser.parse(infile.read())
+                    del parser
+                else:
+                    yield json.load(infile)
+
+
+    def write_metadata(i: dict, basename: str) -> None:
+        # ehhh
+        if not os.path.exists(basename + ".info.json"):
+            with open(basename + ".info.json", "w", encoding="utf-8") as jsonfile:
+                try:
+                    # orjson outputs bytes
+                    jsonfile.write(json.dumps(i).decode("utf-8"))
+                except AttributeError:
+                    # everything else outputs a string
+                    jsonfile.write(json.dumps(i))
+                print(" saved %s" % os.path.basename(jsonfile.name))
+        if not os.path.exists(basename + ".description"):
+            with open(basename + ".description", "w",
+                      encoding="utf-8") as descfile:
+                descfile.write(i["description"])
+                print(" saved %s" % os.path.basename(descfile.name))
+
     args = docopt.docopt(__doc__)

     if not os.path.exists(args["--output"]):
         os.mkdir(args["--output"])

-    for f in load_split_files(args["--database"]):
-        for i in f:
-            uploader = i["uploader_id"] if "uploader_id" in i else None
-            for url in args["<url>"]:
-                channel = url.split("/")[-1]
+    channels = dict()
+
+    for url in args["<url>"]:
+        chn = url.split("/")[-1]
+        channels[chn] = {"output": "%s/%s" % (args["--output"], chn)}
+
+    for channel in channels.values():
+        if not os.path.exists(channel["output"]):
+            os.mkdir(channel["output"])
+
+    # find videos in the database.
+    #
+    # despite how it may seem, this is actually really fast, and fairly
+    # memory efficient too (but really only if we're using simdjson...)
+    videos = [
+        i if not simdjson else i.as_dict()
+        for f in load_split_files(args["--database"])
+        for i in (f if not "videos" in f else f["videos"]) # logic is reversed kinda, python is weird
+        if "uploader_id" in i and i["uploader_id"] in channels
+    ]
+
+    while True:
+        if len(videos) == 0:
+            break
+
+        videos_copy = videos
+
+        for i in videos_copy:
+            channel = channels[i["uploader_id"]]
+
+            # precalculated for speed
+            output = channel["output"]

-                output = "%s/%s" % (args["--output"], channel)
-                if not os.path.exists(output):
-                    os.mkdir(output)
-                ytdl_opts["outtmpl"] = output + "/%(title)s-%(id)s.%(ext)s"
+            print("%s:" % i["id"])
+            basename = "%s/%s-%s" % (output, sanitize_filename(i["title"],
+                                     restricted=True), i["id"])
+            files = [y
+                     for p in ["mkv", "mp4", "webm"]
+                     for y in Path(output).glob(("*-%s." + p) % i["id"])]
+            if files:
+                print(" video already downloaded!")
+                videos.remove(i)
+                write_metadata(i, basename)
+                continue
+
+            # high level "download" function.
+            def dl(video: dict, basename: str, output: str):
+                dls = []
+
+                if ytdlp_works:
+                    dls.append({
+                        "func": ytdlp_dl,
+                        "name": "using yt-dlp",
+                    })

-                if uploader == channel:
-                    print(uploader, channel)
-                    print("%s:" % i["id"])
-                    basename = "%s/%s-%s" % (output, sanitize_filename(i["title"],
-                                             restricted=True), i["id"])
-                    files = [y for p in ["mkv", "mp4", "webm"] for y in list(Path(output).glob(("*-%s." + p) % i["id"]))]
-                    if files:
-                        print(" video already downloaded!")
-                        write_metadata(i, basename)
+                if ia_works:
+                    dls.append({
+                        "func": ia_dl,
+                        "name": "from the Internet Archive",
+                    })
+
+                dls.append({
+                    "func": desirintoplaisir_dl,
+                    "name": "from LMIJLM/DJ Plaisir's archive",
+                })
+                dls.append({
+                    "func": ghostarchive_dl,
+                    "name": "from GhostArchive"
+                })
+                dls.append({
+                    "func": wayback_dl,
+                    "name": "from the Wayback Machine"
+                })
+
+                for dl in dls:
+                    print(" attempting to download %s" % dl["name"])
+                    r = dl["func"](i, basename, output)
+                    if r == 0:
+                        # all good, video's downloaded
+                        return 0
+                    elif r == 2:
+                        # video is unavailable here
+                        print(" oops, video is not available there...")
                         continue
-                    # this code is *really* ugly... todo a rewrite?
-                    with youtube_dl.YoutubeDL(ytdl_opts) as ytdl:
-                        try:
-                            ytdl.extract_info("https://youtube.com/watch?v=%s"
-                                              % i["id"])
-                            continue
-                        except DownloadError:
-                            print(" video is not available! attempting to find In"
-                                  "ternet Archive pages of it...")
-                        except Exception as e:
-                            print(" unknown error downloading video!\n")
-                            print(e)
-                    if internet_archive_dl(i, basename, output):  # if we can't download from IA
-                        continue
-                    print(" video does not have a Internet Archive page! attem"
-                          "pting to download from the Wayback Machine...")
-                    while True:
-                        if wayback_machine_dl(i, basename) == 0:  # success
-                            break
-                        time.sleep(5)
-                        continue
-                    write_metadata(i, basename)
+                    elif r == 1:
+                        # error while downloading; likely temporary.
+                        # TODO we should save which downloader the video
+                        # was on, so we can continue back at it later.
+                        return 1
+                # video is unavailable everywhere
+                return 2
+
+            r = dl(i, basename, output)
+            if r == 1:
+                continue
+
+            # video is downloaded, or it's totally unavailable, so
+            # remove it from being checked again.
+            videos.remove(i)
+            # ... and then dump the metadata, if there isn't any on disk.
+            write_metadata(i, basename)
+
+            if r == 0:
+                # video is downloaded
+                continue
+
+            # video is unavailable; write out the metadata.
+            print(" video is unavailable everywhere; dumping out metadata only")


 if __name__ == "__main__":