Build and configure Heritrix to use ExtractorYoutubeDL

h/t @ldko

Install Maven and Java with coordinating versions that work with your system and Heritrix (see https://github.com/internetarchive/heritrix3/blob/master/.github/workflows/maven.yml):

sudo dnf install maven-openjdk17

Building latest Heritrix (in this example, I put Heritrix distributions at /usr/local/ and have a symlink named h3 that points to the one I currently use, i.e. /usr/local/h3 is a symlink pointing to /usr/local/heritrix-3.4.0-SNAPSHOT-20190523)

cd /tmp
git clone https://github.com/internetarchive/heritrix3.git
cd heritrix3
mvn package
cd dist/target/
tar xvzf  heritrix-3.4.0-SNAPSHOT-dist.tar.gz
# name the snapshot by the current date and put it at /usr/local next to any other previous Heritrix installation)
sudo cp -r  heritrix-3.4.0-SNAPSHOT /usr/local/heritrix-3.4.0-SNAPSHOT-20190523
cd /usr/local/heritrix-3.4.0-SNAPSHOT-20190523
# Change all of the heritrix files to be owned by my user (replace "myusername" with your user name on the system)
sudo chown -R myusername ../heritrix-3.4.0-SNAPSHOT-20190523/
# Assuming I have my most recently previously used Heritrix instance pointed to with a symlink at /usr/local/h3, copy the current startup script and password file to the new instance of Heritrix.
# If you don't have these files to copy, see below for what to put into the files when creating them
cp ../h3/heritrix-start.sh ../h3/heritrix_pass.txt .
# Heritrix creates a /jobs directory that is initially empty, you can use this, but I delete it and add a symlink in its place to a directory on the system where I keep all of my Heritrix job configurations
ls jobs/   # just verifying it is empty
rm -r jobs
ln -s /data01/jobs/ jobs  # replace /data01/jobs with an existing directory where you want Heritrix to create new jobs

Before proceeding, stop any running instance of Heritrix through the UI if you are still running another version of Heritrix. Then update your h3 symlink to use the new Heritrix.

cd /usr/local
sudo rm h3
sudo ln -s heritrix-3.4.0-SNAPSHOT-20190523/ h3
# If you want to use ExtractorYoutubeDL or something else from contrib, put that in the right place
cd /tmp/heritrix3/contrib/target
tar xvzf heritrix-contrib-3.4.0-SNAPSHOT-dist.tar.gz
cp heritrix-contrib-3.4.0-SNAPSHOT/lib/heritrix-contrib-3.4.0-SNAPSHOT.jar heritrix-contrib-3.4.0-SNAPSHOT/lib/gson-2.8.9.jar /usr/local/h3/lib
cd /usr/local/h3
# Hopefully Heritrix starts for you now (see info about what is in the heritrix-start.sh and password files below)
./heritrix-start.sh

In /usr/local/h3/heritrix-start.sh you can set your environment variables. In the command to start Heritrix here, you can bind to the hostname of the server running Heritrix (the hostname I will use when connecting to the UI in the browser, the path to the file with the password, and the port where Heritrix is going to run.

# contents of /usr/local/h3/heritrix-start.sh
export JAVA_OPTS="-Xmx5120m"
export HERITRIX_HOME=/usr/local/h3
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk
$HERITRIX_HOME/bin/heritrix -b heritrix.server.hostname.com -a @$HERITRIX_HOME/heritrix_pass.txt -p 9443

In /usr/local/h3/heritrix_pass.txt there is one line with username and password formatted with a colon in between

username:passwordhere

If you want to use ExtractorYoutubeDL, install yt-dlp (https://github.com/yt-dlp/yt-dlp). Internet archive now uses yt-dlp, see commit.

sudo pip install yt-dlp

Configuring Heritrix to use ExtractorYoutubeDL on only certain pages (or leave out block if you want it to always run). Add something like the following in the part of your Heritrix config after your other extractors)

<bean id="extractorYoutubedl" class="org.archive.modules.extractor.ExtractorYoutubeDL">
   <property name="shouldProcessRule">
     <bean class="org.archive.modules.deciderules.DecideRuleSequence">
       <property name="rules">
         <list>
           <bean class="org.archive.modules.deciderules.RejectDecideRule">
           </bean>
           <bean class="org.archive.modules.deciderules.MatchesListRegexDecideRule">
             <property name="decision" value="ACCEPT"/>
             <property name="regexList">
               <list>
                <value>^https?://([^/]*\.)?host-to-check-for-embedded-videos\.org.*$</value>
                <value>^https?://www\.otherexamplesite\.org.*$</value>
               </list>
             </property>
           </bean>
         </list>
       </property>
     </bean>
   </property>
 </bean>

Add the extractorYoutubedl bean defined above to the list of processors for the FetchChain

<bean id="fetchProcessors" class="org.archive.modules.FetchChain">
  <property name="processors">
   <list>
      ...other processors...
    <!-- ...try for videos on target hosts... -->
    <ref bean="extractorYoutubedl"
   </list>

Notes:

URLs with videos identified by ExtractorYoutubeDL will be logged to your dir_for_job/latest/logs/extractorYoutubedl.log
As far as I know, there is no publicly available viewer that can by default replay the videos you download with Heritrix and ExtractorYoutubeDL.
If you aren't careful about crawling slowly enough, you may trigger bad responses from video hosting sites for making too many requests
Running ExtractorYoutubeDL for every URL uses a lot of system resources, so keep this in mind when configuring your maxToeThreads--you may need to reduce the number of toe threads to allow enough resources for yt-dlp

Test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build and configure Heritrix to use ExtractorYoutubeDL

Clone this wiki locally