-
Notifications
You must be signed in to change notification settings - Fork 0
Indexing
This text is WIP!
Please note that the steps described here must be carried out in the specified order.
- Recurse into Files and Folders
- Optional: Recurse into Archives
- General File Format Recognition and Characterization
- Specific File Format Characterization
- Match against the National Software Reference Library NSRL
- Build or enrich the SOLR Index
Make sure the image files or folders to process are accessible under their names specified in the session table.
The first step is to make Indexer
recurse into the file hierarchy to ingest and fill the file
table with basic technical metadata, comparable to a directory listing.
Log in as root
and start or attach to a tmux
session, so you can safely detach from long-running jobs.
Enter the directory where recurse.php
lives.
It may be a good idea to capture all future console output to a log file, so you can check for errors or success more easily. To do that start a log file with
# script recurse.php.log.txt
or whatever filename you prefer.
Initial ingest is done with the recurse.php
script. The general syntax is:
Usage: php recurse.php <target> [update] Recursively process <target> and ingest (new) files and folders. Modifies the "file" table. <target> A group of volumes like "fd" or "hd", a single volume/session like "2018" or "all" (see "session" table for valid groups or sessions). By default skips every session which is already present in the "file" table with at least one file or folder. update Don't skip sessions already present and look for new files. Useful if a previous run was interrupted. See also: clearsession.php
For example, to process all your optical disks you may run:
# php recurse.php od
To process a single volume, specify its sessionid
from the session
table:
# php recurse.php 2018
(See you session table for valid groups and session IDs.)
While running, recurse.php
displays status information like this:
[...] dir: /mnt/usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16 storeDir( 2018, usr/kde/3.5/share/apps/kgpg/icons/crystalsvg, usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16, 16x16, 3908524, 8 ) recurse( .'/'.usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16, 3908525, 9 ) #442049 2018: /mnt // actions // usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16/actions ---- dir: /mnt/usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16/actions storeDir( 2018, usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16, usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16/actions, actions, 3908525, 9 ) recurse( .'/'.usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16/actions, 3908526, 10 ) #442050 2018: /mnt // kgpg_key1.png // usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16/actions/kgpg_key1.png ---- [...]
You can also watch the file
table grow.
While processing, copies of the files ingested are placed in the localpath
cache folder, so further processing can take place without datapath
being mounted/accessible.
After recurse.php
is finished, you should verify the number of new records and new sessions in the file
table to match the expected results.
There will also be some new performance data in the throughput
view. Depending on your setup, recurse.php
will ingest about 100 files per second.
tbd, skipped for now.
For general file format recognition and characterization, run the following scripts:
-
php index.php libmagic <target>
- General technical metadata and MIME info, based on the
file
command - Modifies the
info_libmagic
table. - Runs at about 170 files per second (at DLA Marbach – your mileage may vary)
- General technical metadata and MIME info, based on the
-
php index.php gvfsinfo <target>
- General technical metadata and MIME info, based on the GNOME
gvfs-info
command - Modifies the
info_gvfs_info
table. - Runs at about 100 files per second.
- General technical metadata and MIME info, based on the GNOME
-
php index.php tika <target>
- General MIME info and full text extraction, based on Apache Tika
- Modifies the
info_tika
table. - Runs at about 170 files per second.
-
php index.php siegfried <target>
For specific file format characterization, full text extraction and thumbnail generation, run the following scripts:
-
php index.php imagemagick <target>
- Technical metadata and thumbnails for graphical file formats, based on Imagemagick
- Modifies the
info_imagick
table. - Runs at about 9 files per second.
-
php index.php avconv <target>
- Technical metadata for audio and video file formats, based on ??
- Modifies the
info_avconv
table. - Runs at about 16 files per second.
-
php index.php antiword <target>
- Self-descriptive metadata and full text extraction for old MS-Word files, based on ??
- Modifies the
info_antiword
table. - Runs at about 80 files per second.
A local copy of the NSRL must be set up as described in …
To match all files against the NSRL, run the following scripts:
-
php index.php md5 <target>
- Adds MD5 checksums to the
file
table. - Runs at about 220 files per second.
- Adds MD5 checksums to the
-
php index.php nsrl <target>
-
NSRL
ProductCode
, if MD5 checksums match - Modifies the
info_nsrl
table. - Runs at about … files per second.
-
NSRL
For faster search, all relevant data is transferred into a SOLR index. To do this, run the following script:
php solr.php <target>
Sometimes, you may want to clear a volume’s data from the database and from the SOLR index as well. Whether it’s to start over with a better image file, or because it turned out to be irrelevant.
To do this, run the following script:
php clearsession.php <session>
Please visit our homepage DLA Marbach