Skip to content

Indexing

Heinz Werner Kramski edited this page Aug 16, 2019 · 18 revisions

This text is WIP!

Please note that the steps described here must be carried out in the specified order.

  1. Recurse into Files and Folders
  2. Optional: Recurse into Archives
  3. General File Format Recognition and Characterization
  4. Specific File Format Characterization
  5. Match against the National Software Reference Library NSRL
  6. Build or enrich the SOLR Index

Recurse into Files and Folders

Prerequisites

Make sure the image files or folders to process are accessible under their names specified in the session table.

Ingest

The first step is to make Indexer recurse into the file hierarchy to ingest and fill the file table with basic technical metadata, comparable to a directory listing.

Log in as root and start or attach to a tmux session, so you can safely detach from long-running jobs.

Enter the directory where recurse.php lives.

It may be a good idea to capture all future console output to a log file, so you can check for errors or success more easily. To do that start a log file with

# script recurse.php.log.txt

or whatever filename you prefer.

Initial ingest is done with the recurse.php script. The general syntax is:

Usage: php recurse.php <target> [update]
Recursively process <target> and ingest (new) files and folders. 
Modifies the "file" table.
  <target>              A group of volumes like "fd" or "hd", a single volume/session like "2018" or "all"
                          (see "session" table for valid groups or sessions).
                          By default skips every session which is already present in the "file" table with at
                          least one file or folder.
  update                Don't skip sessions already present and look for new files.
                          Useful if a previous run was interrupted.
See also: clearsession.php

For example, to process all your optical disks you may run:

# php recurse.php od

To process a single volume, specify its sessionid from the session table:

# php recurse.php 2018

(See you session table for valid groups and session IDs.)

While running, recurse.php displays status information like this:

[...]
dir: /mnt/usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16
storeDir( 2018, usr/kde/3.5/share/apps/kgpg/icons/crystalsvg, usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16, 16x16, 3908524, 8 )
recurse( .'/'.usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16, 3908525, 9 )
#442049 2018: /mnt // actions // usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16/actions ----
dir: /mnt/usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16/actions
storeDir( 2018, usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16, usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16/actions, actions, 3908525, 9 )
recurse( .'/'.usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16/actions, 3908526, 10 )
#442050 2018: /mnt // kgpg_key1.png // usr/kde/3.5/share/apps/kgpg/icons/crystalsvg/16x16/actions/kgpg_key1.png ----
[...]

You can also watch the file table grow.

While processing, copies of the files ingested are placed in the localpath cache folder, so further processing can take place without datapath being mounted/accessible.

After recurse.php is finished, you should verify the number of new records and new sessions in the file table to match the expected results.

There will also be some new performance data in the throughput view. Depending on your setup, recurse.php will ingest about 100 files per second.

Recurse into Archives (Containers)

tbd, skipped for now.

General File Format Recognition and Characterization

For general file format recognition and characterization, run the following scripts:

  1. php index.php libmagic <target>
    • General technical metadata and MIME info, based on the file command
    • Modifies the info_libmagic table.
    • Runs at about 170 files per second (at DLA Marbach – your mileage may vary)
  2. php index.php gvfsinfo <target>
    • General technical metadata and MIME info, based on the GNOME gvfs-info command
    • Modifies the info_gvfs_info table.
    • Runs at about 100 files per second.
  3. php index.php tika <target>
    • General MIME info and full text extraction, based on Apache Tika
    • Modifies the info_tika table.
    • Runs at about 170 files per second.
  4. php index.php siegfried <target>
    • General MIME info and PRONOM IDs, based on Siegfried
    • Modifies the info_siegfried table.
    • Runs at about 12 files per second.

Specific File Format Characterization

For specific file format characterization, full text extraction and thumbnail generation, run the following scripts:

  1. php index.php imagemagick <target>
    • Technical metadata and thumbnails for graphical file formats, based on Imagemagick
    • Modifies the info_imagick table.
    • Runs at about 9 files per second.
  2. php index.php avconv <target>
    • Technical metadata for audio and video file formats, based on ??
    • Modifies the info_avconv table.
    • Runs at about 16 files per second.
  3. php index.php antiword <target>
    • Self-descriptive metadata and full text extraction for old MS-Word files, based on ??
    • Modifies the info_antiword table.
    • Runs at about 80 files per second.

Match against the National Software Reference Library NSRL

Prerequisites

A local copy of the NSRL must be set up as described in …

Matching

To match all files against the NSRL, run the following scripts:

  1. php index.php md5 <target>
    • Adds MD5 checksums to the file table.
    • Runs at about 220 files per second.
  2. php index.php nsrl <target>
    • NSRL ProductCode, if MD5 checksums match
    • Modifies the info_nsrl table.
    • Runs at about … files per second.

Build or enrich the SOLR Index

For faster search, all relevant data is transferred into a SOLR index. To do this, run the following script:

php solr.php <target>

Clear a Volume’s (Sessions’s) Data

Sometimes, you may want to clear a volume’s data from the database and from the SOLR index as well. Whether it’s to start over with a better image file, or because it turned out to be irrelevant.

To do this, run the following script:

php clearsession.php <session>