A utility for the automated generation of digital objects based on the digital signatures documented in the PRONOM database maintained by The National Archives, UK.
The skeleton-test-suite-generator seeks to fill the gap that exists whereby the community requires a corpus of digital objects for the validation and evaluation of format identification tools and techniques.
The output of the skeleton suite should be used to complement a methodology whereby skeleton files are also generated manually by signature developers.
The research paper this work led to can be found here: IJDC.08.01.2013.
The container skeleton suite requires some different technologies to run. It is hosted in a separate repository.
Richard Lehane's builder, builds skeleton suites with each new PRONOM release and includes both standard binary skeletons and the container suite. It is a must-have for all file format signature developers.
The tool takes a signature specified for a digital object in PRONOM and constructs a digital object that will match its footprint. For example, given the signature:
CAFED00D{4}CAFEBABE(0D|0D0A)
The hex sequences comprising digital objects that will match this signature in DROID will look like the following:
CA FE D0 0D 00 00 00 00 CA FE BA BE 0D
Or:
CA FE D0 0D 00 00 00 00 CA FE BA BE 0D 0A
The scripts take an export of the PRONOM database in XML, extract the internal signature information belonging to each format record and generate the digital objects - creating the 'skeleton test suite'.
The objects can be used for:
-
Understanding where signatures in the PRONOM database will conflict, therefore generating multiple identifications for some files.
-
Creating signatures purely based on format specifications where getting sample files or making them available to those able to create signatures is extremely difficult.
-
Incorporation into the DROID unit test-suite to ensure modifications to identification engine do not impact identification capability.
-
Test the stability of signature files over time.
Other benefits include a small footprint - zipped the suite is just over 150kb in size. Unzipped the suite is approx 390kb.
Does not suffer issues relating to IPR and copyright. The suite and generator tool, licensed under CC BY-SA (see below).
The tool so far is a prototype and it doesn't handle every sequence in PRONOM as of yet. Signatures with multiple BOF sequences, for example, will not generate correctly. While this can be corrected by the team working on PRONOM, these are legitimate sequences that should be handled by the tool.
python skeletongenerator.py
Easy as. The scripts require the existence of the 'pronom-export' folder generated by the scripts in the pronom-xml-export repository: https://github.com/exponential-decay/pronom-xml-export
The input and output locations can be configured by modifying the accompanying cfg file skeletonsuite.cfg.
Files are generated by default by using NULL bytes to 'fill' the file as dictated by a signature. This can be configured in the cfg file using the character value for the requested fill values or <0 or >255 for random bytes.
Version information can be displayed by running:
python skeletongenerator.py --version
I completed two reports on the Skeleton Test Suite back in 2012/2013. They document testing of the files on DROID and explore reasons why some files do or do not work. The reports and links to the test-suites used for testing can be found on the repo wiki.
-
Handle multiples of sequence types, e.g. multiple non-colliding BOF sequences.
-
Understand the requirements for metadata to be associated with files, e.g. should the internal structure of files be self-describing?
-
A repository needs to be created on GitHub to host the first non-prototypical output of this generator and the test-suite henceforth.
-
Understand what do we need to do with multiple combinations of byte sequences - currently we always turn-left.
-
Unit tests for signature2bytegenerator.py and filewriter.py as a priority.
- Incorporate suite into unit tests for DROID and FIDO
- Together understand if we can adapt this approach for the UNIX File utility
- Talk about this tool and potential approach and help to understand how to refine it!
- Sit tight as we build an infrastructure to host the suite itself online.
Copyright (c) 2012 Ross Spencer
This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.
Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:
-
The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.
-
Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.
-
This notice may not be removed or altered from any source distribution.
PRONOM data, not owned by this repository is licensed under the Open Government Licence (OGL).