Skip to content

Collection API Technical Specification

Fiona Wood edited this page Jun 12, 2018 · 1 revision

National Museum of Australia, February 2018

1. Overview

Collection Explorer is the National Museum of Australia’s public interface into its online collection. It was first released in August 2014, with multiple design improvements and usability enhancements since then.

The Museum is committed to giving open and easy access to its online collection wherever possible - introducing Creative Commons licensing of suitable object images and API access.

The APIs are RESTful API based on a Solr database and return JSON formatted results.

The two APIs are:

  1. Public API - open to all users, so is limited to data already available to the public via Collection Explorer, with additional restricted content such as Indigenous material. The aim is to encourage the sharing of our collection with other institutions and members of the general public.
  2. Internal API - for the Museum’s internal use, so includes more data than the public API, such as objects that are on loan. The aim is to share the collection across our website and other digital products, and may be used for discrete external data feeds to sites such as Trove.

The public API provides data and images to:

  • Allow external developers and the public programmatic access to our collection data
  • Allow the Museum to easily contribute data to online projects and collections, for example, the National Library of Australia’s Trove
  • Provide a framework to power interaction between other applications and collection data, for example, public users creating their own collections.

The internal API provides data and images to:

  • Display content within the Museum’s main website http://nma.gov.au
  • Display content inside Museum mobile apps and on digital signage
  • Display content on devices in galleries.

The two APIs are delivered via the same API endpoints using different levels of security.

2. Business requirements

  1. Exposure - Increase the exposure of our collection by providing open access to our data and by encouraging collaboration with individuals, companies, and other collections
  2. Enhancements - Grow and add to the API over time by ensuring it is expandable and flexible, making it easy to modify as data needs change, and future-proofing where possible
  3. Engagement - Engage with the developer community by making it easy to get started, adhering to API conventions, and encouraging feedback from developers
  4. Robust - Build confidence through reliability by having a process for deployment that allows updates to the API with minimal downtime
  5. Self-supporting - Require as little support from the Museum as possible by having an API that is easy to use and well documented

3. User requirements

Potential user groups

Public API

  • Web and app developers, hackathon participants, data enthusiasts, digital labs
  • Teachers, students, researchers (academic, family history, digital humanities)
  • Other museums and Government agencies
  • Aggregators (e.g. Trove, MOAD), bots, data.gov, stock libraries
  • Special interest groups, communities of interest
  • Journalists

Internal API

  • NMA digital team
  • NMA gallery redevelopment team, curatorial team, other staff
  • Web and app developers (for internal apps)
  • Business systems

High priority user needs

A user requirements workshop was held at the Museum in 2017, the highest priority need identified was to build a strong foundation for the API.

Public API priority needs:

  • Documentation - Well documented structure and methods including code examples on how the API can be used. This was considered more important than additional features, as overall usability and understanding the API in order to implement it, would drive adoption.
  • Data Quality - Including consistency, interesting and useful data, and high-resolution imagery. Consistency was considered an important factor in determining overall data quality.
  • Robust - High performing, with high availability, high speed access, and stable URLs. For developers to invest in building with an API, reliability will boost their confidence and trust in the service.

Internal API high priority needs:

  • Completeness - Access to all fields available, the inclusion of objects not in the public API (such as loan objects), and the desire to push and pull data. The Museum is also considering how the API could be used to add value to the collection data itself, in the future.

Potential future enhancements:

  • Content relationships - Connecting objects and capturing relationships, for example through the use of groups or lists, categories and other taxonomies. This would help drive ‘related objects’ and other serendipitous recommendations.
  • Augmented content - More specific attributes for objects for the Museum’s collection itself, such as object colour or popularity, detailed physical location, and when the objects have been displayed in the Museum.

4. Functional requirements

MVP (Minimum Viable Product) for initial Public API release (March 2018)

Data sources

  • EMu & Piction: weekly full reindex
  • Ability to make ad hoc record pull-downs

Data scope

  • NMA records released for Public API:
  • Object records are included, where the API status field contains "Public" or "Public Restricted"
  • Narrative records are included, where the narrative purpose field contains "Collection Explorer publish"
  • Restricted objects are excluded, e.g. indigenous content. Determined by AcsCCStatus=Restricted
  • Multimedia/parties/sites records are included, where they are linked to a public released object or narrative record
  • Multimedia files are included, where the linked object record contains a valid licence (PD/CC-BY/CC-BY-NC). Determined by AcsCCStatus value of "Public Domain", "Creative Commons Commercial Use" or "Creative Commons Non-commercial use".

Search

  • Free-text using keywords
  • Boolean operators
  • Limit by object type
  • Limit by custom fields: eg. title, object type, date, image list

Search results

  • Ordering: by relevance only
  • Result count
  • Pagination: limit, offset
  • Parent/child records: duplicated in results

Output

  • JSON format
  • Related objects: excluded
  • Nested objects: IDs only
  • Image URLs: thumbnail, preview, hi-res
  • Basic image metadata
  • Dates: in IS8601 format in UTC timezone (e.g. date, created, modified)
  • Error details

Architecture

  • Metadata standards: Linked Art, and a simplified, Dublin Core-based schema [what for specimens?] [what about field qualifiers]
  • Hosted: on NMA servers, with scaling strategy defined
  • Endpoints: /object, /narrative, https, IRNs used for IRI identifiers
  • API versioning: strategy defined
  • Authentication
    • No authentication for basic usage: for users trying the API out - no key, usage is throttled by IP, data is limited as it excludes records with 'Public Restricted" as the API status. (TBC if possible)
    • Access key authentication: automated API key signup via web form, usage throttled by key and IP (TBD)
  • Metrics: basic usage tracked

Community portal

  • User documentation:
    • Endpoints: list, operations, parameters
    • Sample records
    • Getting started
    • Delivery: github wiki
  • Email list
  • Issues: form to raise issue or provide feedback

Full Public API release (April 2018)

Data sources

  • Daily updates
  • Changes to entity records trigger updates to all related object records

Data scope

  • Trove licence scope

Architecture

  • Endpoints: /option, /party, /place, /media (TBD)
  • API versioning: operational

Search

  • Image metadata: has image, has hi-res, licence type
  • Limit by deleted status (for harvesting)
  • Limit by modified date range (incremental harvest)

Search results

  • Ordering: customisable, e.g. relevance, title, type, date

Output

  • XML
  • Related objects: full related records included (related objects, sub-narratives)
  • Nested objects: full nested record data (don't need additional calls)
  • Select custom fields only (rather than having to receive full records)

Internal API release (May 2018)

Data scope

  • NMA records released for Internal API:
  • Object records are included, where API field contains "Public", "Public Restricted" or "Internal"
  • All narrative records are included.
  • Restricted objects are included, e.g. indigenous content
  • Multimedia/parties/sites records are included, where they are linked to a public/internal released object or narrative record
  • Multimedia files are included, where the linked object record contains a valid licence (PD/CC-BY/CC-BY-NC)

Architecture

  • Authentication
    • Access key authentication: manually created by NMA staff

Future possible enhancements

Data scope

  • Geo location polygons for places
  • Flag whether is featured item
  • Event data combined (e.g. birth date AND place)
  • Semantic relationships

Search

  • SPARQL endpoint
  • OAI-PMH harvest
  • Search by geo location - near location, within bounding polygon

Output

  • CSV
  • NMA exhibition status: gallery, module

Architecture

  • Authentication
    • Communities can access community content by key scope

5. Data requirements

Data scope

The data consists of:

  • Object records - these contain information about a collection item, sourced from KE EMu (EMu). These appear on Collection Explorer as object details.
  • Narratives - these group a number of object records into a ‘set’ of objects, that may contain sub-narratives, sourced from EMu. These appear on Collection Explorer as 'Sets'.
  • Object images - images associated with an object, sourced from EMu or Piction (the Museum’s Digital Asset Manager).

Object records and narratives are exported from EMu nightly. Images from Piction are also exported nightly into a separate export file, and matched with the relevant object record based on a unique identifying number.

Data type Public API Internal API
Object records All published to Collection Explorer Plus objects on loan to the Museum
Narratives All published to Collection Explorer Plus narratives flagged for internal use only, e.g. in-house gallery
Object images Any associated to an included object record with an image license of: Public Domain or Creative Commons All images associated to an included object record. Plus images associated to objects on loan to the Museum
Image sizes 1600px (full), 640px (preview), and 200px (thumbnail) Plus other sizes

Data fields

Initially records in the Public API will contain all fields currently in Collection Explorer. More fields from EMu will be added over time. The Internal API will contain additional non-public fields.

The data fields are detailed in the appendices.

API resource endpoints

API resource endpoints

  • /object – returns a list of objects based upon search parameters
  • /object/{ID} – returns a specific object
  • /narrative – returns a list of narratives based upon search parameters
  • /narrative/{ID} – returns a specific narrative
  • /party – returns a list of parties (people or organisations) based upon search parameters
  • /party/{ID} – returns a specific party
  • /place – returns a list of places based upon search parameters
  • /place/{ID} – returns a specific place
  • /option – returns a list of options for a data field
  • /media – returns a list of multimedia objects (TBD)
  • /media/{ID} – returns a specific multimedia object (TBD)

The endpoint methods and parameters are detailed in the appendices.

Resource logic

Resource logic (controlling resource responses)

  1. API endpoints use consistent conventions for manipulation of search results via query parameters
  2. Endpoints that return multiple results offer paging options (number of results per page, page number) and sort options (field, direction)
  3. Free-text searching is supported, including wildcard functionality
  4. At least JSON formatted output is supported, other formats may be provided in the future
  5. Default parameter values minimise network traffic and server load, in case of accidental misuse.

API versioning

  1. The APIs support multiple concurrent API versions, so that new features can be rolled out without adversely affecting anyone using previous versions of the API
  2. Previous API versions are specified using a version number in the API endpoint IRIs (major and minor version), for example:
  3. The current API version does not specify a version in the API endpoint IRIs, so unaffected applications do not need to make changes after each version release, for example:
  4. When a new major version is released, a deprecation message is added to the response for requests to old major versions of the API, to encourage developers to upgrade and to notify them of the date that the version will become unsupported (once a date has been decided).

Metadata standards

  1. Field names follow best practice where possible
  2. At least one response format contains fields from the Dublin Core metadata schema (http://purl.org/dc/terms).
  3. Mapping of source system (EMu/Piction) field names to API field names ideally occurs once, so tracking data journeys forwards and backwards is less complicated
  4. Data manipulation happens during ingest into the core database, so that the core database is the single complete data source and the API shim delivery layer is not slowed down by complicated transformations
  5. Field names changes must be accompanied by a major version number change so as to not break previous implementations of the API

6. Security and authentication

  1. All software and data resides on NMA servers within the NMA DMZ
  2. API traffic uses HTTPS, HTTP traffic is redirected to HTTPS
  3. The Public and Internal APIs share the same endpoint IRIs, extra functionally is accessed by using API keys with special privileges
  4. Public API endpoints do not expose any restricted content or content fields
  5. Public API endpoints provide basic access without API key authorisation (to allow people to evaluate the service and build simple applications), more advanced usage is restricted with API key authorisation
  6. Anyone is able to sign up for a public API key with their email address, with no administrator approval required. The system stores a record of the API key and email address pair.
  7. Private API content and content fields are restricted with API key authorisation
  8. Private API keys are manually generated by NMA staff for trusted users only
  9. API keys can be invalidated by an administrator.
  10. Rate limiting is used to reduce the risk of accidental or malicious denial of service

7. Hardware and network

ETL (Extract-Transform-Load)

Data is extracted from the EMu collection management database and the Piction digital asset management system, is merged into a core graph database together, then extracted and stored in Solr indexes for delivery.

ETL data flow:

  1. Nightly export of data files from EMu and Piction
  2. EMu and Piction exports are processed by a custom import script and merged in the core graph database on the server
  3. Merged records are extracted and processed by a custom conversion script and loaded into the Solr database on the server
  4. The API application XProc 'shim' queries the Solr index to return data.

Server architecture

  1. One or more identical production Linux Ubuntu servers, each running an Apache Solr index of the data along with the API shim. Additional servers can be added when extra capacity is required
  2. The production servers run inside the NMA DMZ and are only accessible via a public facing load balancer
  3. One production server also hosts the ETL process, consisting of the graph database and associated conversion scripts
  4. An additional identical server acts as a staging version, for previewing changes to any of the components
  5. A reverse proxy caches traffic to the API. The cache is invalidated and warmed each night after the nightly data import. The proxy is TBD (either cloud based or a local Varnish server).
  6. Image files are stored on a separate server and mounted via Samba shares
  7. The core graph database and Solr indexes are updated nightly via a custom import script that runs on EMu XML exports.

Deployment and updating

  1. Development environment - there is no outside access to the NMA servers so developers run their own copies of the server for development and testing purposes
  2. Staging environment - all development is deployed to the NMA's staging server first to be reviewed before deployment to the production server/s
  3. Change request process - NMA IT change request forms (including detailed instructions for the deployment process) are completed for all deployments to the productions servers
  4. Deployment process - deployment is scripted where possible, and the scripts are also used when deploying to the staging server (to test the deployment scripts themselves)
  5. Service interruptions - interruptions are minimised by diverting all traffic to a second server (i.e. a second production server or staging server) while the first is unavailable, and then vice versa
  6. Snapshotting - prior to each deployment a snapshot of the server is taken, and in the case of a catastrophic deployment failure the server will be rolled back to the snapshot.

Appendices

A1. Data schema

TODO

A2. API endpoints

TODO