Collection API Technical Specification

National Museum of Australia, February 2018

1. Overview

Collection Explorer is the National Museum of Australia’s public interface into its online collection. It was first released in August 2014, with multiple design improvements and usability enhancements since then.

The Museum is committed to giving open and easy access to its online collection wherever possible - introducing Creative Commons licensing of suitable object images and API access.

The APIs are RESTful API based on a Solr database and return JSON formatted results.

The two APIs are:

Public API - open to all users, so is limited to data already available to the public via Collection Explorer, with additional restricted content such as Indigenous material. The aim is to encourage the sharing of our collection with other institutions and members of the general public.
Internal API - for the Museum’s internal use, so includes more data than the public API, such as objects that are on loan. The aim is to share the collection across our website and other digital products, and may be used for discrete external data feeds to sites such as Trove.

The public API provides data and images to:

Allow external developers and the public programmatic access to our collection data
Allow the Museum to easily contribute data to online projects and collections, for example, the National Library of Australia’s Trove
Provide a framework to power interaction between other applications and collection data, for example, public users creating their own collections.

The internal API provides data and images to:

Display content within the Museum’s main website http://nma.gov.au
Display content inside Museum mobile apps and on digital signage
Display content on devices in galleries.

The two APIs are delivered via the same API endpoints using different levels of security.

2. Business requirements

Exposure - Increase the exposure of our collection by providing open access to our data and by encouraging collaboration with individuals, companies, and other collections
Enhancements - Grow and add to the API over time by ensuring it is expandable and flexible, making it easy to modify as data needs change, and future-proofing where possible
Engagement - Engage with the developer community by making it easy to get started, adhering to API conventions, and encouraging feedback from developers
Robust - Build confidence through reliability by having a process for deployment that allows updates to the API with minimal downtime
Self-supporting - Require as little support from the Museum as possible by having an API that is easy to use and well documented

3. User requirements

Potential user groups

Public API

Web and app developers, hackathon participants, data enthusiasts, digital labs
Teachers, students, researchers (academic, family history, digital humanities)
Other museums and Government agencies
Aggregators (e.g. Trove, MOAD), bots, data.gov, stock libraries
Special interest groups, communities of interest
Journalists

Internal API

NMA digital team
NMA gallery redevelopment team, curatorial team, other staff
Web and app developers (for internal apps)
Business systems

High priority user needs

A user requirements workshop was held at the Museum in 2017, the highest priority need identified was to build a strong foundation for the API.

Public API priority needs:

Documentation - Well documented structure and methods including code examples on how the API can be used. This was considered more important than additional features, as overall usability and understanding the API in order to implement it, would drive adoption.
Data Quality - Including consistency, interesting and useful data, and high-resolution imagery. Consistency was considered an important factor in determining overall data quality.
Robust - High performing, with high availability, high speed access, and stable URLs. For developers to invest in building with an API, reliability will boost their confidence and trust in the service.

Internal API high priority needs:

Completeness - Access to all fields available, the inclusion of objects not in the public API (such as loan objects), and the desire to push and pull data. The Museum is also considering how the API could be used to add value to the collection data itself, in the future.

Potential future enhancements:

Content relationships - Connecting objects and capturing relationships, for example through the use of groups or lists, categories and other taxonomies. This would help drive ‘related objects’ and other serendipitous recommendations.
Augmented content - More specific attributes for objects for the Museum’s collection itself, such as object colour or popularity, detailed physical location, and when the objects have been displayed in the Museum.

4. Functional requirements

MVP (Minimum Viable Product) for initial Public API release (March 2018)

Data sources

EMu & Piction: weekly full reindex
Ability to make ad hoc record pull-downs

Data scope

NMA records released for Public API:
Object records are included, where the API status field contains "Public" or "Public Restricted"
Narrative records are included, where the narrative purpose field contains "Collection Explorer publish"
Restricted objects are excluded, e.g. indigenous content. Determined by AcsCCStatus=Restricted
Multimedia/parties/sites records are included, where they are linked to a public released object or narrative record
Multimedia files are included, where the linked object record contains a valid licence (PD/CC-BY/CC-BY-NC). Determined by AcsCCStatus value of "Public Domain", "Creative Commons Commercial Use" or "Creative Commons Non-commercial use".

Search

Free-text using keywords
Boolean operators
Limit by object type
Limit by custom fields: eg. title, object type, date, image list

Search results

Ordering: by relevance only
Result count
Pagination: limit, offset
Parent/child records: duplicated in results

Output

JSON format
Related objects: excluded
Nested objects: IDs only
Image URLs: thumbnail, preview, hi-res
Basic image metadata
Dates: in IS8601 format in UTC timezone (e.g. date, created, modified)
Error details

Architecture

Metadata standards: Linked Art, and a simplified, Dublin Core-based schema [what for specimens?] [what about field qualifiers]
Hosted: on NMA servers, with scaling strategy defined
Endpoints: /object, /narrative, https, IRNs used for IRI identifiers
API versioning: strategy defined
Authentication
- No authentication for basic usage: for users trying the API out - no key, usage is throttled by IP, data is limited as it excludes records with 'Public Restricted" as the API status. (TBC if possible)
- Access key authentication: automated API key signup via web form, usage throttled by key and IP (TBD)
Metrics: basic usage tracked

Community portal

User documentation:
- Endpoints: list, operations, parameters
- Sample records
- Getting started
- Delivery: github wiki
Email list
Issues: form to raise issue or provide feedback

Full Public API release (April 2018)

Data sources

Daily updates
Changes to entity records trigger updates to all related object records

Data scope

Trove licence scope

Architecture

Endpoints: /option, /party, /place, /media (TBD)
API versioning: operational

Search

Image metadata: has image, has hi-res, licence type
Limit by deleted status (for harvesting)
Limit by modified date range (incremental harvest)

Search results

Ordering: customisable, e.g. relevance, title, type, date

Output

XML
Related objects: full related records included (related objects, sub-narratives)
Nested objects: full nested record data (don't need additional calls)
Select custom fields only (rather than having to receive full records)

Internal API release (May 2018)

Data scope

NMA records released for Internal API:
Object records are included, where API field contains "Public", "Public Restricted" or "Internal"
All narrative records are included.
Restricted objects are included, e.g. indigenous content
Multimedia/parties/sites records are included, where they are linked to a public/internal released object or narrative record
Multimedia files are included, where the linked object record contains a valid licence (PD/CC-BY/CC-BY-NC)

Architecture

Authentication
- Access key authentication: manually created by NMA staff

Future possible enhancements

Data scope

Geo location polygons for places
Flag whether is featured item
Event data combined (e.g. birth date AND place)
Semantic relationships

Search

SPARQL endpoint
OAI-PMH harvest
Search by geo location - near location, within bounding polygon

Output

CSV
NMA exhibition status: gallery, module

Architecture

Authentication
- Communities can access community content by key scope

5. Data requirements

Data scope

The data consists of:

Object records - these contain information about a collection item, sourced from KE EMu (EMu). These appear on Collection Explorer as object details.
Narratives - these group a number of object records into a ‘set’ of objects, that may contain sub-narratives, sourced from EMu. These appear on Collection Explorer as 'Sets'.
Object images - images associated with an object, sourced from EMu or Piction (the Museum’s Digital Asset Manager).

Object records and narratives are exported from EMu nightly. Images from Piction are also exported nightly into a separate export file, and matched with the relevant object record based on a unique identifying number.

Data type	Public API	Internal API
Object records	All published to Collection Explorer	Plus objects on loan to the Museum
Narratives	All published to Collection Explorer	Plus narratives flagged for internal use only, e.g. in-house gallery
Object images	Any associated to an included object record with an image license of: Public Domain or Creative Commons	All images associated to an included object record. Plus images associated to objects on loan to the Museum
Image sizes	1600px (full), 640px (preview), and 200px (thumbnail)	Plus other sizes

Data fields

Initially records in the Public API will contain all fields currently in Collection Explorer. More fields from EMu will be added over time. The Internal API will contain additional non-public fields.

The data fields are detailed in the appendices.

API resource endpoints

/object – returns a list of objects based upon search parameters
/object/{ID} – returns a specific object
/narrative – returns a list of narratives based upon search parameters
/narrative/{ID} – returns a specific narrative
/party – returns a list of parties (people or organisations) based upon search parameters
/party/{ID} – returns a specific party
/place – returns a list of places based upon search parameters
/place/{ID} – returns a specific place
/option – returns a list of options for a data field
/media – returns a list of multimedia objects (TBD)
/media/{ID} – returns a specific multimedia object (TBD)

The endpoint methods and parameters are detailed in the appendices.

Resource logic

Resource logic (controlling resource responses)

API endpoints use consistent conventions for manipulation of search results via query parameters
Endpoints that return multiple results offer paging options (number of results per page, page number) and sort options (field, direction)
Free-text searching is supported, including wildcard functionality
At least JSON formatted output is supported, other formats may be provided in the future
Default parameter values minimise network traffic and server load, in case of accidental misuse.

API versioning

The APIs support multiple concurrent API versions, so that new features can be rolled out without adversely affecting anyone using previous versions of the API
Previous API versions are specified using a version number in the API endpoint IRIs (major and minor version), for example:
- https://collectionsearch.nma.gov.au/api/1.0/object/
- https://collectionsearch.nma.gov.au/api/2.0/object/
The current API version does not specify a version in the API endpoint IRIs, so unaffected applications do not need to make changes after each version release, for example:
- https://collectionsearch.nma.gov.au/api/object/
When a new major version is released, a deprecation message is added to the response for requests to old major versions of the API, to encourage developers to upgrade and to notify them of the date that the version will become unsupported (once a date has been decided).

Metadata standards

Field names follow best practice where possible
At least one response format contains fields from the Dublin Core metadata schema (http://purl.org/dc/terms).
Mapping of source system (EMu/Piction) field names to API field names ideally occurs once, so tracking data journeys forwards and backwards is less complicated
Data manipulation happens during ingest into the core database, so that the core database is the single complete data source and the API shim delivery layer is not slowed down by complicated transformations
Field names changes must be accompanied by a major version number change so as to not break previous implementations of the API

6. Security and authentication

All software and data resides on NMA servers within the NMA DMZ
API traffic uses HTTPS, HTTP traffic is redirected to HTTPS
The Public and Internal APIs share the same endpoint IRIs, extra functionally is accessed by using API keys with special privileges
Public API endpoints do not expose any restricted content or content fields
Public API endpoints provide basic access without API key authorisation (to allow people to evaluate the service and build simple applications), more advanced usage is restricted with API key authorisation
Anyone is able to sign up for a public API key with their email address, with no administrator approval required. The system stores a record of the API key and email address pair.
Private API content and content fields are restricted with API key authorisation
Private API keys are manually generated by NMA staff for trusted users only
API keys can be invalidated by an administrator.
Rate limiting is used to reduce the risk of accidental or malicious denial of service

7. Hardware and network

ETL (Extract-Transform-Load)

Data is extracted from the EMu collection management database and the Piction digital asset management system, is merged into a core graph database together, then extracted and stored in Solr indexes for delivery.

ETL data flow:

Nightly export of data files from EMu and Piction
EMu and Piction exports are processed by a custom import script and merged in the core graph database on the server
Merged records are extracted and processed by a custom conversion script and loaded into the Solr database on the server
The API application XProc 'shim' queries the Solr index to return data.

Server architecture

One or more identical production Linux Ubuntu servers, each running an Apache Solr index of the data along with the API shim. Additional servers can be added when extra capacity is required
The production servers run inside the NMA DMZ and are only accessible via a public facing load balancer
One production server also hosts the ETL process, consisting of the graph database and associated conversion scripts
An additional identical server acts as a staging version, for previewing changes to any of the components
A reverse proxy caches traffic to the API. The cache is invalidated and warmed each night after the nightly data import. The proxy is TBD (either cloud based or a local Varnish server).
Image files are stored on a separate server and mounted via Samba shares
The core graph database and Solr indexes are updated nightly via a custom import script that runs on EMu XML exports.

Deployment and updating

Development environment - there is no outside access to the NMA servers so developers run their own copies of the server for development and testing purposes
Staging environment - all development is deployed to the NMA's staging server first to be reviewed before deployment to the production server/s
Change request process - NMA IT change request forms (including detailed instructions for the deployment process) are completed for all deployments to the productions servers
Deployment process - deployment is scripted where possible, and the scripts are also used when deploying to the staging server (to test the deployment scripts themselves)
Service interruptions - interruptions are minimised by diverting all traffic to a second server (i.e. a second production server or staging server) while the first is unavailable, and then vice versa
Snapshotting - prior to each deployment a snapshot of the server is taken, and in the case of a catastrophic deployment failure the server will be rolled back to the snapshot.

Appendices

A1. Data schema

TODO

A2. API endpoints

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collection API Technical Specification

1. Overview

2. Business requirements

3. User requirements

Potential user groups

High priority user needs

4. Functional requirements

MVP (Minimum Viable Product) for initial Public API release (March 2018)

Full Public API release (April 2018)

Internal API release (May 2018)

Future possible enhancements

5. Data requirements

Data scope

Data fields

API resource endpoints

Resource logic

API versioning

Metadata standards

6. Security and authentication

7. Hardware and network

ETL (Extract-Transform-Load)

Server architecture

Deployment and updating

Appendices

A1. Data schema

A2. API endpoints

Clone this wiki locally