Skip to content

Harvesting in bulk from the Collection API

Conal Tuohy edited this page Feb 26, 2019 · 14 revisions

The Collection API allows users to "harvest" data in bulk, by making repeated queries.

The API also provides a feature to allow users to efficiently maintain a local cache of the data that interest them over time, without having to harvest all that data again and again, by retrieving only changes to the data.

Specify a set of records to harvest

To harvest records, you must specify some search criteria. If you wish to harvest all records, you should use the text query parameter with a value of "*"; this will return any record which contains any text at all (i.e. all records). e.g. to harvest all objects, use the URL:

/object?text=*

Different base URLs can be used to harvest not just object records, but also place, party, collection and narrative records. e.g. to harvest all place records, use the URL:

/place?text=*

Pagination

For performance reasons, the API does not always immediately return all the records which match a query. Instead, the matching records are divided into "pages" containing a certain number of records, and the initial query returns only the first such page of matching records. To retrieve the second and subsequent pages, it's necessary to make more API calls.

Page size

By default, the number of records returned in a single page is 50, but it can be increased (up to a maximum of 100) by using the limit parameter. e.g.

/object?text=*&limit=100

For a harvester, it usually makes sense to increase the page size to the maximum.

Retrieving subsequent pages

To retrieve a sequence of pages, you would first issue an API request for the records which you wish to harvest, e.g.

https://data.nma.gov.au/object?text=bark

The result would be a JSON document in one of the two data formats which the API offers; a "simple" JSON conforming to the JSON-API specification, or a "Linked Data" JSON-LD format. The preferred way to specify a format is to send an HTTP Accept header of either application/vnd.api+json or application/json, for the "simple" JSON format, or application/ld+json for the JSON-LD format. Alternatively, a format parameter can be appended to the query URL, with the value simple or json-ld.

When the "simple" JSON format is used, the response will contain a data array containing the actual records (the contents of the array are replaced below with "...", for clarity). If there are more pages of data, the response will include a "links" object containing a "next" string whose value is a relative URL to request the next page of data.

This query URL asks for all objects which contain the word bark:

https://data.nma.gov.au/object?text=bark

The JSON-LD response to this query would be:

{
  "data": [ ... ],
  "meta": {
    "results": 2410
  },
  "links": {
    "next": "object?text=bark&offset=50"
  }
}

To retrieve the second page, you would issue an HTTP request using the next URL:

https://data.nma.gov.au/object?text=bark&offset=50

To complete the harvest, repeat this process until you receive an API response which does not include a next link:

https://data.nma.gov.au/object?text=bark&offset=2400

{
  "data": [ ... ],
  "meta": {
    "results": 2410
  }
}

When using the JSON-LD format, the response will contain an aggregates array containing the actual records (the contents of the array are replaced below with "...", for clarity). If there are more pages of data, the response will include a "next" string whose value is a relative URL to request the next page of data.

This query URL asks for all objects which contain the word bark:

https://data.nma.gov.au/object?text=bark

The JSON-LD response to this query would be:

{
  "context": "/context.json",
  "id": "object?text=bark",
  "type": "Aggregation",
  "next": "object?text=bark&offset=50",
  "entities": 2410,
  "aggregates": [ ... ]
}

To retrieve the second page, you would issue an HTTP request using the next URL:

https://data.nma.gov.au/object?text=bark&offset=50

To complete the harvest, repeat this process until you receive an API response which does not include a next link:

https://data.nma.gov.au/object?text=bark&offset=2400

{
  "context": "/context.json",
  "id": "object?text=bark",
  "type": "Aggregation",
  "entities": 2410,
  "aggregates": { ... }
}

Incremental harvesting

The datestamp field records the date on which the API resource last changed. Resources in the API may contain data drawn from a number of internal data sources; the datestamp field reflects the most recently updated one of those sources. For instance an object resource may change when the internal object record has changed, or when a new photograph of the object is added, or when one of the people related to the object has an updated description. NB the datestamp field is different to the modified field, which records the date at which only the primary internal database record for the API resource was changed.

Harvester applications using the API to maintain an cache of the data can use the datestamp field to request data which has been updated since the previous time the harvest was run. To request data which changed on a specific day, specify the day as the value of the datestamp parameter, like so:

datestamp=2018-08-13

To retrieve records which had changed between a pair of dates, a harvester should format the two dates by inserting +TO+ (or equivalently %20TO%20; NB both + and %20 are used in a URL to represent a space) between them, and enclosing the pair in square brackets, e.g. to retrieve records updated from the 13th of August 2018 until the year 2999, a harvester could use this query: datestamp=[2018-08-13%20TO%202999]

e.g.

https://data.nma.gov.au/object?text=*&datestamp=[2018-08-13%20TO%202999]

or

https://data.nma.gov.au/object?text=*&datestamp=[2018-08-13+TO+2999]

Deleted records

On very rare occasions, records may be deleted from the Collection API. e.g.

https://data.nma.gov.au/object/47157#

Making an HTTP request using the identifier of a deleted record will yield a response with an HTTP status code of 410 ("Gone").

These deleted records are also included in the results using the search API, in a query such as:

https://data.nma.gov.au/object?text=*

It's also possible to harvest a list of only deleted records, by specifying the status_code search field in the query URL, with a value of 410. e.g.

https://data.nma.gov.au/object?status_code=410

In the "simple" JSON format, the deletion list will look like:

{
  "data": [
    {
      "id": "47157",
      "type": "object",
      "_meta": {
        "modified": "2018-09-03",
        "statusCode": "410",
        "reason": "Gone"
      }
    },
    {
      "id": "230014",
      "type": "object",
      "_meta": {
        "modified": "2018-09-03",
        "statusCode": "410",
        "reason": "Gone"
      }
    }
  ],
  "meta": {
    "results": 2
  }
}

In JSON-LD format, the same list would look like this:

{
  "context": "/context.json",
  "id": "object?status_code=410",
  "type": "Aggregation",
  "entities": 2,
  "aggregates": [
    {
      "@context": "/context.json",
      "id": "http://data.nma.gov.au/object/47157#",
      "type": "PhysicalObject",
      "identified_by": {
        "id": "http://data.nma.gov.au/object/47157#repositorynumber",
        "type": "Identifier",
        "classified_as": {
          "id": "http://vocab.getty.edu/aat/300404621",
          "type": "Type",
          "label": "repository number"
        },
        "value": "47157"
      },
      "documented_in": {
        "id": "http://data.nma.gov.au/object/47157",
        "type": "http://www.cidoc-crm.org/cidoc-crm/E31_Document",
        "modified": "2018-09-03",
        "subject_to": {
          "id": "http://data.nma.gov.au/term/metadata-rights",
          "type": "Right",
          "component": {
            "id": "https://creativecommons.org/licenses/by-nc/4.0/",
            "type": "Right",
            "label": "CC BY-NC"
          },
          "label": "Copyright National Museum of Australia / CC BY-NC"
        },
        "response": {
          "type": "Response",
          "reason_phrase": "Gone",
          "status_code_value": "410"
        }
      }
    },
    {
      "@context": "/context.json",
      "id": "http://data.nma.gov.au/object/230014#",
      "type": "PhysicalObject",
      "identified_by": {
        "id": "http://data.nma.gov.au/object/230014#repositorynumber",
        "type": "Identifier",
        "classified_as": {
          "id": "http://vocab.getty.edu/aat/300404621",
          "type": "Type",
          "label": "repository number"
        },
        "value": "230014"
      },
      "documented_in": {
        "id": "http://data.nma.gov.au/object/230014",
        "type": "http://www.cidoc-crm.org/cidoc-crm/E31_Document",
        "modified": "2018-09-03",
        "subject_to": {
          "id": "http://data.nma.gov.au/term/metadata-rights",
          "type": "Right",
          "component": {
            "id": "https://creativecommons.org/licenses/by-nc/4.0/",
            "type": "Right",
            "label": "CC BY-NC"
          },
          "label": "Copyright National Museum of Australia / CC BY-NC"
        },
        "response": {
          "type": "Response",
          "reason_phrase": "Gone",
          "status_code_value": "410"
        }
      }
    }
  ]
}