-
Notifications
You must be signed in to change notification settings - Fork 0
Harvesting in bulk from the Collection API
The Collection API allows users to "harvest" data in bulk, by making repeated queries.
The API also provides a feature to allow users to efficiently maintain a local cache of the data that interest them over time, without having to harvest all that data again and again, by retrieving only changes to the data.
To harvest records, you must specify some search criteria. If you wish to harvest all records, you should use the text
query parameter with a value of "*
"; this will return any record which contains any text at all (i.e. all records). e.g. to harvest all objects, use the URL:
/object?text=*
Different base URLs can be used to harvest not just object
records, but also place
, party
, collection
and narrative
records. e.g. to harvest all place records, use the URL:
/place?text=*
For performance reasons, the API does not always immediately return all the records which match a query. Instead, the matching records are divided into "pages" containing a certain number of records, and the initial query returns only the first such page of matching records. To retrieve the second and subsequent pages, it's necessary to make more API calls.
By default, the number of records returned in a single page is 50, but it can be increased (up to a maximum of 100) by using the limit
parameter. e.g.
/object?text=*&limit=100
For a harvester, it usually makes sense to increase the page size to the maximum.
To retrieve a sequence of pages, you would first issue an API request for the records which you wish to harvest, e.g.
https://data.nma.gov.au/object?text=bark
The result would be a JSON document in one of the two data formats which the API offers; a "simple" JSON conforming to the JSON-API specification, or a "Linked Data" JSON-LD format. The preferred way to specify a format is to send an HTTP Accept
header of either application/vnd.api+json
or application/json
, for the "simple" JSON format, or application/ld+json
for the JSON-LD format. Alternatively, a format
parameter can be appended to the query URL, with the value simple
or json-ld
.
When the "simple" JSON format is used, the response will contain a data
array containing the actual records (the contents of the array are replaced below with "...
", for clarity). If there are more pages of data, the response will include a "links" object containing a "next" string whose value is a relative URL to request the next page of data.
This query URL asks for all objects which contain the word bark
:
https://data.nma.gov.au/object?text=bark
The JSON-LD response to this query would be:
{
"data": [ ... ],
"meta": {
"results": 2410
},
"links": {
"next": "object?text=bark&offset=50"
}
}
To retrieve the second page, you would issue an HTTP request using the next
URL:
https://data.nma.gov.au/object?text=bark&offset=50
To complete the harvest, repeat this process until you receive an API response which does not include a next
link:
https://data.nma.gov.au/object?text=bark&offset=2400
{
"data": [ ... ],
"meta": {
"results": 2410
}
}
When using the JSON-LD format, the response will contain an aggregates
array containing the actual records (the contents of the array are replaced below with "...
", for clarity). If there are more pages of data, the response will include a "next" string whose value is a relative URL to request the next page of data.
This query URL asks for all objects which contain the word bark
:
https://data.nma.gov.au/object?text=bark
The JSON-LD response to this query would be:
{
"context": "/context.json",
"id": "object?text=bark",
"type": "Aggregation",
"next": "object?text=bark&offset=50",
"entities": 2410,
"aggregates": [ ... ]
}
To retrieve the second page, you would issue an HTTP request using the next
URL:
https://data.nma.gov.au/object?text=bark&offset=50
To complete the harvest, repeat this process until you receive an API response which does not include a next
link:
https://data.nma.gov.au/object?text=bark&offset=2400
{
"context": "/context.json",
"id": "object?text=bark",
"type": "Aggregation",
"entities": 2410,
"aggregates": { ... }
}
The datestamp
field records the date on which the API resource last changed. Resources in the API may contain data drawn from a number of internal data sources; the datestamp
field reflects the most recently updated one of those sources. For instance an object
resource may change when the internal object
record has changed, or when a new photograph of the object is added, or when one of the people related to the object has an updated description. NB the datestamp
field is different to the modified
field, which records the date at which only the primary internal database record for the API resource was changed.
Harvester applications using the API to maintain an cache of the data can use the datestamp
field to request data which has been updated since the previous time the harvest was run. To request data which changed on a specific day, specify the day as the value of the datestamp
parameter, like so:
datestamp=2018-08-13
To retrieve records which had changed between a pair of dates, a harvester should format the two dates by inserting +TO+
(or equivalently %20TO%20
; NB both +
and %20
are used in a URL to represent a space) between them, and enclosing the pair in square brackets, e.g. to retrieve records updated from the 13th of August 2018 until the year 2999, a harvester could use this query: datestamp=[2018-08-13%20TO%202999]
e.g.
https://data.nma.gov.au/object?text=*&datestamp=[2018-08-13%20TO%202999]
or
https://data.nma.gov.au/object?text=*&datestamp=[2018-08-13+TO+2999]
On very rare occasions, records may be deleted from the Collection API. e.g.
https://data.nma.gov.au/object/47157#
Making an HTTP request using the identifier of a deleted record will yield a response with an HTTP status code of 410
("Gone").
These deleted records are also included in the results using the search API, in a query such as:
https://data.nma.gov.au/object?text=*
It's also possible to harvest a list of only deleted records, by specifying the status_code
search field in the query URL, with a value of 410
. e.g.
https://data.nma.gov.au/object?status_code=410
In the "simple" JSON format, the deletion list will look like:
{
"data": [
{
"id": "47157",
"type": "object",
"_meta": {
"modified": "2018-09-03",
"statusCode": "410",
"reason": "Gone"
}
},
{
"id": "230014",
"type": "object",
"_meta": {
"modified": "2018-09-03",
"statusCode": "410",
"reason": "Gone"
}
}
],
"meta": {
"results": 2
}
}
In JSON-LD format, the same list would look like this:
{
"context": "/context.json",
"id": "object?status_code=410",
"type": "Aggregation",
"entities": 2,
"aggregates": [
{
"@context": "/context.json",
"id": "http://data.nma.gov.au/object/47157#",
"type": "PhysicalObject",
"identified_by": {
"id": "http://data.nma.gov.au/object/47157#repositorynumber",
"type": "Identifier",
"classified_as": {
"id": "http://vocab.getty.edu/aat/300404621",
"type": "Type",
"label": "repository number"
},
"value": "47157"
},
"documented_in": {
"id": "http://data.nma.gov.au/object/47157",
"type": "http://www.cidoc-crm.org/cidoc-crm/E31_Document",
"modified": "2018-09-03",
"subject_to": {
"id": "http://data.nma.gov.au/term/metadata-rights",
"type": "Right",
"component": {
"id": "https://creativecommons.org/licenses/by-nc/4.0/",
"type": "Right",
"label": "CC BY-NC"
},
"label": "Copyright National Museum of Australia / CC BY-NC"
},
"response": {
"type": "Response",
"reason_phrase": "Gone",
"status_code_value": "410"
}
}
},
{
"@context": "/context.json",
"id": "http://data.nma.gov.au/object/230014#",
"type": "PhysicalObject",
"identified_by": {
"id": "http://data.nma.gov.au/object/230014#repositorynumber",
"type": "Identifier",
"classified_as": {
"id": "http://vocab.getty.edu/aat/300404621",
"type": "Type",
"label": "repository number"
},
"value": "230014"
},
"documented_in": {
"id": "http://data.nma.gov.au/object/230014",
"type": "http://www.cidoc-crm.org/cidoc-crm/E31_Document",
"modified": "2018-09-03",
"subject_to": {
"id": "http://data.nma.gov.au/term/metadata-rights",
"type": "Right",
"component": {
"id": "https://creativecommons.org/licenses/by-nc/4.0/",
"type": "Right",
"label": "CC BY-NC"
},
"label": "Copyright National Museum of Australia / CC BY-NC"
},
"response": {
"type": "Response",
"reason_phrase": "Gone",
"status_code_value": "410"
}
}
}
]
}