Get the data structure of a particular database #197
-
I previously used the OECD R package, but it seems development has stopped and the OECD API has changed. I'm now trying to rebuild its functionality to work with the new API. So far, I've successfully retrieved the available datasets using this approach: # Libraries ----
library(httr2)
library(xml2)
library(dplyr)
get_data_sets <- function(description = FALSE) {
# Structure queries ----
## Syntax ----
### Check: https://github.com/sdmx-twg/sdmx-rest/blob/develop/doc/structures.md
#### protocol://ws-entry-point/structure/{artefactType}/{agencyID}/{resourceID}/{version}?{detail}&{references}
### Endpoint ----
# sdmx_v2.1_registry <- "https://sdmx.oecd.org/public"
restful_api <- "https://sdmx.oecd.org/public/rest"
restful_api_version <- "v2"
structure <- "structure"
artefact_type <- "dataflow"
agency_id <- "*" # All
resource_id <- "*" # All
version <- "+" # Latest stable version
detail <- "full"
references <- "none"
## Request ----
request <- httr2::request(base_url = restful_api) |>
httr2::req_url_path_append(restful_api_version) |>
httr2::req_url_path_append(structure) |>
httr2::req_url_path_append(artefact_type) |>
httr2::req_url_path_append(agency_id) |>
httr2::req_url_path_append(resource_id) |>
httr2::req_url_path_append(version) |>
httr2::req_url_query(detail = detail) |>
httr2::req_url_query(references = references)
## HTTP content negotiation ----
# sdmx_xml_v2.0 <- "application/vnd.sdmx.structure+xml; version=2.0; charset=utf-8"
sdmx_xml_v2.1 <- "application/vnd.sdmx.structure+xml; version=2.1; charset=utf-8"
# json <- "application/json; charset=utf-8"
# sdmx_json_v1.0 <- "application/vnd.sdmx.structure+json; version=1.0; charset=utf-8"
# sdmx_json_v1.0.0wd <- "application/vnd.sdmx.structure+json; version=1.0.0-wd; charset=utf-8"
# xml_v2.1 <- "application/xml; version=2.1; charset=utf-8"
# rdf_xml <- "application/rdf+xml; charset=utf-8"
request <- request |>
httr2::req_headers("Accept" = sdmx_xml_v2.1)
## Response ----
response <- request |>
httr2::req_perform()
## Extract body from response ----
xml_data <- response |>
httr2::resp_body_xml()
# Get datasets ----
structure_dataflow <- xml_data |>
xml2::xml_find_all(xpath = ".//structure:Dataflow[starts-with(@agencyID, 'OECD')]")
attributes <- c("id", "agencyID", "version")
data_sets <- structure_dataflow |>
xml2::xml_attrs() |>
dplyr::bind_rows() |>
dplyr::select(dplyr::all_of(x = attributes)) |>
setNames(nm = c("dataflow_id", "agency_id", "dataflow_version")) |>
dplyr::relocate(agency_id, .before = dataflow_id)
dataflow_name <- structure_dataflow |>
# https://github.com/r-lib/xml2/issues/237
xml2::xml_find_first(xpath = "./common:Name[@xml:lang='en']") |>
xml2::xml_text()
data_sets <- data_sets |>
dplyr::mutate(dataflow_name = dataflow_name)
if (description) {
dataflow_description <- structure_dataflow |>
# https://github.com/r-lib/xml2/issues/237
xml2::xml_find_first(xpath = "./common:Description[@xml:lang='en']") |>
xml2::xml_text()
data_sets <- data_sets |>
dplyr::mutate(dataflow_description = dataflow_description)
}
return(data_sets)
}
data_sets <- get_data_sets() Created on 2024-07-03 with reprex v2.1.0 However, I haven't yet succeeded in building a function to extract the data structure of a specific dataset. Here's an example from the package vignettes, using a particular dataset and a variable: library(rsdmx)
library(tibble)
library(dplyr)
dsdUrl <- "https://sdmx.oecd.org/public/rest/v2/structure/dataflow/OECD.CTP.TPS/DSD_REV_COMP_LAC@DF_RSLAC/+?detail=full&references=all"
dsd <- readSDMX(dsdUrl)
cls <- slot(dsd, "codelists")
codelists <- sapply(slot(cls, "codelists"), function(x) slot(x, "id"))
codelist <- as.data.frame(slot(dsd, "codelists"), codelistId = "CL_AREA")
codelist |> as_tibble() |>
select(id, label.en, name.en, en)
#> # A tibble: 558 × 4
#> id label.en name.en en
#> <chr> <chr> <chr> <chr>
#> 1 AUS Australia Australia <NA>
#> 2 AUT Austria Austria <NA>
#> 3 BEL Belgium Belgium <NA>
#> 4 CAN Canada Canada <NA>
#> 5 CHL Chile Chile <NA>
#> 6 COL Colombia Colombia <NA>
#> 7 CRI Costa Rica Costa Rica <NA>
#> 8 CZE Czechia Czechia <NA>
#> 9 DNK Denmark Denmark <NA>
#> 10 EST Estonia Estonia <NA>
#> # ℹ 548 more rows Created on 2024-07-03 with reprex v2.1.0 My initial approach was incorrect, as the dataset I referenced only pertains to Latin America and the Caribbean. I also attempted using This successfully extracted the relevant variables, but I lost the ability to retrieve essential metadata like I would greatly appreciate any guidance or resources on how to correctly extract the complete data structure, including metadata, from a specific dataset. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hello, I've updated the embedded OECD service provider (see #198). You can reinstall rsdmx from Github to use it. To get the list of dataflows, you can do: sdmx.dataflows <- readSDMX(providerId = "OECD", resource = "dataflow")
dataflows <- as.data.frame(sdmx.dataflows) You will see in this list that each To get a datastructure, you should pick this dsdRef <- dataflows[dataflows$id == "DSD_REV_COMP_LAC@DF_RSLAC",]$dsdRef
sdmx.dsd <- readSDMX(providerId = "OECD", resource = "datastructure", resourceId = dsdRef)
#codelists
cls <- slot(sdmx.dsd, "codelists")
codelists <- sapply(slot(cls, "codelists"), function(x) slot(x, "id"))
cl_area = as.data.frame(cls, codelistId = "CL_AREA")
The DSD you get here is the one that the service provider associated with the dataflow and AFAIK it is fixed to a dataflow. In case of a regional dataset targeting LAC, it doesn't mean necessarily that the DSD the service associated is focused on LAC. I suppose their choice was to define regional dataflows, but to stick with a common global DSD, where structure is the same, although reference data (codelists/concepts) coverage may differ. Hope this helps |
Beta Was this translation helpful? Give feedback.
-
Thank you Emmanuel Blondel for your help. |
Beta Was this translation helpful? Give feedback.
Hello, I've updated the embedded OECD service provider (see #198). You can reinstall rsdmx from Github to use it.
To get the list of dataflows, you can do:
You will see in this list that each
dataflow
is associated with a data structure definition (DSD) identified by adsdRef
To get a datastructure, you should pick this
dsdRef
and then query the datastructure. In your case, you interrogate dataflowDSD_REV_COMP_LAC@DF_RSLAC
, the DSD isDSD_REV_COMP_LAC