diff --git a/DESCRIPTION b/DESCRIPTION index 9c62d4d6..f9ee4b23 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -39,10 +39,16 @@ Suggests: httptest2, httpuv, knitr, - leaflet, rmarkdown, testthat (>= 3.0.0), withr +Config/Needs/website: + DT, + ggplot2, + leaflet, + lvaudor/sequins, + sf, + tidyr VignetteBuilder: knitr, rmarkdown @@ -51,3 +57,4 @@ Encoding: UTF-8 LazyData: true Roxygen: list(markdown = TRUE) RoxygenNote: 7.2.3 + diff --git a/_pkgdown.yml b/_pkgdown.yml index 5987af1d..8e210220 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -55,7 +55,7 @@ navbar: href: articles/explore.html - text: glitter for dataBNF href: articles/glitter_for_dataBNF.html - - text: Bibliometry with HAL - href: articles/glitter_for_hal.html + - text: Bibliometry with HAL (French) + href: articles/glitter_bibliometry.html - text: Learn more about how glitter works href: articles/internals.html diff --git a/vignettes/articles/explore.Rmd b/vignettes/articles/explore.Rmd index d8567fd5..541854f2 100644 --- a/vignettes/articles/explore.Rmd +++ b/vignettes/articles/explore.Rmd @@ -1,5 +1,5 @@ --- -title: "How to explore a new base with glitter" +title: "How to explore a new database with glitter" --- ```{r, include = FALSE} @@ -20,13 +20,13 @@ Let's go through an example. ## A word of caution -Depending on the dataset you're working with, some queries might just ask _too much_ of the service so proceed with caution. +Depending on the dataset (or triplestore, in our context) you're working with, some queries might just ask _too much_ of the service so proceed with caution. When in doubt, add a `spq_head()` in your query pipeline, to ask less at a time, or use `spq_count()` to get a sense of how many results there are in total. ## Asking for a subset of all triples In the code below we'll ask for 10 triples. -Note that we use the `endpoint` argument of `spq_perform()` to indicate where to send the query, as well as the `request_type` argument. +Note that we use the `endpoint` argument of `spq_init()` to indicate where to send the query, as well as the `request_type` argument. How can one know whether a service needs `request_type = "body-form"`? @@ -56,6 +56,9 @@ Its results however can be... more or less helpful. ### Find which classes are declared +The **classes** occurring in the database will provide information as to **the kind of data** you will find there. +This can be as varied (across triplestores, or even in a single triplestore) as people, places, buildings, trees, or even things that are more abstract like concepts, philosophical currents, historical periods, etc. + At this point you might think you need to use some prefixes in your query. If these prefixes are present in `glitter::usual_prefixes`, you don't need to do anything. If they're not, use `glitter::spq_prefix()`. @@ -72,13 +75,17 @@ How many classes are defined in total? This query might be too big for the service. ```{r} -query_basis %>% +nclasses = query_basis %>% spq_add("?class a rdfs:Class") %>% spq_count() %>% spq_perform() + +nclasses ``` -We can do the same query for owl classes instead. +There are `r nclasses$n` classes declared in the triplestore. +Not so many that we could not get them all in one query, but definitely too many to show them all here! +Let us examine a few of these classes: ```{r} query_basis %>% @@ -92,12 +99,15 @@ Until now we could still be very in the dark as to what the service provides. ### Which classes have instances? +A class might be declared although **very few or even no items fall under it**. +Getting classes which do have instances actually corresponds to a another triple pattern, "?item is an instance of ?class", a.k.a. "?item a ?class": + ```{r} query_basis %>% spq_add("?instance a ?class") %>% + spq_select(- instance) %>% spq_arrange(class) %>% spq_head(n = 10) %>% - spq_select(- instance) %>% spq_select(class, .spq_duplicate = "distinct") %>% spq_perform() %>% knitr::kable() @@ -105,11 +115,13 @@ query_basis %>% ### Which classes have the most instances? +The number of items falling into each class actually gives an even better overview of the contents of a triplestore: + ```{r} query_basis %>% spq_add("?instance a ?class") %>% spq_select(class, .spq_duplicate = "distinct") %>% - spq_count(class, sort = TRUE) %>% + spq_count(class, sort = TRUE) %>% # count items falling under class spq_head(20) %>% spq_perform() %>% knitr::kable() @@ -121,8 +133,8 @@ In this case the class names are quite self explanatory but if they were not we query_basis %>% spq_add("?instance a ?class") %>% spq_select(class, .spq_duplicate = "distinct") %>% - spq_label(class) %>% - spq_count(class, class_label, sort = TRUE) %>% + spq_label(class) %>% # label class to get class_label + spq_count(class, class_label, sort = TRUE) %>% # group by class and class_label to count spq_head(20) %>% spq_perform() %>% knitr::kable() @@ -153,6 +165,8 @@ query_basis %>% ### What properties are used? +Similarly to counting instances for classes, we wish to get a sense of the **properties that are actually used in the triplestore**. + ```{r} query_basis %>% spq_add("?s ?property ?o") %>% @@ -194,9 +208,12 @@ query_basis %>% knitr::kable() ``` -## What data is stored about a class's instance? +## What data is stored about a class's instances? + +The items falling into a given class are likely to be the subject (or object) of a common set of properties. +One might wish to explore the **properties actually associated to a class**. -For each organization, what data is there? +For instance, in LINDAS, what properties are the schema:Organization class associated to? ```{r} query_basis %>% @@ -208,7 +225,7 @@ query_basis %>% knitr::kable() ``` -And for each postal address? +And what about the properties that the schema:PostalAddress class are associated to? ```{r} query_basis %>% @@ -222,6 +239,8 @@ query_basis %>% ## Which data or property name includes a certain substring? +Let us examine whether there exists in LINDAS some data related to water, through the search of string "hydro" or "Hydro" : + ```{r} query_basis %>% spq_add("?s ?p ?o") %>% @@ -234,7 +253,7 @@ query_basis %>% ## An example query based on what we now know - +To wrap it up, let us now use the LINDAS triplestore for an actual data query: we could for instance try and collect all organizations which have "swiss" in their name: ```{r} query_basis %>% diff --git a/vignettes/articles/glitter_bibliometry.Rmd b/vignettes/articles/glitter_bibliometry.Rmd new file mode 100644 index 00000000..8727f634 --- /dev/null +++ b/vignettes/articles/glitter_bibliometry.Rmd @@ -0,0 +1,235 @@ +--- +title: "glitter for HAL (en français)" +author: "Lise Vaudor" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{glitter for HAL} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +This article deals with queries on the [HAL](https://hal.science/) (for "Hyper Article en Ligne") triplestore, [dataHAL](https://data.hal.science/). **HAL** is the **bibliograpic open archive** chosen by French research institutions, universities, and grandes écoles. Hence, we suppose this article will only be useful to French-speaking R users and will provide it **in French** from now on... + +Les données contenues dans le triplestore HAL sont a priori utilisables pour générer des **rapports bibliographiques** pour une **personne**, une **organisation** (UMR par exemple), en triant par exemple par **année** ou par **période**. + +On peut ainsi imaginer utiliser ces données pour **générer automatiquement et de manière reproductible** un certain nombre de **tables ou graphiques** en lien avec les **évaluations** du personnel publiant des établissements de recherche. + +Dans la suite de cet article je montrerai comment explorer et exploiter ces données à l'aide de R, et (notamment) du paquet glitter (pour créer et réaliser les requêtes) et du paquet [sequins](https://lvaudor.github.io/sequins/) (pour visualiser les requêtes). + +```{r libs, message=FALSE} +library(glitter) +library(sequins) +library(dplyr) +library(tidyr) +library(ggplot2) +``` + + +# Entrée par auteur·rice + +Essayons par exemple d'examiner s'il existe dans la base quelqu'un qui s'appelle (tout à fait au hasard) "Lise Vaudor": + +```{r test_LV} +test_LV=spq_init("hal") %>% + spq_add("?personne foaf:name 'Lise Vaudor'") %>% # récupère les personnes appelées "Lise Vaudor" + spq_perform() +DT::datatable(test_LV) +``` + +Il existe bien une personne ayant ce nom dans la base de données, qui fait l'objet d'une [fiche consultable](`r test_LV$personne[1]`). + +La consultation de cette page montre que deux propriétés sont souvent renseignées: **foaf:interest** et **foaf:topic_interest**. Cette dernière propriété semble regrouper des mots-clés issus de l'ensemble des publications de l'auteur alors que foaf:interest correspond à des centres d'intérêt déclarés (probablement lors de la création du profil HAL: à vrai dire je ne m'en souviens plus!). + +Quoi qu'il en soit, l'information relative aux centres d'intérêt est accessible comme suit: + +```{r interet_LV, fig.width=7,fig.height=7} +requete = spq_init("hal") %>% + spq_add("?personne foaf:name 'Lise Vaudor'") %>% + spq_add("?personne foaf:interest ?interet") %>% # récupère les centres d'intérêt + spq_add("?interet skos:prefLabel ?interet_label") %>% # étiquette les centres d'intérêt + spq_filter(lang(interet_label) == 'fr') + +sequins::graph_query(requete, layout = "tree") +``` + + +```{r interet_LV_run} +interet_LV = requete %>% # garde seulement les étiquettes en français + spq_perform() +DT::datatable(interet_LV) +``` + +# Documents d'un·e auteur·rice + +Une des petites subtilités du modèle de données HAL consiste à considérer que **un document a un créateur·rice -- ou auteur·rice -- et un·e créateur·rice correspond à une personne**. + +## Affiliations + +Par exemple, l'article "How sampling influences the statistical power to detect changes in abundance: an application to river restoration" a pour créatrice (entre autres personnes) "Lise Vaudor à l'époque du Cemagref", qui correspond à la personne "Lise Vaudor" qui elle est intemporelle 😉. + +Ainsi, c'est en considérant les créateurs de documents que l'on va récupérer les affiliations: **l'affiliation est une information qui se récupère en adoptant une entrée par document plutôt que par auteur·rice**. + + +```{r orga_LV_prep, fig.width=7, fig.height=8} +requete = spq_init("hal") %>% + spq_add("?doc dcterms:creator ?createur") %>% # documents crées par créateur + spq_add("?createur hal:structure ?affil") %>% # créateur correspond à une affiliation + spq_add("?createur hal:person ?personne") %>% # créateur correspond à une personne + spq_add("?personne foaf:name 'Lise Vaudor'") %>% + spq_add("?affil skos:prefLabel ?affiliation") %>% # étiquette affiliation + spq_group_by(affiliation) %>% # groupe par affiliation + spq_summarise(n = n()) %>% + spq_arrange(desc(n)) + +requete + +sequins::graph_query(requete, layout="tree") +``` + +```{r orga_LV_run} +orga_LV = requete %>% # renvoie le nombre d'enregistrements + spq_perform() + +DT::datatable(orga_LV) +``` + +## Documents + +Si l'on ne s'intéresse pas aux affiliations mais aux **documents** eux-mêmes: + +```{r docs_LV} +docs_LV = spq_init(endpoint = "hal") %>% + spq_add("?doc dcterms:creator ?createur") %>% + spq_add("?createur hal:structure ?affil") %>% + spq_add("?createur hal:person ?personne") %>% + spq_add("?personne foaf:name 'Lise Vaudor'") %>% + spq_add("?affil skos:prefLabel ?affiliation") %>% + spq_add("?doc dcterms:type ?type") %>% + spq_add("?type skos:prefLabel ?type_label") %>% + spq_filter(lang(type_label) == 'fr') %>% + spq_add("?doc dcterms:bibliographicCitation ?citation") %>% + spq_add("?doc dcterms:issued ?date") %>% + spq_mutate(date = str_sub(as.character(date), 1, 4)) %>% + spq_group_by(citation, type_label, date) %>% + spq_summarise(affiliation = str_c(affiliation, sep = ", ")) %>% + spq_perform() + +docs_LV +``` + +Cette requête renvoie une table comptant `r nrow(docs_LV)`. Voici les 20 documents les plus récents: + +```{r docs_LV_recents} +docs_LV %>% + arrange(desc(date)) %>% + head(20) %>% + DT::datatable() +``` + + +# Entrée par laboratoire + + +## Identification du laboratoire + +Intéressons-nous maintenant aux **publications issues d'un laboratoire**. Ici, nous avons choisi le laboratoire "Environnement Ville Société", alias "EVS" ou encore "UMR 5600". + +Essayons de le retrouver dans la base de données: + +```{r labo_EVS} +labo_EVS = spq_init(endpoint = "hal") %>% + spq_add("?labo skos:prefLabel ?labo_label") %>% + spq_add("?labo dcterms:identifier ?labo_id", .required = FALSE) %>% + spq_filter(str_detect(labo_label,"EVS|(UMR 5600)|(Environnement Ville Soc)")) %>% + spq_perform() +labo_EVS +``` + +Bon! Eh bien, étant donné la diversité des formats dans la dénomination d'EVS, un petit tri manuel s'impose. + +```{r labo_EVS_filter} +labo_EVS = labo_EVS %>% + unique() %>% + mutate(num = 1:n()) %>% + filter(!(num %in% c(1,2,3,18))) %>% # ici je retire les labos qui ne correspondent pas à UMR 5600 / EVS + select(-num) +DT::datatable(labo_EVS) +``` + +Créons maintenant une fonction qui permet de récupérer l'**ensemble des documents pour chacune de ces dénominations de laboratoire**. + +```{r get_docs_lab} +get_docs_lab = function(lab){ + lab = paste0("<",lab,">") + result = spq_init(endpoint = "hal") %>% + spq_add(glue::glue("?createur hal:structure {lab}")) %>% + spq_add("?createur hal:person ?personne") %>% + spq_add("?personne foaf:name ?auteur") %>% + spq_add("?doc dcterms:creator ?createur") %>% + spq_select(-createur) %>% + spq_add("?doc dcterms:type ?type") %>% # récupère le type de document + spq_add("?type skos:prefLabel ?type_label") %>% # étiquette le type de document + spq_filter(lang(type_label) == 'fr') %>% # ... en français + spq_add("?doc dcterms:bibliographicCitation ?citation") %>% # récupère la citation + spq_add("?doc dcterms:issued ?date") %>% + spq_perform() %>% + mutate(date = stringr::str_sub(date,1,4)) %>% + select(auteur, type = type_label, date, citation) + return(result) +} +``` + +Appliquons maintenant cette fonction à chacune des dénominations possibles pour le labo EVS: + +```{r apply_get_docs_lab} +docs_EVS = labo_EVS %>% + group_by(labo, labo_label) %>% + tidyr::nest() %>% + mutate(data = purrr::map(labo, get_docs_lab)) %>% + tidyr::unnest(cols="data") + +dim(docs_EVS) +``` + +Cette table compte de nombreux enregistrements (`r nrow(docs_EVS)`). On montre ci-dessous les plus récents (à partir de 2020): + +```{r docs_EVS_show} +docs_EVS_show = docs_EVS %>% + select(-labo) %>% + filter(date >= 2020) %>% + unique() %>% + select(auteur, date, type, citation, citation) %>% + ungroup() + +dim(docs_EVS_show) + +DT::datatable(docs_EVS_show) +``` + +# Rendus graphiques + +On peut dès lors utiliser ces données pour produire un **certain nombre de graphiques** permettant d'apprécier la production du laboratoire au cours du temps: + +```{r plot_datecitation} +docs_datecitation = docs_EVS %>% + group_by(type) %>% + mutate(ntype = n()) %>% + ungroup() %>% + mutate(ntot = n()) %>% + mutate(proptype = ntype/ntot) %>% + filter(proptype > 0.05) %>% + group_by(date, citation, type) %>% + summarise(n = n()) %>% + filter(date > 2015) + +ggplot(docs_datecitation, aes(x = date, y = n, fill = type)) + + geom_bar(stat = "identity") + + facet_grid(rows = vars(type)) +``` + +Il ne s'agit là que d'un exemple (parmi beaucoup d'autres possibilités comme l'exploitation de mots clés, de statistiques par journal, de réseaux d'auteurs) pour exploiter ces données. Nanmoins ces méthodes allant au-delà du "scope" du package glitter nous n'irons pas plus loin en terme d'analyse des résultats des requêtes dans ce document. + diff --git a/vignettes/articles/glitter_for_Wikidata.Rmd b/vignettes/articles/glitter_for_Wikidata.Rmd index 23503f92..63e9f474 100644 --- a/vignettes/articles/glitter_for_Wikidata.Rmd +++ b/vignettes/articles/glitter_for_Wikidata.Rmd @@ -18,178 +18,228 @@ knitr::opts_chunk$set( ```{r setup} library(glitter) library(WikidataR) +library(dplyr) +library(tidyr) +library(sf) +library(leaflet) ``` -This first vignette shows how to use `glitter` to extract data from the **Wikidata SPARQL endpoint**. - +This first vignette shows how to use `glitter` to extract data from the **Wikidata SPARQL endpoint**. We imagine here a case study in which one is interested in the **Wikidata items available regarding the Lyon metro network**. # Find items and properties to build your query -To find the identifiers of items and properties of interest, you can: - -- browse [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) -- use package `WikidataR` (functions `find_item()`,`find_property()`). - -Then `glitter` functions might be used to start exploring data. +To find the identifiers of items and properties of interest for a particular case study, you can: -# Example 1: Lyon Metro +- browse [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) +- use package `WikidataR` (functions `WikidataR::find_item()`, `WikidataR::find_property()`). Here, we will explore that second option -Imagine you are interested in the Wikidata available regarding the **Lyon Metro network**. - -Let's try and see if there are Wikidata about it: +Let's try and find the Wikidata identifier for the Lyon metro network: ```{r} -find_item("Metro Lyon") +WikidataR::find_item("Metro Lyon") ``` -So you'd be interested, for instance, in all the subway stations that are part of this network. +So you'd be interested, for instance, in all the subway stations that are part of this network. Let's try and find the property identifier that corresponds to this notion: ```{r} -find_property("part of") +WikidataR::find_property("part of") ``` -So you're looking for all the stations that are part of ("wd:P361") the Lyon metro network ("wd:Q1552"). +So you're looking for all the stations that are part of ("wdt:P16") the Lyon metro network ("wd:Q1552"). -You could access this information through: +# Use glitter functions to start exploring data -```{r} -stations=get_triple("?items wdt:P16 wd:Q1552") -stations %>% head() +The `glitter` functions might now be used to start exploring data. + +We're looking for items (the "unknown" in our query below, hence the use of a "?") which are part of the Lyon metro network: + +```{r single_line} +stations = spq_init() %>% + spq_add("?items wdt:P16 wd:Q1552") %>% + spq_perform() + +head(stations) ``` -Notice that we do not have values yet for the stations (that's what we're looking for) hence the use of "?" at the beginning of the subject string. +To also get the labels for stations, we can use `spq_label()`: -To also get the labels for stations, we can use the argument `label`: +```{r single_line_with_label} +stations = spq_init() %>% + spq_add("?items wdt:P16 wd:Q1552") %>% + spq_label(items) %>% + spq_perform() -```{r} -parts_metro_Lyon=get_triple("?items wdt:P16 wd:Q1552", label="?items") -parts_metro_Lyon %>% head() +head(stations) ``` -For now, we get 50 items, not only stations but also other types of items such as metro lines. Let's have a look at the item "Place Guichard - Bourse du Travail" (Q59855) which we know correspond to a station. +## Labelling -Property "wdt:P31" should enable us to collect only stations ("wd:Q928830") instead of all parts of the Lyon Metro network. +The query above, with `spq_label(items)`, will return a table comprising both `items` (with the Wikidata identifiers) and `items_label` (with the human-readable label corresponding to these items). -We can enrich and refine our query incrementally with `spq_filter()` before sending it (whereas `get_triple()` was a shortcut to build and send simple requests) : +If the Wikidata unique identifier is not particularly useful, one can use the argument `.overwrite = TRUE` so that only labels will be returned, under the shorter name `items`: -```{r} -stations_metro_Lyon=spq_init() %>% - spq_add("?stations wdt:P361 wd:Q1552", .label = "?stations") %>% - spq_add("?stations wdt:P31 wd:Q928830") %>% +```{r overwrite_labelling} +stations=spq_init() %>% + spq_add("?items wdt:P16 wd:Q1552") %>% + spq_label(items, .overwrite = TRUE) %>% spq_perform() -head(stations_metro_Lyon) + +head(stations) ``` -Or we could use glitter DSL (domain-specific language) a bit more, making use of `spq_set()` and the `spq_filter()` fonction. +# Detail query -```{r} -stations_metro_Lyon = spq_init() %>% - spq_set(stations_code = "wd:Q1552") %>% - spq_filter(stations == wdt::P361(stations_code), .label="?stations") %>% - spq_set(lyon_code = "wd:Q928830") %>% - spq_filter(stations == wdt::P31(lyon_code)) %>% - spq_select(- stations_code, - lyon_code) %>% +## Add another triple pattern + +As it turns out, for now we get `r nrow(stations)` items, which actually correspond not only to stations but also to other types of items such as metro lines. +Let's have a look at the item "Place Guichard - Bourse du Travail" ("wd:Q599865") which we know correspond to a station. +We can do that e.g. through [the Wikidata url associated to this item](https://www.wikidata.org/wiki/Q599865){target="_blank"}. + +Hence, the property called "wdt:P31" ("is an instance of") should enable us to collect specifically stations ("wd:Q928830") instead of any part of the Lyon metro network. + +```{r stations} +stations = spq_init() %>% + spq_add("?station wdt:P16 wd:Q1552") %>% + spq_add("?station wdt:P31 wd:Q928830") %>% # added instruction + spq_label(station, .overwrite = TRUE) %>% spq_perform() -head(stations_metro_Lyon) + +dim(stations) +head(stations) ``` -We now get 42 stations that are part of the Lyon metro network. +## Get coordinates -If we wanted to get other properties and associated values for these stations (for instance their location ("wdt:P625")) we could proceed this way: +If we want to get the geographical coordinate of these stations (property "wdt:P625") we can proceed this way: -```{r} -stations_metro_Lyon = spq_init() %>% - spq_set(stations_code = "wd:Q1552") %>% - spq_filter(stations == wdt::P361(stations_code), .label="?stations") %>% - spq_set(lyon_code = "wd:Q928830") %>% - spq_filter(stations == wdt::P31(lyon_code)) %>% - spq_select(- stations_code, - lyon_code) %>% - spq_mutate(coords = wdt::P625(stations)) %>% +```{r add_coords} +stations_coords = spq_init() %>% + spq_add("?station wdt:P16 wd:Q1552") %>% + spq_add("?station wdt:P31 wd:Q928830") %>% + spq_add("?station wdt:P625 ?coords") %>% # added instruction + spq_label(station, .overwrite = TRUE) %>% spq_perform() -head(stations_metro_Lyon) + +dim(stations_coords) +head(stations_coords) ``` -This tibble can be easily transform into a Simple feature collection (sfc) : +This tibble can be transformed into a Simple feature collection (sfc) object using package `sf`: -```{r, eval=FALSE} -stations_metro_Lyon.shp=sf::st_as_sf(stations_metro_Lyon, wkt = "coords") -plot(stations_metro_Lyon.shp$coords) +```{r stations_as_sf} +stations_sf = st_as_sf(stations_coords, wkt = "coords") +head(stations_sf) ``` +The resulting object may then be used easily with (for instance) package `leaflet`: -`glitter` provides functions to clean and transform "raw" Wikidata tibbles into tibbles that are easier to use in R. - +```{r leaflet_stations} +leaflet(stations_sf) %>% + addTiles() %>% + addCircles(popup = ~station) +``` +# Add property qualifiers -Function `transform_wikidata_coords()` get the coordinates as longitude (`lng`) and latitude (`lat`) based on the Wikidata WKT formatting of spatial coordinates ("Point(*lng* *lat*)"). +Now, we would like not only to view the stations but also the **connecting lines**. +One property is of particular interest in this prospect: P197, which indicates **which other stations one station is connected to**. +To form connecting lines, this information about the connection to other stations need to be complemented by *the involved line* and *direction* of that connection. +Hence, we are not only interested in the **values** of the property P197, but also in the **property qualifiers** corresponding to the connecting line (P81) and direction (P5051) -```r -stations_metro_Lyon=stations_metro_Lyon %>% - transform_wikidata_coords("coords") -``` +We can thus complete our query this way: -The resulting table may then be used easily with (for instance) package `leaflet`: +```{r query_prop_qualifiers} +stations_adjacency=spq_init() %>% + spq_add("?station wdt:P16 wd:Q1552") %>% + spq_add("?station wdt:P31 wd:Q928830") %>% + spq_add("?station wdt:P625 ?coords") %>% + spq_add("?station p:P197 ?statement") %>% # added instruction + spq_add("?statement ps:P197 ?adjacent") %>% # added instruction + spq_add("?statement pq:P81 ?line") %>% # added instruction + spq_add("?statement pq:P5051 ?direction") %>% # added instruction + spq_label("station", "adjacent", "line", "direction",.overwrite = TRUE) %>% + spq_select(-statement) %>% + spq_perform() %>% + na.omit() %>% + select(coords,station,adjacent,line,direction) -```r -leaflet::leaflet(stations_metro_Lyon) %>% - leaflet::addTiles() %>% - leaflet::addCircles(popup=~stations_label) +head(stations_adjacency) ``` +Now, we would like to put the stations **in the right order** so that we will be able to form the connecting lines. -# Example 2 : cities around Lyon +This **data-wrangling part is a bit tricky** though not directly due to any glitter-related operation. -Now, let's imagine that we are interested in the cities in a 200km radius around Lyon. +We define a function `form_line()` which will put the rows in the table of stations in the correct order. -```{r} -find_item("Lyon") -find_item("city") +```{r form_line} +form_line = function(adjacencies, direction) { + N = nrow(adjacencies) + num = rep(NA,N) + ind = which(adjacencies$adjacent == direction) + i = N + num[ind] = i + while (i>1) { + indnew = which(adjacencies$adjacent == adjacencies$station[ind]) + ind = indnew + i = i-1 + num[ind] = i + } + adjacencies = adjacencies %>% + mutate(num = num) %>% + arrange(num) + adjacencies = c(adjacencies$station, direction) + return(adjacencies) +} ``` -We could start exploring Wikidata with this query which finds all items that are instances ("wdt:P31") of "city" or of any subclass ("wdt:P279") of "city" . This query might return many items so that it seems reasonable to limit the number of items retrieved for now with the argument `limit` +Now let's **apply this function to all lines and directions possible**. +Making full use of the tidyverse, we can use iteratively this function while not dropping the table-like structure of our data using a combination of tidyr::nest() and purrr::map(). -```{r} -spq_init() %>% - spq_add("?city wdt:P31/wdt:P279* wd:Q515", .label="?city") %>% - spq_head(n = 10) %>% - spq_perform() +```{r calc_lines} +stations_lines = stations_adjacency %>% + sf::st_drop_geometry() %>% # make this a regular tibble, not sf + group_by(direction,line) %>% + na.omit() %>% + tidyr::nest(.key = "adj") %>% # have nested "adj" table for each direction-line + mutate(station = purrr::map(.x = adj, .y = direction, + ~form_line(.x,.y))) %>% + tidyr::unnest(cols = "station") %>% + ungroup() ``` -Now, let's get the location ("wdt:P625") of the cities +We use left_join() to complete the table ordering the stations into lines with the coordinates of stations: -```{r} -spq_init() %>% - spq_add("?city wdt:P31/wdt:P279* wd:Q515", .label="?city") %>% - spq_mutate(coords = wdt::P625(city)) %>% - spq_head(n = 10) %>% - spq_perform() +```{r join_coords} +stations_lines=stations_lines %>% + left_join(unique(stations_coords), # get corresponding coordinates + by=c("station")) %>% + na.omit() +head(stations_lines) ``` -We can refine this query, stating that we want cities (or items of subclasses of city) in a radius of 5km around Lyon (which has lat-long coordinates ~ 45.76 and 4.84). We will use the argument `within_distance`: - +`stations_lines` is now an **sf points object** which is properly formatted to be transformed into an **sf lines object** (the stations are in the right order for each line-direction, and the associated coordinates are provided in the table): -```{r} -cities_around_Lyon = spq_init() %>% - spq_add("?city wdt:P31/wdt:P279* wd:Q486972") %>% - spq_label(city) %>% - spq_mutate(coords = wdt::P625(city), - .within_distance=list(center=c(long=4.84,lat=45.76), - radius=5)) %>% - spq_perform() -head(cities_around_Lyon) +```{r stations_lines_sf} +stations_lines_sf=stations_lines %>% + sf::st_as_sf(wkt="coords") %>% + group_by(direction,line) %>% + summarise(do_union = FALSE) %>% # for each group, and keeping order of points, + sf::st_cast("LINESTRING") # form a linestring geometry +stations_lines_sf ``` - +We can now use this new object to **display the Lyon metro lines on a leaflet map**: + +```{r leaflet_lines} +factpal <- colorFactor(topo.colors(8), + unique(stations_lines$line)) +leaflet(data=stations_sf) %>% + addTiles() %>% + addCircles(popup=~station) %>% + addPolylines(data=stations_lines_sf, + color=~factpal(line), popup=~line) +``` - - - - - - - - - - diff --git a/vignettes/articles/glitter_for_hal.Rmd b/vignettes/articles/glitter_for_hal.Rmd deleted file mode 100644 index 2aa62037..00000000 --- a/vignettes/articles/glitter_for_hal.Rmd +++ /dev/null @@ -1,257 +0,0 @@ ---- -title: "glitter for HAL" -author: "Lise Vaudor" -date: "04/11/2021" -output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{glitter for HAL} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE) -``` - - -```{r} -library(glitter) -``` - -Cette vignette reprend les **requêtes proposées (dans le langage SPARQL)** dans [cette présentation](https://fr.slideshare.net/lespetitescases/dcouverte-du-sparql-endpoint-de-hal) de **Gautier Poupeau** et vous montre comment les formuler dans R à l'aide du package glitter. - -# Recherche par document - -On s'intéresse à ce document: haldoc:inria-00362381. - -## Rechercher toutes les informations associées à ce document - -```{r} -tib_info=spq_init(endpoint = "hal") %>% - spq_add("haldoc:inria-00362381 dcterms:hasVersion ?version") %>% # Ce doc a des versions ?version - spq_add("?version ?p ?object") %>% # ?version a des prop. ?p les liant à ?object - spq_perform() - -head(tib_info) -``` - -## Rechercher l’URI, le titre, le lien vers le PDF et les auteurs de [toutes les versions de] ce document - -```{r} -query_doc = spq_init(endpoint = "hal") %>% - spq_add("haldoc:inria-00362381 dcterms:hasVersion ?version") %>% # Ce doc a des versions ?version - spq_add("?version dcterms:title ?title") %>% # ?version a pour titre ?titre - spq_add(". dcterms:creator ?creator") %>% # ...... et pour créateur ?creator - spq_add(". ore:aggregates ?pdf") %>% # ...... et ce lien vers un ?pdf - spq_add("?creator hal:person ?person") %>% # ?creator est une personne ?person - spq_add("?person foaf:name ?name") # ?person a pour nom ?name - -tib_doc=spq_perform(query_doc) - -tib_doc -``` - -Remarque: on peut concaténer les réponses d’une variable (ici les auteurs par exemple): - -```{r} -query_doc_autConcat = query_doc %>% - spq_group_by(version, title,pdf) %>% # Groupe les résultats par ?version, ?title, ?pdf - spq_summarise(authors = str_c(name, sep = ', ')) # Concatène les noms d'auteur dans ?authors - -tib_doc_autConcat = spq_perform(query_doc_autConcat) - -tib_doc_autConcat -``` - -On peut également **agréger/résumer** les résultats renvoyés par la requête, par exemple en **recherchant le nombre d’auteurs d’un document** (par version) - -```{r} -tib_nbAutDoc=spq_init(endpoint = "hal") %>% - spq_add("haldoc:inria-00362381 dcterms:hasVersion ?version") %>% # Ce doc a des versions ?version - spq_add("?version dcterms:creator ?creator") %>% # ?version a pour créateur ?creator - spq_add("?creator hal:person ?person") %>% # ?creator est une personne ?person - spq_group_by(version) %>% # Groupe par ?version - spq_summarise(nbperson = n(unique(person))) %>% # Résume: ?nbperson = nb de ?person dist. - spq_perform() - -tib_nbAutDoc -``` - -## Rechercher l’URI et les types (et l'étiquette associée) de [toutes les versions de] ce document - -On cherche les types associés aux versions de documents. Ces types sont associés à des étiquettes (dans plusieurs langues) qui permettent de comprendre de quoi il s'agit "en langage ordinaire". - -```{r} -query_docType=spq_init(endpoint = "hal") %>% - spq_add("haldoc:inria-00362381 dcterms:hasVersion ?version") %>% # Ce doc a des versions ?version - spq_add("?version dcterms:type ?type") %>% # ?version est un document de type ?type - spq_add("?type skos:prefLabel ?label") # ?type a pour étiquette ?label - -tib_docType=spq_perform(query_docType) - -tib_docType -``` - - -On peut **filtrer les résultats** pour n'afficher que la ligne de résultats pour laquelle l'étiquette est en français : - -```{r} -query_docTypeFr=query_docType %>% - spq_filter(lang(label) == 'fr') # Filtre pour garder les étiquettes en français - -tib_docTypeFr=spq_perform(query_docTypeFr) - -tib_docTypeFr -``` - - -# Recherche par forme-auteur - -## Afficher les documents associés à une forme-auteur - -Considérons une des "forme-auteur" de Fabien Gandon, [https://data.archives-ouvertes.fr/author/827904](https://data.archives-ouvertes.fr/author/827904) (vu dans les résultats des requêtes précédentes). - -```{r} -fabien_gandon="" # fabien_gandon = cette forme auteur - -query_foAut=spq_init(endpoint = "hal") %>% - spq_add("?document dcterms:hasVersion ?version") %>% # ?document a des versions ?version - spq_add("?version dcterms:creator ?creator") %>% # ?version a pour créateur ?creator - spq_add("?creator hal:person {fabien_gandon}") %>% # ?creator est fabien_gandon ({déf. de l'objet dans R}) - spq_add("?version dcterms:type ?type") %>% # ?version a pour type ?type - spq_add("?type skos:prefLabel ?label") %>% # ?type a pour étiquette ?label - spq_filter(lang(label) == 'fr') # Filtre pour garder les ?label en français -tib_foAut=spq_perform(query_foAut) - -head(tib_foAut) -``` - -## Enrichir/simplifier la requête - -On peut **résumer les résultats requêtés** par exemple en affichant le nombre de documents par type (et étiquette associée): - - -```{r} -query_foAutNbDoc=query_foAut %>% - spq_group_by(type, label) %>% # Groupe par ?type et ?label - spq_summarise(nbdoc = n(unique(document))) - -tib_foAutNbDoc=spq_perform(query_foAutNbDoc) - -tib_foAutNbDoc -``` - -On peut afficher ces résultats en **ordonnant** les lignes selon le nombre de documents décroissant: - -```{r} -query_foAutNbDocOrd=query_foAutNbDoc%>% - spq_arrange(desc(nbdoc)) # Ordonne par ordre décroissant de ?nbdoc - -tib_foAutNbDocOrd=spq_perform(query_foAutNbDocOrd) - -tib_foAutNbDocOrd -``` - -On peut également s'intéresser aux **dates** de publication. Par exemple ici, on récupère la date de publication, l'année correspondante, et on résume l'information en calculant le nombre de documents par année. - -```{r} -tib_dat=spq_init(endpoint = "hal") %>% - spq_add("?document dcterms:hasVersion ?version") %>% - spq_add("?version dcterms:creator ?creator") %>% - spq_add(". dcterms:issued ?date") %>% # ?version est sortie à ?date - spq_add("?creator hal:person {fabien_gandon}") %>% - spq_mutate(year = year(date)) %>% # Crée ?year = année de ?date - spq_group_by(year) %>% # Groupe par ?year - spq_summarise(nbdoc = n(unique(document))) %>% - spq_arrange(year) %>% - spq_perform() - -tib_dat -``` - -# Recherche par auteur - -Une personne=un IdHAL=plusieurs formes-auteur - -## Afficher les URIs de toutes les formes-auteur associées à la forme « Fabien Gandon » via l’IdHAL - -```{r} -query_aut=spq_init(endpoint = "hal") %>% - spq_add("{fabien_gandon} ore:isAggregatedBy ?o") %>% # {fabien gandon} correspond à ?o - spq_add("?forme ore:isAggregatedBy ?o") # ?forme correspond à ?o - -tib_aut=spq_perform(query_aut) - -tib_aut -``` - -## Enrichir/résumer les données - -... en comptant le **nombre de documents par année**: - -```{r} -query_autNbDocYear=query_aut %>% - spq_add("?version dcterms:creator ?creator") %>% # ?version a pour créateur ?creator - spq_add("?creator hal:person ?forme") %>% # ?creator est une personne ?forme - spq_add("?version dcterms:issued ?date") %>% # ?version a été publié à ?date - spq_add("?document dcterms:hasVersion ?version") %>% # ?document a pour versions ?version - spq_mutate(year = year(date)) %>% # Ajoute ?year qui correspond à l'année de ?date - spq_group_by(year) %>% # Groupe par ?year - spq_arrange(year) %>% # Ordonne par ordre décroissant de ?year - spq_summarise(nbdoc = n(unique(document))) - -tib_autNbDocYear=spq_perform(query_autNbDocYear) - -tib_autNbDocYear -``` - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - diff --git a/vignettes/articles/img/hal_author.png b/vignettes/articles/img/hal_author.png new file mode 100644 index 00000000..6fe41f1b Binary files /dev/null and b/vignettes/articles/img/hal_author.png differ diff --git a/vignettes/articles/img/hal_document.png b/vignettes/articles/img/hal_document.png new file mode 100644 index 00000000..063091d5 Binary files /dev/null and b/vignettes/articles/img/hal_document.png differ