Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update some vignettes (Wikidata & other triplestores, complex queries) #157

Merged
merged 44 commits into from
Sep 28, 2023
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
fd99de5
update Wikidata vignette
lvaudor Aug 30, 2023
9fee428
update Wikidata vignette
lvaudor Aug 30, 2023
8b34ca8
Merge branch 'dokeydoc' of https://github.com/lvaudor/glitter into do…
lvaudor Aug 30, 2023
c5d30d4
Lyon metro lines
lvaudor Sep 5, 2023
1db04d8
improve sf explaining
lvaudor Sep 8, 2023
453700d
glitter for bibliometric data with HAL
lvaudor Sep 8, 2023
34e1960
Merge branch 'main' into dokeydoc
lvaudor Sep 8, 2023
088dea7
suggestions for explore article
lvaudor Sep 11, 2023
29c784d
sequins in deps
lvaudor Sep 15, 2023
3801e6e
change menu
lvaudor Sep 15, 2023
863d50b
Merge branch 'main' into dokeydoc
lvaudor Sep 15, 2023
936f81d
Merge branch 'main' into dokeydoc
lvaudor Sep 21, 2023
4d8f8d4
Update DESCRIPTION
lvaudor Sep 28, 2023
7848b75
Update DESCRIPTION
lvaudor Sep 28, 2023
3052973
Update vignettes/articles/explore.Rmd
lvaudor Sep 28, 2023
c612a4a
Update vignettes/articles/explore.Rmd
lvaudor Sep 28, 2023
4965acb
Update vignettes/articles/explore.Rmd
lvaudor Sep 28, 2023
4a9720d
Update vignettes/articles/explore.Rmd
lvaudor Sep 28, 2023
b843072
Update vignettes/articles/glitter_bibliometry.Rmd
lvaudor Sep 28, 2023
ef68835
Update vignettes/articles/glitter_bibliometry.Rmd
lvaudor Sep 28, 2023
6ee062a
Update vignettes/articles/glitter_bibliometry.Rmd
lvaudor Sep 28, 2023
bb4c605
Update vignettes/articles/glitter_for_Wikidata.Rmd
lvaudor Sep 28, 2023
d3c4db4
Update vignettes/articles/glitter_bibliometry.Rmd
lvaudor Sep 28, 2023
b583cd8
Update vignettes/articles/glitter_for_Wikidata.Rmd
lvaudor Sep 28, 2023
2efe3d2
Update vignettes/articles/glitter_bibliometry.Rmd
lvaudor Sep 28, 2023
876e395
take into account Maelle's dokeydoc review
lvaudor Sep 28, 2023
633cee5
fix article
maelle Sep 28, 2023
7fe2e51
oops
maelle Sep 28, 2023
b23f4ac
Merge branch 'main' into dokeydoc
maelle Sep 28, 2023
0cb3106
register sequins in DESCRIPTION
maelle Sep 28, 2023
29fd25e
add true
maelle Sep 28, 2023
3456c54
show requete text
maelle Sep 28, 2023
b949151
fix article
maelle Sep 28, 2023
5d431d5
tweaks
maelle Sep 28, 2023
d844b7a
rm outdated vignette
maelle Sep 28, 2023
cc0f212
don't use tidyverse package in pkg
maelle Sep 28, 2023
5eb9372
oops
maelle Sep 28, 2023
b5352ef
ouch
maelle Sep 28, 2023
9a6bae5
fix config
maelle Sep 28, 2023
257eb24
oops
maelle Sep 28, 2023
0b44631
cleaner deps
maelle Sep 28, 2023
ebb2994
,
maelle Sep 28, 2023
e0a98d6
oops
maelle Sep 28, 2023
991ce5b
try
maelle Sep 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ Suggests:
rmarkdown,
testthat (>= 3.0.0),
withr
Remotes:
lvaudor marked this conversation as resolved.
Show resolved Hide resolved
lvaudor/glitter
lvaudor marked this conversation as resolved.
Show resolved Hide resolved
VignetteBuilder:
knitr,
rmarkdown
Expand Down
Binary file added vignettes/articles/docs_EVS.RDS
Binary file not shown.
34 changes: 23 additions & 11 deletions vignettes/articles/explore.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "How to explore a new base with glitter"
title: "How to explore a new database with glitter"
---

```{r, include = FALSE}
Expand All @@ -20,7 +20,7 @@ Let's go through an example.

## A word of caution

Depending on the dataset you're working with, some queries might just ask _too much_ of the service so proceed with caution.
Depending on the dataset (or triplestore, in our context) you're working with, some queries might just ask _too much_ of the service so proceed with caution.
When in doubt, add a `spq_head()` in your query pipeline, to ask less at a time, or use `spq_count()` to get a sense of how many results there are in total.

## Asking for a subset of all triples
Expand Down Expand Up @@ -56,6 +56,8 @@ Its results however can be... more or less helpful.

### Find which classes are declared

The **classes** occurring in the database will provide information as to **the kind of data** you will find there. This can be as varied (across triplestores, or even in a single triplestore) as people, places, buildings, trees, or even things that are more abstract like concepts, philosophical currents, historical periods, etc.
lvaudor marked this conversation as resolved.
Show resolved Hide resolved

At this point you might think you need to use some prefixes in your query.
If these prefixes are present in `glitter::usual_prefixes`, you don't need to do anything.
If they're not, use `glitter::spq_prefix()`.
Expand All @@ -78,7 +80,7 @@ query_basis %>%
spq_perform()
```

We can do the same query for owl classes instead.
There are `r nclasses$n` classes declared in the triplestore. Not so many that we could not get them all in one query, but definitely too many to show them all here! Let us examine a few of these classes:
lvaudor marked this conversation as resolved.
Show resolved Hide resolved

```{r}
query_basis %>%
Expand All @@ -92,24 +94,28 @@ Until now we could still be very in the dark as to what the service provides.

### Which classes have instances?

A class might be declared although **very few or even no items fall under it**. Getting classes which do have instances actually corresponds to a another triple pattern, "?item is an instance of ?class", a.k.a. "?item a ?class":
lvaudor marked this conversation as resolved.
Show resolved Hide resolved

```{r}
query_basis %>%
spq_add("?instance a ?class") %>%
spq_arrange(class) %>%
spq_head(n = 10) %>%
spq_select(- instance) %>%
spq_select(- item) %>%
maelle marked this conversation as resolved.
Show resolved Hide resolved
spq_select(class, .spq_duplicate = "distinct") %>%
spq_perform() %>%
knitr::kable()
```

### Which classes have the most instances?

The number of items falling into each class actually gives an even better overview of the contents of a triplestore:

```{r}
query_basis %>%
spq_add("?instance a ?class") %>%
spq_select(class, .spq_duplicate = "distinct") %>%
spq_count(class, sort = TRUE) %>%
spq_count(class, sort = TRUE) %>% # count items falling under class
spq_head(20) %>%
spq_perform() %>%
knitr::kable()
Expand All @@ -121,8 +127,8 @@ In this case the class names are quite self explanatory but if they were not we
query_basis %>%
spq_add("?instance a ?class") %>%
spq_select(class, .spq_duplicate = "distinct") %>%
spq_label(class) %>%
spq_count(class, class_label, sort = TRUE) %>%
spq_label(class) %>% # label class to get class_label
spq_count(class, class_label, sort = TRUE) %>% # group by class and class_label to count
spq_head(20) %>%
spq_perform() %>%
knitr::kable()
Expand Down Expand Up @@ -153,6 +159,8 @@ query_basis %>%

### What properties are used?

Similarly to counting instances for classes, we wish to get a sense of the **properties that are actually used in the triplestore**.

```{r}
query_basis %>%
spq_add("?s ?property ?o") %>%
Expand Down Expand Up @@ -194,9 +202,11 @@ query_basis %>%
knitr::kable()
```

## What data is stored about a class's instance?
## What data is stored about a class's instances?

For each organization, what data is there?
The items falling into a given class are likely to be the subject (or object) of a common set of properties. One might wish to explore the **properties actually associated to a class**.
lvaudor marked this conversation as resolved.
Show resolved Hide resolved

For instance, in LINDAS, what properties are the schema:Organization class associated to?

```{r}
query_basis %>%
Expand All @@ -208,7 +218,7 @@ query_basis %>%
knitr::kable()
```

And for each postal address?
And what about the properties that the schema:PostalAddress class are associated to?

```{r}
query_basis %>%
Expand All @@ -222,6 +232,8 @@ query_basis %>%

## Which data or property name includes a certain substring?

Let us examine whether there exists in LINDAS some data related to water, through the search of string "hydro" or "Hydro" :

```{r}
query_basis %>%
spq_add("?s ?p ?o") %>%
Expand All @@ -234,7 +246,7 @@ query_basis %>%

## An example query based on what we now know


To wrap it up, let us now use the LINDAS triplestore for an actual data query: we could for instance try and collect all organizations which have "swiss" in their name:

```{r}
query_basis %>%
Expand Down
220 changes: 220 additions & 0 deletions vignettes/articles/glitter_bibliometry.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
---
title: "glitter for HAL"
lvaudor marked this conversation as resolved.
Show resolved Hide resolved
author: "Lise Vaudor"
date: "04/11/2021"
lvaudor marked this conversation as resolved.
Show resolved Hide resolved
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{glitter for HAL}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

This article deals with queries on the [HAL](https://hal.science/) triplestore, [dataHAL](https://data.hal.science/). **HAL** is the **bibliograpic open archive** chosen by French research institutions, universities, and grandes écoles. Hence, we suppose this article will only be useful to French-speaking R users and will provide it **in French** from now on...
maelle marked this conversation as resolved.
Show resolved Hide resolved

Les données contenues dans le triplestore HAL sont a priori utilisables pour générer des **rapports bibliographiques** pour une **personne**, une **organisation** (UMR par exemple), en triant par exemple par **année** ou par **période**.
maelle marked this conversation as resolved.
Show resolved Hide resolved

On peut ainsi imaginer utiliser ces données pour **générer automatiquement et de manière reproductible** un certain nombre de **tables ou graphiques** en lien avec les **évaluations** du personnel publiant des établissements de recherche.

Dans la suite de cet article je montrerai comment explorer et exploiter ces données à l'aide de R, et (notamment) du package glitter.

```{r libs}
library(glitter)
library(sequins)
library(tidyverse)
```


# Entrée par auteur

Essayons par exemple d'examiner s'il existe dans la base quelqu'un qui s'appelle (tout à fait au hasard) "Lise Vaudor":
maelle marked this conversation as resolved.
Show resolved Hide resolved

```{r test_LV}
test_LV=spq_init() %>%
spq_add("?personne foaf:name 'Lise Vaudor'") %>% # récupère les personnes appelées "Lise Vaudor"
spq_perform("hal")
DT::datatable(test_LV)
maelle marked this conversation as resolved.
Show resolved Hide resolved
```

Il existe bien une personne ayant ce nom dans la base de données, qui fait l'objet d'une fiche consultable [ici](`r test_LV$personne[1]`).
lvaudor marked this conversation as resolved.
Show resolved Hide resolved

La consultation de cette page montre que deux propriétés sont souvent renseignées: **foaf:interest** et **foaf:topic_interest**. Cette dernière propriété semble regrouper des mots-clés issus de l'ensemble des publications de l'auteur alors que foaf:interest correspond à des centres d'intérêt déclarés (probablement lors de la création du profil HAL: à vrai dire je ne m'en souviens plus!).

Quoi qu'il en soit, l'information relative aux centres d'intérêt est accessible comme suit:

```{r interet_LV, fig.width=7,fig.height=7}
query = spq_init() %>%
spq_add("?personne foaf:name 'Lise Vaudor'") %>%
spq_add("?personne foaf:interest ?interet") %>% # récupère les centres d'intérêt
spq_add("?interet skos:prefLabel ?interet_label") %>% # étiquette les centres d'intérêt
spq_filter(lang(interet_label) == 'fr')
graph_query(query, layout="tree")
maelle marked this conversation as resolved.
Show resolved Hide resolved
```


```{r interet_LV_run}
interet_LV = query %>% # garde seulement les étiquettes en français
spq_perform("hal")
DT::datatable(interet_LV)
```

# Documents d'un auteur
lvaudor marked this conversation as resolved.
Show resolved Hide resolved

Une des petites subtilités du modèle de données HAL consiste à considérer que **un document a un créateur -ou auteur- et un créateur correspond à une personne**.

## Affiliations

Par exemple, l'article "How sampling influences the statistical power to detect changes in abundance: an application to river restoration" a pour créatrice (entre autres personnes) "Lise Vaudor à l'époque du Cemagref", qui correspond à la personne "Lise Vaudor" qui elle est intemporelle ;-). Ainsi, c'est en considérant les créateurs de documents que l'on va récupérer les affiliations: **l'affiliation est une information qui se récupère en adoptant une entrée par document plutôt que par auteur**.
lvaudor marked this conversation as resolved.
Show resolved Hide resolved


```{r orga_LV_prep, fig.width=7, fig.height=8}
query=spq_init() %>%
spq_add("?doc dcterms:creator ?createur") %>% # documents crées par créateur
spq_add("?createur hal:structure ?affil") %>% # créateur correspond à une affiliation
spq_add("?createur hal:person ?personne") %>% # créateur correspond à une personne
spq_add("?personne foaf:name 'Lise Vaudor'") %>%
spq_add("?affil skos:prefLabel ?affiliation") %>% # étiquette affiliation
spq_group_by(affiliation) %>% # groupe par affiliation
spq_summarise(n=n())
maelle marked this conversation as resolved.
Show resolved Hide resolved
graph_query(query, layout="tree")
```

```{r orga_LV_run}
orga_LV=query %>% # renvoie le nombre d'enregistrements
spq_perform("hal") %>%
maelle marked this conversation as resolved.
Show resolved Hide resolved
arrange(desc(n))
maelle marked this conversation as resolved.
Show resolved Hide resolved

DT::datatable(orga_LV)
```

## Documents

Si l'on ne s'intéresse pas aux affiliations mais aux **documents** eux-mêmes:

```{r docs_LV}
docs_LV=spq_init() %>%
spq_add("?doc dcterms:creator ?createur") %>%
spq_add("?createur hal:structure ?affil") %>%
spq_add("?createur hal:person ?personne") %>%
spq_add("?personne foaf:name 'Lise Vaudor'") %>%
spq_add("?affil skos:prefLabel ?affiliation") %>%
spq_add("?doc dcterms:type ?type") %>% # récupère le type de document
spq_add("?type skos:prefLabel ?type_label") %>% # étiquette le type de document
spq_filter(lang(type_label) == 'fr') %>% # ... en français
spq_add("?doc dcterms:bibliographicCitation ?citation") %>% # récupère la citation
spq_add("?doc dcterms:issued ?date") %>% # et la date de publication
spq_perform("hal") %>%
mutate(date=stringr::str_sub(date,1,4)) %>% # simplifie la date pour ne garder que l'année
maelle marked this conversation as resolved.
Show resolved Hide resolved
group_by(citation,type_label,date) %>% # pour une citation, un type, une date
summarise(affiliation=paste(affiliation, collapse=", ")) # simplifie l'affiliation

dim(docs_LV)
DT::datatable(docs_LV %>% arrange(desc(date)) %>% head(20)) # Montre les 20 documents les plus récents
```


# Entrée par laboratoire


## Identification du laboratoire

Intéressons-nous maintenant aux publications issues d'un laboratoire. Ici, nous avons choisi le laboratoire "Environnement Ville Société", alias "EVS" ou encore "UMR 5600".

Essayons de le retrouver dans la base de données:

```{r labo_EVS}
labo_EVS=spq_init() %>%
spq_add("?labo skos:prefLabel ?labo_label") %>%
spq_add("?labo dcterms:identifier ?labo_id", .required=FALSE) %>%
spq_filter(str_detect(labo_label,"EVS|(UMR 5600)|(Environnement Ville Soc)")) %>%
spq_perform("hal")
labo_EVS
```

Bon! Eh bien, étant donné la diversité des formats dans la dénomination d'EVS, un petit tri manuel s'impose.

```{r labo_EVS_filter}
labo_EVS= labo_EVS %>%
unique() %>%
mutate(num=1:n()) %>%
filter(!(num %in% c(1,2,3,18))) %>% # ici je retire les labos qui ne correspondent pas à UMR 5600 / EVS
select(-num)
DT::datatable(labo_EVS)
```

Créons maintenant une fonction qui permet de récupérer l'ensemble des documents pour chacune de ces dénominations de laboratoire.

```{r get_docs_lab}
get_docs_lab=function(lab){
lab=paste0("<",lab,">")
result=spq_init() %>%
spq_add(glue::glue("?createur hal:structure {lab}")) %>%
spq_add("?createur hal:person ?personne") %>%
spq_add("?personne foaf:name ?auteur") %>%
spq_add("?doc dcterms:creator ?createur") %>%
spq_select(-createur) %>%
spq_add("?doc dcterms:type ?type") %>% # récupère le type de document
spq_add("?type skos:prefLabel ?type_label") %>% # étiquette le type de document
spq_filter(lang(type_label) == 'fr') %>% # ... en français
spq_add("?doc dcterms:bibliographicCitation ?citation") %>% # récupère la citation
spq_add("?doc dcterms:issued ?date") %>%
spq_perform("hal") %>%
mutate(date=stringr::str_sub(date,1,4)) %>%
select(auteur, type=type_label, date, citation)
return(result)
}
```

Appliquons maintenant cette fonction à chacune des dénominations possibles pour le labo EVS:

```{r apply_get_docs_lab}
if(!file.exists("docs_EVS.RDS")){
docs_EVS=labo_EVS %>%
group_by(labo,labo_label) %>%
tidyr::nest() %>%
mutate(data=purrr::map(labo,get_docs_lab)) %>%
tidyr::unnest(cols="data")
saveRDS(docs_EVS,"docs_EVS.RDS")
}
docs_EVS=readRDS("docs_EVS.RDS")
maelle marked this conversation as resolved.
Show resolved Hide resolved
dim(docs_EVS)
```

Cette table compte de nombreux enregistrements (`r nrow(docs_EVS)`). On montre ci-dessous les plus récents (à partir de 2020):

```{r docs_EVS_show}
docs_EVS_show=docs_EVS %>%
select(-labo) %>%
filter(date>=2020) %>%
unique() %>%
select(auteur,date,type=type_label,citation,doc) %>%
ungroup()

dim(docs_EVS_show)
DT::datatable(docs_EVS_show)
```

# Rendus graphiques

On peut dès lors utiliser ces données pour produire un certain nombre de graphiques permettant d'apprécier la production du laboratoire au cours du temps:

```{r plot_datecitation}
docs_datecitation=docs_EVS %>%
group_by(type) %>%
mutate(ntype=n()) %>%
ungroup() %>%
mutate(ntot=n()) %>%
mutate(proptype=ntype/ntot) %>%
filter(proptype>0.05) %>%
group_by(date,citation,type) %>%
summarise(n=n()) %>%
filter(date>2015)

ggplot(docs_datecitation, aes(x=date,y=n, fill=type)) +
geom_bar(stat="identity")+
facet_grid(rows=vars(type))
```
maelle marked this conversation as resolved.
Show resolved Hide resolved

Loading
Loading