Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inability to Conduct Detailed Analysis on the CR Column of the Bibliometrix Data Frame #228

Open
elexingyu opened this issue Apr 7, 2024 · 6 comments
Labels
question Further information is requested

Comments

@elexingyu
Copy link

Here is my code for batch fetching all the paper metadata of a specific journal and then converting it into a bibliometrix data frame for analysis. However, I'm facing an issue where the CR column in the output bibliometrix data frame contains OpenAlex IDs, which prevents the analysis of cited references' authors. Are there any future updates planned to address this issue, enabling the CR column in the bibliometrix data frame obtained from OpenAlex to follow the same format as WoS data?

OpenAlexR generated data frame:

results <- oa_fetch(
     entity = "works",
     locations.source.issn = "0140-6736",
     verbose = TRUE
)
biblio_data <- oa2bibliometrix(results)

> biblio_data[1:1, "CR", drop = FALSE]

CR
JOSEPH P. SIMMONS, 2011, PSYCHOLOGICAL SCIENCE W2016336925;W2038552289;W2050551731;W2127151679;W2149893809;W2169244873;W4236183769;W4254129346

WoS data frame:

> bib_analysis[1:1, "CR", drop = FALSE]

RAMACHANDRAN M, 2016, ANALYSIS-UK ANONYMOUS, 1981, PHILOSOPHICAL EXPLANATIONS;ANONYMOUS, 2009, REFLECTIVE KNOWLEDGE: APT BELIEF AND REFLECTIVE KNOWLEDGE;ANONYMOUS, AUSTRALIASIAN J PHIL;COHEN S, 2002, PHILOS PHENOMEN RES, V65, P309, DOI 10.1111/J.1933-1592.2002.TB00204.X;KALLESTRUP J, 2012, SYNTHESE, V189, P395, DOI 10.1007/S11229-011-9990-9;OLIN DORIS., 2003, PARADOX;SORENSEN RA, 1982, AUSTRALAS J PHILOS, V60, P355, DOI 10.1080/00048408212340761;VAN CLEVE J., 2005, SCEPTICS CONT ESSAYS, P45;VOGEL J, 2000, J PHILOS, V97, P602, DOI 10.2307/2678454;VOGEL J, 2008, J PHILOS, V105, P518, DOI 10.5840/JPHIL2008105931;WILLIAMSON T, 1992, MIND, V101, P217, DOI 10.1093/MIND/101.402.217;WILLIAMSON TIMOTHY., 2000, KNOWLEDGE AND ITS LIMITS
@trangdata
Copy link
Collaborator

@elexingyu thank you for this feedback! Could you give me a more specific example, please? For example, what would the CR column of the WoS dataframe for W3001118548 be?

@elexingyu
Copy link
Author

elexingyu commented Apr 10, 2024

@trangdata Thank you for your response! Below are the data I obtained from the Web of Science W3001118548, which I then converted into a bibliographic data frame using the bibliometrix::convert2df function, followed by an examination of the results in the CR column. Based on my observations, the CR column typically includes the first author's name, the year of publication, the publication title and article location, plus the DOI. These details facilitate conducting Co-citation Network analysis regarding authors and sources using bibliometrix.

You can also refer to the following webpage: A brief introduction to bibliometrix, which contains detailed information about the CR column in the section "Analysis of Cited References".


> data <- bibliometrix::convert2df("savedrecs.txt", dbsource = "wos", format = "plaintext")

> data$CR
[1] "ANONYMOUS, 2020, PEDIATR MED RODZ, V16, P9, DOI 10.15557/PIMR.2020.0003;ANONYMOUS, NOVEL CORONAVIRUS GE;ANONYMOUS, NOVEL CORONAVIR 0114;ANONYMOUS, NOVEL CORONAVIR 0112;ANONYMOUS, NOVEL CORONAVIR 0117;ANONYMOUS, NOVEL CORONAVIR 0121;ARABI YM, 2018, AM J RESP CRIT CARE, V197, P757, DOI 10.1164/RCCM.201706-1172OC;ARABI YM, 2018, TRIALS, V19, DOI 10.1186/S13063-017-2427-0;ASSIRI A, 2013, LANCET INFECT DIS, V13, P752, DOI 10.1016/S1473-3099(13)70204-4;CDC, 2020, 1 TRAV REL CAS 2019;CHU CM, 2004, THORAX, V59, P252, DOI 10.1136/THORAX.2003.012658;CUI J, 2019, NAT REV MICROBIOL, V17, P181, DOI 10.1038/S41579-018-0118-9;DE GROOT RJ, 2013, J VIROL, V87, P7790, DOI 10.1128/JVI.01244-13;DROSTEN C, 2003, NEW ENGL J MED, V348, P1967, DOI 10.1056/NEJMOA030747;ECKARDT KU, 2012, KIDNEY INT SUPPL, V2, P7, DOI 10.1038/KISUP.2012.8;FALZARANO D, 2013, NAT MED, V19, P1313, DOI 10.1038/NM.3362;FAURE E, 2014, PLOS ONE, V9, DOI 10.1371/JOURNAL.PONE.0088716;GAO C, 2020, CRIT CARE MED, V48, P451, DOI 10.1097/CCM.0000000000004207;GARNER JS, 1988, AM J INFECT CONTROL, V16, P128, DOI 10.1016/0196-6553(88)90053-3;GE XY, 2013, NATURE, V503, P535, DOI 10.1038/NATURE12711;HE L, 2006, J PATHOL, V210, P288, DOI 10.1002/PATH.2067;KSIAZEK TG, 2003, NEW ENGL J MED, V348, P1953, DOI 10.1056/NEJMOA030781;KUIKEN T, 2003, LANCET, V362, P263, DOI 10.1016/S0140-6736(03)13967-0;LANSBURY L, 2019, COCHRANE DB SYST REV, DOI 10.1002/14651858.CD010406.PUB3;LEE N, 2003, NEW ENGL J MED, V348, P1986, DOI 10.1056/NEJMOA030685;MAHALLAWI WH, 2018, CYTOKINE, V104, P8, DOI 10.1016/J.CYTO.2018.01.025;PERLMAN S, 2009, NAT REV MICROBIOL, V7, P439, DOI 10.1038/NRMICRO2147;RICHMAN DD, 2016, CLIN VIROLOGY;SANZ F, 2011, EUR RESP J S55, V38, P2492;SHEAHAN TP, 2020, NAT COMMUN, V11, DOI 10.1038/S41467-019-13940-6;SHEAHAN TP, 2017, SCI TRANSL MED, V9, DOI 10.1126/SCITRANSLMED.AAL3653;STOCKMAN LJ, 2006, PLOS MED, V3, P1525, DOI 10.1371/JOURNAL.PMED.0030343;WANG MANLI, 2013, VIROLOGICA SINICA, V28, P315, DOI 10.1007/S12250-013-3402-X;WHO, 2004, SUMM PROB SARS CAS O;WHO, 2019, MIDDLE E RESP SYNDRO;WONG CK, 2004, CLIN EXP IMMUNOL, V136, P95, DOI 10.1111/J.1365-2249.2004.02415.X;ZAKI AM, 2012, NEW ENGL J MED, V367, P1814, DOI 10.1056/NEJMOA1211721"

Below is the co-citation network analysis using the dataset from openalexR, where, unfortunately, it's impossible to display the authors' names and only the OpenAlex IDs are shown:
Co_citationNetwork-_2024-04-10173030 479655

Below is the co-citation network analysis using the WoS data, which successfully analyzed the author networks. However, it also has a drawback: it assigns all missing paper author information to 'anonymous.' OpenAlex should be able to avoid this issue due to its more comprehensive data:
Co_citationNetwork-_2024-04-10173501 775222

@trangdata
Copy link
Collaborator

trangdata commented Apr 10, 2024

@elexingyu Thank you for the explanation. I will need @massimoaria's input since he's more familiar with the internals of bibliometrix.

In the mean time, however, you can try manually modifying the CR column. For example:

library(openalexR)
biblio_data <- oa_fetch(identifier = c("W3001118548", "W2015795623")) |>
  oa2bibliometrix()
get_cr <- function(cr) {
  r <- oa_fetch(identifier = strsplit(cr, ";")[[1]])
  auths <- show_works(r, identity)[["first_author"]]
  paste(auths, collapse = ";")
}
biblio_data$CR <- sapply(biblio_data$CR, get_cr)
str(biblio_data$CR)
#>  chr [1:2] "Douglas G. Altman;Jaswinder Gill;Harriet G. Oldham;J. A. Tytler;G. L. Serfontein" ...

Created on 2024-04-10 with reprex v2.0.2

From here, your co-citation network analysis should have first name authors instead of OpenAlex IDs.

To get a CR column more similar to the output from bibliometrix::convert2df (with first author's name, publication year, title, location, DOI), you can edit the get_cr function:

library(openalexR)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
biblio_data <- oa_fetch(identifier = c("W3001118548", "W2015795623")) |>
  oa2bibliometrix()
shorten_doi <- function(doi) {
  gsub("^https://doi.org/", "", doi)
}
get_cr <- function(cr) {
  r <- oa_fetch(
    identifier = strsplit(cr, ";")[[1]],
    options = list(select = c(
      "authorships", "display_name", "publication_year", 
      "primary_location", "doi"
    ))
  )
  auths <- vapply(
    r$author, openalexR:::get_auth_position, character(1),
    position = "first"
  )
  r |>
    mutate(
      first_aut = auths,
      doi = paste("DOI", shorten_doi(doi)),
      o = paste(first_aut, publication_year, display_name, doi, so, sep = ", ")
    ) |>
    pull(o) |>
    paste(collapse = ";") |>
    toupper()
}
biblio_data$CR <- sapply(biblio_data$CR, get_cr)
str(biblio_data$CR)
#>  chr [1:2] "DOUGLAS G. ALTMAN, 1983, MEASUREMENT IN MEDICINE: THE ANALYSIS OF METHOD COMPARISON STUDIES, DOI 10.2307/298793"| __truncated__ ...

Created on 2024-04-10 with reprex v2.0.2

@elexingyu
Copy link
Author

elexingyu commented Apr 11, 2024

@trangdata Fantastic! Thank you for your reply and your code! I am touched by your selfless spirit in answering questions for others! The code basically works now, and it can generate beautiful results for the author's Co-citation Network! However, since Co-citation Network analysis can be applied to papers, authors, and sources, it impacts the arrangement of content in the CR column, so I made some modifications to the code you generously provided:

  1. To handle cases where the CR column is NA, I added a detection mechanism;

  2. I removed the display name because including it in the CR column affects the final analysis of the paper's Co-citation Network, resulting in overly long node labels in the generated images, which affects aesthetics.

Code:

library(openalexR)
library(dplyr)
library(bibliometrix)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

openalex_data <- oa_fetch(
  entity = "works",
  locations.source.issn = "0956-7976",
  to_publication_date = "1990-12-31",
  verbose = TRUE
)

biblio_data <- oa2bibliometrix(openalex_data)

# Retrieve CR column
shorten_doi <- function(doi) {
  gsub("^https://doi.org/", "", doi)
}

get_cr <- function(cr) {
  if (is.na(cr) || cr == "") {
    return(NA)  # Handle NA or empty strings
  }
  # Split the CR field, ensuring to pass a character vector to oa_fetch
  cr_ids <- unlist(strsplit(cr, ";"))
  cr_ids <- cr_ids[cr_ids != ""]  # Filter out empty strings
  if (length(cr_ids) == 0) {
    return(NA)  # Return NA if no valid CR identifiers are found
  }
  
  # Fetch data using OpenAlex API
  tryCatch({
    r <- oa_fetch(
      identifier = cr_ids,
      options = list(select = c(
        "authorships", "publication_year", 
        "primary_location", "doi", "display_name"
      ))
    )
    if (is.null(r) || nrow(r) == 0) {
      return(NA)  # Return NA if query results are empty
    }
    auths <- vapply(
      r$author, openalexR:::get_auth_position, character(1),
      position = "first"
    )
    r |>
      dplyr::mutate(
        first_aut = auths,
        doi = paste("DOI", shorten_doi(doi)),
        o = paste(first_aut, publication_year, doi, so, sep = ", ")
      ) |>
      dplyr::pull(o) |>
      paste(collapse = ";") |>
      toupper()
  }, error = function(e) {
    return(NA)  # Handle any error situations
  })
}

biblio_data$CR <- sapply(biblio_data$CR, get_cr)

Before the modification, the result of the Papers-Co-citation Network analysis:

Co_citationNetwork-_2024-04-11211520 386339

After the modification, the result of the Papers-Co-citation Network analysis:
Co_citationNetwork-_2024-04-11211757 111365

However, there is a minor issue now; when analyzing the source-Co-citation Network, an error occurs. I tried to analyze the original code of this function in bibliometrix, but it is quite complex, and difficult for a newbie in R like me.

The error message is as follows:

[1] "Field CR_SO is not a column name of input data frame"
Warning: Error in crossprod: requires numeric/complex matrix/vector arguments
  148: base::crossprod
  147: crossprod
  145: biblioNetwork
  144: intellectualStructure [utils.R#1003]
  143: eventReactiveValueFunc [/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/bibliometrix/biblioshiny/server.R#4973]
   99: COCITnetwork
   98: shinyRenderWidget [/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/bibliometrix/biblioshiny/server.R#4982]
   97: func
   84: renderFunc
   83: output$cocitPlot
    2: runApp
    1: biblioshiny

@trangdata
Copy link
Collaborator

Looks like you need the CR_SO column in your input dataframe to your source-co-citation network analysis. So essentially you need a similar function to get_cr but only extract the so column from r. Or modify get_cr to return CR_SO as well. @massimoaria do we need to add this column to the output of oa2bibliometrix?

@tanchangde
Copy link

when analyzing the source-Co-citation Network, an error occurs

@elexingyu Please give the smallest reproducible code example

@trangdata trangdata added the question Further information is requested label Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants