Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show "Audit compliance" #50

Open
3 tasks
cziaarm opened this issue Jan 30, 2020 · 8 comments
Open
3 tasks

Show "Audit compliance" #50

cziaarm opened this issue Jan 30, 2020 · 8 comments

Comments

@cziaarm
Copy link
Contributor

cziaarm commented Jan 30, 2020

Ideas for showing "audit compliance". In order for items to be available for Audit by RE for REF they will need to be available via the core.ac.uk aggregator and to verify openness they will need to be in the unpaywall DB. They will also need to be available for Text and Data mining

Additions to REF CC could give ticks for these to show "audit compliance"... ie that they are auditable

(With thanks to Jennifer Smith)

@jesusbagpuss
Copy link
Contributor

I'm guessing these aren't meant to be human-tickable boxes - the machines will ask the questions?

Would we need to store both the state (yes/no) and a date-last-checked?

@wfyson
Copy link
Contributor

wfyson commented Jan 31, 2020

A link to the audit guidance which has some paragraphs to back up the above checklist: https://www.ref.ac.uk/media/1164/ref-2019_04-audit-guidance.pdf

@cziaarm
Copy link
Contributor Author

cziaarm commented Jan 31, 2020

@jesusbagpuss No your are correct they would be very much tick icons :)

Agreed on the stamp for those that are reliant on checking third party systems (unpaywall example below)

Whether or not something is available for TDM is within the ken of EPrints so no need to stamp that I think?

But it does highlight that this is a different notion. We are checking the status of something rather than a fact... for the other REF CC data it just needed to be true once... Fr the audit stuff, the question is more "Will this work?" rather than "Is this true?"

  name => "unpaywall`,
  type => "compound",
  fields => [
     { 
       sub_name => "present",
       type => "set",
       options => [ qw/yes no/ ],
   },
   {
     sub_name => "last_checked",
     type => "time",
   }
}```

@jesusbagpuss
Copy link
Contributor

jesusbagpuss commented Jan 31, 2020

We might want to cache slightly more data from unpaywall...

  • is the item oa (is_oa == true) - based entirely on the DOI
  • is this repository listed as a source? (one oa_locations.pmh_id matches our repo OAI identifier)
  • are we the best place (does best_oa_location.url_for_pdf point to our repository?)

@cziaarm
Copy link
Contributor Author

cziaarm commented Jan 31, 2020

Cheers @wfyson (and yes @jesusbagpuss ) so from that doc the specific things

  • unpaywall (first stop... is there a doi in EPrints?)
    • is_oa
    • url_for_pdf (of any description)
    • oa_locations.pnh_id ( is this repo there at all? is the reference correct at the article level?)
    • best_oa_location.url_for_pdf (is our repo the best place?... how much does that matter?)
  • CORE
    • datePublished
    • repositories.repositoryDocument.depositedDate

Now we all know the depositedDate is not available via rioxx, so as far as I'm aware CORE are scrabbling it out of the abstract page (and the deposit date is not necessarily available nor a compliant deposit date).

Ideally we'd just do a rioxx implementation that makes depositedDate available (and put the proper FCD in there).. then CORE and the "the rioxx team" would catch-up with their harvesting and standard writing.

but in lieu of that happening there may be value in the the "Audit" tab (it's become a tab in my head now). To report on the data that CORE does have, and how it compares to what EPrints thinks it should be. (traffic lights anyone?... green present and matches, yellow present but doesn't match, red not present)

Based on the guidance I think we can probably dial back the TDM as the only mention of the word text (let alone mining) is "with searchable text" and that is in reference to the url_for_pdf

@jesusbagpuss
Copy link
Contributor

Now we all know the depositedDate is not available via rioxx

The DOI isn't even directly available in RIOXX - something that Sheffield noted recently - as searching for things by DOI wasn't returning the expected records
Also (in case it's of use/interest, the CORE search is case-sensitive for DOIs - which should be case-agnostic!? This may cause things to not be returned when actually they are there!

Ideally we'd just do a rioxx implementation

I have seriously mulled over the idea of creating a RIOXX3-alpha OAI-PMH metadata profile that addresses some of the stuff suggested here: https://github.com/antleaf/rioxx/issues ...

@MickEadie
Copy link

One use-case we have is to: 'show me what upaywall / core / crossref say about things we have actually selected for REF' - I wonder of a report could be set-up as part of the REF Support plugin?

we have put together a script that looks for unpaywall data for a bunch of eprintids - screenshot attached - i could put the code for this on github (but its very exprimental!)
enlighten_unpaywall

it used Martin Braendle's unpaywall plugin code as a starting point: https://github.com/eprintsug/OpenAccessType

also on the Core deposit date - we have started making HOA_DATE_FCD, HOA_DATE_ACC and HOA_DATE_PUB date public and called FCD 'Date Deposited' as per Core's guidelines https://core.ac.uk/ref-audit/#recommendations, example in the Deposit and Record Details section of summary page here: http://eprints.gla.ac.uk/174607/

haven't checked it its now picking this up peoprly yet but is a short term fix, and also not every site is keen to publish these dates - so ideally RIOXX would do this

@jesusbagpuss
Copy link
Contributor

From a thread (CORE/ unpaywall) on UK-CORR, the Unpaywall data from the API might be a bit 'stale' in some cases:

Our dataset changes every day, for a number of reasons:
New article are published.
Authors self-archive new OA copies to repositories for existing articles.
Articles become OA as embargoes expire.
Publisher-hosted "Bronze OA" articles (free-to-read but without an open license) are moved from OA to toll-access.
For many use-cases, especially in enterprise contexts, it's important to stay up-do-date with these changes. The snapshot is a poor fit for this, since it's only updated a few times a year. Likewise, the API works poorly for this since it takes many months to scroll through the whole of DOI-space polling for changes.
So, we built the Data Feed to address this issue.

We might want to capture the oa_location.updated data for our repository, and compare that to the data in EPrints. We can indicate that the Unpaywall data is stale, especially when their updated date is before our FOA date. (It would be nice to be able to poke Unpaywall in this situation to say "come and get this record again - it's got more good stuff in it").

For reference, unpaywall data format is described here: https://unpaywall.org/data-format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants