Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static data imports #370

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Static data imports #370

wants to merge 3 commits into from

Conversation

joeroe
Copy link
Contributor

@joeroe joeroe commented Dec 17, 2024

This PR represents a rethinking of the import pipeline with the aim of getting data into XRONOS sooner. Basically, I dropped the idea of creating a generalised architecture or UI for imports, which greatly complicated the problem and have been holding up imports for a long time.

The new approach is simply to write a rake task for each resource to be imported. Advantages of this approach:

  • It's reproducible – the source data is saved in db/import and the rake script shows exactly how it was imported
  • The scripts can also be used to populate new instances of XRONOS with 'real data' (such as dev environments!)
  • We can deal with the peculiarities of each dataset without having to build a generalised infrastructure every time (e.g. in the proof-of-concept, cleaning up the punctuation of citations)
  • We can update and re-run scripts if relevant parts of the data model change
  • It can be expanded to deal with versioned and live resources fairly easily, i.e. just re-run the task when there's new data
  • We can use existing tools to schedule imports to manage server load

The major disadvantage is of course that bulk imports can only be done be sysadmins. However, I am no longer convinced that this was ever a viable option. Even with a relatively small and clean dataset like the one below, there are too many potential problems to guard against and a poorly-specified import can great a huge mess.

I have written a proof-of-concept of this approach with Wang et al.'s (2014) database of Chinese radiocarbon dates. These are already in the production version of XRONOS via p3k14c, and so this will create more duplicates, but it illustrate the advantages of the scripted approach because we pick up richer data on the context and sample, as well as full(ish) bibliographic references. If you agree with this approach @MartinHinz, I'll add more scripts for other static resources (https://github.com/xronos-ch/xronos-import/issues/5, prioritising https://github.com/xronos-ch/xronos-import/issues/54) before finalising the PR and running them in production.

To perform the import you need to set the environment variable ADMIN_USER_ID and then run bin/rails import:wang_et_al-2014.

Using the concer. Previously they just had `has_paper_trail`, which
meant changes were recorded with paper trail but there was no
`revision_comment`.
@joeroe joeroe marked this pull request as draft December 17, 2024 13:39
END

PaperTrail.request(whodunnit: admin_user_id) do
ActiveRecord::Base.transaction do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note everything is wrapped in a transaction, so if there are any errors the whole import is aborted.

@MartinHinz
Copy link
Collaborator

Agreed, and in that case I will proceed in the same way for the dendro imports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants