Static data imports #370

joeroe · 2024-12-17T13:39:37Z

This PR represents a rethinking of the import pipeline with the aim of getting data into XRONOS sooner. Basically, I dropped the idea of creating a generalised architecture or UI for imports, which greatly complicated the problem and have been holding up imports for a long time.

The new approach is simply to write a rake task for each resource to be imported. Advantages of this approach:

It's reproducible – the source data is saved in db/import and the rake script shows exactly how it was imported
The scripts can also be used to populate new instances of XRONOS with 'real data' (such as dev environments!)
We can deal with the peculiarities of each dataset without having to build a generalised infrastructure every time (e.g. in the proof-of-concept, cleaning up the punctuation of citations)
We can update and re-run scripts if relevant parts of the data model change
It can be expanded to deal with versioned and live resources fairly easily, i.e. just re-run the task when there's new data
We can use existing tools to schedule imports to manage server load

The major disadvantage is of course that bulk imports can only be done be sysadmins. However, I am no longer convinced that this was ever a viable option. Even with a relatively small and clean dataset like the one below, there are too many potential problems to guard against and a poorly-specified import can great a huge mess.

I have written a proof-of-concept of this approach with Wang et al.'s (2014) database of Chinese radiocarbon dates. These are already in the production version of XRONOS via p3k14c, and so this will create more duplicates, but it illustrate the advantages of the scripted approach because we pick up richer data on the context and sample, as well as full(ish) bibliographic references. If you agree with this approach @MartinHinz, I'll add more scripts for other static resources (https://github.com/xronos-ch/xronos-import/issues/5, prioritising https://github.com/xronos-ch/xronos-import/issues/54) before finalising the PR and running them in production.

To perform the import you need to set the environment variable ADMIN_USER_ID and then run bin/rails import:wang_et_al-2014.

Using the concer. Previously they just had `has_paper_trail`, which meant changes were recorded with paper trail but there was no `revision_comment`.

joeroe · 2024-12-17T13:41:03Z

lib/tasks/import.rake

+    END
+
+    PaperTrail.request(whodunnit: admin_user_id) do
+      ActiveRecord::Base.transaction do


Note everything is wrapped in a transaction, so if there are any errors the whole import is aborted.

MartinHinz · 2024-12-17T14:34:23Z

Agreed, and in that case I will proceed in the same way for the dendro imports.

joeroe added 3 commits December 17, 2024 14:11

Make contexts, materials, and typos versioned

5a1e310

Using the concer. Previously they just had `has_paper_trail`, which meant changes were recorded with paper trail but there was no `revision_comment`.

Bump schema version

fb98657

Add task to import Wang et al. 2014

85fe216

joeroe marked this pull request as draft December 17, 2024 13:39

joeroe commented Dec 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static data imports #370

Static data imports #370

joeroe commented Dec 17, 2024

joeroe Dec 17, 2024

MartinHinz commented Dec 17, 2024

Static data imports #370

Are you sure you want to change the base?

Static data imports #370

Conversation

joeroe commented Dec 17, 2024

joeroe Dec 17, 2024

Choose a reason for hiding this comment

MartinHinz commented Dec 17, 2024