On detecting deleted files across versions #525

marcolarosa · 2021-01-12T00:23:22Z

In #522 I talked about a way we might use S3 as a backend. One point I made is that versioning would require pulling down from S3 the entire object which would be terrible in the case of very large objects (TB sized objects but even a few GB would make updates slow).

In his reply @pwinckles stated that the most recent inventory is likely the only thing needed.

This ticket is about thinking through how to detect that a file has been deleted in the next version without needing the whole object (which I can't see is possible - hence why I'm asking for help!).

Consider the following:

  v1                                     v2
  |- File A - hash X                     |- File A - hash X

No change; do not create new version.

  v1                                     v2
  |- File A - hash X                     |- File A - hash X
                                         |- File B - hash Y

New file; create new version referencing File A -> v1 and File B -> v2

  v1                                     v2
  |- File A - hash X                     |- File A - hash Z
  |- File B - hash Y                     |- File B - hash Y
    
File changed (File A); create new version referencing File B -> v1, File A -> v2

Our library works by digesting the path (walking the object tree and producing pairs of files + hashes) and then comparing the new tree to the existing tree. In all of the cases above we get the expected behaviour. However, if we didn't have the whole dataset available in the new version then changing a file would result in all of the other data being removed from the next version.

  v1                                     v2
  |- File A - hash X                     |- File A - hash Z
  |- File B - hash Y                     
    
File changed (File A); create new version referencing File A -> v2 but File B 
ends up removed from the new version.

So, if I've thought this through correctly, comparing against the latest inventory rather than a full digest means we will pick up file changes and file additions but we won't be able to remove a file from one version to the next as we would always need to include everything that is referenced in the latest version.

Is there another way to detect file deletions across versions without needing all of the object data up to that point?

The text was updated successfully, but these errors were encountered:

marcolarosa · 2021-01-13T07:21:04Z

After a good long discussion with my colleague @ptsefton it has become clear to me that what I'm discussing is an implementation detail of my library. Accordingly, I'm moving this ticket to my library but I'll link it here so that anyone who wants to follow along can.

Discussion and ideas about handling object operations in a sensible manner: CoEDL/ocfl-js#3

That said, this ticket can be closed if necessary.

awoods · 2021-01-13T14:54:43Z

Thanks for closing the loop on this, @marcolarosa . It is very exciting to see / follow how you are implementing OCFL for your use case. Please keep everyone posted, and feel free to join the community meetings whenever it makes sense.

marcolarosa mentioned this issue Jan 13, 2021

On detecting deleted files across versions CoEDL/ocfl-js#3

Open

awoods closed this as completed Jan 13, 2021

rosy1280 added this to the 1.1 milestone May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On detecting deleted files across versions #525

On detecting deleted files across versions #525

marcolarosa commented Jan 12, 2021 •

edited

Loading

marcolarosa commented Jan 13, 2021 •

edited

Loading

awoods commented Jan 13, 2021

On detecting deleted files across versions #525

On detecting deleted files across versions #525

Comments

marcolarosa commented Jan 12, 2021 • edited Loading

marcolarosa commented Jan 13, 2021 • edited Loading

awoods commented Jan 13, 2021

marcolarosa commented Jan 12, 2021 •

edited

Loading

marcolarosa commented Jan 13, 2021 •

edited

Loading