Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On detecting deleted files across versions #525

Closed
marcolarosa opened this issue Jan 12, 2021 · 2 comments
Closed

On detecting deleted files across versions #525

marcolarosa opened this issue Jan 12, 2021 · 2 comments
Milestone

Comments

@marcolarosa
Copy link

marcolarosa commented Jan 12, 2021

In #522 I talked about a way we might use S3 as a backend. One point I made is that versioning would require pulling down from S3 the entire object which would be terrible in the case of very large objects (TB sized objects but even a few GB would make updates slow).

In his reply @pwinckles stated that the most recent inventory is likely the only thing needed.

This ticket is about thinking through how to detect that a file has been deleted in the next version without needing the whole object (which I can't see is possible - hence why I'm asking for help!).

Consider the following:

  v1                                     v2
  |- File A - hash X                     |- File A - hash X

No change; do not create new version.
  v1                                     v2
  |- File A - hash X                     |- File A - hash X
                                         |- File B - hash Y

New file; create new version referencing File A -> v1 and File B -> v2
  v1                                     v2
  |- File A - hash X                     |- File A - hash Z
  |- File B - hash Y                     |- File B - hash Y
    
File changed (File A); create new version referencing File B -> v1, File A -> v2

Our library works by digesting the path (walking the object tree and producing pairs of files + hashes) and then comparing the new tree to the existing tree. In all of the cases above we get the expected behaviour. However, if we didn't have the whole dataset available in the new version then changing a file would result in all of the other data being removed from the next version.

  v1                                     v2
  |- File A - hash X                     |- File A - hash Z
  |- File B - hash Y                     
    
File changed (File A); create new version referencing File A -> v2 but File B 
ends up removed from the new version.

So, if I've thought this through correctly, comparing against the latest inventory rather than a full digest means we will pick up file changes and file additions but we won't be able to remove a file from one version to the next as we would always need to include everything that is referenced in the latest version.

Is there another way to detect file deletions across versions without needing all of the object data up to that point?

@marcolarosa
Copy link
Author

marcolarosa commented Jan 13, 2021

After a good long discussion with my colleague @ptsefton it has become clear to me that what I'm discussing is an implementation detail of my library. Accordingly, I'm moving this ticket to my library but I'll link it here so that anyone who wants to follow along can.

Discussion and ideas about handling object operations in a sensible manner: CoEDL/ocfl-js#3

That said, this ticket can be closed if necessary.

@awoods
Copy link
Member

awoods commented Jan 13, 2021

Thanks for closing the loop on this, @marcolarosa . It is very exciting to see / follow how you are implementing OCFL for your use case. Please keep everyone posted, and feel free to join the community meetings whenever it makes sense.

@awoods awoods closed this as completed Jan 13, 2021
@rosy1280 rosy1280 added this to the 1.1 milestone May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants