Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to model crash transitions ? #77

Open
damnMeddlingKid opened this issue Aug 21, 2024 · 6 comments
Open

How to model crash transitions ? #77

damnMeddlingKid opened this issue Aug 21, 2024 · 6 comments

Comments

@damnMeddlingKid
Copy link

damnMeddlingKid commented Aug 21, 2024

I want to model a serial action which can crash. Heres an example

fair action DoWork:
    if current_state == State.Start:
        lock_service.acquire_lock()
        do_something
        do_another_thing
        lock_service.release_lock()

it is ok to yield anywhere in this action, but if the action crashes I would like to cleanup some state and release the lock.

right now im using a work around where i make the action atomic and use something like

if MaybeFail():
   return

to place failures in between lines.

@damnMeddlingKid damnMeddlingKid changed the title How to handle crash transitions ? How to model crash transitions ? Aug 21, 2024
@jayaprabhakar
Copy link
Collaborator

I was considering adding try-except-else-finally clause of python, or to support other types of flows where it can yield but never crash. but for most real examples, they were not required, so I didn't bother to add.

Can you let me know the exact case when you would need this? Based on the name lock_service, I assume, the lock is managed by a separate microservice/database.

In that case, no same-action cleanup solution would really work, when the model is translated into real code during development. The real solution would require a separate fair action to detect and cleanup.

@damnMeddlingKid
Copy link
Author

Can you let me know the exact case when you would need this? Based on the name lock_service, I assume, the lock is managed by a separate microservice/database.

Yeah exactly. We have a process that acquires a lock from the lock service and then that process can die while holding the lock. In production we have a periodic health check that can detect that the process has died and then it performs some cleanup.

But i'm not sure how to model that detection.

@jayaprabhakar
Copy link
Collaborator

jayaprabhakar commented Aug 21, 2024

This will depend on how much you would want to model and the what happens when the node is working and thinks it is still holding the lock, but the lock service thinks the node failed and released the lock for some other process to acquire.

In the real world, the implementation most likely will have a timeout on the lock so it will get released after the timeout, and the nodes acquiring it will send periodic heartbeats, the heartbeats acts as lease renewal of the lock. The heartbeat interval will be significantly shorter than the lock release timeout. Before each critical operation, it must check if it still holds the lock. (That is, the lock was last renewed less than the heartbeat interval)

Since the system you are modeling, you are not interested in how the locks renewal and leases are implemented, you don't need to model the heartbeats, renewal etc, all we need to know is if the node still holds the lock.

You can simply maintain a holds_lock boolean variable. Before each critical step, atomically check holds_lock and execute the step. A separate fair action Timeout to release the lock.

fair action DoWork:
    if current_state == State.Start:
        lock_service.acquire_lock()
        holds_lock = True
        do_something()
        do_another_thing()
        # these two are optional for correctness, but as a spec good to mention this.
        holds_lock = False
        lock_service.release_lock()

func do_something():
    atomic:
       if not holds_lock:
         pass # Just to ensure it marks this block as enabled
         return
       ....

func do_another_thing():
    # some steps
    atomic:
       if not holds_lock:
         pass
         return
       critical_section

      # more steps

fair action LockTimeout:
    holds_lock = False
    # calling lock_service.release_lock() is optional, that is an 
    # implementation optimization than modeling for correctness.

@jayaprabhakar
Copy link
Collaborator

Btw, the spec might be slightly simpler if lock_service.acquire_lock() returned a lock that you store as a state variable. The holds_lock will be within the lock record. But for now, start with whatever is the simplest

@damnMeddlingKid
Copy link
Author

damnMeddlingKid commented Aug 22, 2024

Thanks!, it's similar to what i've been doing. I didn't realize that i need a pass to activate the block, but it makes sense it needs some statement execution to be enabled.

In my actual model there are several more lines and i'd like to model failing at any point, so it gets a bit tedious to check "holds_lock" everywhere. Implementing try/except makes sense to me, but maybe with different terminology. It's up to the author to think about if the model reflects reality ?.

@jayaprabhakar
Copy link
Collaborator

jayaprabhakar commented Sep 3, 2024

In a real implementation, isn't the lock always checked (either explicitly to check if the lock hasn't timeout or implicitly where the individual operations have a much shorter timeout compared to the lock timeout) before any critical operation?

In most cases require holds_lock should be sufficient. In that case, this should be short and makes the implicit assumption explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants