-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
basic architecture doc #19
Comments
Hi @gregwebs, thanks for your comment. We have such a document internally, and I plan on formatting it for external use. (And also make sure it's up to date with reality.) |
Also, to address some of your specific questions:
|
thanks for the detailed response! Everything makes sense here except for the first one about async submission. What happens if I tell reflow to run and then Ctrl-C? |
Reflow behaves just like any other program interpreter: ctrl-c stops execution. Of course, because Reflow is incremental, and intermediate reductions are cached, if you start it up again, it's (most of the time) able to pick up from where it stopped.. |
Also to clarify: when you hit control-C, reflow will fail to maintain keepalives to the worker instances; this will cause them to automatically terminate after 10 minutes of idleness. This is also noted in the README:
|
Does it catch Ctrl+C and send a stop request to the agent, or reflow is sending keep alives to the agent? Thanks for explaining. Idle instance termination is really a separate concern from this question since another job may be scheduled to the instance, so I don't think the README statement explains this aspect. |
Reflow maintains a keep alive to all of its resource allocations. Thus if the job is killed it in any way, or loses connectivity, the keepalive is no longer maintained. An instance is considered idle if none of its allocs are alive. |
Thanks for all the explanations! I will close this, but you could re-open it if it serves as a useful reminder to add an architecture doc. I am quite surprised by some aspects of the current operational architecture, although I see how it was probably the easiest way to get things to a good enough state. I use a more centralized job flow controller that I could try connecting this to in order to achieve an async workflow, etc. |
The design may be unusual but it's not done out of ease or convenience. This design lets us very easily control the whole compute stack, and provision resources as they are needed—the cloud is elastic after all, and Reflow fully exploits this. With respect to synchronous vs. asynchronous execution, I view this as an orthogonal concern: an external system can be responsible for managing individual Reflow jobs. (This is what we do at GRAIL.) This split of responsibilities keep things both orthogonal and simple: Reflow interprets and executes Reflow programs on a cloud provider; a separate system is responsible for managing a set of jobs. |
Sorry, I may have mis-understood your description, going off an architecture doc would make things clearer :) Also makes a lot more sense that you do have a system for async execution. Would be great to see the architecture of that also even if there is no implementation provided. |
The code is beautiful. Just want to ask if you plan to publish an architecture doc soon. Thanks! |
Hi @chenpengcheng, thanks! Yes, this is getting closer to the top of my list... |
@mariusae Is this still on the list and still supposed to go in Reflow core? As always, very interesting project! |
@hchauvin yep, @prasadgopal is working on this actively right now :-) |
Awesome, thanks! |
Thanks for opening this up and the nice docs so far. I am wondering if a basic architecture doc focused on infrastructure and runtime can be produced. Things that come to mind:
The text was updated successfully, but these errors were encountered: