-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remake input data placement upon site list changes #12040
Comments
PS: based on implementation details, we might have to break this down into 2 or 3 issues. |
@amaltaro as a logical continuation of the effort on delivering the dynamic SiteLists changes, I decided it would be good to take this issue as well and start looking into/thinking about the eventual logic while we go through the review process of #12099 . Please let me know if you think this was not a good idea and I will step down ..... |
@todor-ivanov there are a couple of decisions that are still not clear to me on this development, they are:
I wonder what your thoughts are on this? UPDATE: just to add, we have a very similar challenge with this ticket #12039, where we are also considering an external messaging queue (NATS). |
hi @amaltaro Indeed, those are really important questions. on:
I can 100% agree this is the place. No need for yet another service to deal with this.
This is tricky. But.... lets think logically. We have basically two options:
NOTE: By change of conditions here I refer to the change of input data location based on the update of the SiteWhitelist and SiteVBlckalists at the workflow level. Now thinking in the perspective of those both options, I'd definitely be in favor of a single action triggered on a user request. And this is a moment we already know in time. While following the path of regular updates to me would lead early or late to spawning yet another service. (On which I already expressed my skepticism). So following this line of thoughts, I'd suggest we develop a dedicated API call in the MSTransferor microservice, through which we should trigger the proper transfer rules update based on the user request for update of SiteLists at the workflow level, and the rest of the code to be placed at MSTransferor. This would mean of course we would have to rethink (eventually) the transfer rule retention policy, in case it turns out we do not keep the information long enough. So if this is about to lead to a change in the concept of the MStransferor or change/adding of a different backend databases etc....(I mean we are not using Mongo in MStransferor which would be the perfect database for this) well.... again I am ok with that too... but may cost some more work.... So if we are not ready to pay the price of a conceptual change in the MSTransferror (or even better if we find a way not to do so) we must think of all those things in the perspective of the current design of MStransferror and add any functionality or workaround there as needed. (Honestly I hope the current design is already flexible enough such that we avoid similar drastic changes) |
Just to add here: There is some minor difference in what we need to do here with respect to what needs to be done in: #12039 And the difference is simple - we do not need to preserve previous state because at this stage everything is still retained at Central services and we also know the exact moment of a change of Sitelist and also if there was an actual change or it was just the old one. So we can still develop a mechanism based on a set of REST calls between the central services and trigger the proper actions. In contrast to how we propagate the information between the agents and central services - where everything is done mostly through CouchDB, which implies the necessity of the knowledge and the ability to compare previous to current states. |
I spoke with Todor via MM channel and will post hist suggestion here. @amaltaro at this moment I need your feedback on the following questions before I'll assign this issue to myself and work on it:
Todor's proposal:
But in order to avoid code conflicts we must converge with Alan on: dmwm/WMCore#12099 |
Hi @vkuznet
As recently discussed, NATS is out of the game. With this approach, we need to be mindful with concurrency between updating the spec in ReqMgr2 versus fetching an up-to-date spec in MSTransferor. If we have that covered, then the REST endpoint can be "static", where we simply provide the workflow name that we want to trigger a data (re-)placement. An alternative to that would be to have an endpoint that would receive 3 parameters: workflow name, site whitelist, site blacklist.
Yes, it is correct that this feature depends on ReqMgr2. However, most of the implementation of this feature relies solely on MSTransferor codebase, so there is no need to wait for anything and developments can be done concurrently. To summarize, we can start working on the MSTransferor code right away, development-level validation can also be performed without any changes in ReqMgr2. Once we are happy with that, we can look on integrating this feature to ReqMgr2 (still within this same ticket). |
@amaltaro , if I understand you correctly the following changes should be made outside ReqMgr2:
Please provide these details to proceed. |
Perhaps
I would be in favor of providing solely the workflow name. But it really depends on how we can re-trigger the data placement, which is in your next question.
I would be in favor of having a model similar to the data migration in DBS. Where the workflow would be: The reason I prefer an asynchronous data placement is that this call can take up to many seconds, as we have to re-execute the data discovery (what are the final blocks that need to be transferred) for the workflow in question. Another alternative for this data discovery, would be to fetch the Rucio rule ids for the previous data placement, grab the DIDs in those rules, and use those for the next data placement. However, I fear that we could have blocks being invalidated during the lifetime of a workflow, which would lead Rucio to return a DIDNotFound exception. |
Impact of the new feature
MSTransferor
Is your feature request related to a problem? Please describe.
This is a sub-task of this meta-issue: #8323
Towards providing a feature that allows workflow site lists to be changed while workflows are active in the system.
Describe the solution you'd like
Whenever the site lists of an active workflow changes, we need to revisit the input data placement rules - if any - and update their RSE expressions accordingly. That will involve at least the following:
This can be implemented either:
a) synchronous with the status transition, but it might make the client HTTP request just too expensive (ReqMgr2 + Globalqueue + Rucio change all in a single client call)
b) asynchronous, but then we need a mechanism to flag/identify workflows that need data re-placement.
Describe alternatives you've considered
There are a few ways to implement this feature, like:
a) trigger a new data placement with the new site lists
b) once the new rule creation is sucessful, we could delete the previous rule superseded
c) similar to a), but instead of making a new rule, we could consume the rule ids already persisted in the database (via MSTransferor/MSMonitor) and update their RSE expression accordingly.
Hasan made the following observation for case c): "You cannot update (update-rule) the rse expression of a rule and keep the same rule id. You can change the rse expression by "moving" (move-rule) a rule which creates a new rule."
Additional context
NOTE that the best would be to let MSTransferor take care of this data re-placement, such that it considers everything (campaign configuration, pileup configuration, rse quota, etc).
The text was updated successfully, but these errors were encountered: