Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSOutput: set output datasets to VALID in DBS3 before announcing a standard workflow #9837

Open
amaltaro opened this issue Jul 13, 2020 · 8 comments · May be fixed by dmwm/deployment#1044 or #10394
Open

Comments

@amaltaro
Copy link
Contributor

Impact of the new feature
ReqMgr2MS (MSOutput)

Is your feature request related to a problem? Please describe.
This has been discussed with Todor and Sharad, and it looks like we could accommodate a few more actions in the MSOutput, such that these functionalities can be deprecated in Unified.

Describe the solution you'd like
In addition to the output data placement performed by MSOutput (for wfs in closed-out and announced), we should also mark all those datasets as VALID in DBS3.
The DBS client API is:

dbsapi = DbsApi(url=url3)
dbsapi.updateDatasetType(dataset=DATASET_NAME, dataset_access_type='VALID')

but if we decide to run requests concurrently, then we could use the equivalent REST API (ask Yuyi for details).

The best order of actions, in my opinion, would be:

  1. mark output datasets as VALID
  2. if previous step was successful, then perform the output data placement

NOTE: RelVal workflows are not announced in an standalone mode, but in batches. In order to ease this transition, we should only mark output datasets as VALID for standard workflows; Unified will keep taking care of RelVals for a little longer.

Describe alternatives you've considered
Keep workflow announcement in Unified

Additional context
Here we have a more complete description of the workflow announcement process:
#8921
but some of those have already been implemented, and others might be slightly modified.

@haozturk
Copy link

Hi @amaltaro I was thinking about this issue. I did not fully get why you suggested to deal with RelVals specially. As far as I understand from Unified code, RelVals are created in batches, but they are announced in a standalone mode. The relevant Unified module sets output datasets to VALID regardless of its type: [1]
[1] https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/closor.py#L494

@amaltaro
Copy link
Contributor Author

RelVal announcement only happens in batch - when all workflows within the batch have completed - and the scenario I wanted to avoid was to have a batch getting rejected or something like that, while some of its workflows have their output datasets marked as VALID, while others are marked still as PRODUCTION.

This DBS dataset status change would likely happen when workflows are set to closed-out. If you consider this not to be a real issue, then we can just deal with them in a standard way.

@haozturk
Copy link

I understand your point. Let's try to think about cases where a problem can occur if we set RelVal outputs as VALID in a standalone mode in MSOutput:

Let's say, in a batch some RelVal outputs are completely produced and set as VALID while others are still in PRODUCTION. If the batch is rejected at this moment, then all the outputs (both VALID and PRODUCTION ones) are going to be invalidated, so no problem. Can you think of a problem here?

IMO, the only critical item here is that when Unified is going to announce the batch, all outputs should be VALID. (Otherwise, we tell people that you can use these datasets while they can't, right?). So, somehow, MSOutput should be quicker than Unified. However, this argument is the same for output data placement as well, i.e. Unified announces the workflows without making sure that the Rucio rules are created. Since we haven't seen any issue with output data placement, I believe we will not see an issue with setting the outputs as VALID, either.

Please let me know if I am missing something.

@haozturk
Copy link

@amaltaro @todor-ivanov Besides my statement in the previous comment, I have another point to discuss: Is there a specific reason for having the following order of actions?

  1. Set the output datasets as VALID on DBS.
  2. If previous step is successful, do the output data placement.

What if the 1st step is successful and 2nd one is not. Then we're saying the users that you can use this dataset while we don't put a protection for that data. The other way around makes more sense to me. What do you think?

@amaltaro
Copy link
Contributor Author

Let's say, in a batch some RelVal outputs are completely produced and set as VALID while others are still in PRODUCTION. If the batch is rejected at this moment, then all the outputs (both VALID and PRODUCTION ones) are going to be invalidated, so no problem. Can you think of a problem here?

If all outputs are invalidated, then indeed it should not be a problem.

So, somehow, MSOutput should be quicker than Unified.

This is something we cannot guarantee! Those services are asynchronous. For instance, if Unified moves the last workflow from a batch to closed-out, then in the next minute it starts moving all the workflows in the batch to announced, it's very likely that some workflows will still not have their final output data placement made by MSOutput.

What if the 1st step is successful and 2nd one is not. Then we're saying the users that you can use this dataset while we don't put a protection for that data. The other way around makes more sense to me. What do you think?

This is a good question. The reason we should have DBS status change first is, nothing changes if we (try to) set a dataset that is already in VALID status to VALID. Different than a rucio rule, the Tape destination is evaluated within the polling cycle, so we would risk creating multiple output data placements if we order them as you suggested.
Well, unless we always persist the output of the rucio output data placement, even if other actions failed... if I'm not wrong, this is how the service is implemented at the moment.

In short, it could be that changing the order would have no negative effect :-D

@haozturk
Copy link

I am afraid I did not get your point. How could we create multiple output data placement rules if we use the order that I am suggesting?

@amaltaro
Copy link
Contributor Author

Right. The order you suggest is:

  1. Run the output data placement.
  2. if successful, then set the output datasets as VALID on DBS.
    correct?

In this order, there is more code to be executed until we can persist the mongodb document changes, which I believe to happen in this method:

def docUploader(self, msOutDoc, update=False, keys=None, stride=None):

While performing the final data placement as a final stage (the order I suggested) would be the step right before persisting the mongo changes.

However, from a quick look at the code, I believe the correct thing will be to carry those actions sequentially, regardless of the status of the previous step. This is how I understand the concept of pipeline processing which Todor inserted in MSOutput. Of course, we need to keep the final state of each step, such that in the end of the pipeline, we know what is the final outcome for a given workflow.

@amaltaro
Copy link
Contributor Author

@haozturk Hi Hasan, given the lack of activity on this issue and the fact that we actually provide a mechanism at the workflow spec level to decide which status the DBS data should be injected with, see:
#11236

I wonder if we can now consider this issue and the initial developments no longer relevant? Please let us know in the coming week what your thoughts are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment